Written by 14:00 Database administration, Frameworks, Languages & Coding, Work with data

Creation of Pandas DataFrame in Python with Examples

Creation of Pandas DataFrames in Python with Examples

Python is a general-purpose language that can work with various types of data structures and solve data-related problems with the help of community-built libraries. One of such ways of handling data in Python is using DataFrames.

In this article, we will discuss what the DataFrames structure is and how to how to create a DataFrame in Python and apply it to help you make the most of this structure.

What is DataFrame in Python?

CodingSight - Understanding the Basic DataFrame Structure

DataFrame in Python is a two-dimensional data structure that can store heterogeneous data.

You can consider it similar to an Excel spreadsheet or a table in an SQL database. It is a basic data form that contains rows, columns, and data values.

The figure below presents an example of a simple DataFrame.

On the left, it has rows, indexed from 0 to 4. There are also columns on the top – they have names associated with each of those rows. And we have the data for each row belonging to the particular column.

Example of Pandas DataFrame
Figure 1 – Example of a Pandas DataFrame (Source)

With DataFrames, you can query data by rows of columns. Also, you can select a subset of the DataFrame by using functions. When working with larger data sets, especially web-based data, Python libraries may not be enough. In such cases, one can use tools dedicated to web data extraction. For instance, ZenRows provides a platform that simplifies the process of web data extraction with its dynamic web scraping using Python. With it, you can fetch all the relevant data and structure it in your desired format, like a DataFrame, which further enhances data analysis and manipulation.

Now, that we have some idea about what a Dataframe in Python will look like, let us understand how can we create one.

 How to Install Pandas in Python

To work with our DataFrames in Python, we need to use a library called Pandas. It is a popular data processing library favored by millions of users. As per its official definition, “Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.”

It is written with C++ and open to use for all. You can either download the library from here or install it with the Python package manager, Pip. In this example, we are dealing with Windows OS. However, you can use Pip on macOS and Linux as well.

The following command will get it installed on your machine:

pip install pandas

As soon as you run this command, it will install Pandas and other dependencies required to handle it, e.g. NumPy.

Installing Pandas on the local machine
Figure 2 – Installing Pandas on the local machine

After installing the library, you can start using it. For demonstration purposes, I am going to use Jupyter Notebooks within Visual Studio. Setting up a Python notebook is outside this article’s scope, but you can follow this tutorial and configure Jupyter Notebooks on your machine.

Create Pandas DataFrame from List

Pandas will provide us with the default DataFrame constructor that can create DataFrames from a list of items. Let us see how to do this in practice.

Creating Pandas DataFrame from list
Figure 3 – Creating a Pandas DataFrame from list

As you can see in the figure above, we have imported the Pandas library and provided an alias ‘pd’ to it. This is usually done so that the library can be referred to by the alias in the code blocks.

In the next step, we create a list in Python that stores some string values.

Finally, we use this constructor to pass the Pandas DataFrame to list of values. We can convert DataFrame and prints out the result.

Notice that the column name in the DataFrame is 0 because we have not provided any name yet. Also, pay attention to using the alias to call the Pandas constructor.

Pandas DataFrame from JSON

To illustrate how to convert JSON to Pandas DataFrame, we are going to define a dictionary in Python that will contain multiple lists. The lists will also have a key associated with them. Further, this key will be converted into the column name.

Creating a DataFrame from a dictionary object in Python
Figure 4 – Creating a DataFrame from a dictionary object in Python

We have created a dictionary object and provided the lists of students, subjects, and marks within it. These names get converted into column names. The values in the lists get converted into rows in the DataFrame.

Additionally, Pandas allow selecting the data subset from the DataFrame. For example, the DataFrame has three columns, but you may choose to display only two of them. This is useful when you have a large DataFrame but need to show only a part for data analysis purposes.

Selecting specified columns from dataset
Figure 5- Selecting specified columns from the dataset

As you can see in the figure above, we have specified the column names within the double braces. It tells the interpreter to select the data from those two columns and display it on the panel. This is how to turn JSON to DataFrame.

Export Pandas DataFrame to CSV

One of the most common use cases with Python DataFrames is importing and exporting data to and from files (mostly CSV and TXT).

Once you have the data in your DataFrame, you can easily export that Python DataFrame to CSV. Alternatively, if you have data in a CSV file, you can read it and create a DataFrame from it.

Writing data from Dataframe to a CSV
Figure 6 – Writing data from Dataframe to a CSV

We have exported data using the to_csv method provided by the Pandas library. You can apply it to the DataFrame object and provide the filename as a parameter. The second parameter, index, is optional. It controls if the row indices should be exported to the CSV file.

Similarly, using the read_csv method, you can read data from a CSV file and store it in a Python Pandas DataFrame.

Reading data from a CSV file
Figure 7 – Reading data from a CSV file

You can find the entire notebook on GitHub.

Conclusion

Python is an extremely popular general-purpose programming language that is also widely used for Data Science and Engineering. It is one of the most common programming languages dealing with data.

The specialized Python Data Structure called DataFrame helps programmers to cope with the tabular-fashioned data within Python. You can create a DataFrame in Python, apply multiple transformations like JOINs, filters, etc., and also combine two or more DataFrames vertically. All these operations on DataFrames are performed with the Pandas library in Python.

To learn more about DataFrames and Pandas in general, refer to the official documentation from Pandas. Also, you can read about how to connect python to sql server.

Tags: , , Last modified: August 29, 2023
Close