Python is a general-purpose language that can work with various types of data structures and solve data-related problems with the help of community-built libraries. One of such ways of handling data in Python is using DataFrames.
In this article, we will discuss what the DataFrames structure is and how to how to create a DataFrame in Python and apply it to help you make the most of this structure.
What is DataFrame in Python?
DataFrame in Python is a two-dimensional data structure that can store heterogeneous data.
You can consider it similar to an Excel spreadsheet or a table in an SQL database. It is a basic data form that contains rows, columns, and data values.
The figure below presents an example of a simple DataFrame.
On the left, it has rows, indexed from 0 to 4. There are also columns on the top – they have names associated with each of those rows. And we have the data for each row belonging to the particular column.
With DataFrames, you can query data by rows of columns. Also, you can select a subset of the DataFrame by using functions.
Now, that we have some idea about what a Dataframe in Python will look like, let us understand how can we create one.
How to Install Pandas in Python
To work with our DataFrames in Python, we need to use a library called Pandas. It is a popular data processing library favored by millions of users. As per its official definition, “Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.”
It is written with C++ and open to use for all. You can either download the library from here or install it with the Python package manager, Pip. In this example, we are dealing with Windows OS. However, you can use Pip on macOS and Linux as well.
The following command will get it installed on your machine:
pip install pandas
As soon as you run this command, it will install Pandas and other dependencies required to handle it, e.g. NumPy.
After installing the library, you can start using it. For demonstration purposes, I am going to use Jupyter Notebooks within Visual Studio. Setting up a Python notebook is outside this article’s scope, but you can follow this tutorial and configure Jupyter Notebooks on your machine.
Create Pandas DataFrame from List
Pandas will provide us with the default DataFrame constructor that can create DataFrames from a list of items. Let us see how to do this in practice.
As you can see in the figure above, we have imported the Pandas library and provided an alias ‘pd’ to it. This is usually done so that the library can be referred to by the alias in the code blocks.
In the next step, we create a list in Python that stores some string values.
Finally, we use this constructor to pass the Pandas DataFrame to list of values. We can convert DataFrame and prints out the result.
Notice that the column name in the DataFrame is 0 because we have not provided any name yet. Also, pay attention to using the alias to call the Pandas constructor.
Pandas DataFrame from JSON
To illustrate how to convert JSON to Pandas DataFrame, we are going to define a dictionary in Python that will contain multiple lists. The lists will also have a key associated with them. Further, this key will be converted into the column name.
We have created a dictionary object and provided the lists of students, subjects, and marks within it. These names get converted into column names. The values in the lists get converted into rows in the DataFrame.
Additionally, Pandas allow selecting the data subset from the DataFrame. For example, the DataFrame has three columns, but you may choose to display only two of them. This is useful when you have a large DataFrame but need to show only a part for data analysis purposes.
As you can see in the figure above, we have specified the column names within the double braces. It tells the interpreter to select the data from those two columns and display it on the panel. This is how to turn JSON to DataFrame.
Export Pandas DataFrame to CSV
One of the most common use cases with Python DataFrames is importing and exporting data to and from files (mostly CSV and TXT).
Once you have the data in your DataFrame, you can easily export that Python DataFrame to CSV. Alternatively, if you have data in a CSV file, you can read it and create a DataFrame from it.
We have exported data using the to_csv method provided by the Pandas library. You can apply it to the DataFrame object and provide the filename as a parameter. The second parameter, index, is optional. It controls if the row indices should be exported to the CSV file.
Similarly, using the read_csv method, you can read data from a CSV file and store it in a Python Pandas DataFrame.
You can find the entire notebook on GitHub.
Python is an extremely popular general-purpose programming language that is also widely used for Data Science and Engineering. It is one of the most common programming languages dealing with data.
The specialized Python Data Structure called DataFrame helps programmers to cope with the tabular-fashioned data within Python. You can create a DataFrame in Python, apply multiple transformations like JOINs, filters, etc., and also combine two or more DataFrames vertically. All these operations on DataFrames are performed with the Pandas library in Python.dataframes, pandas, python Last modified: October 07, 2022