Python is a multi-purpose programming language developed by Guido van Rossum in the late 1980s. Today, it is widely used in software engineering worldwide and across various domains. Web development, application development, IoT, Data Science, Machine Learning, and Artificial Intelligence are a few of the scope. One of the reasons why Python is so popular is its simplicity in code-writing.
Also, Python has a huge community of developers who contribute towards creating libraries, maintaining the projects, etc. It is one of the most active programming communities, with tons of online documentation. According to the survey done by Stack Overflow in 2020, Python has topped the list of the most wanted programming languages.
Being a general-purpose language, Python is also extensively used across the Data Science and Machine Learning domains alongside R (a statistical language). It has the capabilities to work with various types of data structures and solve data-related problems with the help of community-built libraries.
One of such ways of handling data in Python is using DataFrames. In this article, we will talk about the DataFrames essence, creation, and application. Let’s get started.
What is a DataFrame in Python?
DataFrames in Python are the two-dimensional data structures that can store heterogeneous data.
You can consider a DataFrame similar to an Excel spreadsheet or a table in an SQL database. It is a basic data form that contains rows, columns, and data values.
The figure below presents an example of a simple DataFrame.
On the left, it has rows, indexed from 0 to 4. There are also columns on the top – they have names associated with each of those rows. And we have the data for each row belonging to the particular column.
With DataFrames, you can query data by rows of columns. Also, you can select a subset of the DataFrame by using functions.
Now, that we have some idea about what a Dataframe in Python will look like, let us understand how can we create Pandas DataFrame.
How to Install Pandas in Python
To work with our DataFrames in Python, we need to use a library called Pandas. It is a popular data processing library favored by millions of users. As per its official definition, “Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.”
It is written with C++ and open to use for all. You can either download the library from here or install it with the Python package manager, Pip. The following command will get it installed on your machine:
pip install pandas
As soon as you run this command, it will install Pandas and other dependencies required to handle it, e.g. NumPy.
After installing the library, you can start using it. For demonstration purposes, I am going to use Jupyter Notebooks within Visual Studio. Setting up a Python notebook is outside this article’s scope, but you can follow this tutorial and configure Jupyter Notebooks on your machine.
Create Pandas DataFrame from List
There are multiple ways to do the task. Pandas will provide us with the default DataFrame constructor that can create Pandas DataFrame from a list of items. Let us see how to do this in practice.
As you can see in the figure above, we have imported the Pandas library and provided an alias ‘pd’ to it. This is usually done so that the library can be referred to by the alias in the code blocks.
In the next step, we create a list in Python that stores some string values.
Finally, we use the DataFrame constructor from the Pandas library and pass the list of values. This converts the list into a DataFrame and prints out the result.
Notice that the column name in the DataFrame is 0 because we have not provided any name yet. Also, pay attention to using the alias to call the Pandas constructor.
Pandas DataFrame from JSON
We are going to define a dictionary in Python that will contain multiple lists. The lists will also have a key associated with them. Further, this key will be converted into the column name.
We have created a dictionary object and provided the lists of students, subjects, and marks within it. These names get converted into column names. The values in the lists get converted into rows in the DataFrame.
Additionally, Pandas allow selecting the data subset from the DataFrame. For example, the DataFrame has three columns, but you may choose to display only two of them. This is useful when you have a large DataFrame but need to show only a part for data analysis purposes.
As you can see in the figure above, we have specified the column names within the double braces. It tells the interpreter to select the data from those two columns and display it on the panel.
Export Pandas DataFrame to CSV
One of the most common use cases with Python DataFrames is importing and exporting data to and from files (mostly CSV and TXT).
Once you have the data in your DataFrame, you can easily export it to the CSV file. Alternatively, if you have data in a CSV file, you can read it and create Pandas DataFrame from it.
We have exported data using the to_csv method provided by the Pandas library. You can apply it to the DataFrame object and provide the filename as a parameter. The second parameter, index, is optional. It controls if the row indices should be exported to the CSV file.
Similarly, using the read_csv method, you can read data from a CSV file and store it in a DataFrame.
You can find the entire notebook on GitHub.
Python is an extremely popular general-purpose programming language that is also widely used for Data Science and Engineering. It is one of the most common programming languages dealing with data.
The specialized Python Data Structure called DataFrame helps programmers to cope with the tabular-fashioned data within Python. You can apply multiple transformations like JOINs, filters, etc., and also combine two or more DataFrames vertically. All these operations on DataFrames are performed with the Pandas library in Python.