An Overview of DataFrames in Python

Total: 1 Average: 5

Python is a multi-purpose programming language developed by Guido van Rossum in the late 1980s. Today, it is widely used in software engineering worldwide and across various domains. Web development, application development, IoT, Data Science, Machine Learning, and Artificial Intelligence are a few of the scope. One of the reasons why Python is so popular is its simplicity in code-writing.

Also, Python has a huge community of developers who contribute towards creating libraries, maintaining the projects, etc. It is one of the most active programming communities, with tons of online documentation. According to the survey done by Stack Overflow in 2020, Python has topped the list of the most wanted programming languages.

Most Wanted Programming Language
Figure 1 – Most wanted Programming Language; Survey by Stack Overflow (Source)

Being a general-purpose language, Python is also extensively used across the Data Science and Machine Learning domains alongside R (a statistical language). It has the capabilities to work with various types of data structures and solve data-related problems with the help of community-built libraries.

One of such ways of handling data in Python is using DataFrames. In this article, we will talk about the DataFrames essence, creation, and application. Let’s get started.

Understanding the Basic DataFrame Structure

CodingSight - Understanding the Basic DataFrame Structure

DataFrames in Python are the two-dimensional data structures that can store heterogeneous data.

You can consider a DataFrame similar to an Excel spreadsheet or a table in an SQL database. It is a basic data form that contains rows, columns, and data values.

The figure below presents an example of a simple DataFrame.

On the left, it has rows, indexed from 0 to 4. There are also columns on the top – they have names associated with each of those rows. And we have the data for each row belonging to the particular column.

Example of Pandas DataFrame
Figure 2 – Example of a Pandas DataFrame (Source)

With DataFrames, you can query data by rows of columns. Also, you can select a subset of the DataFrame by using functions.

Now, that we have some idea about what a Dataframe in Python will look like, let us understand how can we create one.

Installing the Pandas Library

To work with our DataFrames in Python, we need to use a library called Pandas. It is a popular data processing library favored by millions of users. As per its official definition, “Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.”

It is written with C++ and open to use for all. You can either download the library from here or install it with the Python package manager, Pip. The following command will get it installed on your machine:

pip install pandas

As soon as you run this command, it will install Pandas and other dependencies required to handle it, e.g. NumPy.

Installing Pandas on the local machine
Figure 3 – Installing Pandas on the local machine

After installing the library, you can start using it. For demonstration purposes, I am going to use Jupyter Notebooks within Visual Studio. Setting up a Python notebook is outside this article’s scope, but you can follow this tutorial and configure Jupyter Notebooks on your machine.

Creating a Basic Dataframe in Python

There are multiple ways to do the task. Pandas will provide us with the default DataFrame constructor that can create DataFrames from a list of items. Let us see how to do this in practice.

Creating Pandas DataFrame from list
Figure 4 – Creating a Pandas DataFrame from list

As you can see in the figure above, we have imported the Pandas library and provided an alias ‘pd’ to it. This is usually done so that the library can be referred to by the alias in the code blocks.

In the next step, we create a list in Python that stores some string values.

Finally, we use the DataFrame constructor from the Pandas library and pass the list of values. This converts the list into a DataFrame and prints out the result.

Notice that the column name in the DataFrame is 0 because we have not provided any name yet. Also, pay attention to using the alias to call the Pandas constructor.

Creating a DataFrame from the JSON Dictionary

We are going to define a dictionary in Python that will contain multiple lists. The lists will also have a key associated with them. Further, this key will be converted into the column name.

Creating a DataFrame from a dictionary object in Python
Figure 5 – Creating a DataFrame from a dictionary object in Python

We have created a dictionary object and provided the lists of students, subjects, and marks within it. These names get converted into column names. The values in the lists get converted into rows in the DataFrame.

Additionally, Pandas allow selecting the data subset from the DataFrame. For example, the DataFrame has three columns, but you may choose to display only two of them. This is useful when you have a large DataFrame but need to show only a part for data analysis purposes.

Selecting specified columns from dataset
Figure 6- Selecting specified columns from the dataset

As you can see in the figure above, we have specified the column names within the double braces. It tells the interpreter to select the data from those two columns and display it on the panel.

Reading and Writing Data from the DataFrame to CSV

One of the most common use cases with Python DataFrames is importing and exporting data to and from files (mostly CSV and TXT).

Once you have the data in your DataFrame, you can easily export it to the CSV file. Alternatively, if you have data in a CSV file, you can read it and create a DataFrame from it.

Writing data from Dataframe to a CSV
Figure 7 – Writing data from Dataframe to a CSV

We have exported data using the to_csv method provided by the Pandas library. You can apply it to the DataFrame object and provide the filename as a parameter. The second parameter, index, is optional. It controls if the row indices should be exported to the CSV file.

Similarly, using the read_csv method, you can read data from a CSV file and store it in a DataFrame.

Reading data from a CSV file
Figure 8 – Reading data from a CSV file

You can find the entire notebook on GitHub.

Conclusion

Python is an extremely popular general-purpose programming language that is also widely used for Data Science and Engineering. It is one of the most common programming languages dealing with data.

The specialized Python Data Structure called DataFrame helps programmers to cope with the tabular-fashioned data within Python. You can apply multiple transformations like JOINs, filters, etc., and also combine two or more DataFrames vertically. All these operations on DataFrames are performed with the Pandas library in Python.

To learn more about DataFrames and Pandas in general, refer to the official documentation from Pandas.

Latest posts by Aveek Das (see all)

Aveek Das

Aveek is an experienced Data and Analytics Engineer, currently working in Dublin, Ireland. His main areas of technical interest include SQL Server, SSIS/ETL, SSAS, Python, Big Data tools like Apache Spark, Kafka, and cloud technologies such as AWS/Amazon and Azure. He is a prolific author, with over 100 articles published on various technical blogs, including his own blog, and a frequent contributor to different technical forums. In his leisure time, he enjoys amateur photography mostly street imagery and still life. Some glimpses of his work can be found on Instagram. You can also find him on LinkedIn.