The Pandas DataFrame is a Python data structure that can be used to create and ma­nip­u­late tables. We explain the structure of the data structure and its most important methods and prop­er­ties.

How does Pandas DataFrame work?

Pandas Data­Frames are the core of the Python Pandas library and enable efficient and flexible data analysis in Python. A Pandas DataFrame is a two-di­men­sion­al tabular data structure with numbered rows and labelled columns. This structure allows data to be organised in an easily un­der­stand­able and ma­nip­ulable form, similar to spread­sheet programs such as Excel or Lib­reOf­fice. Each column in a DataFrame can contain different Python data types, which means that a DataFrame can store het­ero­gen­eous data – for example numeric values, strings and booleans in a single table.

Tip

Pandas Data­Frames are based on NumPy arrays, which enables efficient handling of data and cal­cu­la­tion of values. However, Panda’s Data­Frames differ from NumPy data struc­tures in some respects, for example in their het­ero­gen­eity and their number of di­men­sions. For this reason, NumPy data struc­tures are suitable for ma­nip­u­lat­ing huge quant­it­ies of numerical values and Panda’s data struc­tures are more suitable for general data ma­nip­u­la­tion.

Structure of Pandas Data­Frames

A DataFrame has three main com­pon­ents: the data, row indices, and column names. The row index (or simply index) uniquely iden­ti­fies each row. By default, rows are indexed with numeric values, but these can be replaced with strings. It’s important to note that Pandas Data­Frames are zero-indexed, meaning indices start at 0.

Image: The structure of a Pandas DataFrame
Pandas Data­Frames have a tabular structure and are therefore very similar to Excel or SQL tables.
Note

While Pandas Data­Frames are among the most popular and useful Python data struc­tures, they are not part of the base language and must be imported sep­ar­ately. This is done using the line import pandas or from pandas import DataFrame at the beginning of your file. Al­tern­at­ively, you can use import pandas as pd if you want to reference the module with a shorter name (in this case ‘pd’).

Use of Pandas Data­Frames

Pandas Data­Frames provide various tech­niques and methods for efficient data pro­cessing, analysis, and visu­al­isa­tion. Below, you’ll learn about key concepts and methods for data ma­nip­u­la­tion using Pandas Data­Frames.

How to create a Pandas DataFrame

If you have already saved your desired data in a Python list or Python dic­tion­ary, you can easily create a DataFrame from it. Simply pass the existing data structure to the DataFrame con­struct­or using pandas.DataFrame([data]). How Pandas in­ter­prets your data will depend on the structure you provide. For example, you can create a Pandas Data­Frames from a Python list as follows:

import pandas
lists = ["Ahmed", "Beatrice", "Candice", "Donovan", "Elisabeth", "Frank"]
df = pandas.DataFrame(list)
print(df)
# Output:
#            0
# 0     	Ahmed
# 1      	Beatrice
# 2     	Candice
# 3    		Donovan
# 4  	  	Elisabeth
# 5  		Frank
python

As you can see in the example above, with simple lists you can only create Data­Frames with a single, un­la­belled column. For this reason, it is re­com­men­ded to create Data­Frames from dic­tion­ar­ies that contain lists. The keys are in­ter­preted as column names and the lists as the as­so­ci­ated data. The following example serves to il­lus­trate this:

import pandas
datA = {
    'Name': ['Arthur', 'Bruno', 'Christoph'],
    'Age': [34, 30, 55],
    'Income': [75000.0, 60000.5, 90000.3],
}
df = pandas.DataFrame(data)
print(df)
# Output:
#         Name  Age   Income
# 0     Arthur     34  75000.0
# 1      Bruno     30  60000.5
# 2  Christoph     55  90000.3
python
Web hosting
The hosting your website deserves at an un­beat­able price
  • Loading 3x faster for happier customers
  • Rock-solid 99.99% uptime and advanced pro­tec­tion
  • Only at IONOS: up to 500 GB included

Using this method, the DataFrame im­me­di­ately has the desired format and the desired headings. However, if you don’t want to rely on the built-in Python data struc­tures, you can also load your data from an external source, such as a CSV file or an SQL database. Simply call the ap­pro­pri­ate Pandas function:

import pandas
import sqlalchemy
# DataFrame of CSV:
csv = pandas.read_csv("csv-data/files.csv")
# DataFrame of SQL:
engine = create_engine('postgresql://username:password@localhost:5432/mydatabase')
sql = pandas.read_sql_query('SELECT * FROM table', engine)
python

The Data­Frames csv and sql in the above example now contain all the data from the data.csv and the SQL table table. When creating a DataFrame from an external source, you can specify ad­di­tion­al details, for example whether the numerical indices should be included in the DataFrame or not. Find out more about the ad­di­tion­al arguments of the two functions on the official Pandas DataFrame doc­u­ment­a­tion page.

Tip

To create a Pandas DataFrame from an SQL table, you must use Pandas in con­junc­tion with a Python SQL module such as SQLAl­chemy. Establish a con­nec­tion to the database using your chosen SQL module and pass it to read_sql_query().

How to display data in Pandas Data­Frames

With Pandas Data­Frames, you can display not only the entire table but also in­di­vidu­al rows and columns. You can select specific rows and columns to view. The following example il­lus­trates how to display in­di­vidu­al or multiple rows and columns:

# Output 0-th line
print(df.loc[0])
# Output lines 3 to 6
print(df.loc[3:6])
# Output lines 3 and 6
print(df.loc[[3, 6]])
# Output "Occupation" column
print(df["Occupation"])
# Output "Occupation" and "Age" columns
print(df[["Occupation", "Age"]])
# Selection of multiple rows and columns
print(df.loc[[3, 6], ['Occupation', 'Age']])
python

In the example, ref­er­en­cing a column is done by using its name in single brackets, similar to how you access values in Python dic­tion­ar­ies. In contrast, the loc attribute is used to reference rows. With loc you can also apply logical con­di­tions to filter data. The following code block demon­strates how to output only the rows where the value for ‘age’ is greater than 30:

print(df.loc[df['Age'] > 30])
python

However, you can also use the iloc attribute to select rows and columns based on their position in the DataFrame. For example, you can display the cell that is in the third row and the fourth column:

print(df.iloc[3, 4]) 
# Output: 
# London
 
print(df.iloc[[3, 4, 6], 4]) 
# Output: 
# 3 London
# 4 Birmingham
# 6 Preston
python

How to iterate over lines with Pandas Data­Frames

When pro­cessing data in Python, it’s often necessary to iterate over the rows of a Pandas Data­Frames to apply the same operation to all data. Pandas provides two methods for this purpose: itertuples() and iterrows(). Each method has its own ad­vant­ages and dis­ad­vant­ages con­cern­ing per­form­ance and user-friend­li­ness.

The iterrows() method returns a tuple of index and Series for each row in the DataFrame. A Series is a Pandas or NumPy data structure similar to a Python list, but it offers better per­form­ance. You can access in­di­vidu­al elements in the Series using the column name, which sim­pli­fies data handling.

While Pandas Series are more efficient than Python lists, they still come with some per­form­ance overhead. Therefore, the itertuples() method is par­tic­u­larly re­com­men­ded for very large Data­Frames. In contrast to iterrows(), itertuples() returns the entire row including index as tuples, which are more per­form­ant than Series. With tuples, you can access in­di­vidu­al elements using dot notation, similar to accessing at­trib­utes of an object.

Another important dif­fer­ence between series and tuples is that tuples are not mutable. So if you want to iterate over a DataFrame using itertuples() and change values, you have to reference the DataFrame with the at attribute and the index of the tuple. This attribute works very similarly to loc. The following example serves to il­lus­trate the dif­fer­ences between iterrows() and itertuples():

import pandas
df = pandas.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'], 
    'Age': [25, 30, 35], 
    'Income ': [70000.0, 80000.5, 90000.3]
})
for index, row in df.iterrows():
        row['Income'] += 1000
        print(f"Index: {index}, Age: {row['Age']}, Income: {row['Income']}")
for tup in df.itertuples():
        df.at[tup.Index, 'Income'] += 1000 # Change value directly in the DataFrame using at[] 
       print(f “Index: {tup.Index}, Age: {tup.Age}, Income: {df.loc[tup.Index, 'Income']}”)
# Both loops have the same output
python
Go to Main Menu