How to clean data in pandas with dropna()
The Python pandas DataFrame.dropna()
function is used to remove all rows or columns containing missing values (NaN) from a DataFrame. This makes it especially crucial for preparing and cleaning data.
- 99.9% uptime and super-fast loading
- Advanced security features
- Domain and email included
What is the syntax for pandas dropna()
?
The dropna()
function accepts up to five parameters. Here’s its syntax:
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False, ignore_index=False)
pythonImportant parameters for dropna()
You can use parameters to influence the behaviour of the pandas DataFrame.dropna()
function. Here’s an overview of the most important ones:
Parameter | Description | Default Value |
---|---|---|
axis
|
Determines whether rows (0 or index ) or columns (1 or columns ) will be removed
|
0 |
how
|
Specifies whether all (all ) or only some (any ) values must be NaN
|
any
|
thresh
|
Specifies the minimum number of non-NaN values a row or column must have to avoid being removed; cannot be combined with how
|
optional |
subset
|
Specifies which rows or columns should be considered | optional |
inplace
|
Determines whether the operation is performed on the original DataFrame | False
|
ignore_index
|
If True , the remaining axis is labelled from 0 to n-1
|
False
|
How to use pandas DataFrame.dropna()
Pandas dropna()
is used to clean data before it’s analysed. The removal of rows or columns with missing values helps to prevent biases in statistical evaluations. Since missing values can also lead to problems with data visualisation, using the function is also advantageous when creating charts and reports.
Removing rows with missing values
In the following example, we’ll take a look at a DataFrame containing NaN values:
import pandas as pd
import numpy as np
# Creating a DataFrame with sample data
data = {
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
}
df = pd.DataFrame(data)
print(df)
pythonThe DataFrame looks like this:
A B C
0 1.0 5.0 9
1 2.0 NaN 10
2 NaN NaN 11
3 4.0 8.0 12
Next, we’re going to apply the pandas dropna()
function:
## Remove all rows that contain at least one NaN value
df_cleaned = df.dropna()
print(df_cleaned)
pythonRunning the code above produces the following result:
A B C
0 1.0 5.0 9
3 4.0 8.0 12
Since all the other rows contain NaN values, only the zeroth and third rows remain.
Removing columns with missing values
Similarly, you can remove columns with missing values by setting the axis
parameter to 1:
## Remove all columns that contain at least one NaN value
df_cleaned_columns = df.dropna(axis=1)
print(df_cleaned_columns)
pythonColumn C is the only column that remains, since it’s the only one that doesn’t contain NaN values:
C
0 9
1 10
2 11
3 12
Using thresh
If you want to remove rows that contain fewer than two non-NaN values, you can use the thresh
parameter:
## Only keeps rows that have 2 or more non-NaN values
df_thresh = df.dropna(thresh=2)
print(df_thresh)
pythonRunning the code produces the following output:
A B C
0 1.0 5.0 9
1 2.0 NaN 10
3 4.0 8.0 12
Row 1 is not removed from the output because it contains 2 non-NaN values (2.0 and 10).
Using subset
The subset
parameter allows you to specify the columns where the program should look for missing values. Only rows that contain missing values in the columns that have been specified will be removed.
## Removes all rows where column A contains a NaN value
df_subset = df.dropna(subset=['A'])
print(df_subset)
pythonHere, only the second row is removed. The NaN value in the first row is ignored due to the subset parameter, which only takes column A into consideration:
A B C
0 1.0 5.0 9
1 2.0 NaN 10
3 4.0 8.0 12