What is the pandas DataFrame describe() method?
The Python pandas function DataFrame.describe()
is used to generate a statistical summary of the numerical columns in a DataFrame. This summary includes key statistical metrics like mean, standard deviation, minimum, maximum and different percentiles.
- 99.9% uptime and super-fast loading
- Advanced security features
- Domain and email included
What is the syntax for pandas’ describe()
function?
The basic syntax of describe()
for DataFrames is simple. It looks like this:
DataFrame.describe(percentiles=None, include=None, exclude=None)
pythonImportant parameters for pandas’ DataFrame.describe()
Using the following parameters, you can adjust the output of describe()
:
Parameter | Description | Default value |
---|---|---|
percentiles
|
Lists the percentiles that should be included in the summary | [.25, .5, .75]
|
include
|
Specifies which data types to include in the description; possible values are numpy.number , numpy.object , all or None
|
None
|
exclude
|
Specifies which data types to exclude from the description; functions like the include parameter
|
None
|
Statistical percentiles are values that divide a sorted dataset into equal parts, showing what percentage of data points fall below a specific threshold. These include metrics like the median (50th percentile), the 25th percentile and the 75th percentile. This information helps to provide a clearer picture of data distribution.
Examples of how to use pandas describe()
If you need a quick overview of the key statistical metrics of a dataset, the pandas DataFrame.describe()
function is extremely useful.
Example 1: Statistical summary of numerical data
In the following example, we take a look at the DataFrame df
, which contains different types of sales data.
import pandas as pd
import numpy as np
# Example DataFrame with sales data
data = {
'Product': ['A', 'B', 'C', 'D', 'E'],
'Quantity': [10, 20, 15, 5, 30],
'Price': [100, 150, 200, 80, 120],
'Revenue': [1000, 3000, 3000, 400, 3600]
}
df = pd.DataFrame(data)
print(df)
pythonNow, you can use pandas describe()
to get a statistical summary of the numerical data in the columns:
summary = df.describe()
print(summary)
pythonThe output of the pandas DataFrame.describe()
function is as follows:
Quantity Price Revenue
count 5.000000 5.000000 5.000000
mean 16.000000 130.000000 2200.000000
std 9.617692 46.904158 1407.124728
min 5.000000 80.000000 400.000000
25% 10.000000 100.000000 1000.000000
50% 15.000000 120.000000 3000.000000
75% 20.000000 150.000000 3000.000000
max 30.000000 200.000000 3600.000000
The key metrics shown in the output are:
count
: Number of non-NaN (Not a Number) entriesmean
: Average of the values (also accessible via DataFrame.mean())std
: Standard deviation of the valuesmin
,25%
,50%
,75%
,max
: Minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values
Example 2: Customising percentiles
You can customise the percentiles in the pandas DataFrame.describe()
output with the percentiles
parameter:
# Statistical summary with custom percentiles
custom_summary = df.describe(percentiles=[0.1, 0.5, 0.9])
print(custom_summary)
pythonThis function call provides the following output:
Quantity Price Revenue
count 5.000000 5.000000 5.000000
mean 16.000000 130.000000 2200.000000
std 9.617692 46.904158 1407.124728
min 5.000000 80.000000 400.000000
10% 7.000000 88.000000 640.000000
50% 15.000000 120.000000 3000.000000
90% 26.000000 180.000000 3360.000000
max 30.000000 200.000000 3600.000000
In the output, 10%
, 50% and 90% are included instead of the standard percentiles output in the previous example.