📊
Pandas Statistics Data Analysis Aggregation

Topic 6.7: Descriptive Statistics with Pandas

Extracting statistical insights from data in one line of code.

📈 Complete Statistical Summary with describe()

📂 Dataset for this topic — 6_7_sales.csv: ⬇ Download 6_7_sales.csv

In Week 3, we calculated mean, median, and standard deviation by hand. We wrote loops, sorted arrays, and computed formulas step by step. Pandas condenses all that work into a single method call.

The describe() method generates a complete statistical profile of your DataFrame. Load a CSV file with read_csv(), call describe() on the DataFrame, and Pandas returns count, mean, standard deviation, minimum, maximum, and quartiles for every numeric column. This snapshot reveals the shape of your data instantly.

Python
import pandas as pd

# Load sales data from CSV
sales = pd.read_csv('6_7_sales.csv')

# Get complete statistical summary
print(sales.describe())
▶ Output
Price Units_Sold Revenue count 13.000000 13.000000 13.000000 mean 80.038462 223.615385 3407.692308 std 217.545250 195.707919 5128.224439 min 3.500000 10.000000 350.000000 25% 5.000000 30.000000 750.000000 50% 12.000000 200.000000 2400.000000 75% 40.000000 300.000000 2700.000000 max 800.000000 600.000000 20000.000000

Each row in the describe() output answers a different question about your data:

If you see a price of negative one hundred or an age of five thousand, describe() surfaces that anomaly right away. You scan the min and max rows, spot the outlier, and investigate before running further analysis. This method guards against corrupted data poisoning your results.

📐 Understanding Quartiles and Distribution

Quartiles split your data into four equal parts. Together, the 25%, 50%, and 75% quartiles sketch the distribution without plotting a single graph.

The 25% quartile means one quarter of the values fall below this number. The 50% quartile is the median, the middle value when data is sorted. The 75% quartile means three quarters of values sit below it.

Comparing the distances between quartiles can give insight into the shape of the distribution. If the gap between the 50th and 75th percentiles is noticeably larger than the gap between the 25th and 50th percentiles, the distribution may be right-skewed (toward higher values). If the lower gap is larger, it may be left-skewed. If the gaps are roughly equal, the distribution is roughly symmetric around the median. These patterns can be seen from the summary statistics before creating any visualization.

The describe() method works only on numeric columns. Text columns like Product or Category get ignored automatically. Pandas recognizes data types and applies statistics where they make sense. If you want to include non-numeric columns, pass include='all' as a parameter.

Python
# Include all columns, even non-numeric
print(sales.describe(include='all'))
▶ Output
Product Category Price Units_Sold Revenue City count 13 13 13.000000 13.000000 13.000000 13 unique 13 3 NaN NaN NaN 3 top Pen Stationery NaN NaN NaN Cairo freq 1 8 NaN NaN NaN 5 mean NaN NaN 80.038462 223.615385 3407.692308 NaN std NaN NaN 217.545250 195.707919 5128.224439 NaN min NaN NaN 3.500000 10.000000 350.000000 NaN 25% NaN NaN 5.000000 30.000000 750.000000 NaN 50% NaN NaN 12.000000 200.000000 2400.000000 NaN 75% NaN NaN 40.000000 300.000000 2700.000000 NaN max NaN NaN 800.000000 600.000000 20000.000000 NaN
🎯 Individual Statistical Methods

Sometimes you need just one statistic, not the full describe() output. Pandas provides dedicated methods for each measure. Call mean() on the DataFrame to get the average of every numeric column as a Series. Call median() to get the middle value. Call std() to compute standard deviation.

Each method returns a Series where the index is column names and the values are the computed statistics. These methods skip text columns by default. If you want to force numeric-only processing and avoid warnings, pass numeric_only=True.

Python
# Mean of all numeric columns
print(sales.mean(numeric_only=True))

# Median of numeric columns
print(sales.median(numeric_only=True))

# Standard deviation
print(sales.std(numeric_only=True))
▶ Output
Price 80.038462 Units_Sold 223.615385 Revenue 3407.692308 dtype: float64 Price 12.0 Units_Sold 200.0 Revenue 2400.0 dtype: float64 Price 217.545250 Units_Sold 195.707919 Revenue 5128.224439 dtype: float64

Additional aggregation methods:

All these methods operate column-wise. They collapse the DataFrame vertically, aggregating each column into a single number. If you have one thousand rows and five numeric columns, mean() returns five numbers, one per column.

Python
# Minimum and maximum values
print(sales.min(numeric_only=True))
print(sales.max(numeric_only=True))

# Sum of all numeric columns
print(sales.sum(numeric_only=True))

# Count non-null values per column
print(sales.count())
▶ Output
Price 3.5 Units_Sold 10.0 Revenue 350.0 dtype: float64 Price 800.0 Units_Sold 600.0 Revenue 20000.0 dtype: float64 Price 1040.5 Units_Sold 2907.0 Revenue 44300.0 dtype: float64 Product 13 Price 13 Units_Sold 13 Revenue 13 Category 13 City 13 dtype: int64
🔍 Targeting Specific Columns

To focus on a single column, use square bracket notation first, then call the method. Write sales['Revenue'].mean(), and Pandas calculates the average revenue across all products. This approach narrows the computation to one metric, returning a scalar instead of a Series.

You can store this result in a variable and use it for comparisons or thresholds. For example, compute the mean revenue, then filter for products above that mean to identify top performers.

Python
# Total revenue across all products
total_revenue = sales['Revenue'].sum()
print(f"Total Revenue: {total_revenue}")

# Average units sold
avg_units = sales['Units_Sold'].mean()
print(f"Average Units Sold: {avg_units:.1f}")

# Maximum price
max_price = sales['Price'].max()
print(f"Highest Price: {max_price}")
▶ Output
Total Revenue: 44300 Average Units Sold: 223.6 Highest Price: 800.0

You can chain these methods with other Pandas operations. Compute the median price, then count how many products exceed it. Compute the standard deviation of units sold, then flag products more than two standard deviations above the mean as outliers. Each method integrates seamlessly into your analysis pipeline.

🏷️ Categorical Counting with value_counts()

Not all analysis involves numbers. Text columns like Category or City contain categories, and we need to count how often each category appears. The value_counts() method does this.

Call it on a column, and Pandas returns a Series listing each unique value with its frequency, sorted from most to least common. If the Category column shows Stationery appeared eight times, Electronics three times, and Bags two times, you understand product distribution at a glance.

Python
# Count occurrences of each category
print(sales['Category'].value_counts())

# Count by city
print(sales['City'].value_counts())
▶ Output
Category Stationery 8 Electronics 3 Bags 2 Name: count, dtype: int64 City Cairo 5 Alexandria 4 Giza 4 Name: count, dtype: int64

This method reveals imbalances in your data. If ninety percent of sales come from one city, you might focus marketing there or investigate why other cities lag. Value_counts() turns categorical data into actionable frequency tables without writing a single loop.

By default, value_counts() sorts results in descending order by frequency. You can change this behavior by passing sort=False to preserve the original order, or by passing ascending=True to sort from least to most common.

🔀 Group Aggregation with groupby()

The groupby() method splits a DataFrame into groups based on a column, then applies an aggregation to each group. This pattern turns scattered data into group summaries.

Write sales.groupby('Category'), and Pandas divides rows into categories. Chain a column selection like ['Price'], then call mean(). Pandas computes the average price for each category separately, returning a Series where the index is category names and the values are the means.

Python
# Average price by category
print(sales.groupby('Category')['Price'].mean())

# Total revenue by category
print(sales.groupby('Category')['Revenue'].sum())

# Average units sold by city
print(sales.groupby('City')['Units_Sold'].mean())
▶ Output
Category Bags 45.000000 Electronics 298.333333 Stationery 6.937500 Name: Price, dtype: float64 Category Bags 1000 Electronics 26900 Stationery 16400 Name: Revenue, dtype: int64 City Alexandria 190.5 Cairo 221.0 Giza 260.0 Name: Units_Sold, dtype: float64

You can group by one column and aggregate another. Group by City, then sum the Revenue column to see total earnings per location. Group by Category, then count how many products fall into each group. Groupby answers questions like: which category has the highest average price? Which city generates the most total revenue?

For more sophisticated analysis, use the agg() method to apply multiple aggregations at once. This returns a DataFrame with one row per group and one column per aggregation function.

Python
# Multiple aggregations on grouped data
print(sales.groupby('Category')['Revenue'].agg(['mean', 'sum', 'count']))
▶ Output
mean sum count Category Bags 500.000000 1000 2 Electronics 8966.666667 26900 3 Stationery 2050.000000 16400 8

This single line computes the mean revenue, total revenue, and count of products for each category simultaneously. The output is a multi-column DataFrame that gives you a complete statistical breakdown by group.

Common Groupby Patterns

Groupby operations follow a consistent pattern: split, apply, combine. Pandas splits the DataFrame into groups based on your grouping column, applies an aggregation function to each group, then combines the results into a new DataFrame or Series.

Common patterns:

You can also group by multiple columns at once. Write groupby(['Category', 'City']), and Pandas creates groups for each unique combination of category and city. This multi-level grouping shows patterns within categories.

📊 Putting It Together

Descriptive statistics give you a complete picture of your data in a few lines of code. describe() surfaces the shape of every numeric column. Individual methods like mean(), sum(), and count() let you target specific measurements. value_counts() turns text columns into frequency tables. groupby() breaks data into groups and aggregates each one.

These tools work together in a natural sequence: load your data, call describe() to get a first overview, use groupby() to compare across categories, and use value_counts() on text columns to understand distribution. Each step narrows your focus from the whole dataset toward the specific patterns that matter.

The real value shows when you combine them. Compute mean revenue per category with groupby, then find which categories fall below the overall mean. Count how many products each city sells with value_counts, then compare that against total revenue per city from groupby. Pandas makes all of these possible without writing a single loop.

  • describe() generates a complete statistical summary: count, mean, std, min, quartiles, and max.
  • Quartiles (25%, 50%, 75%) reveal data distribution without plotting graphs.
  • Individual methods like mean(), median(), std(), min(), max(), sum(), and count() compute specific statistics.
  • Target specific columns with sales['Revenue'].mean() to get a single scalar value.
  • value_counts() counts occurrences of categorical values, sorted by frequency.
  • groupby() splits data into groups, applies aggregations, and combines results.
  • Use agg() with groupby() to apply multiple aggregation functions at once.
  • Descriptive statistics reveal patterns, but filtering extracts actionable subsets.
Quick Check
?
Which Pandas method returns a frequency count of unique values in a column?
?
You want to calculate the total Revenue for each City in a DataFrame. Which Pandas operation best achieves this?
?
What does df['Price'].median() return?
📚External Resources