Topic 6.7: Descriptive Statistics with Pandas
Extracting statistical insights from data in one line of code.
📂 Dataset for this topic — 6_7_sales.csv: ⬇ Download 6_7_sales.csv
In Week 3, we calculated mean, median, and standard deviation by hand. We wrote loops, sorted arrays, and computed formulas step by step. Pandas condenses all that work into a single method call.
The describe() method generates a complete statistical profile of your DataFrame. Load a CSV file with read_csv(), call describe() on the DataFrame, and Pandas returns count, mean, standard deviation, minimum, maximum, and quartiles for every numeric column. This snapshot reveals the shape of your data instantly.
import pandas as pd # Load sales data from CSV sales = pd.read_csv('6_7_sales.csv') # Get complete statistical summary print(sales.describe())
Each row in the describe() output answers a different question about your data:
If you see a price of negative one hundred or an age of five thousand, describe() surfaces that anomaly right away. You scan the min and max rows, spot the outlier, and investigate before running further analysis. This method guards against corrupted data poisoning your results.
Quartiles split your data into four equal parts. Together, the 25%, 50%, and 75% quartiles sketch the distribution without plotting a single graph.
The 25% quartile means one quarter of the values fall below this number. The 50% quartile is the median, the middle value when data is sorted. The 75% quartile means three quarters of values sit below it.
Comparing the distances between quartiles can give insight into the shape of the distribution. If the gap between the 50th and 75th percentiles is noticeably larger than the gap between the 25th and 50th percentiles, the distribution may be right-skewed (toward higher values). If the lower gap is larger, it may be left-skewed. If the gaps are roughly equal, the distribution is roughly symmetric around the median. These patterns can be seen from the summary statistics before creating any visualization.
The describe() method works only on numeric columns. Text columns like Product or Category get ignored automatically. Pandas recognizes data types and applies statistics where they make sense. If you want to include non-numeric columns, pass include='all' as a parameter.
# Include all columns, even non-numeric print(sales.describe(include='all'))
Sometimes you need just one statistic, not the full describe() output. Pandas provides dedicated methods for each measure. Call mean() on the DataFrame to get the average of every numeric column as a Series. Call median() to get the middle value. Call std() to compute standard deviation.
Each method returns a Series where the index is column names and the values are the computed statistics. These methods skip text columns by default. If you want to force numeric-only processing and avoid warnings, pass numeric_only=True.
# Mean of all numeric columns print(sales.mean(numeric_only=True)) # Median of numeric columns print(sales.median(numeric_only=True)) # Standard deviation print(sales.std(numeric_only=True))
Additional aggregation methods:
All these methods operate column-wise. They collapse the DataFrame vertically, aggregating each column into a single number. If you have one thousand rows and five numeric columns, mean() returns five numbers, one per column.
# Minimum and maximum values print(sales.min(numeric_only=True)) print(sales.max(numeric_only=True)) # Sum of all numeric columns print(sales.sum(numeric_only=True)) # Count non-null values per column print(sales.count())
To focus on a single column, use square bracket notation first, then call the method. Write sales['Revenue'].mean(), and Pandas calculates the average revenue across all products. This approach narrows the computation to one metric, returning a scalar instead of a Series.
You can store this result in a variable and use it for comparisons or thresholds. For example, compute the mean revenue, then filter for products above that mean to identify top performers.
# Total revenue across all products total_revenue = sales['Revenue'].sum() print(f"Total Revenue: {total_revenue}") # Average units sold avg_units = sales['Units_Sold'].mean() print(f"Average Units Sold: {avg_units:.1f}") # Maximum price max_price = sales['Price'].max() print(f"Highest Price: {max_price}")
You can chain these methods with other Pandas operations. Compute the median price, then count how many products exceed it. Compute the standard deviation of units sold, then flag products more than two standard deviations above the mean as outliers. Each method integrates seamlessly into your analysis pipeline.
Not all analysis involves numbers. Text columns like Category or City contain categories, and we need to count how often each category appears. The value_counts() method does this.
Call it on a column, and Pandas returns a Series listing each unique value with its frequency, sorted from most to least common. If the Category column shows Stationery appeared eight times, Electronics three times, and Bags two times, you understand product distribution at a glance.
# Count occurrences of each category print(sales['Category'].value_counts()) # Count by city print(sales['City'].value_counts())
This method reveals imbalances in your data. If ninety percent of sales come from one city, you might focus marketing there or investigate why other cities lag. Value_counts() turns categorical data into actionable frequency tables without writing a single loop.
By default, value_counts() sorts results in descending order by frequency. You can change this behavior by passing sort=False to preserve the original order, or by passing ascending=True to sort from least to most common.
The groupby() method splits a DataFrame into groups based on a column, then applies an aggregation to each group. This pattern turns scattered data into group summaries.
Write sales.groupby('Category'), and Pandas divides rows into categories. Chain a column selection like ['Price'], then call mean(). Pandas computes the average price for each category separately, returning a Series where the index is category names and the values are the means.
# Average price by category print(sales.groupby('Category')['Price'].mean()) # Total revenue by category print(sales.groupby('Category')['Revenue'].sum()) # Average units sold by city print(sales.groupby('City')['Units_Sold'].mean())
You can group by one column and aggregate another. Group by City, then sum the Revenue column to see total earnings per location. Group by Category, then count how many products fall into each group. Groupby answers questions like: which category has the highest average price? Which city generates the most total revenue?
For more sophisticated analysis, use the agg() method to apply multiple aggregations at once. This returns a DataFrame with one row per group and one column per aggregation function.
# Multiple aggregations on grouped data print(sales.groupby('Category')['Revenue'].agg(['mean', 'sum', 'count']))
This single line computes the mean revenue, total revenue, and count of products for each category simultaneously. The output is a multi-column DataFrame that gives you a complete statistical breakdown by group.
Groupby operations follow a consistent pattern: split, apply, combine. Pandas splits the DataFrame into groups based on your grouping column, applies an aggregation function to each group, then combines the results into a new DataFrame or Series.
Common patterns:
You can also group by multiple columns at once. Write groupby(['Category', 'City']), and Pandas creates groups for each unique combination of category and city. This multi-level grouping shows patterns within categories.
Descriptive statistics give you a complete picture of your data in a few lines of code. describe() surfaces the shape of every numeric column. Individual methods like mean(), sum(), and count() let you target specific measurements. value_counts() turns text columns into frequency tables. groupby() breaks data into groups and aggregates each one.
These tools work together in a natural sequence: load your data, call describe() to get a first overview, use groupby() to compare across categories, and use value_counts() on text columns to understand distribution. Each step narrows your focus from the whole dataset toward the specific patterns that matter.
The real value shows when you combine them. Compute mean revenue per category with groupby, then find which categories fall below the overall mean. Count how many products each city sells with value_counts, then compare that against total revenue per city from groupby. Pandas makes all of these possible without writing a single loop.
- describe() generates a complete statistical summary: count, mean, std, min, quartiles, and max.
- Quartiles (25%, 50%, 75%) reveal data distribution without plotting graphs.
- Individual methods like mean(), median(), std(), min(), max(), sum(), and count() compute specific statistics.
- Target specific columns with sales['Revenue'].mean() to get a single scalar value.
- value_counts() counts occurrences of categorical values, sorted by frequency.
- groupby() splits data into groups, applies aggregations, and combines results.
- Use agg() with groupby() to apply multiple aggregation functions at once.
- Descriptive statistics reveal patterns, but filtering extracts actionable subsets.
- ↗ Pandas Documentation: Descriptive Statistics
https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics - ↗ Pandas Documentation: GroupBy Operations
https://pandas.pydata.org/docs/user_guide/groupby.html - ↗ Real Python: Pandas GroupBy Tutorial
https://realpython.com/pandas-groupby/ - ↗ Khan Academy: Quartiles and Interquartile Range
https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/identifying-outliers-iqr-rule - ↗ Kaggle Learn: Pandas Course (Free Interactive Lessons)
https://www.kaggle.com/learn/pandas