πŸ“°
Bar Chart Histogram Scatter Plot Chart Selection Week 7

Topic 7.2: The Right Chart for the Story

Choosing between bar, line, scatter, and histogram based on what your data is trying to say

πŸ—žοΈ The Journalist's Rule β€” One Question, One Chart
β–Ό

A professional journalist never uses the same format for every story. A national election result needs a bar chart; a stock price over a year needs a line chart; the relationship between two variables needs a scatter plot; the distribution of responses needs a histogram. Each format is designed to answer a specific type of question.

The same principle applies to data visualization. Choosing the wrong chart type doesn't make the data wrong β€” it makes it misleading or confusing. The right chart makes the answer to your question obvious within two seconds of looking at it.

ℹ️
The Student Results Dataset

All four charts in this topic use the same dataset β€” 30 students with six columns: name, math_score, science_score, english_score, study_hours (per day), and pass_fail.

Python
import matplotlib.pyplot as plt
import pandas as pd

# Load student results dataset
df = pd.read_csv('../Datasets/7_2_student_results.csv')

print("Student Results Dataset:")
print(df.head(10))
print(f"\nShape: {df.shape}")
β–Ά Output
Student Results Dataset: name math_score science_score english_score study_hours pass_fail 0 Ahmed Ali 79 55 66 6.6 Pass 1 Sara Mohamed 87 59 63 6.9 Pass 2 Omar Hassan 67 48 53 4.9 Pass 3 Nour Ibrahim 50 31 36 3.1 Fail 4 Youssef Kamal 39 30 60 3.3 Fail 5 Mariam Fathy 76 53 70 7.1 Pass 6 Khaled Nasser 87 70 91 9.2 Pass 7 Hana Sami 79 59 50 6.3 Pass 8 Tarek Mahmoud 75 49 45 5.8 Pass 9 Dina Adel 66 62 55 6.4 Pass Shape: (30, 6)
πŸ“Š Bar Chart β€” Comparing Categories
β–Ό

Use a bar chart when you want to compare discrete, named categories. The question it answers is: Which category is largest? Which is smallest? The bars make differences in magnitude visually obvious.

In the student dataset, we have three subjects: Math, Science, and English. To compare average performance across subjects, a bar chart is the natural choice. Each subject is a category; the average score is the value.

Python
# Bar Chart β€” compare average scores by subject
subjects = ['math_score', 'science_score', 'english_score']
averages = [df[s].mean() for s in subjects]
labels = ['Math', 'Science', 'English']

fig, ax = plt.subplots(figsize=(8, 5))
ax.bar(labels, averages, color='steelblue')

ax.set_title('Average Subject Scores')
ax.set_xlabel('Subject')
ax.set_ylabel('Average Score')
ax.set_ylim(0, 100)

plt.tight_layout()
plt.show()
Chart output

The code works as follows: first, we build a list of column names (subjects) and compute the mean of each using a list comprehension β€” [df[s].mean() for s in subjects] iterates over each column name and returns the average value. ax.bar(categories, values) draws one vertical bar for each category, with height equal to the corresponding value. The ax.set_ylim(0, 100) call forces the y-axis to start at zero. This is important for bar charts: starting the y-axis above zero exaggerates differences and can mislead the reader.

⚠️
Always Start Bar Charts at Zero

Bar charts encode values as bar height. If the y-axis starts at 70 instead of 0, a bar at 80 looks four times taller than one at 75, even though the difference is only 5 points. Start bar charts at zero to keep comparisons honest. Use ax.set_ylim(0, max_value) explicitly.

πŸ“‰ Line Chart β€” Tracking a Sequence of Values
β–Ό

Use a line chart when your data follows a natural sequence β€” most commonly time. The question it answers is: How does this value change across a series of ordered points? The connected line reveals the direction and shape of the trend at a glance.

Use a line chart when your data has a natural order and you want to see how a value changes across that sequence. Here we use the same student dataset β€” plotting math scores for each student in order. The connected line makes it easy to spot which students score high or low and how scores vary across the class.

Python
# Line Chart β€” math score for each student in sequence
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(df['name'], df['math_score'])

ax.set_title('Math Score by Student')
ax.set_xlabel('Student')
ax.set_ylabel('Math Score')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()
Chart output
ℹ️
Line vs Bar β€” a one-question test

You have student scores listed in order. You want to show how scores vary from student to student across the class sequence. Which is more appropriate β€” a line chart or a bar chart?

Answer: A line chart β€” because the students form a natural ordered sequence and the connected line makes it easy to see the variation and flow of scores. A bar chart would compare each student as an independent category and loses the sense of sequence.

πŸ“ˆ Scatter Plot β€” Revealing Correlations
β–Ό

Use a scatter plot when you want to show the relationship between two continuous variables. The question it answers is: When variable A increases, does variable B also increase? Decrease? Show no pattern?

For the student dataset, a natural question is: do students who study more hours get higher math scores? Each student becomes a dot β€” their study hours on the x-axis, their math score on the y-axis. If a correlation exists, the dots will form a visible diagonal pattern.

Python
# Scatter Plot β€” correlation between study hours and math score
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df['study_hours'], df['math_score'],
           color='purple', alpha=0.7)

ax.set_title('Study Hours vs Math Score')
ax.set_xlabel('Study Hours per Day')
ax.set_ylabel('Math Score')

plt.tight_layout()
plt.show()
Chart output

ax.scatter(x, y) draws one dot for each data point, placing it at the x-coordinate (study hours) and y-coordinate (math score) for that student. If a positive trend exists, dots will slope upward from left to right. With 30 students, some dots may overlap β€” you can address this with the alpha parameter once you start customizing.

πŸ’‘
The alpha Parameter β€” Handling Overlap

alpha controls the transparency of each dot, ranging from 0 (fully transparent) to 1 (fully opaque). When multiple data points cluster together, making each dot slightly transparent (e.g., alpha=0.7) lets you see the density of overlapping points β€” darker regions mean more data concentrated there.

πŸ“¦ Histogram β€” Understanding Distribution
β–Ό

Use a histogram when you want to understand how a single variable is distributed. The question it answers is: Where do most values cluster? Is the data symmetric, skewed, or spread out?

A histogram divides the full range of values into equal-width bins and counts how many data points fall into each bin. The height of each bar represents that count. Unlike a bar chart (which compares named categories), a histogram shows the frequency landscape of a continuous variable.

Python
# Histogram β€” distribution of math scores
fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(df['math_score'], bins=8, color='steelblue', edgecolor='white')

ax.set_title('Distribution of Math Scores')
ax.set_xlabel('Math Score')
ax.set_ylabel('Number of Students')

plt.tight_layout()
plt.show()
Chart output

ax.hist(data, bins=8) automatically divides the full range of math_score values into 8 equal-width intervals (bins) and counts how many students fall into each one. The height of each bar is that count. You do not need to group the data manually β€” Matplotlib handles all the counting for you.

The bins parameter controls how many equal-width intervals the range is divided into. Too few bins (e.g., 2) loses all the detail. Too many bins (e.g., 30 for 30 data points) makes every bar height 0 or 1, which is equally useless. A good starting point for small datasets is 8–12 bins. The edgecolor='white' parameter adds a thin white border between bars for visual clarity.

Bins too few
Detail is lost β€” you can't see where values cluster
Bins too many
Noise replaces pattern β€” each bin has 0 or 1 counts
Bins just right
The shape of the distribution is clear and readable
Good default
8–12 bins for datasets of 30–100 rows
πŸ—ΊοΈ The Chart Selection Decision Guide
β–Ό

Every chart type is built to answer a specific kind of question. Before reaching for a chart, ask: What is my question? The answer determines the chart.

Question TypeChart TypeExample QuestionMatplotlib Method
Compare categoriesBar ChartWhich city has the highest population?ax.bar()
Show trend over timeLine ChartHow did temperature change month by month?ax.plot()
Find correlationScatter PlotDo students who study more score higher?ax.scatter()
Show distributionHistogramHow are math scores distributed across 30 students?ax.hist()
Rule 1
Categories β†’ Bar Chart
When comparing named groups (cities, subjects, products), heights make differences obvious.
Rule 2
Time sequence β†’ Line Chart
When values follow a natural order (months, years, steps), lines reveal trends and direction.
Rule 3
Two variables β†’ Scatter Plot
When exploring relationships between two continuous measures, dots reveal correlation patterns.
Rule 4
One variable β†’ Histogram
When you need to understand how a single variable spreads across its range, bins show the shape.
?
A data analyst wants to show how the average daily temperature in Cairo changed month by month throughout 2024. Which chart type should they use?
  • The right chart type is determined by the question you are asking, not by personal preference.
  • Bar Chart (ax.bar()): Compare discrete named categories β€” always start the y-axis at zero.
  • Line Chart (ax.plot()): Show trends across a natural sequence β€” ideal for time-series data.
  • Scatter Plot (ax.scatter()): Reveal correlation between two continuous variables β€” use alpha to handle overlapping points.
  • Histogram (ax.hist()): Show how a single variable is distributed β€” choose bins carefully (8–12 is a good default).
  • The edgecolor parameter in ax.hist() adds borders between bars for better readability.
  • Chart selection is a communication decision: the wrong chart type can hide the very pattern you are trying to show.
πŸ“šExternal Resources
β–Ό