Vectorization Ufuncs Statistics Aggregation Performance

Topic 5.5: Vectorization and Statistical Operations

Performing calculations on entire arrays at once and extracting meaningful insights

🚀 The Power of Vectorization

Imagine you need to convert a list of prices from dollars to Egyptian pounds. The exchange rate is 30.9 EGP per dollar. In traditional programming, you'd write a loop:

Python
# The slow way: loop through each price
prices_usd = [10, 25, 50, 100, 200]
prices_egp = []

for price in prices_usd:
    prices_egp.append(price * 30.9)

print(prices_egp)
▶ Output
[309.0, 772.5, 1545.0, 3090.0, 6180.0]

This works, but it's slow for large datasets because Python processes one price at a time. NumPy lets you express the same operation on the entire array at once:

Python
import numpy as np

# The fast way: vectorized operation
prices_usd = np.array([10, 25, 50, 100, 200])
prices_egp = prices_usd * 30.9

print(prices_egp)
▶ Output
[ 309. 772.5 1545. 3090. 6180. ]

The * operator automatically applies to every element. No loop needed. This is vectorization—expressing operations on entire arrays. Behind the scenes, NumPy uses optimized techniques to process multiple elements efficiently.

Universal Functions (Ufuncs)

NumPy provides dozens of universal functions (ufuncs) that work element-wise on arrays. They're called "universal" because they work on arrays of any shape:

Python
# Temperature data in Celsius
temps_c = np.array([20, 25, 30, 35, 40])

# Convert to Fahrenheit: F = C × 9/5 + 32
temps_f = temps_c * 9/5 + 32
print(f"Fahrenheit: {temps_f}")

# Square each value
squares = temps_c ** 2
print(f"Squares: {squares}")

# Square root
roots = np.sqrt(temps_c)
print(f"Square roots: {roots}")

# Logarithm
logs = np.log(temps_c)
print(f"Natural log: {logs}")
▶ Output
Fahrenheit: [ 68. 77. 86. 95. 104.]
Squares: [ 400 625 900 1225 1600]
Square roots: [4.47 5. 5.48 5.92 6.32]
Natural log: [2.99 3.22 3.40 3.56 3.69]

Each operation applies to every element automatically. You don't write loops, you don't track indices, you just express the mathematical operation you want and NumPy handles the rest.

📊 Statistical Aggregation

Data science is about finding patterns and summarizing information. NumPy provides a suite of functions that reduce arrays to single summary values:

Python
# Daily temperatures for a week
temps = np.array([22, 24, 28, 27, 25, 23, 21])

print(f"Minimum: {temps.min()}°C")
print(f"Maximum: {temps.max()}°C")
print(f"Average (mean): {temps.mean():.1f}°C")
print(f"Median (middle value): {np.median(temps)}°C")
print(f"Sum total: {temps.sum()}°C")
print(f"Standard deviation: {temps.std():.2f}°C")
▶ Output
Minimum: 21°C
Maximum: 28°C
Average (mean): 24.3°C
Median (middle value): 24.0°C
Sum total: 170°C
Standard deviation: 2.36°C

These aggregation functions collapse an entire array into a single number that tells you something meaningful about the data.

Understanding Standard Deviation

The standard deviation measures how spread out the data is. A small standard deviation means the numbers are clustered near the average. A large one means they're spread far apart.

Python
# Two datasets with the same mean but different spreads
consistent = np.array([23, 24, 24, 25, 24])
variable = np.array([10, 20, 24, 30, 36])

print(f"Consistent mean: {consistent.mean()}")
print(f"Consistent std: {consistent.std():.2f}")

print(f"\nVariable mean: {variable.mean()}")
print(f"Variable std: {variable.std():.2f}")
▶ Output
Consistent mean: 24.0
Consistent std: 0.63

Variable mean: 24.0
Variable std: 9.17

Both datasets average to 24, but the first has a standard deviation of 0.63 (very consistent), while the second has 9.17 (highly variable). This tells you the second dataset is much less predictable.

📐 Operating Along Axes

When you have multi-dimensional data, you often want to aggregate along a specific direction. Do you want the average of each row, or the average of each column? That's where the axis parameter comes in.

Python
# Test scores: 3 students, 4 tests
scores = np.array([
    [85, 90, 88, 92],  # Student 0
    [78, 82, 80, 85],  # Student 1
    [92, 95, 93, 96]   # Student 2
])

print("Score table:")
print(scores)

# Average for each student (across tests)
student_avgs = scores.mean(axis=1)
print(f"\nStudent averages: {student_avgs}")

# Average for each test (across students)
test_avgs = scores.mean(axis=0)
print(f"Test averages: {test_avgs}")
▶ Output
Score table:
[[85 90 88 92]
 [78 82 80 85]
 [92 95 93 96]]

Student averages: [88.75 81.25 94. ]
Test averages: [85. 89. 87. 91. ]

When axis=1, you aggregate across columns (horizontally)—giving you one value per row. When axis=0, you aggregate down rows (vertically)—giving you one value per column.

Understanding Axes
axis=0 (down the rows)
  • Aggregates vertically
  • Result has same number of columns
  • Example: Average score per test
axis=1 (across the columns)
  • Aggregates horizontally
  • Result has same number of rows
  • Example: Average score per student

A helpful way to remember: the axis you specify is the one that disappears. If you use axis=0 on a (3,4) array, the 3 goes away and you're left with 4 values. If you use axis=1, the 4 goes away and you're left with 3 values.

🎯 Finding Extremes: argmin and argmax

Sometimes you don't just want to know the minimum or maximum value—you want to know where it is. The argmin() and argmax() functions return the index of the extreme value:

Python
# Prices of different products
prices = np.array([15.99, 8.50, 12.25, 6.75, 22.00])

# Find the cheapest and most expensive
cheapest_idx = prices.argmin()
most_expensive_idx = prices.argmax()

print(f"Cheapest price: {prices[cheapest_idx]} at index {cheapest_idx}")
print(f"Most expensive: {prices[most_expensive_idx]} at index {most_expensive_idx}")
▶ Output
Cheapest price: 6.75 at index 3
Most expensive: 22.0 at index 4

This is especially powerful in machine learning. When a model makes predictions, it outputs probabilities for each possible answer. You use argmax() to find which answer has the highest probability:

Python
# AI model predicting emotion from 4 possibilities
predictions = np.array([0.05, 0.12, 0.78, 0.05])
emotions = ['Happy', 'Sad', 'Angry', 'Surprised']

# Which emotion did the model predict?
predicted_idx = predictions.argmax()
confidence = predictions[predicted_idx]

print(f"Prediction: {emotions[predicted_idx]}")
print(f"Confidence: {confidence*100:.1f}%")
▶ Output
Prediction: Angry
Confidence: 78.0%

The model is 78% confident the emotion is "Angry" (index 2, the highest probability). This pattern—using argmax to pick the winner—is fundamental to how AI systems make decisions.

💡 Practical Example: Sales Analysis

Let's combine vectorization and statistics to analyze realistic sales data:

Python
# Weekly sales data: 4 products × 7 days
sales = np.array([
    [120, 135, 142, 128, 155, 168, 145],  # Product A
    [85, 92, 88, 95, 102, 98, 90],        # Product B
    [200, 215, 208, 225, 240, 235, 220],  # Product C
    [45, 52, 48, 55, 60, 58, 50]          # Product D
])

print(f"Sales data shape: {sales.shape}")

# Calculate total sales per product (sum across days)
product_totals = sales.sum(axis=1)
print(f"\nTotal sales per product: {product_totals}")

# Calculate average daily sales per product
product_averages = sales.mean(axis=1)
print(f"Average daily sales per product: {product_averages}")

# Find best-selling product
best_product_idx = product_totals.argmax()
print(f"\nBest-selling product: Product {chr(65+best_product_idx)}")
print(f"Total units sold: {product_totals[best_product_idx]}")

# Calculate total sales per day (sum across products)
daily_totals = sales.sum(axis=0)
print(f"\nDaily totals: {daily_totals}")

# Find best sales day
best_day_idx = daily_totals.argmax()
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
print(f"Best sales day: {days[best_day_idx]} ({daily_totals[best_day_idx]} units)")

# Calculate consistency (standard deviation) for Product A
product_a_std = sales[0].std()
print(f"\nProduct A consistency (std): {product_a_std:.2f}")
▶ Output
Sales data shape: (4, 7)

Total sales per product: [ 993 650 1543 368]
Average daily sales per product: [141.86 92.86 220.43 52.57]

Best-selling product: Product C
Total units sold: 1543

Daily totals: [450 494 486 503 557 559 505]
Best sales day: Sat (559 units)

Product A consistency (std): 14.73

This analysis reveals: Product C is the top seller, Saturday has the highest sales, and Product A has fairly consistent daily sales (std of 14.73 units). All computed with just a few lines of NumPy code—no manual loops required.

  • Vectorization means applying operations to entire arrays at once—faster and more readable than loops.
  • Universal functions (ufuncs) like +, *, sqrt(), log() work element-wise on arrays of any shape automatically.
  • Aggregation functions (sum, mean, min, max, std) reduce arrays to summary statistics that reveal patterns.
  • Standard deviation measures data spread—low values mean consistent data, high values mean high variability.
  • The axis parameter controls aggregation direction: axis=0 aggregates down rows, axis=1 aggregates across columns.
  • argmin() and argmax() return the INDEX of minimum/maximum values—essential for finding locations, not just values.
  • These operations form the foundation of data analysis—every complex calculation builds on these basic aggregations.
📚External Resources