Topic 5.4: Navigating Your Data

📏 Understanding Array Structure

▼

Before you can work with data effectively, you need to understand its structure. Think of it like navigating a city—you need to know the layout before you can find specific addresses. NumPy gives you several tools to inspect an array's structure.

The Three Essential Attributes

Python

import numpy as np

# Create a sample 2D array (a small table)
grid = np.array([
    [10, 20, 30],
    [40, 50, 60]
])

print(f"Number of dimensions: {grid.ndim}")
print(f"Shape (rows, columns): {grid.shape}")
print(f"Total elements: {grid.size}")
print(f"Data type: {grid.dtype}")

▶ Output

Number of dimensions: 2
Shape (rows, columns): (2, 3)
Total elements: 6
Data type: int64

The .ndim attribute tells you how many dimensions the array has. Our grid has 2 dimensions—it's a table with rows and columns. The .shape tells you the size of each dimension: (2, 3) means 2 rows and 3 columns. The .size is simply the total count: 2 × 3 = 6 elements.

✂️ Slicing: Extracting Subsets

▼

Often you don't need the entire array—you just need a piece of it. NumPy's slicing syntax lets you extract any rectangular section you want. It's like using a cookie cutter to get just the shape you need from a sheet of dough.

One-Dimensional Slicing

For 1D arrays, slicing works exactly like Python lists:

Python

# A simple 1D array
temperatures = np.array([22, 24, 26, 25, 23, 21, 20])

# Get the first three days
first_three = temperatures[0:3]
print(f"First three: {first_three}")

# Get the last three days
last_three = temperatures[-3:]
print(f"Last three: {last_three}")

# Get every other day
every_other = temperatures[::2]
print(f"Every other: {every_other}")

▶ Output

First three: [22 24 26]
Last three: [23 21 20]
Every other: [22 26 23 20]

Two-Dimensional Slicing

This is where NumPy really shines. You can slice both dimensions at once by separating them with a comma:

Python

# Create a 4x4 grid of numbers
data = np.arange(16).reshape(4, 4)
print("Full array:")
print(data)

# Extract a 2x2 section from the top-right
top_right = data[0:2, 2:4]
print("\nTop-right 2x2:")
print(top_right)

# Get the entire second column
second_column = data[:, 1]
print(f"\nSecond column: {second_column}")

# Get the last row
last_row = data[-1, :]
print(f"Last row: {last_row}")

▶ Output

Full array:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]

Top-right 2x2:
[[2 3]
[6 7]]

Second column: [ 1 5 9 13]
Last row: [12 13 14 15]

The syntax is array[row_slice, column_slice]. A colon : by itself means "all of them." So data[:, 1] means "all rows, second column." And data[-1, :] means "last row, all columns."

ℹ️

Slices Are Views, Not Copies

When you slice an array, NumPy doesn't create a copy of the data—it creates a view that points to the same memory. If you modify the slice, you're modifying the original array too. This is intentional for memory efficiency, but it can be surprising. If you want a true copy, use .copy().

Python

# Demonstrate view behavior
original = np.array([1, 2, 3, 4, 5])
slice_view = original[1:4]
slice_view[0] = 99

print(f"Original array: {original}")
# Original changed too!

# If you want a copy instead:
original2 = np.array([1, 2, 3, 4, 5])
slice_copy = original2[1:4].copy()
slice_copy[0] = 99
print(f"Original array: {original2}")
# Original unchanged

▶ Output

Original array: [ 1 99 3 4 5]
Original array: [1 2 3 4 5]

🎯 Boolean Masking: Filtering by Condition

▼

One of NumPy's most powerful features is the ability to filter data using logical conditions. Instead of writing loops with if statements, you create a boolean mask—an array of True/False values—and use it to extract only the elements you want.

Creating a Boolean Mask

Python

# Temperature data for a week
temps = np.array([22, 28, 31, 27, 33, 29, 24])

# Create a mask: which days were hot (above 28°C)?
hot_days_mask = temps > 28
print(f"Mask: {hot_days_mask}")
print(f"Type: {hot_days_mask.dtype}")

▶ Output

Mask: [False False True False True True False]
Type: bool

The comparison temps > 28 doesn't return a single True or False—it returns an array of booleans, one for each element. Each position is True if that element satisfies the condition, False otherwise.

Using the Mask to Filter

Python

# Extract only the hot days
hot_temps = temps[hot_days_mask]
print(f"Hot days (>28°C): {hot_temps}")

# You can combine the steps into one line:
really_hot = temps[temps > 30]
print(f"Really hot days (>30°C): {really_hot}")

# Count how many days were hot
num_hot_days = (temps > 28).sum()
print(f"Number of hot days: {num_hot_days}")

▶ Output

Hot days (>28°C): [31 33 29]
Really hot days (>30°C): [31 33]
Number of hot days: 3

When you use a boolean array as an index, NumPy returns only the elements where the mask is True. It's like a filter that lets through only what you want. The .sum() trick works because True counts as 1 and False as 0.

Combining Conditions

You can combine multiple conditions using & (and), | (or), and ~ (not). Note: you must use these symbols, not Python's and/or keywords:

Python

# Find moderate days (between 25 and 30)
moderate = temps[(temps >= 25) & (temps <= 30)]
print(f"Moderate days (25-30°C): {moderate}")

# Find extreme days (very hot or very cold)
extreme = temps[(temps > 30) | (temps < 23)]
print(f"Extreme days: {extreme}")

# Find days that were NOT hot
not_hot = temps[~(temps > 28)]
print(f"Not hot days: {not_hot}")

▶ Output

Moderate days (25-30°C): [28 27 29]
Extreme days: [22 31 33 24]
Not hot days: [22 28 27 24]

Notice the parentheses around each condition. They're required because of how Python evaluates expressions. (temps >= 25) & (temps <= 30) won't work correctly without them.

🎲 Fancy Indexing: Selecting Specific Positions

▼

Sometimes you want to extract elements at specific positions, but those positions aren't in a neat range. Fancy indexing lets you pass a list or array of indices to grab elements in any order you want.

Python

# Days of the week
days = np.array(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

# Get weekend days (indices 5 and 6)
weekend = days[[5, 6]]
print(f"Weekend: {weekend}")

# Get Monday, Wednesday, Friday (indices 0, 2, 4)
mwf = days[[0, 2, 4]]
print(f"MWF: {mwf}")

# You can even repeat indices
repeated = days[[0, 0, 0, 6]]
print(f"Repeated: {repeated}")

▶ Output

Weekend: ['Sat' 'Sun']
MWF: ['Mon' 'Wed' 'Fri']
Repeated: ['Mon' 'Mon' 'Mon' 'Sun']

Fancy indexing creates a copy, not a view. This is different from regular slicing. If you modify the result, it won't affect the original array.

Fancy Indexing in 2D

Python

# Test scores: 4 students, 3 tests
scores = np.array([
    [85, 90, 88],  # Student 0
    [92, 88, 95],  # Student 1
    [78, 85, 80],  # Student 2
    [88, 92, 90]   # Student 3
])

# Get test scores for students 0 and 2
selected_students = scores[[0, 2]]
print("Students 0 and 2:")
print(selected_students)

# Get specific elements: (student 1, test 2) and (student 3, test 0)
specific_scores = scores[[1, 3], [2, 0]]
print(f"\nSpecific scores: {specific_scores}")

▶ Output

Students 0 and 2:
[[85 90 88]
[78 85 80]]

Specific scores: [95 88]

When you provide two lists of indices for a 2D array, NumPy pairs them up: scores[[1, 3], [2, 0]] means "give me element (1,2) and element (3,0)." It doesn't give you a rectangular block—it gives you specific individual elements.

🎯 Practical Example: Student Grade Analysis

▼

Let's combine everything we've learned to analyze a realistic dataset:

Python

# Student test scores across 5 tests
all_scores = np.array([
    [85, 90, 88, 92, 87],  # Student 0
    [92, 88, 95, 89, 94],  # Student 1
    [78, 72, 80, 75, 77],  # Student 2
    [88, 92, 90, 93, 91],  # Student 3
    [65, 70, 68, 72, 69]   # Student 4
])

print(f"Dataset shape: {all_scores.shape}")
print(f"Total scores: {all_scores.size}")

# Calculate each student's average
student_averages = all_scores.mean(axis=1)
print(f"\nStudent averages: {student_averages}")

# Find students with average >= 85
strong_students_mask = student_averages >= 85
strong_student_ids = np.where(strong_students_mask)[0]
print(f"Strong students (avg >= 85): {strong_student_ids}")

# Get all scores for strong students
strong_student_scores = all_scores[strong_students_mask]
print(f"\nStrong students' scores:")
print(strong_student_scores)

# Find the highest score on test 3 (index 2)
test3_scores = all_scores[:, 2]
highest_on_test3 = test3_scores.max()
who_scored_highest = test3_scores.argmax()
print(f"\nHighest on test 3: {highest_on_test3} (Student {who_scored_highest})")

▶ Output

Dataset shape: (5, 5)
Total scores: 25

Student averages: [88.4 91.6 76.4 90.8 68.8]
Strong students (avg >= 85): [0 1 3]

Strong students' scores:
[[85 90 88 92 87]
[92 88 95 89 94]
[88 92 90 93 91]]

Highest on test 3: 95 (Student 1)

This example demonstrates real data analysis: checking structure, aggregating across axes, filtering with boolean masks, and extracting specific slices. These are the fundamental operations you'll use constantly in data science work.

Use .ndim, .shape, and .size to understand array structure before manipulating data.
Slicing syntax array[rows, columns] extracts rectangular sections; slices are views (not copies) for memory efficiency.
Boolean masking filters data by condition: array[array > threshold] returns only elements meeting the condition.
Combine conditions with & (and), | (or), ~ (not)—must use these symbols, not Python's and/or keywords, and wrap each condition in parentheses.
Fancy indexing array[[indices]] extracts specific positions in any order—creates copies, not views.
Always check array.shape before slicing to avoid dimension errors, especially with multi-dimensional data.

📚External Resources

▼

↗
NumPy Indexing and Slicing Guide
https://numpy.org/doc/stable/user/basics.indexing.html
↗
Boolean Indexing in NumPy (Visual Guide)
https://jalammar.github.io/visual-numpy/
↗
Advanced NumPy Indexing
https://scipy-lectures.org/intro/numpy/array_object.html#indexing-and-slicing