Topic 5.4: Navigating Your Data
Understanding array structure and extracting exactly what you need
Before you can work with data effectively, you need to understand its structure. Think of it like navigating a city—you need to know the layout before you can find specific addresses. NumPy gives you several tools to inspect an array's structure.
The Three Essential Attributes
import numpy as np # Create a sample 2D array (a small table) grid = np.array([ [10, 20, 30], [40, 50, 60] ]) print(f"Number of dimensions: {grid.ndim}") print(f"Shape (rows, columns): {grid.shape}") print(f"Total elements: {grid.size}") print(f"Data type: {grid.dtype}")
Shape (rows, columns): (2, 3)
Total elements: 6
Data type: int64
The .ndim attribute tells you how many dimensions the array has. Our grid has 2 dimensions—it's a table with rows and columns. The .shape tells you the size of each dimension: (2, 3) means 2 rows and 3 columns. The .size is simply the total count: 2 × 3 = 6 elements.
Often you don't need the entire array—you just need a piece of it. NumPy's slicing syntax lets you extract any rectangular section you want. It's like using a cookie cutter to get just the shape you need from a sheet of dough.
One-Dimensional Slicing
For 1D arrays, slicing works exactly like Python lists:
# A simple 1D array temperatures = np.array([22, 24, 26, 25, 23, 21, 20]) # Get the first three days first_three = temperatures[0:3] print(f"First three: {first_three}") # Get the last three days last_three = temperatures[-3:] print(f"Last three: {last_three}") # Get every other day every_other = temperatures[::2] print(f"Every other: {every_other}")
Last three: [23 21 20]
Every other: [22 26 23 20]
Two-Dimensional Slicing
This is where NumPy really shines. You can slice both dimensions at once by separating them with a comma:
# Create a 4x4 grid of numbers data = np.arange(16).reshape(4, 4) print("Full array:") print(data) # Extract a 2x2 section from the top-right top_right = data[0:2, 2:4] print("\nTop-right 2x2:") print(top_right) # Get the entire second column second_column = data[:, 1] print(f"\nSecond column: {second_column}") # Get the last row last_row = data[-1, :] print(f"Last row: {last_row}")
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
Top-right 2x2:
[[2 3]
[6 7]]
Second column: [ 1 5 9 13]
Last row: [12 13 14 15]
The syntax is array[row_slice, column_slice]. A colon : by itself means "all of them." So data[:, 1] means "all rows, second column." And data[-1, :] means "last row, all columns."
When you slice an array, NumPy doesn't create a copy of the data—it creates a view that points to the same memory. If you modify the slice, you're modifying the original array too. This is intentional for memory efficiency, but it can be surprising. If you want a true copy, use .copy().
# Demonstrate view behavior original = np.array([1, 2, 3, 4, 5]) slice_view = original[1:4] slice_view[0] = 99 print(f"Original array: {original}") # Original changed too! # If you want a copy instead: original2 = np.array([1, 2, 3, 4, 5]) slice_copy = original2[1:4].copy() slice_copy[0] = 99 print(f"Original array: {original2}") # Original unchanged
Original array: [1 2 3 4 5]
One of NumPy's most powerful features is the ability to filter data using logical conditions. Instead of writing loops with if statements, you create a boolean mask—an array of True/False values—and use it to extract only the elements you want.
Creating a Boolean Mask
# Temperature data for a week temps = np.array([22, 28, 31, 27, 33, 29, 24]) # Create a mask: which days were hot (above 28°C)? hot_days_mask = temps > 28 print(f"Mask: {hot_days_mask}") print(f"Type: {hot_days_mask.dtype}")
Type: bool
The comparison temps > 28 doesn't return a single True or False—it returns an array of booleans, one for each element. Each position is True if that element satisfies the condition, False otherwise.
Using the Mask to Filter
# Extract only the hot days hot_temps = temps[hot_days_mask] print(f"Hot days (>28°C): {hot_temps}") # You can combine the steps into one line: really_hot = temps[temps > 30] print(f"Really hot days (>30°C): {really_hot}") # Count how many days were hot num_hot_days = (temps > 28).sum() print(f"Number of hot days: {num_hot_days}")
Really hot days (>30°C): [31 33]
Number of hot days: 3
When you use a boolean array as an index, NumPy returns only the elements where the mask is True. It's like a filter that lets through only what you want. The .sum() trick works because True counts as 1 and False as 0.
Combining Conditions
You can combine multiple conditions using & (and), | (or), and ~ (not). Note: you must use these symbols, not Python's and/or keywords:
# Find moderate days (between 25 and 30) moderate = temps[(temps>=25) & (temps<=30)] print(f"Moderate days (25-30°C): {moderate}") # Find extreme days (very hot or very cold) extreme = temps[(temps > 30) | (temps < 23)] print(f"Extreme days: {extreme}") # Find days that were NOT hot not_hot = temps[~(temps > 28)] print(f"Not hot days: {not_hot}")
Extreme days: [22 31 33 24]
Not hot days: [22 28 27 24]
Notice the parentheses around each condition. They're required because of how Python evaluates expressions. (temps >= 25) & (temps <= 30) won't work correctly without them.
Sometimes you want to extract elements at specific positions, but those positions aren't in a neat range. Fancy indexing lets you pass a list or array of indices to grab elements in any order you want.
# Days of the week days = np.array(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']) # Get weekend days (indices 5 and 6) weekend = days[[5, 6]] print(f"Weekend: {weekend}") # Get Monday, Wednesday, Friday (indices 0, 2, 4) mwf = days[[0, 2, 4]] print(f"MWF: {mwf}") # You can even repeat indices repeated = days[[0, 0, 0, 6]] print(f"Repeated: {repeated}")
MWF: ['Mon' 'Wed' 'Fri']
Repeated: ['Mon' 'Mon' 'Mon' 'Sun']
Fancy indexing creates a copy, not a view. This is different from regular slicing. If you modify the result, it won't affect the original array.
Fancy Indexing in 2D
# Test scores: 4 students, 3 tests scores = np.array([ [85, 90, 88], # Student 0 [92, 88, 95], # Student 1 [78, 85, 80], # Student 2 [88, 92, 90] # Student 3 ]) # Get test scores for students 0 and 2 selected_students = scores[[0, 2]] print("Students 0 and 2:") print(selected_students) # Get specific elements: (student 1, test 2) and (student 3, test 0) specific_scores = scores[[1, 3], [2, 0]] print(f"\nSpecific scores: {specific_scores}")
[[85 90 88]
[78 85 80]]
Specific scores: [95 88]
When you provide two lists of indices for a 2D array, NumPy pairs them up: scores[[1, 3], [2, 0]] means "give me element (1,2) and element (3,0)." It doesn't give you a rectangular block—it gives you specific individual elements.
Let's combine everything we've learned to analyze a realistic dataset:
# Student test scores across 5 tests
all_scores = np.array([
[85, 90, 88, 92, 87], # Student 0
[92, 88, 95, 89, 94], # Student 1
[78, 72, 80, 75, 77], # Student 2
[88, 92, 90, 93, 91], # Student 3
[65, 70, 68, 72, 69] # Student 4
])
print(f"Dataset shape: {all_scores.shape}")
print(f"Total scores: {all_scores.size}")
# Calculate each student's average
student_averages = all_scores.mean(axis=1)
print(f"\nStudent averages: {student_averages}")
# Find students with average >= 85
strong_students_mask = student_averages >= 85
strong_student_ids = np.where(strong_students_mask)[0]
print(f"Strong students (avg >= 85): {strong_student_ids}")
# Get all scores for strong students
strong_student_scores = all_scores[strong_students_mask]
print(f"\nStrong students' scores:")
print(strong_student_scores)
# Find the highest score on test 3 (index 2)
test3_scores = all_scores[:, 2]
highest_on_test3 = test3_scores.max()
who_scored_highest = test3_scores.argmax()
print(f"\nHighest on test 3: {highest_on_test3} (Student {who_scored_highest})")Total scores: 25
Student averages: [88.4 91.6 76.4 90.8 68.8]
Strong students (avg >= 85): [0 1 3]
Strong students' scores:
[[85 90 88 92 87]
[92 88 95 89 94]
[88 92 90 93 91]]
Highest on test 3: 95 (Student 1)
This example demonstrates real data analysis: checking structure, aggregating across axes, filtering with boolean masks, and extracting specific slices. These are the fundamental operations you'll use constantly in data science work.
- Use .ndim, .shape, and .size to understand array structure before manipulating data.
- Slicing syntax array[rows, columns] extracts rectangular sections; slices are views (not copies) for memory efficiency.
- Boolean masking filters data by condition: array[array > threshold] returns only elements meeting the condition.
- Combine conditions with & (and), | (or), ~ (not)—must use these symbols, not Python's and/or keywords, and wrap each condition in parentheses.
- Fancy indexing array[[indices]] extracts specific positions in any order—creates copies, not views.
- Always check array.shape before slicing to avoid dimension errors, especially with multi-dimensional data.
- ↗ NumPy Indexing and Slicing Guide
https://numpy.org/doc/stable/user/basics.indexing.html - ↗ Boolean Indexing in NumPy (Visual Guide)
https://jalammar.github.io/visual-numpy/ - ↗ Advanced NumPy Indexing
https://scipy-lectures.org/intro/numpy/array_object.html#indexing-and-slicing