Topic 5.2: Building Your Data Containers

🛠️ From Lists to Arrays: Your First Conversion

▼

You already know how to work with Python lists. Now it's time to learn how to convert those lists into NumPy arrays, unlocking all the performance benefits we discussed in the previous topic.

The simplest way to create a NumPy array is to convert an existing Python list using the np.array() function. This is perfect when you already have data in a list and want to take advantage of NumPy's speed.

Python

import numpy as np

# Creating a 1D array from a simple list
temperatures = [22, 24, 26, 25, 23, 21, 20]
temp_array = np.array(temperatures)
print(temp_array)
print(f"Type: {type(temp_array)}")

▶ Output

[22 24 26 25 23 21 20]
Type: <class 'numpy.ndarray'>

Notice that the output looks similar to a Python list, but without the commas. More importantly, the type is now numpy.ndarray (where 'ndarray' stands for N-dimensional array). This isn't just a cosmetic change—this object now has all of NumPy's powerful capabilities.

Creating Multidimensional Arrays: Tables of Data

Most real-world data isn't just a simple list—it's tabular, like a spreadsheet. You have rows and columns. NumPy handles this by creating 2D arrays from nested lists.

Python

# Creating a 2D array from nested lists
student_grades = [
    [85, 90, 78],  # Student 1's scores
    [92, 88, 95],  # Student 2's scores
    [78, 85, 80]   # Student 3's scores
]

grades_array = np.array(student_grades)
print(grades_array)
print(f"\nShape: {grades_array.shape}")

▶ Output

[[85 90 78]
[92 88 95]
[78 85 80]]

Shape: (3, 3)

Here's what happened: each inner list became a row in the array. The shape (3, 3) tells us we have 3 rows and 3 columns. You can think of this as a table where each row represents a student and each column represents a different test.

ℹ️

Understanding Shape

The shape attribute returns a tuple. For 2D arrays, it's always (rows, columns). So (3, 3) means 3 rows and 3 columns. A shape of (100, 5) would mean 100 rows and 5 columns—perhaps 100 students with 5 test scores each.

This structure is fundamental to data science. When you work with Pandas DataFrames later, they're built on exactly this kind of 2D structure. Understanding how rows and columns work in NumPy arrays is understanding how most data in the world is organized.

🎨 Building Arrays from Scratch

▼

Often you don't have existing data to convert. Instead, you need to create a new array and fill it later—perhaps with calculations, sensor readings, or data from a file. NumPy provides several convenient functions for this.

Arrays Filled with Zeros or Ones

The np.zeros() function creates an array filled entirely with zeros, while np.ones() creates an array filled with ones. These might seem trivial, but they're incredibly useful for initializing arrays before you populate them with actual values.

Think of it like setting up empty containers before you fill them. You know you need space for 100 temperature readings, so you create an array of 100 zeros. As data comes in, you replace those zeros with real values.

Python

# Create a 1D array of 10 zeros
zeros_array = np.zeros(10)
print("Zeros:")
print(zeros_array)

# Create a 2D array (3 rows, 4 columns) of ones
ones_array = np.ones((3, 4))
print("\nOnes (3x4):")
print(ones_array)

▶ Output

Zeros:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Ones (3x4):
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]

Notice the decimal points after the zeros and ones. NumPy defaulted to floating-point numbers rather than integers. This is intentional—floating-point numbers are more flexible for most numerical work. But you can control this, which we'll discuss soon.

Also notice how we specified the shape for the 2D array: np.ones((3, 4)). The double parentheses might look odd, but it's because we're passing a tuple (3, 4) as a single argument. For 1D arrays, you can just write np.zeros(10) directly.

Arrays with Custom Fill Values

What if you want an array filled with a number other than zero or one? Use np.full():

Python

# Create an array of 8 elements, all filled with the number 5
fives = np.full(8, 5)
print(fives)

# Create a 2x3 array filled with 42
forty_twos = np.full((2, 3), 42)
print("\nForty-twos:")
print(forty_twos)

▶ Output

[5 5 5 5 5 5 5 5]

Forty-twos:
[[42 42 42]
[42 42 42]]

📏 Creating Sequences of Numbers

▼

Some of the most useful NumPy functions create arrays that are sequences of numbers. These are essential for creating ranges, generating test data, or preparing inputs for mathematical functions.

The arange Function: NumPy's Range

If you've used Python's built-in range() function, np.arange() will feel familiar. It works almost identically, but produces a NumPy array instead of a Python range object.

Python

# Create an array from 0 to 9
range_array = np.arange(10)
print("0 to 9:")
print(range_array)

# Create an array from 10 to 50 with a step of 5
step_array = np.arange(10, 51, 5)
print("\n10 to 50 (step 5):")
print(step_array)

# It works with floats too!
float_array = np.arange(0, 1, 0.1)
print("\n0 to 1 (step 0.1):")
print(float_array)

▶ Output

0 to 9:
[0 1 2 3 4 5 6 7 8 9]

10 to 50 (step 5):
[10 15 20 25 30 35 40 45 50]

0 to 1 (step 0.1):
[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]

Just like Python's range, arange() doesn't include the stop value by default. So arange(10) gives you 0 through 9, not 0 through 10. If you want to include the endpoint, there's a better function for that.

The linspace Function: Precise Divisions

Sometimes you don't care about the step size—you care about how many numbers you want. For example, you want to divide the range from 0 to 100 into exactly 11 evenly spaced points. That's where np.linspace() shines.

Python

# Create exactly 5 evenly spaced numbers between 0 and 1
linear = np.linspace(0, 1, 5)
print("5 points from 0 to 1:")
print(linear)

# Create 11 evenly spaced numbers from 0 to 100
scores = np.linspace(0, 100, 11)
print("\n11 points from 0 to 100:")
print(scores)

▶ Output

5 points from 0 to 1:
[0. 0.25 0.5 0.75 1. ]

11 points from 0 to 100:
[ 0. 10. 20. 30. 40. 50. 60. 70. 80. 90. 100.]

Notice that linspace includes the endpoint by default. So linspace(0, 1, 5) actually gives you 0, 0.25, 0.5, 0.75, and 1—five numbers, with the last one being exactly 1.

arange vs linspace

np.arange(start, stop, step)

You specify the step size
Endpoint is NOT included
Good when you know the interval between values
Example: every 5 degrees from 0 to 100

np.linspace(start, stop, num)

You specify how many numbers you want
Endpoint IS included by default
Good when you need precise divisions
Example: divide 0 to 100 into 11 equal parts

For plotting graphs or dividing a range into equal parts, linspace is usually the better choice. For generating sequences with known steps, arange is more natural.

🔬 Understanding Data Types

▼

Every NumPy array has a data type, accessed through the dtype attribute. This isn't just a label—it fundamentally affects how much memory your array uses and what operations you can perform.

When you create an array, NumPy makes an intelligent guess about what type of data you're storing. If you give it integers, it uses an integer type. If you give it decimals, it uses a float type.

Python

# NumPy automatically chooses appropriate types
int_array = np.array([1, 2, 3, 4])
print(f"Integer array dtype: {int_array.dtype}")

float_array = np.array([1.5, 2.7, 3.9])
print(f"Float array dtype: {float_array.dtype}")

mixed_array = np.array([1, 2.5, 3])
print(f"Mixed numbers dtype: {mixed_array.dtype}")

▶ Output

Integer array dtype: int64
Float array dtype: float64
Mixed numbers dtype: float64

Notice what happened with the mixed array. You gave it both integers (1, 3) and a float (2.5). NumPy can't mix types in the same array, so it promoted everything to float64. This is called upcasting—choosing the type that can represent all the values without losing information.

The Meaning of int64 and float64

The numbers in type names like int64 and float32 tell you how many bits are used to store each number. More bits means more precision or a larger range, but also more memory usage.

int8

Integers from -128 to 127 (1 byte)

int32

Integers from about -2 billion to 2 billion (4 bytes)

int64

Huge integers, up to ±9 quintillion (8 bytes)

float32

Decimals with ~7 significant digits (4 bytes)

float64

Decimals with ~15 significant digits (8 bytes)

On most systems, NumPy defaults to int64 for integers and float64 for decimals. These are safe choices—they have enough precision and range for almost any calculation you'll do.

Controlling the Data Type

Sometimes you want to explicitly choose the data type. Maybe you know your numbers will always be small, so you can save memory by using a smaller type. You do this with the dtype parameter:

Python

# Force integers to be stored in just 1 byte each
small_ints = np.array([1, 2, 3], dtype=np.int8)
print(f"Small ints dtype: {small_ints.dtype}")
print(f"Memory per element: {small_ints.itemsize} byte")

# Force floats to use only 4 bytes instead of 8
small_floats = np.array([1.5, 2.5, 3.5], dtype=np.float32)
print(f"\nSmall floats dtype: {small_floats.dtype}")
print(f"Memory per element: {small_floats.itemsize} bytes")

▶ Output

Small ints dtype: int8
Memory per element: 1 byte

Small floats dtype: float32
Memory per element: 4 bytes

The itemsize attribute shows you how many bytes each element uses. By choosing float32 instead of float64, you cut memory usage in half. For a dataset with 10 million numbers, that's the difference between using 80 MB versus 40 MB of RAM.

ℹ️

When to Optimize Data Types

For everyday work, stick with the defaults (int64 and float64). Only optimize data types when you're working with truly massive datasets that don't fit in memory, or when you're certain the smaller types won't cause overflow or precision issues.

📐 Dimensions: From Lines to Tables to Cubes

▼

Arrays can have different numbers of dimensions, and understanding this structure is fundamental to working with NumPy effectively. The number of dimensions determines how your data is organized and what you can do with it.

One Dimension: Vectors

A 1D array is a simple sequence of numbers—like a list. In mathematical terms, this is called a vector. It has length, but no width or depth.

Python

vector = np.array([10, 20, 30, 40])
print(f"Vector: {vector}")
print(f"Shape: {vector.shape}")
print(f"Dimensions: {vector.ndim}")

▶ Output

Vector: [10 20 30 40]
Shape: (4,)
Dimensions: 1

The shape (4,) might look odd with that trailing comma, but it's Python's way of representing a tuple with one element. It means: one dimension with 4 elements.

Two Dimensions: Matrices

A 2D array has rows and columns, like a table or spreadsheet. In mathematical terms, this is called a matrix. It has both length and width.

Python

matrix = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print("Matrix:")
print(matrix)
print(f"Shape: {matrix.shape}")
print(f"Dimensions: {matrix.ndim}")
print(f"Total elements: {matrix.size}")

▶ Output

Matrix:
[[1 2 3]
[4 5 6]]
Shape: (2, 3)
Dimensions: 2
Total elements: 6

Shape (2, 3) means 2 rows and 3 columns. The size attribute tells you the total number of elements: 2 × 3 = 6.

Three Dimensions and Beyond: Tensors

Arrays with three or more dimensions are generally called tensors. A 3D array can be visualized as a stack of matrices, or a cube of numbers.

Python

tensor = np.array([
    [[1, 2], [3, 4]],
    [[5, 6], [7, 8]]
])
print("Tensor:")
print(tensor)
print(f"Shape: {tensor.shape}")
print(f"Dimensions: {tensor.ndim}")

▶ Output

Tensor:
[[[1 2]
[3 4]]

[[5 6]
[7 8]]]
Shape: (2, 2, 2)
Dimensions: 3

Shape (2, 2, 2) means 2 layers, each with 2 rows and 2 columns. Total elements: 2 × 2 × 2 = 8.

While 3D arrays might seem abstract now, they're essential when working with images. A color image is 3D: height, width, and color channels (red, green, blue). A batch of images used in machine learning is 4D: batch size, height, width, and channels.

ℹ️

Checking Array Properties

Three essential attributes tell you everything about an array's structure: .ndim (number of dimensions), .shape (size of each dimension), and .size (total number of elements). Get in the habit of checking these whenever you work with a new array—it will save you from many confusing errors.

Convert existing Python lists to NumPy arrays using np.array()—the simplest way to start working with NumPy.
Create arrays from scratch using np.zeros(), np.ones(), or np.full() when you need to initialize containers before filling them with data.
Use np.arange() when you know the step size between values; use np.linspace() when you need a specific number of evenly-spaced points.
The dtype attribute determines data type and memory usage—int64 and float64 are safe defaults, but you can optimize for large datasets.
Arrays can be 1D (vectors), 2D (matrices/tables), or higher-dimensional (tensors)—the shape attribute tells you the structure.
Always check .ndim, .shape, and .size when working with new arrays to understand their structure and avoid dimension mismatches.

📚External Resources

▼

↗
NumPy Array Creation Guide
https://numpy.org/doc/stable/user/basics.creation.html
↗
Understanding NumPy Data Types (Interactive)
https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html
↗
NumPy Quickstart Tutorial
https://numpy.org/doc/stable/user/quickstart.html