Topic 6.5: The DataFrame - The Relational Engine
Mastering Pandas' core data structure for organizing and analyzing tabular data.
📂 Dataset for this topic — 6_5_inventory.csv: ⬇ Download 6_5_inventory.csv
We learned about Series, the intelligent column with labels. But real data rarely comes as a single column. Usually it arrives as a complete table: students with multiple subjects, products with multiple attributes, inventory with multiple characteristics. A DataFrame is how Pandas handles this reality.
A DataFrame is a two-dimensional table with rows and columns, where each column is a Series. Think of it as a spreadsheet in Python, but with programming power built in. Excel has rows and columns, but a DataFrame adds analysis tools to filter, transform, and aggregate data with code instead of mouse clicks.
Imagine you have three separate Series: one contains product names (Koshari, Falafel, Mahshi), another contains prices (25, 15, 35), and a third contains stock quantities (120, 200, 80). Each Series alone is useful, but they are disconnected. You cannot see which price belongs to which product, or how much stock each item has. A DataFrame solves this by organizing all three Series into columns of a single table, where each row represents one complete product record.
DataFrames keep related data together. When you have product names, prices, stock levels, and categories in one DataFrame, you can ask complex questions: which products are low in stock? What is the average price by category? Which items should we reorder? The DataFrame structure makes these questions answerable with just a few lines of code.
Pandas provides three primary methods for creating DataFrames, each suited to different data sources and situations. Understanding when to use each method makes data loading efficient and intuitive.
The first method uses a dictionary of lists. The keys become the column names, and the values are lists containing the data for each column. This method is straightforward and readable - you define the structure by naming each column, then provide all the values for that column as a list.
import pandas as pd import numpy as np # Method 1: Dictionary of lists df1 = pd.DataFrame({ 'Product': ['Koshari', 'Falafel', 'Mahshi', 'Molokhia'], 'Price': [25, 15, 35, 30], 'Stock': [120, 200, 80, 95], 'Category': ['Main Dish', 'Appetizer', 'Main Dish', 'Main Dish'] }) print(df1) # Output: # Product Price Stock Category # 0 Koshari 25 120 Main Dish # 1 Falafel 15 200 Appetizer # 2 Mahshi 35 80 Main Dish # 3 Molokhia 30 95 Main Dish
Pandas automatically aligns the data so that the first item in each list forms the first row, the second item in each list forms the second row, and so on. All lists must have the same length, otherwise Pandas cannot create a proper rectangular table.
The second method uses a list of dictionaries. Each dictionary represents one complete row, where the keys are the column names and the values are the data points in that row.
# Method 2: List of dictionaries inventory = [ {'Product': 'Koshari', 'Price': 25, 'Stock': 120, 'Category': 'Main Dish'}, {'Product': 'Falafel', 'Price': 15, 'Stock': 200, 'Category': 'Appetizer'}, {'Product': 'Mahshi', 'Price': 35, 'Stock': 80, 'Category': 'Main Dish'}, {'Product': 'Molokhia', 'Price': 30, 'Stock': 95, 'Category': 'Main Dish'} ] df2 = pd.DataFrame(inventory) print(df2) # Output: # Product Price Stock Category # 0 Koshari 25 120 Main Dish # 1 Falafel 15 200 Appetizer # 2 Mahshi 35 80 Main Dish # 3 Molokhia 30 95 Main Dish
Pandas collects all the keys from all dictionaries to determine the column names, then fills in the values row by row. If some dictionaries have keys that others lack, Pandas fills those positions with NaN (Not a Number) to maintain the rectangular table structure.
The third method uses a NumPy array combined with explicit labels. This combines the computational speed of NumPy with the organizational clarity of Pandas. The array contains just raw numbers, without any labels or structure. By providing the index and columns parameters, you transform those numbers into a labeled DataFrame where each value has both a row name and a column name.
# Method 3: NumPy array with labels data = np.array([[25, 120], [15, 200], [35, 80], [30, 95]]) df3 = pd.DataFrame( data, index=['Koshari', 'Falafel', 'Mahshi', 'Molokhia'], columns=['Price', 'Stock'] ) print(df3) # Output: # Price Stock # Koshari 25 120 # Falafel 15 200 # Mahshi 35 80 # Molokhia 30 95
Use this method when you are working with numerical computations in NumPy and want to add meaningful labels for analysis. The array provides computational efficiency, while the DataFrame adds interpretability. This is common in machine learning workflows where you compute features with NumPy, then organize them in a DataFrame for exploration.
A DataFrame has several components that work together to organize data. Understanding these components helps you navigate and manipulate datasets effectively.
Every DataFrame has an index, columns, and values. The index identifies rows - by default, these are sequential integers starting from zero, but you can set them to meaningful labels like product names or dates. The columns are the names of each column, typically strings that describe what data that column contains. The values are the actual data stored in the table.
df = pd.DataFrame({ 'Product': ['Koshari', 'Falafel', 'Mahshi', 'Molokhia'], 'Price': [25, 15, 35, 30], 'Stock': [120, 200, 80, 95], 'Category': ['Main Dish', 'Appetizer', 'Main Dish', 'Main Dish'] }) # Access the index print(df.index) # Output: RangeIndex(start=0, stop=4, step=1) # Access the columns print(df.columns) # Output: Index(['Product', 'Price', 'Stock', 'Category'], dtype='object') # Access the values as a NumPy array print(df.values) # Output: # [['Koshari' 25 120 'Main Dish'] # ['Falafel' 15 200 'Appetizer'] # ['Mahshi' 35 80 'Main Dish'] # ['Molokhia' 30 95 'Main Dish']]
Each column in a DataFrame is a Series. When you extract a single column, you get a Series object with all the properties and methods we learned in the previous topic. This consistency makes working with DataFrames predictable - operations you learned on Series work the same way on DataFrame columns.
# Extract a column as a Series prices = df['Price'] print(type(prices)) # Output: <class 'pandas.core.series.Series'> print(prices) # Output: # 0 25 # 1 15 # 2 35 # 3 30 # Name: Price, dtype: int64
The dtypes attribute shows the data type of each column. This is critical information when working with data. If you expect a Price column to be numeric but dtypes shows it as object, that signals a problem - perhaps there are text values mixed in with numbers, which would prevent mathematical operations.
# Check data types for all columns print(df.dtypes) # Output: # Product object # Price int64 # Stock int64 # Category object # dtype: object
Here, Product and Category are object type, which means they contain strings. Price and Stock are int64, which means they are integers. This confirms the data is structured correctly for analysis - text columns contain text, numeric columns contain numbers.
When you open a large DataFrame with thousands of rows, printing the entire thing overwhelms your screen and provides little insight. Pandas provides methods for quick inspection that show you just enough to understand the data without flooding your terminal.
The head method returns the first five rows by default. The tail method returns the last five rows. You can pass a different number as an argument if you want more or fewer rows. For example, head(3) shows the first three rows, and tail(10) shows the last ten rows.
# Load a sample inventory dataset df = pd.read_csv('6_5_inventory.csv') # View first 3 rows print(df.head(3)) # Output: # Product Price Stock Category # 0 Koshari 25 120 Main Dish # 1 Falafel 15 200 Appetizer # 2 Mahshi 35 80 Main Dish # View last 2 rows print(df.tail(2)) # Output: # Product Price Stock Category # 8 Ful Medames 10 180 Appetizer # 9 Basbousa 20 70 Dessert
Head shows what the data looks like at the beginning, revealing the structure and column names. This is typically the first command you run when loading a new dataset. Tail shows what the data looks like at the end, which can reveal incomplete or unusual entries at the bottom of the file. Sometimes data collection processes fail partway through, leaving the last rows incomplete or corrupted.
Beyond quick views of the data itself, Pandas provides methods that summarize the DataFrame's structure and statistics. These methods answer questions like: how many rows are there? What data types are present? Are there missing values? What is the range of numerical values?
The info method gives you a comprehensive summary of the DataFrame: the number of rows, the number of columns, the data types of each column, and how many non-null values each column contains. This is one of the most important inspection methods because it reveals data quality issues at a glance.
# Get DataFrame structure and metadata print(df.info()) # Output: # <class 'pandas.core.frame.DataFrame'> # RangeIndex: 10 entries, 0 to 9 # Data columns (total 4 columns): # # Column Non-Null Count Dtype # --- ------ -------------- ----- # 0 Product 10 non-null object # 1 Price 10 non-null int64 # 2 Stock 9 non-null int64 # 3 Category 10 non-null object # dtypes: int64(2), object(2) # memory usage: 448.0+ bytes
Run info first when you open a new dataset. The output tells you that this DataFrame has 10 rows and 4 columns. It shows the data types for each column, confirming whether columns are numeric or text. Most importantly, it shows the non-null count. If a column has fewer non-null entries than total rows, that means there are missing values. In this example, the Stock column has only 9 non-null values out of 10 rows, indicating one missing value that needs investigation.
The describe method provides statistical summaries for all numerical columns: count, mean, standard deviation, minimum, quartiles, and maximum. This shows you the range of values, the average, and whether there are any extreme outliers.
# Get statistical summary print(df.describe()) # Output: # Price Stock # count 10.000000 9.000000 # mean 23.500000 117.777778 # std 9.438834 58.871603 # min 10.000000 50.000000 # 25% 17.500000 75.000000 # 50% 22.500000 100.000000 # 75% 30.000000 150.000000 # max 40.000000 200.000000
The describe output tells you that prices range from 10 to 40, with an average of 23.5. Stock levels range from 50 to 200, with an average of about 118. The standard deviation tells you how spread out the values are - a large standard deviation relative to the mean indicates high variability. The quartiles (25%, 50%, 75%) show the distribution of values, helping you identify whether data is concentrated or spread evenly.
The shape attribute returns a tuple containing the number of rows and columns. This is useful for quick size checks, especially when working with large datasets or when you need to verify that data operations produced the expected number of rows.
# Get dimensions print(df.shape) # Output: (10, 4) # This means 10 rows and 4 columns # Access rows and columns separately num_rows = df.shape[0] num_cols = df.shape[1] print(f"The dataset has {num_rows} rows and {num_cols} columns.") # Output: The dataset has 10 rows and 4 columns.
The columns attribute returns the column names as an Index object, which you can convert to a list. This is useful when you need to iterate over columns or check whether a specific column exists.
# Get column names print(df.columns) # Output: Index(['Product', 'Price', 'Stock', 'Category'], dtype='object') # Convert to list column_list = list(df.columns) print(column_list) # Output: ['Product', 'Price', 'Stock', 'Category'] # Check if a column exists if 'Price' in df.columns: print("Price column exists") # Output: Price column exists
The index attribute returns the row labels. For a default DataFrame, this is a RangeIndex starting at zero. For DataFrames with custom row labels, this shows those labels.
# Get row index print(df.index) # Output: RangeIndex(start=0, stop=10, step=1) # Get first 5 index values print(list(df.index[:5])) # Output: [0, 1, 2, 3, 4]
Inspecting your DataFrame before analysis is not optional. Skipping this step leads to errors, incorrect conclusions, and wasted time. Professional data scientists follow a consistent inspection workflow when loading new data.
First, run info to understand the structure. How many rows and columns are there? What are the data types? Are there missing values? This gives you the big picture.
Second, run head to see the actual data. Does it look reasonable? Are the column names clear? Are there any obvious errors in the first few rows?
Third, run describe to check the statistics. Are the ranges sensible? If you have a Price column and the maximum is 1 million, that might be an error. If the minimum is negative, that is definitely suspicious. Do the mean and median make sense for this domain?
Fourth, check the shape. Does the number of rows match your expectations? If you are loading a file that should have 1000 records but shape shows 950, something went wrong during the import.
This inspection workflow takes less than a minute but prevents hours of debugging later. It catches data quality issues early when they are easy to fix, rather than discovering them halfway through an analysis when they invalidate all your work.
- A DataFrame is a two-dimensional table where each column is a Series, combining multiple related data columns into one structure.
- Create DataFrames using dictionary of lists (column-oriented), list of dictionaries (row-oriented), or NumPy arrays with labels.
- Every DataFrame has an index (row labels), columns (column names), and values (the actual data).
- Use head() to view the first few rows and tail() to view the last few rows for quick data inspection.
- Use info() to see DataFrame structure, data types, and missing values - run this first on new datasets.
- Use describe() to get statistical summaries of all numerical columns, revealing ranges and distributions.
- The shape attribute gives you (rows, columns), columns gives column names, and dtypes shows data types.
- Proper inspection before analysis catches data quality issues early and prevents errors later.
- ↗ Pandas DataFrame Documentation - Official Reference
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html - ↗ Pandas User Guide - DataFrame Basics
https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe - ↗ Real Python - Pandas DataFrame: Working with Tabular Data
https://realpython.com/pandas-dataframe/