Topic 8.4: Initial Inspection
A full read-only audit of the W8 competition dataset using .shape, .dtypes, .info(), .isna(), .duplicated(), and .describe()
Before a structural engineer begins renovating an old building, they conduct a thorough survey. They walk through every room, check load-bearing walls, measure floor dimensions, test the electrics, and inspect the plumbing β without moving a single brick. This survey produces a complete picture of what is there, what is missing, and what needs to be fixed. Only after the survey is finished do they begin any work. A renovation without a survey means working blind: walls are moved only to discover they were load-bearing, and problems in unexamined areas go unnoticed until they cause failures.
Data inspection follows the same logic. Before any cell in a dataset is modified, a complete read-only audit must be performed. The audit answers six questions: How big is the dataset? What columns exist and what type is each? Where are the missing values, and how many? Are there duplicate rows? What do the numeric distributions look like? Are there any immediately visible impossible values?
All inspection methods in this topic are read-only. They return information about the DataFrame without modifying it. This is intentional: the inspection phase is purely diagnostic. No cleaning decisions should be made, and no values should be changed, until the full diagnostic picture is available.
The .shape attribute returns a tuple (rows, columns). This is always the first inspection step β it establishes the baseline size of the dataset before any cleaning begins. After cleaning, comparing the new shape to the original shape confirms how many rows were removed.
import pandas as pd df = pd.read_csv('8_4_competition_results.csv') # Step 1: Check dataset dimensions print("Dataset shape (rows, columns):") print(df.shape) print(f"\nTotal cells: {df.shape[0] * df.shape[1]:,}") print(f"Rows: {df.shape[0]}") print(f"Columns: {df.shape[1]}")
The dataset has 207 rows and 7 columns. This is the baseline: after removing duplicates (7 rows), we expect 200 rows. Any additional row changes during cleaning must be tracked against this baseline to ensure no unintended data is lost.
The .dtypes attribute shows the data type that Pandas is using to store each column. This is the fastest way to identify type mismatches: a date column stored as object, a numeric column stored as object because it contains some text, or integers incorrectly stored as floats (which signals hidden NaN values).
# Step 2: Inspect column data types print("Column data types:") print(df.dtypes)
Two immediate observations from this output. First: age is stored as float64 even though ages are whole numbers β this signals that the column contains at least one NaN value (Pandas converts integer columns to float when NaN is introduced). score is also float64, which is expected here because scores are decimals (e.g. 97.5). Second: date is stored as str instead of datetime64 β this signals that the date values are still text, likely because multiple date formats prevented automatic parsing.
The .info() method combines type information with non-null counts in one printout. The non-null count per column implicitly reveals missing values: if a DataFrame has 207 rows and a column shows 197 non-null, there are 10 missing values in that column.
# Step 3: Full structural overview with .info() print("DataFrame structural overview:") df.info()
age shows 192 non-null out of 207 rows β confirming 15 missing values. city shows 199 non-null β confirming 8 missing values. score shows 197 non-null β confirming 10 missing values. All other columns are complete. The memory usage (11.4 KB) is also visible, useful for large dataset planning.
To find missing values from .info() output: subtract Non-Null Count from total rows. If the RangeIndex says 207 entries and a column shows 197 non-null, then 207 β 197 = 10 values are missing. No extra calculation needed β the difference is the gap.
The .isna().sum() chain produces the most precise read of missing values: exact counts per column. A complete audit also reports missingness as a percentage β a count of 12 means very different things in a 100-row dataset versus a 100,000-row dataset.
# Step 4: Precise missing value audit print("Missing value counts per column:") missing = df.isna().sum() print(missing) print("\nMissing value percentage per column:") missing_pct = (missing / len(df) * 100).round(2) print(missing_pct) print(f"\nTotal missing cells: {missing.sum()} out of {df.size} ({missing.sum()/df.size*100:.2f}%)")
All three columns with missing data are well under the 20% threshold. age is missing for 7.25% of students, city for 3.86%, and score for 4.83% β all moderate to low. The decision on how to handle each is made in Topic 8.6.
The .duplicated() method returns a Boolean Series β True for any row that is an exact copy of an earlier row, False otherwise. Chaining .sum() counts the total number of duplicate rows. In the inspection phase, the goal is to measure the scale of duplication and examine a sample of duplicated rows before any removal begins.
# Step 5: Duplicate row audit print(f"Total duplicate rows: {df.duplicated().sum()}") # Show the actual duplicate rows (the later copies) print("\nDuplicated records:") print(df[df.duplicated()])
There are 7 duplicate rows. df.duplicated() marks each row that is an exact copy of an earlier row as True, so filtering with it shows the 7 later copies β each of these has an identical twin earlier in the dataset. A row is only flagged when every column matches exactly. Before removing anything, inspecting these rows confirms they are true duplicates and not coincidentally similar records. Removal happens in Topic 8.5.
The .describe() method returns summary statistics for every numeric column: count, mean, standard deviation, minimum, Q1, median (Q3), Q3, and maximum. This is the primary tool for detecting outliers and impossible values β the min and max values deserve immediate attention.
# Step 6: Statistical distribution audit print("Numeric column statistics:") print(df.describe().round(2)) print("\nNon-numeric column overview:") print(df.describe(include='object'))
Key findings from this output. For score: the mean (76.96) is close to the median (77.40), indicating no severe outlier distortion. The count is 197 (not 207), confirming the 10 missing values. For age: range 18β35, fully valid for a student competition, with a count of 192 confirming 15 missing values. For the text columns: city has more unique values than the real number of cities β confirming the formatting inconsistencies (CAIRO / cairo / Giza / giza / Alex) detected earlier. grade and level each have 3 unique values, and date appears in several different formats β confirming format variation.
After running all six inspection steps, the analyst has a complete diagnostic picture. Before any cleaning begins, this picture should be documented as an inspection report β a list of all identified problems with their scale, severity, and the cleaning action required.
| Issue Found | Column(s) | Scale | Planned Action | Topic |
|---|---|---|---|---|
| Duplicate rows | All columns | 7 rows (3.4%) | drop_duplicates(keep='first') | 8.5 |
| Missing values | age | 15 rows (7.25%) | fillna(median) | 8.6 |
| Missing values | city | 8 rows (3.86%) | fillna(mode) | 8.6 |
| Missing values | score | 10 rows (4.83%) | fillna(median) | 8.6 |
| Formatting inconsistencies | city | multiple cases for the same cities | str.strip().str.title() + map variants | 8.7 |
| Type mismatch | date | object instead of datetime64 | pd.to_datetime(format='mixed') | 8.7 |
| Encoding needed | level, city | Text categories for ML | LabelEncoder / get_dummies | 8.8 |
| Scaling needed | score, age | Different numeric ranges | MinMaxScaler / StandardScaler | 8.9 |
All six steps above are read-only. The DataFrame df still has 207 rows and 7 columns in its original state. The inspection report above is the output of the diagnostic phase β a specification for the cleaning work that begins in Topic 8.5.
.describe() shows statistical problems β outliers, impossible values, suspicious distributions β but it does not show formatting inconsistencies, duplicate rows, or wrong data types. A complete audit requires all six steps. Using only .describe() gives a partial picture that misses structural and categorical problems entirely.
A column stored as float64 may be numerically valid but contain sentinel values (-999, 9999) or logically impossible values (negative ages). The dtype tells you how the data is stored, not whether the stored values are correct. .describe() must also be run to detect value-level problems.
Even a dataset that has been used before must be inspected when it is loaded fresh. Data sources change: new rows may have introduced formatting inconsistencies, a system update may have changed a column type, or a data-export script may have introduced duplicates. Inspection is not a one-time task β it should be the first step every time a dataset is loaded.
- .shape returns (rows, columns) β establishes the baseline size before cleaning begins. The W8 dataset: (207, 7).
- .dtypes reveals type mismatches β the W8 date column is stored as object instead of datetime64, and score is float64 (signalling NaN values).
- .info() combines type information with non-null counts in one view β confirms 15 missing ages, 8 missing cities, and 10 missing scores.
- .isna().sum() provides precise per-column missing value counts and, when divided by len(df), the missingness percentage.
- .duplicated().sum() counts exact duplicate rows β the W8 dataset has 7 duplicate rows representing 7 students entered twice.
- .describe() reveals statistical distributions β compare min/max against valid ranges, and mean vs. median to detect outlier distortion.
- All six steps are read-only. The inspection phase produces a complete diagnostic report; the cleaning phase (Topics 8.5β8.7) addresses each identified problem.
- β Dataset: 8_4_competition_results.csv
The W8 competition dataset used in the code examples above - β Practice Dataset: 8_4_students.csv
A second dataset to practice the six-step inspection routine on your own - β Notebook: 8_4_Initial_Inspection.ipynb
Follow-along notebook for this topic's inspection steps
- β Pandas Documentation: DataFrame.info()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html - β Pandas Documentation: DataFrame.describe()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html - β Pandas Documentation: DataFrame.isna()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html - β Pandas Documentation: DataFrame.duplicated()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html