πŸ”­
Data Audit shape dtypes info isna duplicated describe Pandas

Topic 8.4: Initial Inspection

A full read-only audit of the W8 competition dataset using .shape, .dtypes, .info(), .isna(), .duplicated(), and .describe()

πŸŒ‰ The Building Survey
β–Ό

Before a structural engineer begins renovating an old building, they conduct a thorough survey. They walk through every room, check load-bearing walls, measure floor dimensions, test the electrics, and inspect the plumbing β€” without moving a single brick. This survey produces a complete picture of what is there, what is missing, and what needs to be fixed. Only after the survey is finished do they begin any work. A renovation without a survey means working blind: walls are moved only to discover they were load-bearing, and problems in unexamined areas go unnoticed until they cause failures.

Data inspection follows the same logic. Before any cell in a dataset is modified, a complete read-only audit must be performed. The audit answers six questions: How big is the dataset? What columns exist and what type is each? Where are the missing values, and how many? Are there duplicate rows? What do the numeric distributions look like? Are there any immediately visible impossible values?

ℹ️
Read-Only by Design

All inspection methods in this topic are read-only. They return information about the DataFrame without modifying it. This is intentional: the inspection phase is purely diagnostic. No cleaning decisions should be made, and no values should be changed, until the full diagnostic picture is available.

πŸ“ Dataset Size
β–Ό

The .shape attribute returns a tuple (rows, columns). This is always the first inspection step β€” it establishes the baseline size of the dataset before any cleaning begins. After cleaning, comparing the new shape to the original shape confirms how many rows were removed.

Python
import pandas as pd

df = pd.read_csv('8_4_competition_results.csv')

# Step 1: Check dataset dimensions
print("Dataset shape (rows, columns):")
print(df.shape)

print(f"\nTotal cells: {df.shape[0] * df.shape[1]:,}")
print(f"Rows: {df.shape[0]}")
print(f"Columns: {df.shape[1]}")
β–Ά Output
Dataset shape (rows, columns): (207, 7) Total cells: 1,449 Rows: 207 Columns: 7

The dataset has 207 rows and 7 columns. This is the baseline: after removing duplicates (7 rows), we expect 200 rows. Any additional row changes during cleaning must be tracked against this baseline to ensure no unintended data is lost.

πŸ—‚οΈ Column Types
β–Ό

The .dtypes attribute shows the data type that Pandas is using to store each column. This is the fastest way to identify type mismatches: a date column stored as object, a numeric column stored as object because it contains some text, or integers incorrectly stored as floats (which signals hidden NaN values).

Python
# Step 2: Inspect column data types
print("Column data types:")
print(df.dtypes)
β–Ά Output
Column data types: name str age float64 city str score float64 grade str level str date str dtype: object

Two immediate observations from this output. First: age is stored as float64 even though ages are whole numbers β€” this signals that the column contains at least one NaN value (Pandas converts integer columns to float when NaN is introduced). score is also float64, which is expected here because scores are decimals (e.g. 97.5). Second: date is stored as str instead of datetime64 β€” this signals that the date values are still text, likely because multiple date formats prevented automatic parsing.

int64
Integer values β€” no missing values present in this column. If you expect integers but see float64 instead, there are hidden NaN values.
float64
Decimal numbers β€” or integers that contain at least one NaN. An age column showing float64 means some ages are missing.
object
Strings or mixed types β€” anything not purely numeric. A numeric column showing 'object' means text is mixed in (check for 'N/A', '-', or header rows).
datetime64[ns]
Correctly parsed date and time values. If a date column shows 'object', it was not parsed β€” must be converted with pd.to_datetime().
bool
True/False values. Check whether the column should be categorical instead.
πŸ“‹ One-Command Overview
β–Ό

The .info() method combines type information with non-null counts in one printout. The non-null count per column implicitly reveals missing values: if a DataFrame has 207 rows and a column shows 197 non-null, there are 10 missing values in that column.

Python
# Step 3: Full structural overview with .info()
print("DataFrame structural overview:")
df.info()
β–Ά Output
DataFrame structural overview: <class 'pandas.core.frame.DataFrame'> RangeIndex: 207 entries, 0 to 206 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 207 non-null str 1 age 192 non-null float64 2 city 199 non-null str 3 score 197 non-null float64 4 grade 207 non-null str 5 level 207 non-null str 6 date 207 non-null str dtypes: float64(2), str(5) memory usage: 11.4 KB

age shows 192 non-null out of 207 rows β€” confirming 15 missing values. city shows 199 non-null β€” confirming 8 missing values. score shows 197 non-null β€” confirming 10 missing values. All other columns are complete. The memory usage (11.4 KB) is also visible, useful for large dataset planning.

ℹ️
How to Read Non-Null Count as a Missing Value Signal

To find missing values from .info() output: subtract Non-Null Count from total rows. If the RangeIndex says 207 entries and a column shows 197 non-null, then 207 βˆ’ 197 = 10 values are missing. No extra calculation needed β€” the difference is the gap.

❓ Column-by-Column Gaps
β–Ό

The .isna().sum() chain produces the most precise read of missing values: exact counts per column. A complete audit also reports missingness as a percentage β€” a count of 12 means very different things in a 100-row dataset versus a 100,000-row dataset.

Python
# Step 4: Precise missing value audit
print("Missing value counts per column:")
missing = df.isna().sum()
print(missing)

print("\nMissing value percentage per column:")
missing_pct = (missing / len(df) * 100).round(2)
print(missing_pct)

print(f"\nTotal missing cells: {missing.sum()} out of {df.size} ({missing.sum()/df.size*100:.2f}%)")
β–Ά Output
Missing value counts per column: name 0 age 15 city 8 score 10 grade 0 level 0 date 0 dtype: int64 Missing value percentage per column: name 0.00 age 7.25 city 3.86 score 4.83 grade 0.00 level 0.00 date 0.00 dtype: float64 Total missing cells: 33 out of 1449 (2.28%)

All three columns with missing data are well under the 20% threshold. age is missing for 7.25% of students, city for 3.86%, and score for 4.83% β€” all moderate to low. The decision on how to handle each is made in Topic 8.6.

πŸ”„ Counting Duplicates
β–Ό

The .duplicated() method returns a Boolean Series β€” True for any row that is an exact copy of an earlier row, False otherwise. Chaining .sum() counts the total number of duplicate rows. In the inspection phase, the goal is to measure the scale of duplication and examine a sample of duplicated rows before any removal begins.

Python
# Step 5: Duplicate row audit
print(f"Total duplicate rows: {df.duplicated().sum()}")

# Show the actual duplicate rows (the later copies)
print("\nDuplicated records:")
print(df[df.duplicated()])
β–Ά Output
Total duplicate rows: 7 Duplicated records: name age city score grade level date 86 Noha Ibrahim 27.0 giza 91.2 A advanced 19-05-2024 116 Samira Hassan 22.0 CAIRO 84.0 A advanced 30-12-2023 157 Mido Gamal 27.0 Giza 81.0 A advanced 2024-04-14 174 Wedad Ali 23.0 Alex 56.7 C beginner 2023-11-26 187 Khaled Hassan 24.0 GIZA 83.9 A advanced 19/12/2023 198 Diaa Adel 22.0 Cairo 65.8 B intermediate 25-10-2023 199 Diaa Adel 31.0 Giza 100.0 A advanced 2024-04-21

There are 7 duplicate rows. df.duplicated() marks each row that is an exact copy of an earlier row as True, so filtering with it shows the 7 later copies β€” each of these has an identical twin earlier in the dataset. A row is only flagged when every column matches exactly. Before removing anything, inspecting these rows confirms they are true duplicates and not coincidentally similar records. Removal happens in Topic 8.5.

πŸ“Š Checking the Numbers
β–Ό

The .describe() method returns summary statistics for every numeric column: count, mean, standard deviation, minimum, Q1, median (Q3), Q3, and maximum. This is the primary tool for detecting outliers and impossible values β€” the min and max values deserve immediate attention.

Python
# Step 6: Statistical distribution audit
print("Numeric column statistics:")
print(df.describe().round(2))

print("\nNon-numeric column overview:")
print(df.describe(include='object'))
β–Ά Output
Numeric column statistics: age score count 192.00 197.00 mean 24.21 76.96 std 3.95 12.71 min 18.00 40.00 25% 21.00 67.60 50% 24.00 77.40 75% 27.00 85.70 max 35.00 100.00 Non-numeric column overview: name city grade level date count 207 199 207 207 207 unique 200 7 3 3 24 top ... CAIRO A advanced 2024-04-21 freq 2 58 104 121 14

Key findings from this output. For score: the mean (76.96) is close to the median (77.40), indicating no severe outlier distortion. The count is 197 (not 207), confirming the 10 missing values. For age: range 18–35, fully valid for a student competition, with a count of 192 confirming 15 missing values. For the text columns: city has more unique values than the real number of cities β€” confirming the formatting inconsistencies (CAIRO / cairo / Giza / giza / Alex) detected earlier. grade and level each have 3 unique values, and date appears in several different formats β€” confirming format variation.

count < total rows
The column has missing values. score shows count=197 (not 207), confirming 10 NaN values. Compare count to df.shape[0] to measure missingness.
mean β‰ˆ median (50%)
The distribution is roughly symmetric β€” no severe skew or outlier distortion. For score: mean=76.96, median=77.40. The small gap is acceptable.
mean >> median
The distribution is right-skewed β€” a few very high values are pulling the mean up. Investigate max and Q3 for possible outliers or data-entry errors.
unique >> expected
A text column has more categories than it should. city shows more unique values than the real number of cities, because the same city is written in several cases (CAIRO / cairo / Giza / giza) β€” indicates formatting inconsistencies to fix in Topic 8.7.
πŸ—ΊοΈ Before You Clean
β–Ό

After running all six inspection steps, the analyst has a complete diagnostic picture. Before any cleaning begins, this picture should be documented as an inspection report β€” a list of all identified problems with their scale, severity, and the cleaning action required.

Issue FoundColumn(s)ScalePlanned ActionTopic
Duplicate rowsAll columns7 rows (3.4%)drop_duplicates(keep='first')8.5
Missing valuesage15 rows (7.25%)fillna(median)8.6
Missing valuescity8 rows (3.86%)fillna(mode)8.6
Missing valuesscore10 rows (4.83%)fillna(median)8.6
Formatting inconsistenciescitymultiple cases for the same citiesstr.strip().str.title() + map variants8.7
Type mismatchdateobject instead of datetime64pd.to_datetime(format='mixed')8.7
Encoding neededlevel, cityText categories for MLLabelEncoder / get_dummies8.8
Scaling neededscore, ageDifferent numeric rangesMinMaxScaler / StandardScaler8.9
βœ…
Inspection Is Complete β€” No Values Were Changed

All six steps above are read-only. The DataFrame df still has 207 rows and 7 columns in its original state. The inspection report above is the output of the diagnostic phase β€” a specification for the cleaning work that begins in Topic 8.5.

⚠️ Inspection Myths
β–Ό
⚠️
Misconception: .describe() shows all data quality problems

.describe() shows statistical problems β€” outliers, impossible values, suspicious distributions β€” but it does not show formatting inconsistencies, duplicate rows, or wrong data types. A complete audit requires all six steps. Using only .describe() gives a partial picture that misses structural and categorical problems entirely.

⚠️
Misconception: If .info() shows the correct dtype, the column is clean

A column stored as float64 may be numerically valid but contain sentinel values (-999, 9999) or logically impossible values (negative ages). The dtype tells you how the data is stored, not whether the stored values are correct. .describe() must also be run to detect value-level problems.

⚠️
Misconception: Inspection can be skipped if you already know the dataset

Even a dataset that has been used before must be inspected when it is loaded fresh. Data sources change: new rows may have introduced formatting inconsistencies, a system update may have changed a column type, or a data-export script may have introduced duplicates. Inspection is not a one-time task β€” it should be the first step every time a dataset is loaded.

βœ… Audit This
β–Ό
?
After running df.info() on the W8 dataset, the 'age' column shows 'float64' dtype with 192 non-null count (out of 207 rows). What do these two observations together indicate?
?
An analyst runs the full 6-step inspection on the W8 dataset and produces an inspection report. At what point should they begin cleaning the data?
  • .shape returns (rows, columns) β€” establishes the baseline size before cleaning begins. The W8 dataset: (207, 7).
  • .dtypes reveals type mismatches β€” the W8 date column is stored as object instead of datetime64, and score is float64 (signalling NaN values).
  • .info() combines type information with non-null counts in one view β€” confirms 15 missing ages, 8 missing cities, and 10 missing scores.
  • .isna().sum() provides precise per-column missing value counts and, when divided by len(df), the missingness percentage.
  • .duplicated().sum() counts exact duplicate rows β€” the W8 dataset has 7 duplicate rows representing 7 students entered twice.
  • .describe() reveals statistical distributions β€” compare min/max against valid ranges, and mean vs. median to detect outlier distortion.
  • All six steps are read-only. The inspection phase produces a complete diagnostic report; the cleaning phase (Topics 8.5–8.7) addresses each identified problem.
πŸ“‚Dataset & Notebook
β–Ό
πŸ“šExternal Resources
β–Ό