Topic 8.1: Data Quality

🏥 Hospital Records Problem

▼

A large hospital network collects patient records across twenty clinics over five years. Each clinic used a slightly different intake form — one clinic recorded age as a number, another as a date of birth, and a third skipped it entirely for some patients. Several patients appear twice because they visited two different branches. Blood pressure was sometimes recorded as a single average number rather than systolic/diastolic. Before any doctor could study health trends in this data, an entire team spent weeks just making the records usable.

This is not an unusual situation — it is the default. Real-world data is collected by humans, entered by hand, exported from systems that were never designed to work together, and accumulated over years without consistent standards. The result is almost never clean. It is raw: unverified, inconsistent, and full of gaps.

🔬

The 80/20 Rule of Data Science

Industry surveys consistently show that data scientists spend approximately 80% of their working time preparing and cleaning data, and only 20% actually building models or drawing insights. This ratio surprises students who expect data science to be mostly about algorithms. The discipline of data preparation is where most of the professional value is created.

The hospital example maps precisely onto every data science project. Before analysis can begin, someone must unify the age formats, remove duplicate patients, decide what to do about missing values, and verify that all recorded values fall within plausible ranges. That process — systematic, careful, and skilled — is what Week 8 is about.

🎯 Good Enough Data

▼

Data quality is not a single property — it is a combination of six measurable dimensions. A dataset can excel on one dimension and fail on another. Understanding each dimension separately allows an analyst to diagnose precisely what is wrong and choose the correct remediation strategy.

🎯

Accuracy

Values reflect the true real-world measurements they represent. An age of 250 or a score of -50 is inaccurate. Accuracy problems arise from data-entry errors, unit mismatches, and sensor calibration failures.

📋

Completeness

All required values are present — no critical cells are left empty. Completeness is measured as a percentage: a column with 10 missing values out of 100 rows has 90% completeness.

🔄

Consistency

The same real-world entity is represented the same way everywhere in the dataset. 'Cairo', 'cairo', and 'CAIRO' all refer to the same city but are inconsistent representations that corrupt group-level analysis.

⏰

Timeliness

Data reflects the current state of the real world at the time of analysis. A customer address from five years ago may be outdated. Stale data produces conclusions that were once valid but are no longer.

✅

Validity

Values conform to the expected format, type, and range for their column. A date column containing the string 'N/A', or a score column containing the value 999 on a 100-point scale, fails validity.

🧬

Uniqueness

Each real-world entity appears exactly once in the dataset. A student whose record was exported twice inflates every count and every average that includes them.

These six dimensions are not independent. A value can be complete (present) but inaccurate (wrong). A dataset can be consistent (all the same format) but invalid (all the same wrong format). A full quality assessment checks all six dimensions simultaneously.

Dimension	Question It Answers	Example of Failure
Accuracy	Are the values correct?	A student's score recorded as 850 on a 100-point exam.
Completeness	Are all required values present?	The 'grade' column is empty for 40 out of 207 students.
Consistency	Is the same entity always written the same way?	'Alex', 'Alexandria', and 'ALEX' all appearing in the city column.
Timeliness	Is the data current enough to answer today's questions?	Competition results from 2019 used to evaluate 2024 students.
Validity	Does each value conform to its expected type, format, and range?	The date column contains '2024-13-01' — month 13 does not exist.
Uniqueness	Does each real-world entity appear only once?	The same student appears twice because a system export ran twice.

📦 Messy vs. Clean

▼

The most important distinction to establish at the start of Week 8 is the difference between raw data and analysis-ready data. These are not just different stages — they represent entirely different fitness levels for the task at hand.

Raw Data

Data in its original, unprocessed form — as collected from sensors, forms, databases, or APIs. It may contain errors, gaps, redundancies, and inconsistencies across all six quality dimensions.

Analysis-Ready Data

Data that has been inspected, cleaned, and transformed so that every row represents one valid record, every column has a consistent type and format, and no critical information is missing or duplicated.

The Gap Between Them

The gap is filled by data preparation: a structured sequence of diagnostic and corrective steps performed before any analytical question is asked of the data.

A useful test: ask What would happen if I ran an analysis on this data right now? If the answer involves plausible errors — averages pulled up by a typo, categories that should be the same counted as different, or a model that silently ignores half the rows — then the data is not analysis-ready. The job of data preparation is to close every one of those gaps.

What Makes Data 'Analysis-Ready'?

Complete

No critical missing values remain
Every row has enough information to be useful
Gaps are filled with justified estimates or intentionally removed

Consistent

Every column has one data type throughout
Text values follow a single format (e.g., Title Case for city names)
Dates are all in the same structure (e.g., YYYY-MM-DD)

Non-Redundant

No exact duplicate rows remain
Each record represents exactly one real-world entity
No columns that repeat the same information

Accurate and Valid

Values fall within plausible ranges for their domain
No obvious data-entry errors remain
Outliers have been investigated and either corrected or documented

🔬 Cleaning Is a Skill

▼

It is tempting to view data preparation as a preliminary chore — something you get through before the 'real' analysis begins. This framing is inaccurate and dangerous. A model trained on dirty data does not produce dirty results; it produces confident results that are wrong, which is far more harmful than an obviously broken output.

Professional data preparation is a structured diagnostic and remediation process. It follows a consistent sequence: first inspect the data to understand what problems exist, then address each problem type in a planned order, then validate the result before passing the data to any downstream process.

Phase	What Happens	Professional Purpose
Inspection	Examine shape, types, missing counts, and summary statistics without changing anything.	Diagnose before treating — understand the problem completely before applying any fix.
Structural Repair	Fix column types, resolve schema inconsistencies, standardise formats.	Ensure the data has the correct shape and type for analysis.
Content Cleaning	Remove duplicates, handle missing values, correct formatting errors.	Ensure the values inside each cell are valid and consistent.
Transformation	Encode categories, scale numeric features, derive new variables.	Convert cleaned data into the form required by the analytical method.
Validation	Re-inspect the cleaned dataset to confirm all issues were resolved.	Quality-assure the output before it is used in analysis or modelling.

Each phase requires judgment, not just syntax. Knowing which method to use — and when — is the skill that separates a careful analyst from someone who merely runs code. The decisions made in data preparation directly determine the validity of every conclusion drawn from the data.

ℹ️

Data Quality Is a Discipline, Not a Single Step

In professional environments, data quality is an ongoing concern, not a one-time fix. Many organisations have dedicated data engineering teams whose job is to keep data pipelines clean, consistent, and validated before analysts ever see the data. Week 8 gives you the vocabulary and skills to take part in that kind of work.

📊 What's Wrong Here?

▼

The W8 competition dataset is a concrete illustration of quality problems across multiple dimensions. Looking at the raw dataset through the lens of all six quality dimensions reveals exactly what needs to be fixed before any analysis or modelling can begin.

Column	Quality Dimension Violated	Observed Problem
city	Consistency	'Cairo', 'cairo', 'CAIRO', 'Alex', 'Alexandria', 'Giza', 'giza' — 7 spellings for 3 cities
score	Completeness	12 missing values (NaN) out of 207 rows — 5.8% incomplete
date	Validity + Consistency	Four different date format conventions in the same column; stored as text instead of datetime
All columns	Uniqueness	7 duplicate rows — the same student appears twice, inflating all counts
level	Completeness	5 missing values — 2.4% of level entries are absent

None of these problems cause a Pandas error. The dataset loads, the columns are named correctly, and the data appears superficially reasonable. The problems only become visible when each quality dimension is explicitly audited — which is exactly what Topic 8.4 (Initial Inspection) covers.

🔬

The W8 Dataset Is Intentionally Dirty

The competition results dataset used throughout Week 8 was constructed with realistic data quality problems: duplicate entries (representing a system export that ran twice), missing scores (representing students who did not complete the submission form), casing inconsistencies in city names (representing data entered by different operators), and mixed date formats (representing data merged from two different registration systems). These are not artificial edge cases — they are the kinds of problems found in real event registration, school record, and survey datasets every day.

The W8 dataset violates three of the six quality dimensions: completeness (missing scores and level values), consistency (seven city spellings for three cities and four date format conventions), and uniqueness (duplicate rows that overcount students). Accuracy, validity, and timeliness happen to be intact for this dataset — the scores are within the valid range and the competition data is current. But no single dataset violates all six dimensions at once — the art of data quality is knowing which dimensions to check and in what order.

💥 The Cost of Skipping

▼

Data quality problems are not abstract technical concerns — they translate directly into wrong decisions, wasted resources, and failed projects. Understanding what can go wrong when quality is ignored makes the case for rigorous preparation more concrete than any statistic.

Quality Failure	Real-World Consequence	Root Dimension
Student competition scores contain 7 duplicate rows	Class average is inflated because top students are counted twice	Uniqueness
City column has 7 spellings for 3 cities	A 'students per city' report shows 7 rows instead of 3 — Cairo data is split across three groups	Consistency
Date column stored as text strings	Time-series sorting fails; a chronological chart shows dates in alphabetical order instead of calendar order	Validity
Score column has 12 missing values that are ignored	Average score is calculated from only 195 students, not 207 — the reported class mean is wrong	Completeness
A model is trained on dirty data	The model produces high accuracy on the training set but fails in production because it learned patterns from corrupted values	All dimensions

Each row in the table above describes a consequence that looks like an analysis error but is actually a data quality error. The analyst runs the correct code, the code executes without errors, and the output is presented to decision-makers as factual. The damage is done before anyone realises the data was never trustworthy.

⚠️

Garbage In, Garbage Out

The most dangerous category of data quality failure produces confident-looking wrong answers. A model that silently ignores NaN rows and trains on only 195 out of 207 students reports its accuracy on 195 students — but is deployed to classify all 207. The 12 students it was never trained on represent exactly those whose data was problematic to begin with. Their predictions are the least reliable — but the performance metric never reveals this.

Quality awareness is not negativity — it is the professional default. Treating every new dataset as likely to have quality issues (because real-world data almost always does) is what separates an analyst who produces trustworthy results from one who produces fast results. Speed without quality is not an advantage in data science.

Rule 1

Always measure quality before drawing conclusions.

A single .info(), .describe(), and .isna().sum() call takes less than one second and can reveal problems that would corrupt hours of downstream analysis.

Rule 2

Treat the six quality dimensions as a checklist, not a suggestion.

A dataset that passes accuracy checks but fails consistency checks is still not analysis-ready. All six dimensions must be assessed.

Rule 3

Document every cleaning decision.

Recording why a value was imputed (not deleted), or why a particular canonical form was chosen, allows the analysis to be reproduced and audited. Undocumented cleaning decisions cannot be verified.

Rule 4

Quality assurance is iterative — repeat inspection after cleaning.

A cleaning step can introduce new problems: imputing a missing value with an incorrect method, or standardising a column in a way that produces a new invalid value. Re-inspect after every major cleaning step.

⚠️ Quality Myths

▼

⚠️

Misconception: 'Clean data is the default'

Students accustomed to tutorial datasets assume that real data is normally clean and only occasionally has a few errors. In practice, the opposite is true. Every real dataset collected over time — from any source — will have quality issues. Treating data quality problems as exceptional leads analysts to skip the inspection phase entirely.

⚠️

Misconception: 'Just delete anything that looks wrong'

Deletion is the simplest fix for any data problem, but it is often the most damaging one. Deleting rows with missing values, outliers, or formatting errors can remove large proportions of a dataset — and the rows that are deleted may not be random. If missing values are concentrated in a particular subgroup, deleting those rows introduces a systematic bias that invalidates any conclusion drawn from the remaining data.

⚠️

Misconception: 'Data preparation is a one-time step'

In real projects, data preparation is iterative. After an initial cleaning pass, exploration often reveals new problems — a distribution that looks wrong, a category that splits unexpectedly, a date field that works in most rows but fails in edge cases. Preparation and exploration alternate until the analyst is confident the data is trustworthy.

✅ Quality Check

▼

?

A competition dataset has a 'score' column where one student's value is 850. The exam was out of 100 points. Which dimension of data quality does this violate?

?

A data science team receives a dataset and immediately begins training a machine learning model without any inspection. What is the most likely consequence?

Data quality has six measurable dimensions: Accuracy (values are correct), Completeness (values are present), Consistency (same entity is always written the same way), Uniqueness (each entity appears only once), Timeliness (data is current), and Validity (values conform to expected type, format, and range).
Raw data is data in its original, unprocessed state — almost always containing problems across one or more quality dimensions that make it unfit for direct analysis.
Analysis-ready data is complete, consistent, non-redundant, accurate, and valid — every row represents a valid record and every column has a uniform format and type.
Data scientists spend approximately 80% of their working time preparing data — and this is where most analytical errors are either prevented or introduced.
Data preparation follows a structured five-phase sequence: inspect → repair structure → clean content → transform → validate. Skipping the inspection phase produces confident-looking results built on flawed foundations.

📚External Resources

▼

↗
InfoWorld: The 80/20 Data Science Dilemma
https://www.infoworld.com/article/2257148/the-80-20-data-science-dilemma.html
↗
IBM: What Is Data Quality?
https://www.ibm.com/think/topics/data-quality
↗
Collibra: The 6 Data Quality Dimensions with Examples
https://www.collibra.com/blog/the-6-dimensions-of-data-quality