Data Engineering & Analytics

The foundation everything else builds on. You'll learn to navigate a Linux system, create and manage files, control who can access what, and chain commands together to automate tasks. By the end, you'll be able to set up users, connect to servers over SSH, install software, and run your first Docker container.

Chain commands together to filter, sort, count, and search through data. Redirect output to files, combine streams, and build one-liners that do real work.

|, >, >>, 2>&1, tee, sort, uniq, wc, grep

T04 Processes & System (+Quiz)

See what's running on your system, monitor resource usage in real time, stop runaway processes, and manage background jobs.

ps, top, kill, bg, fg, jobs

Lab Disk Space Incident

A disk space alert fired overnight. Investigate the cause, find what's eating the space, and preserve the evidence.

Covers: df, ls -lh, tail, grep -c, output redirection

T05 Users & SSH (+Quiz)

Create and manage user accounts, configure sudo access, generate SSH keys, and connect to remote servers securely.

useradd, passwd, sudo, ssh, ssh-keygen, authorized_keys

T06 Networking (+Quiz)

Check your network interfaces, test connectivity, make HTTP requests from the terminal, understand how DNS resolves names, and identify what's listening on which ports.

ip, ss, ping, curl, DNS resolution, ports

T07 Package Management (+Quiz)

Install, update, and remove software. Manage repositories, resolve dependency issues, and keep your system current.

apt, dpkg, repositories, dependencies

T08 Docker Basics (+Quiz)

Pull images, run containers, map ports, and manage the container lifecycle. Your first step into the tool that changed how software gets deployed.

images, containers, run, ps, stop, rm, port mapping

Lab Shift Handoff

Your colleague left two issues to check before their shift ended. Read the handoff, investigate both, and close them out.

Covers: cat, ls, find, wc, free

Exam Level 1 Master Exam

Prove you've mastered the fundamentals. 10 questions covering everything from Level 1. Score 80% or higher to pass.

Covers: files, permissions, pipes, processes, users, SSH, networking, packages, docker

Cards Level 1 Flashcards

Drill every command and concept from Level 1. Cards pull from this level's quizzes and exam, and the ones you miss come back more often.

Master the data-engineering fundamentals that power every analytics job. You'll work with CSV, JSON, TSV, and SQLite — the formats you'll meet every day. You'll learn regex patterns, sed cleanup, awk aggregation, cut/paste/join column ops, jq for JSON, and SQL basics. By the end, you can inspect, clean, extract from, and query any tabular data file at the command line.

T09 Data Formats & Structure (+Quiz)

Recognize and inspect the formats you'll see every day. Read CSV, TSV, JSON, and fixed-width files, measure them with head and wc, and understand the shape of data before you process it.

CSV, TSV, JSON, fixed-width, head, wc, awk

T10 grep & Regex for Data (+Quiz)

Extract patterns from raw text. Master character classes, quantifiers, and anchors, then use grep -oE to pull IPs, dates, emails, and anything else matching a pattern out of messy files.

character classes, quantifiers, anchors, grep -oE, extracting patterns

T11 sed for Data Cleaning (+Quiz)

Clean up messy data at the command line. Do substitutions, strip whitespace, delete bad lines, and edit files in place using address ranges when you need surgical control.

substitution, whitespace cleanup, line deletion, in-place editing

T12 cut, paste & join (+Quiz)

Reshape tabular data without loading it into a spreadsheet. Pull specific columns with cut, stitch files side-by-side with paste, and merge datasets on a shared key with join.

column extraction, merging files, combining on shared keys

Lab Vendor Data Cleanup

Clean a messy vendor CSV: strip whitespace, remove blank lines, extract valid rows, and cut to the required columns.

Covers: grep, sed, cut, wc, head, regex, whitespace cleanup, column extraction

T13 Introduction to awk (+Quiz)

Your first real text-processing language. Split fields on any delimiter, filter rows by condition, reference fields by number, and run calculations like sum and average right from the command line.

field splitting, -F delimiter, $1/$NF, NR, filtering, calculations

T14 Introduction to JSON & jq (+Quiz)

Work with the format modern APIs actually return. Understand JSON structure, navigate it with jq's dot notation, extract specific fields, iterate through arrays, and filter results by condition.

JSON structure, dot notation, field extraction, array iteration

T15 Introduction to SQL (+Quiz)

Write your first real queries against a real database. Use sqlite3 to SELECT rows, filter with WHERE, sort with ORDER BY, aggregate with GROUP BY, and count, sum, and average anything you need.

sqlite3, SELECT, WHERE, ORDER BY, LIMIT, GROUP BY, aggregates

T16 Mini Project — Data Pipeline (+Quiz)

Put everything from this level together. Chain sed, awk, sort, uniq, jq, and sqlite3 into a single end-to-end pipeline that takes raw input and produces clean queryable output.

sed, awk, sort, uniq, jq, sqlite3 end-to-end

Lab Data Reconciliation

Reconcile conflicting revenue totals between a CSV and JSON export by tracing the gap to missing records, then verify against the authoritative database.

Covers: awk aggregation, jq extraction, wc/jq length, cut, grep -v, sqlite3 verification

Exam Level 2 Master Exam

Prove you've mastered data foundations. 10 questions covering data formats, regex, sed, awk, cut/paste/join, jq, and SQL. Score 80% or higher to pass.

Covers: CSV, JSON, regex, sed, awk, cut, paste, join, jq, SQL basics

Cards Level 2 Flashcards

Drill every command and concept from Level 2. Cards pull from this level's quizzes and exam, and the ones you miss come back more often.

Turn raw files into answers. You'll aggregate with awk arrays, reshape JSON with jq, inspect CSVs with csvkit, join SQL tables with sqlite3, chain ETL pipelines with tee and xargs, compute log analytics, generate formatted reports, and wield Python's csv/json/statistics/sqlite3 modules from one-liners. By the end, you can turn any CSV, JSON, or log file into a summary report.

T17 awk Deep Dive (+Quiz)

Unlock awk as a real programming language. Use BEGIN and END blocks for setup and teardown, write conditionals, build associative arrays, manipulate strings, and process multiple files at once.

BEGIN/END blocks, conditionals, arrays, string functions, printf

T18 jq & JSON (+Quiz)

Reshape any JSON document into anything else. Use select filters, build new objects, reduce arrays, group by any field, and convert JSON straight to CSV when downstream tools need it.

array iteration, select filters, reduce, group_by, CSV output

T19 csvkit (+Quiz)

Power tools purpose-built for CSVs. Preview files with csvlook, cut columns with csvcut, filter rows with csvgrep, run statistics with csvstat, and query CSV files directly with csvsql.

csvlook, csvcut, csvgrep, csvstat, csvsql

T20 sqlite3 (+Quiz)

Go beyond simple SELECTs. Join tables, filter groups with HAVING, write subqueries, use advanced aggregates, and create views that let you treat complex queries like single tables.

JOINs, GROUP BY HAVING, aggregates, subqueries, views

Lab Cross-Source Executive Brief

Build a five-section executive brief by pulling department payroll, headcount, hiring trends, error rates, and salary statistics from CSV, JSON, and log sources.

Covers: awk associative arrays and printf, jq reshaping, csvsql, log analytics, python3 statistics, tee

T21 Data Pipelines (+Quiz)

Build real multi-stage pipelines. Chain awk, jq, sort, and uniq into sequences that extract, transform, aggregate, and deliver results on a schedule with no babysitting.

awk, jq, sort, uniq chained into multi-stage processing

T22 Log Analytics (+Quiz)

Turn web and system logs into metrics. Parse access logs into structured fields, extract request counts and response times, run top-N analyses, and bucket events by time window.

access logs, metrics extraction, top-N analysis, time aggregation

T23 Report Generation (+Quiz)

Ship reports that actually look ready for leadership. Format output with awk, template sections with sed, and build repeatable report scripts your team can run on any schedule.

formatted output with awk, sed templates, automated scripts

T24 Python Data Tools (+Quiz)

Reach for Python when bash runs out of road. Write one-liners with python3, use the csv and json modules to read and write structured files, and start moving from shell scripts to real scripts.

python3 one-liners, csv module, json module, basic scripting

Lab Revenue Anomaly

Investigate an unexplained Q1 revenue spike by tracing it from quarterly totals through department breakdowns to the individual employee responsible.

Covers: awk associative arrays, awk conditionals, jq select and reshaping, csvsql, answer task, investigative analysis

Exam Level 3 Master Exam

Prove you've mastered data analysis and wrangling. 10 questions covering awk deep dive, jq, csvkit, sqlite3, pipelines, log analytics, report generation, and Python data tools. Score 80% or higher to pass.

Covers: awk BEGIN/END, jq select/group_by, csvsql, sqlite3 JOIN/HAVING, pipelines with tee/xargs, log analytics, Python data tools

Cards Level 3 Flashcards

Drill every command and concept from Level 3. Cards pull from this level's quizzes and exam, and the ones you miss come back more often.

Data engineering proper. You'll query CSV, JSON, and Parquet directly with duckdb, reshape DataFrames in pandas, author advanced awk and jq pipelines, write production-grade Python ETL scripts with argparse and logging, and orchestrate end-to-end pipelines with bash, cron, systemd timers, and make. By the end, you can ship an analytics pipeline from raw file to scheduled report.

T25 duckdb Basics (+Quiz)

Query CSV, JSON, and Parquet files directly with SQL, no import step required. Point duckdb at a file and start running analytical queries on millions of rows in seconds.

analytical SQL on CSV/JSON/Parquet, direct file queries

T26 duckdb Analytics (+Quiz)

Run serious analytics without a warehouse. Use window functions, build CTEs for readability, compute complex aggregations, and join across files as if they were tables in the same database.

window functions, CTEs, complex aggregations, cross-file joins

T27 pandas Basics (+Quiz)

Load data into DataFrames and start working with it. Use read_csv, filter and select rows, understand dtypes, and inspect your data with head, tail, and describe before you transform anything.

read_csv, DataFrames, filtering, dtypes, head/tail/describe

T28 pandas Transforms (+Quiz)

Reshape data the way analysts actually work. Group and aggregate with groupby, join tables with merge, pivot long to wide, apply custom functions, and chain methods to build readable pipelines.

groupby, merge, pivot_table, apply, method chaining, output

Lab Pipeline Build Sprint

Build a 5-stage ETL pipeline: extract with duckdb, reshape with jq, merge with pandas, format with awk, and validate totals end-to-end.

Covers: duckdb aggregation and export, jq reshaping and @csv, pandas merge and new columns, awk printf formatting, pipeline validation

T29 Advanced awk Pipelines (+Quiz)

Push awk further than most engineers ever do. Write multi-pass pipelines, coordinate processes with co-processing, and manipulate fields in ways that would take dozens of lines in Python.

multi-pass processing, co-processing, complex field manipulation

T30 Advanced jq Pipelines (+Quiz)

Transform deeply nested JSON into whatever shape you need. Chain nested transforms, reduce across arrays, group by any key, and reshape complex API responses without touching a Python script.

nested transforms, reduce, group_by, complex JSON reshaping

T31 Python ETL Scripts (+Quiz)

Write ETL scripts that survive production. Structure extract-transform-load as separate functions, handle CLI flags with argparse, manage file I/O safely, and catch errors before they corrupt the target.

extract-transform-load, argparse, file I/O, error handling

T32 Pipeline Orchestration (+Quiz)

Schedule pipelines that run reliably without a human watching. Chain dependencies with cron, prevent overlapping runs with lock files, log every step, and catch failures before data consumers do.

cron, dependency chains, lock files, logging, monitoring

Lab Pipeline Output Mismatch

Trace a silent data loss through a pipeline that exited successfully — count records at each stage, identify where rows were dropped, and discover the root cause.

Covers: duckdb record counting, wc -l, awk conditional filtering, answer tasks, pandas data cleaning, pipeline stage diagnosis

Exam Level 4 Master Exam

Prove you've mastered data engineering. 10 questions across duckdb, pandas, advanced awk/jq pipelines, Python ETL scripts, and orchestration. Score 80% or higher to pass.

Covers: duckdb SQL, pandas DataFrames, window functions, jq select/group_by, awk aggregation, Python ETL, bash orchestration, cron

Cards Level 4 Flashcards

Drill every command and concept from Level 4. Cards pull from this level's quizzes and exam, and the ones you miss come back more often.

Data analytics from first principles. You'll write SQL window functions for rankings and running totals, clean dirty datasets by finding nulls / duplicates / type mismatches / outliers, compute descriptive statistics and correlations, extract structured data from text with regex, pull and paginate API responses with curl and jq, produce terminal and matplotlib charts, automate report pipelines with sed templates and cron, and tie it all together in an e-commerce analytics capstone.

T33 SQL Window Functions (+Quiz)

Rank, compare, and aggregate across rows without collapsing your result set. Use ROW_NUMBER, RANK, LAG, LEAD, and SUM() OVER with frame clauses to answer questions basic SQL can't.

ROW_NUMBER, RANK, LAG, LEAD, SUM() OVER, frame clauses

T34 Data Cleaning (+Quiz)

Detect nulls that aren't really null, deduplicate on composite keys, validate types, flag outliers, and normalize inconsistent values. The work that happens before every real analysis.

null detection, deduplication, type validation, outliers, normalization

T35 Statistical Analysis (+Quiz)

Apply real statistics to your data. Compute descriptive stats, understand distributions, measure correlation between variables, and learn the hypothesis testing concepts behind every A/B test.

descriptive stats, distributions, correlation, hypothesis testing

T36 Regex Deep Dive (+Quiz)

Master regex for real extraction work. Use character classes and quantifiers confidently, capture groups, apply lookahead and lookbehind, and know which dialect each tool expects.

character classes, quantifiers, groups, lookahead, sed/grep/awk

Lab Revenue Drop Investigation

Investigate a 22% March revenue drop by tracing the anomaly through monthly trends, regional breakdowns, and weekly record counts to discover a data collection gap.

Covers: SQL window functions (LAG), duckdb analytics, regional segmentation, data quality diagnosis, investigative analysis workflow

T37 API Data Collection (+Quiz)

Pull data from live APIs the right way. Call REST endpoints with curl, handle pagination across thousands of results, respect rate limits, and parse JSON responses into clean records.

REST APIs, curl, pagination, rate limiting, JSON parsing

T38 Data Visualization from CLI (+Quiz)

Turn numbers into pictures without leaving the terminal. Render bar charts with awk, draw real graphs with gnuplot, build inline sparklines, and generate histograms you can paste straight into a report.

awk charts, gnuplot basics, sparklines, histogram generation

T39 Report Automation (+Quiz)

Ship reports that land in inboxes on schedule. Template with sed, wire cron to trigger runs, format output for each channel, and understand the email delivery concepts behind automated briefings.

sed templates, cron scheduling, output formatting, email concepts

T40 Analytics Capstone (+Quiz)

Run a complete analysis end-to-end. Start from raw source data, clean it, run statistical checks, build visualizations, and deliver a formatted report that answers a real question.

end-to-end analysis from raw data to formatted report

Lab Dataset Handoff Audit

Audit an inherited dataset: find nulls, duplicates, and malformed records, then compute correlations and produce a trend visualization.

Covers: data cleaning (nulls, duplicates), regex validation, statistical correlation, awk bar charts, exploration

Exam Level 5 Master Exam

Prove you've mastered data analytics. 10 questions covering SQL window functions, data cleaning, statistics, regex, API collection, visualization, and report automation. Score 80% or higher to pass.

Covers: window functions, data cleaning, descriptive statistics, correlation, regex extraction, API pagination, chart selection, sed templates, cron scheduling

Cards Level 5 Flashcards

Drill every command and concept from Level 5. Cards pull from this level's quizzes and exam, and the ones you miss come back more often.

Database administration for production PostgreSQL. You'll move from sqlite3 to psql, design schemas with proper types (JSONB, TIMESTAMPTZ, NUMERIC), bulk-load data with \COPY and upserts, diagnose slow queries with EXPLAIN and index them, manage users/roles/permissions with GRANT and pg_hba.conf, use transactions and savepoints for ACID integrity, run pg_dump / pg_restore / VACUUM / ANALYZE for maintenance, and monitor a live database via pg_stat_activity, pg_stat_user_tables, pg_locks, and pg_stat_database.

T41 PostgreSQL Fundamentals (+Quiz)

Graduate from sqlite3 to a real client-server database. Connect with psql, learn the meta-commands that save your hands, work with PostgreSQL types, use RETURNING clauses, and wrap operations in transactions.

psql, meta-commands, types, RETURNING, transactions

T42 Schemas & Data Types (+Quiz)

Use PostgreSQL's type system to your advantage. Organize tables with schemas, query JSONB fields, store arrays natively, handle TIMESTAMPTZ for real time zones, and enforce rules with constraints.

namespaces, JSONB queries, arrays, TIMESTAMPTZ, constraints

T43 Data Loading Patterns (+Quiz)

Load data into PostgreSQL the way production systems do. Bulk-load with COPY, stage imports before they hit target tables, handle conflicts with upserts, and know when to TRUNCATE vs DELETE.

COPY, staging tables, ON CONFLICT upserts, TRUNCATE vs DELETE

T44 Indexing & EXPLAIN (+Quiz)

Make slow queries fast. Read EXPLAIN plans, design B-tree and partial indexes for real workloads, understand when the planner picks an index over a scan, and know when not to add one at all.

B-tree, partial indexes, query plans, cost analysis

Lab Database Fire Drill

Diagnose and fix a degraded production database: find and kill a runaway query, analyze the slow query log, and add a missing index.

Covers: pg_stat_activity, pg_terminate_backend, slow query log parsing, EXPLAIN, CREATE INDEX, query plan analysis

T45 Users, Roles & Permissions (+Quiz)

Control who can touch what in your database. Create roles, grant and revoke privileges, use role inheritance to avoid permission sprawl, and configure pg_hba.conf to lock down access.

CREATE ROLE, GRANT/REVOKE, inheritance, pg_hba.conf

T46 Transactions & Data Integrity (+Quiz)

Wrap risky operations in transactions that actually protect your data. Use BEGIN, COMMIT, and ROLLBACK correctly, set savepoints for partial rollbacks, pick the right isolation level, and avoid deadlocks.

BEGIN/COMMIT/ROLLBACK, SAVEPOINT, isolation, deadlocks

T47 Backup, Restore & Maintenance (+Quiz)

Protect your database from the worst day. Take consistent backups with pg_dump, restore them with pg_restore, run VACUUM and ANALYZE to keep things fast, and schedule the work so it happens automatically.

pg_dump formats, pg_restore, VACUUM, ANALYZE, scheduling

T48 Production Database Operations (+Quiz)

Run a PostgreSQL database in production. Watch active connections with pg_stat_activity, find hot tables with pg_stat_user_tables, catch slow queries in the log, and understand why you'll want connection pooling.

pg_stat_activity, pg_stat_user_tables, slow queries, pooling

Lab Data Integrity Incident

Investigate a cross-customer data leak by tracing through role permissions, upsert logic, and transaction isolation to find the root causes.

Covers: role permissions, GRANT/REVOKE, least privilege, ON CONFLICT upsert diagnosis, transactions, VACUUM ANALYZE, answer tasks

Exam Level 6 Master Exam

Prove you've mastered PostgreSQL administration. 10 questions covering psql, schemas, data loading, indexing, users/roles, transactions, backup, and production operations. Score 80% or higher to pass.

Covers: psql, JSONB, TIMESTAMPTZ, \COPY, ON CONFLICT, indexes, EXPLAIN, GRANT/REVOKE, pg_hba.conf, transactions, SAVEPOINT, isolation, pg_dump, VACUUM, ANALYZE, pg_stat_*

Cards Level 6 Flashcards

Drill every command and concept from Level 6. Cards pull from this level's quizzes and exam, and the ones you miss come back more often.

Data pipeline engineering from first principles. You'll design idempotent pipelines with state management, extract from paginated APIs with retries and backoff, transform with validation and deduplication, load via staging → upsert transactions, handle errors with structured logs and dead letter queues, emit JSON-lines observability, assert data quality between stages, and recover from failures using precise error messages and dependency graphs.

T49 Pipeline Architecture (+Quiz)

Decide how your pipeline should actually work. Compare ETL vs ELT tradeoffs, design for idempotency so retries are safe, manage state between runs, and externalize configuration so environments don't drift.

ETL vs ELT, idempotency, state management, configuration

T50 Extraction Patterns (+Quiz)

Pull data from sources without breaking them. Paginate through APIs, respect rate limits, extract only new records since last run, and detect when a schema changes before it breaks your pipeline.

API pagination, rate limiting, incremental, schema change detection

T51 Transformation Patterns (+Quiz)

Shape raw input into usable output. Apply validation rules, cast types safely, deduplicate on composite keys, denormalize for query speed, and know when to use Python vs SQL for each step.

validation rules, type casting, deduplication, denormalization

T52 Loading Strategies (+Quiz)

Land data in the target without corrupting it. Choose between full reloads, incremental appends, and merge/upsert patterns, and use staging tables to verify data before it hits production.

full reload, incremental append, merge/upsert, staging→target

Lab Inherited Pipeline Audit

Explore an undocumented pipeline, trace a successful run through the logs, and evaluate whether the pipeline is idempotent.

Covers: pipeline architecture, incremental extraction, load strategies, structured log analysis, row count assertions, idempotency evaluation, exploration

T53 Error Handling & Recovery (+Quiz)

Design pipelines that fail safely. Fail loud when the data is wrong, catch expected errors with try/except, retry transient failures with backoff, and route permanently bad records to a dead letter queue.

fail loud, try/except, retry with backoff, dead letter queues

T54 Logging & Observability (+Quiz)

See what your pipeline is actually doing. Emit structured JSON-lines logs, use Python's logging module properly, expose metrics per stage, and wire alerts so someone knows when it breaks.

JSON-lines logging, Python logging, metrics, alerting

T55 Testing & Data Quality (+Quiz)

Verify the pipeline did what it claims. Assert expected row counts, check for nulls in required fields, enforce referential integrity across tables, and catch schema drift before downstream queries break.

row count assertions, null checks, referential integrity, schema drift

T56 Pipeline Capstone (+Quiz)

Trace a full pipeline from source to consumer. Extract, validate, transform, load, verify, log, and hand off a system that runs unattended and recovers from the failures you expect.

extract, validate, transform, load, verify, log

Lab Pipeline Recovery Decision

Diagnose a load-stage failure, verify the staged data is intact, then choose between a full pipeline replay or a targeted resume from the checkpoint.

Covers: structured log parsing, error diagnosis, dead letter queues, staging verification, choose-your-approach recovery, post-load row count assertion

Exam Level 7 Master Exam

Prove you've mastered production data pipeline engineering. 10 questions covering architecture, extraction, transformation, loading, error handling, observability, testing, and recovery. Score 80% or higher to pass.

Covers: ETL vs ELT, pipeline stages, idempotency, state management, pagination, retry with backoff, schema drift, validation rules, dead letter queues, staging→merge upsert, structured logging, metrics, row count assertions, referential integrity, failure classification

Cards Level 7 Flashcards

Drill every command and concept from Level 7. Cards pull from this level's quizzes and exam, and the ones you miss come back more often.

Machine learning from the command line with scikit-learn. You'll build features from CSV with pandas, apply one-hot and ordinal encoding, split train/test with random_state, train decision-tree and linear models via the sklearn fit/predict API, save models with joblib, evaluate with MAE/RMSE/R² for regression and accuracy/precision/recall/F1 for classification, read confusion matrices and feature importances, build batch prediction pipelines with feature-name reindexing, train TF-IDF + Naive Bayes text classifiers with Pipeline, and monitor production models for data and concept drift.

T57 ML Concepts for Data Engineers (+Quiz)

Learn the ideas behind every machine learning model. Understand when to use classification vs regression, what features and labels actually are, how training differs from inference, and why overfitting ruins otherwise good models.

classification vs regression, features, labels, train/test, overfitting

T58 Feature Engineering from CLI (+Quiz)

Turn raw columns into features a model can actually learn from. Encode categorical variables, derive new fields from existing ones, handle missing values, and scale numeric features so no single column dominates.

one-hot, ordinal encoding, derived features, nulls, scaling

T59 Training Models with scikit-learn (+Quiz)

Train your first real model. Use scikit-learn's fit/predict API, start with Decision Trees and Linear Regression, split your data for honest evaluation, and persist trained models to disk with joblib.

scikit-learn fit/predict, Decision Trees, Linear Regression, joblib

T60 Model Evaluation (+Quiz)

Measure whether your model is actually any good. Compute MAE, RMSE, and R² for regression; accuracy, precision, recall, and F1 for classification. Read a confusion matrix without squinting.

MAE, RMSE, R², accuracy, precision, recall, F1, confusion matrix

Lab Model Performance Autopsy

Investigate a churn model whose recall dropped from 0.85 to 0.53 by tracing through the confusion matrix, feature importance, and drift detection to find the root cause.

Covers: model evaluation, classification metrics, confusion matrix interpretation, feature importance, data drift detection, answer tasks

T61 Prediction Pipelines (+Quiz)

Put trained models to work. Load saved models with joblib, run batch predictions on thousands of rows, write results to CSV or PostgreSQL, and version models so you know which one produced which output.

loading models, batch prediction, writing to CSV/PostgreSQL, versioning

T62 Text Processing for ML (+Quiz)

Turn raw text into something a model can classify. Use bag of words and TF-IDF to represent documents, train a Naive Bayes classifier, and tie it all together with scikit-learn's Pipeline.

bag of words, TF-IDF, Naive Bayes, scikit-learn Pipeline

T63 Model Monitoring (+Quiz)

Watch for the moment your model stops working. Detect data drift in incoming features, spot prediction drift in the outputs, track monitoring metrics over time, and know when it's time to retrain.

data drift, prediction drift, monitoring metrics, when to retrain

T64 ML Capstone (+Quiz)

Ship an end-to-end ML pipeline. Extract data, engineer features, train a model, evaluate it, run batch predictions in production, and monitor the system over time. Every stage, in sequence.

extract, engineer, train, evaluate, predict, monitor

Lab Deployment Threshold

Choose a deployment threshold for a churn model by comparing an aggressive approach (high recall, low precision) against a conservative approach (high precision, low recall).

Covers: batch prediction, classification metrics, precision-recall tradeoff, choose-your-approach, model monitoring setup

Exam Level 8 Master Exam

Prove you've mastered CLI-based machine learning. 10 questions covering ML concepts, feature engineering, scikit-learn training, evaluation metrics, prediction pipelines, text ML, and model monitoring. Score 80% or higher to pass.

Covers: classification vs regression, features vs labels, train/test split, one-hot vs ordinal, feature scaling, DecisionTree/LinearRegression API, fit/predict pattern, joblib persistence, MAE/RMSE/R², accuracy/precision/recall/F1, confusion matrix, feature importance, overfitting detection, feature_names_in_ reindex, batch prediction, TF-IDF + Naive Bayes + Pipeline, data/label/concept drift, drift thresholds, retrain triggers

Cards Level 8 Flashcards

Drill every command and concept from Level 8. Cards pull from this level's quizzes and exam, and the ones you miss come back more often.

Python for data engineering from first principles. You'll write modular multi-file Python projects with functions and docstrings, connect to databases with DB-API 2.0 (sqlite3/psycopg2), create matplotlib and seaborn visualizations with the Agg backend, ingest data from Excel with skiprows, handle encoding issues with Latin-1, paginate APIs with requests, scrape HTML with BeautifulSoup, write advanced pandas with method chains/pipe/chunks/memory optimization/reshaping, build automated analysis pipelines with argparse and logging, and generate self-contained HTML reports with base64-embedded PNG charts.

T65 Python Project Fundamentals (+Quiz)

Get comfortable in Python. Work with variables, data types, functions, conditionals, loops, and the core data structures (dicts and lists) that every data script depends on.

variables, data types, functions, conditionals, loops, dicts, lists

T66 Python for Databases (+Quiz)

Push pandas to production scale. Pick the right merge strategy, pivot complex tables, apply window functions, chain methods for readability, and understand where the performance cliffs are.

merge strategies, pivot tables, window functions, chaining, performance

T67 Data Visualization with matplotlib (+Quiz)

Connect Python to PostgreSQL and treat the database like any other resource. Manage connections with context managers, use parameterized queries to avoid SQL injection, and bulk-load thousands of rows in one call.

psycopg2, connection management, parameterized queries, bulk ops

T68 Data Visualization with seaborn (+Quiz)

Pull data from APIs in Python the right way. Handle pagination with the requests library, respect rate limits, retry transient errors, and extract clean records from noisy responses.

requests, pagination, rate limiting, error handling, extraction

Lab Automated Report Pipeline

Build an automated weekly report: query the database, compute monthly metrics, generate a correlation heatmap, render a dashboard chart, and produce a self-contained HTML report.

Covers: sqlite3 from Python, pandas read_sql and groupby, seaborn heatmap, matplotlib charts, base64-embedded HTML reports, argparse

T69 Real-World Data Ingestion (+Quiz)

Stop bad data at the gate. Run schema checks, enforce assertion frameworks that fail loudly, score data quality numerically, and flag anomalous records before they corrupt downstream tables.

schema checks, assertion frameworks, quality scoring, anomaly detection

T70 Advanced pandas Patterns (+Quiz)

Turn Python DataFrames into charts stakeholders will actually read. Build bar, line, and scatter plots with matplotlib, style them with seaborn, and generate publication-ready graphics straight from automated reports.

matplotlib charts, seaborn, automated report graphics

T71 Automated Analysis Pipeline (+Quiz)

Coordinate multi-step Python jobs that run themselves. Schedule work, manage dependencies between stages, send notifications on completion and failure, and add retry logic that handles transient errors.

scheduling, dependency management, notifications, retry logic

T72 Python Data Capstone (+Quiz)

Build a complete production data tool in Python. Connect every piece you've learned: extraction, validation, transformation, database loading, visualization, and scheduling into a single deliverable.

production data pipeline tool in Python

Lab Report Discrepancy

Investigate a $201K revenue discrepancy between the automated report and finance by tracing duplicate records to the Excel ingestion source.

Covers: pandas read_sql, read_excel, read_csv, duplicated, drop_duplicates, data reconciliation, answer tasks

Exam Level 9 Master Exam

Prove you've mastered Python for data engineering. 10 questions covering Python project structure, database access, visualization with matplotlib and seaborn, data ingestion from Excel/APIs/web, advanced pandas patterns, automated analysis pipelines, and HTML report generation. Score 80% or higher to pass.

Covers: __name__ == '__main__' guard, venv isolation, Counter/defaultdict, DB-API 2.0 parity, parameterized queries, pd.read_sql, transactions, matplotlib Agg + OO style, subplots, seaborn heatmap/boxplot/pairplot, read_excel with skiprows, Latin-1 encoding, API pagination, BeautifulSoup, method chaining, pipe(), chunksize for large files, category dtype + downcast, melt/pivot_table, argparse, logging, base64-embedded HTML reports

Cards Level 9 Flashcards

Drill every command and concept from Level 9. Cards pull from this level's quizzes and exam, and the ones you miss come back more often.

Cross-level master labs that use skills from all 9 levels. Four bonus labs — no tutorials, no quizzes, no exam. Messy Data Challenge (clean a vendor dump end-to-end), Pipeline Debugging (fix a broken 3-stage ETL), Automated Intelligence Report (build a daily executive briefing from 90 days of data), and ML Model Showdown (train and compare three classifiers on loan default prediction). Progress tracked independently.

Lab Cold Case: The Missing Records

Explore an undocumented data warehouse, trace missing December records through the pipeline stages, and identify whether the failure was a code bug or a source data quality issue.

Covers: exploration, duckdb/sqlite3 querying, wc -l, grep log analysis, awk aggregation, pipeline stage diagnosis, answer tasks

Lab Pipeline Resurrection

Resurrect a 3-stage ETL pipeline with cascading failures: fix the API endpoint, fix the date format parser, identify a missing verification step, and fix the upsert conflict target.

Covers: grep log analysis, sed -i config repair, error diagnosis, pipeline design evaluation, answer tasks, Fix-It

Lab Briefing Pipeline Breakdown

Explore a briefing pipeline, diagnose why the daily report stopped sending, and fix the cron path and data source configuration bugs.

Covers: exploration, grep log analysis, sed -i config repair, awk aggregation, sed template rendering, cron, Fix-It

Lab Model Degradation Postmortem

Conduct a full postmortem on a degraded production ML model: compare metrics against baseline, identify the drifted feature, determine whether it's data drift or concept drift, and decide whether to retrain or investigate first.

Covers: Python DB queries, classification metrics comparison, data drift detection, feature importance, concept drift vs data drift, answer tasks

Data Engineering & Analytics

Full Curriculum

Where these skills take you

Start building data skills