About
Courses
Security Pro Data Pro DevOps Pro AI Pro Pricing Contact
Data Pro

Data Engineering & Analytics

From Linux basics to machine learning in 10 levels. Learn awk, jq, SQL, pandas, duckdb, PostgreSQL, ETL pipelines, scikit-learn, and Python for data automation.

72Tutorials
22Labs
72Quizzes
9Exams

Full Curriculum

Click any level to see what you'll learn.

The foundation everything else builds on. You'll learn to navigate a Linux system, create and manage files, control who can access what, and chain commands together to automate tasks. By the end, you'll be able to set up users, connect to servers over SSH, install software, and run your first Docker container.

T01 Files & Navigation (+Quiz)
Learn to move around a Linux filesystem, create and organize directories, copy, move, and remove files, and search for anything on the system.
ls, cd, pwd, mkdir, rm, cp, mv, find
T02 Viewing, Editing & Permissions (+Quiz)
Read files from the terminal, edit them with nano, and control exactly who can read, write, or execute every file on the system.
cat, head, tail, less, nano, chmod, chown
T03 Pipes & Redirection (+Quiz)
Chain commands together to filter, sort, count, and search through data. Redirect output to files, combine streams, and build one-liners that do real work.
|, >, >>, 2>&1, tee, sort, uniq, wc, grep
T04 Processes & System (+Quiz)
See what's running on your system, monitor resource usage in real time, stop runaway processes, and manage background jobs.
ps, top, kill, bg, fg, jobs
Lab 1A: Server Cleanup
Put your skills to the test. You're handed a messy server with oversized log files and runaway processes. Find them, clean them up, and bring the system back to health.
Covers: filesystem audit, log cleanup, process management, disk recovery, system health
T05 Users & SSH (+Quiz)
Create and manage user accounts, configure sudo access, generate SSH keys, and connect to remote servers securely.
useradd, passwd, sudo, ssh, ssh-keygen, authorized_keys
T06 Networking (+Quiz)
Check your network interfaces, test connectivity, make HTTP requests from the terminal, understand how DNS resolves names, and identify what's listening on which ports.
ip, ss, ping, curl, DNS resolution, ports
T07 Package Management (+Quiz)
Install, update, and remove software. Manage repositories, resolve dependency issues, and keep your system current.
apt, dpkg, repositories, dependencies
T08 Docker Basics (+Quiz)
Pull images, run containers, map ports, and manage the container lifecycle. Your first step into the tool that changed how software gets deployed.
images, containers, run, ps, stop, rm, port mapping
Lab 1B: New Server Checklist
You just got access to a fresh server. Set up users, configure SSH, install essential packages, and get it production-ready from scratch.
Covers: user setup, SSH configuration, package installation, server hardening, baseline config
Exam Level 1 Master Exam
Prove you've mastered the fundamentals. 10 questions covering everything from Level 1. Score 80% or higher to pass.
Covers: files, permissions, pipes, processes, users, SSH, networking, packages, docker

Your bridge from Linux to data work. Recognize and inspect data formats (CSV, JSON, TSV), use grep with regex for pattern extraction, clean data with sed, extract columns with cut/paste/join, get started with awk, parse JSON with jq, write basic SQL with sqlite3, and chain it all together in a mini pipeline.

T09Data Formats & Structure (+Quiz)
Recognize and inspect the formats you'll see every day. Read CSV, TSV, JSON, and fixed-width files, measure them with head and wc, and understand the shape of data before you process it.
CSV, TSV, JSON, fixed-width, head, wc, awk
T10grep & Regex for Data (+Quiz)
Extract patterns from raw text. Master character classes, quantifiers, and anchors, then use grep -oE to pull IPs, dates, emails, and anything else matching a pattern out of messy files.
character classes, quantifiers, anchors, grep -oE, extracting patterns
T11sed for Data Cleaning (+Quiz)
Clean up messy data at the command line. Do substitutions, strip whitespace, delete bad lines, and edit files in place using address ranges when you need surgical control.
substitution, whitespace cleanup, line deletion, in-place editing
T12cut, paste & join (+Quiz)
Reshape tabular data without loading it into a spreadsheet. Pull specific columns with cut, stitch files side-by-side with paste, and merge datasets on a shared key with join.
column extraction, merging files, combining on shared keys
Lab2A: Clean the Export
A vendor hands you a messy CSV with mixed whitespace, inconsistent dates, and bad encoding. Clean it up using sed, cut, and grep until it's ready for the analytics team to import.
Covers: whitespace cleanup, date normalization, encoding fixes, column extraction, import validation
T13Introduction to awk (+Quiz)
Your first real text-processing language. Split fields on any delimiter, filter rows by condition, reference fields by number, and run calculations like sum and average right from the command line.
field splitting, -F delimiter, $1/$NF, NR, filtering, calculations
T14Introduction to JSON & jq (+Quiz)
Work with the format modern APIs actually return. Understand JSON structure, navigate it with jq's dot notation, extract specific fields, iterate through arrays, and filter results by condition.
JSON structure, dot notation, field extraction, array iteration
T15Introduction to SQL (+Quiz)
Write your first real queries against a real database. Use sqlite3 to SELECT rows, filter with WHERE, sort with ORDER BY, aggregate with GROUP BY, and count, sum, and average anything you need.
sqlite3, SELECT, WHERE, ORDER BY, LIMIT, GROUP BY, aggregates
T16Mini Data Pipeline (+Quiz)
Put everything from this level together. Chain sed, awk, sort, uniq, jq, and sqlite3 into a single end-to-end pipeline that takes raw input and produces clean queryable output.
sed, awk, sort, uniq, jq, sqlite3 end-to-end
Lab2B: Pipeline From Scratch
Build a working data pipeline on your own. Start with raw CSV and JSON inputs, clean them, join them, load into sqlite3, and produce a summary report that answers a real business question.
Covers: CSV/JSON ingestion, data joining, sqlite3 loading, summary reporting, count verification
ExamLevel 2 Master Exam
Prove you can handle data from the command line. 10 questions covering formats, regex, sed, cut, awk, jq, SQL, and pipelines. Score 80% or higher to pass.
Covers: data formats, regex, sed, cut, awk, jq, SQL, pipelines

Now you go deep. Advanced awk with BEGIN/END blocks, associative arrays, and string functions. Full jq mastery. csvkit power tools. Advanced SQL with JOINs and subqueries. Multi-stage pipelines. Log analytics. Report generation. Python data tools introduction.

T17awk Deep Dive (+Quiz)
Unlock awk as a real programming language. Use BEGIN and END blocks for setup and teardown, write conditionals, build associative arrays, manipulate strings, and process multiple files at once.
BEGIN/END blocks, conditionals, arrays, string functions, printf
T18jq & JSON Advanced (+Quiz)
Reshape any JSON document into anything else. Use select filters, build new objects, reduce arrays, group by any field, and convert JSON straight to CSV when downstream tools need it.
array iteration, select filters, reduce, group_by, CSV output
T19csvkit (+Quiz)
Power tools purpose-built for CSVs. Preview files with csvlook, cut columns with csvcut, filter rows with csvgrep, run statistics with csvstat, and query CSV files directly with csvsql.
csvlook, csvcut, csvgrep, csvstat, csvsql
T20sqlite3 Advanced (+Quiz)
Go beyond simple SELECTs. Join tables, filter groups with HAVING, write subqueries, use advanced aggregates, and create views that let you treat complex queries like single tables.
JOINs, GROUP BY HAVING, aggregates, subqueries, views
Lab3A: Analyze the Logs
A week of raw web server logs just landed in your lap. Parse them with awk, aggregate with sort and uniq, and produce the top-N reports the ops team needs before the meeting starts.
Covers: log parsing, awk aggregation, top-N reporting, sort/uniq analysis, ops delivery
T21Data Pipelines (+Quiz)
Build real multi-stage pipelines. Chain awk, jq, sort, and uniq into sequences that extract, transform, aggregate, and deliver results on a schedule with no babysitting.
awk, jq, sort, uniq chained into multi-stage processing
T22Log Analytics (+Quiz)
Turn web and system logs into metrics. Parse access logs into structured fields, extract request counts and response times, run top-N analyses, and bucket events by time window.
access logs, metrics extraction, top-N analysis, time aggregation
T23Report Generation (+Quiz)
Ship reports that actually look ready for leadership. Format output with awk, template sections with sed, and build repeatable report scripts your team can run on any schedule.
formatted output with awk, sed templates, automated scripts
T24Python Data Tools Intro (+Quiz)
Reach for Python when bash runs out of road. Write one-liners with python3, use the csv and json modules to read and write structured files, and start moving from shell scripts to real scripts.
python3 one-liners, csv module, json module, basic scripting
Lab3B: Daily Metrics Report
Build an automated daily report from scratch. Pull yesterday's logs, aggregate metrics by endpoint, format the output, and save a versioned report that runs on cron every morning.
Covers: log extraction, endpoint aggregation, output formatting, cron scheduling, versioned reports
ExamLevel 3 Master Exam
Prove you can analyze and wrangle real data. 10 questions covering advanced awk, jq, csvkit, SQL joins, pipelines, log analytics, and report generation. Score 80% or higher to pass.
Covers: advanced awk, jq, csvkit, SQL joins, pipelines, log analytics, reporting

Bigger data, better tools. duckdb for analytical SQL directly on files. Pandas for DataFrame operations. Advanced awk and jq pipelines. Python ETL scripting. Pipeline orchestration with cron and dependency management.

T25duckdb Basics (+Quiz)
Query CSV, JSON, and Parquet files directly with SQL, no import step required. Point duckdb at a file and start running analytical queries on millions of rows in seconds.
analytical SQL on CSV/JSON/Parquet, direct file queries
T26duckdb Analytics (+Quiz)
Run serious analytics without a warehouse. Use window functions, build CTEs for readability, compute complex aggregations, and join across files as if they were tables in the same database.
window functions, CTEs, complex aggregations, cross-file joins
T27pandas Basics (+Quiz)
Load data into DataFrames and start working with it. Use read_csv, filter and select rows, understand dtypes, and inspect your data with head, tail, and describe before you transform anything.
read_csv, DataFrames, filtering, dtypes, head/tail/describe
T28pandas Transforms (+Quiz)
Reshape data the way analysts actually work. Group and aggregate with groupby, join tables with merge, pivot long to wide, apply custom functions, and chain methods to build readable pipelines.
groupby, merge, pivot_table, apply, method chaining, output
Lab4A: Build an ETL
Write a real ETL pipeline end-to-end. Extract from a messy source CSV, transform with pandas, load results into sqlite3, and produce a summary report that verifies every row landed correctly.
Covers: source extraction, pandas transformation, sqlite3 loading, row verification, summary reporting
T29Advanced awk Pipelines (+Quiz)
Push awk further than most engineers ever do. Write multi-pass pipelines, coordinate processes with co-processing, and manipulate fields in ways that would take dozens of lines in Python.
multi-pass processing, co-processing, complex field manipulation
T30Advanced jq Pipelines (+Quiz)
Transform deeply nested JSON into whatever shape you need. Chain nested transforms, reduce across arrays, group by any key, and reshape complex API responses without touching a Python script.
nested transforms, reduce, group_by, complex JSON reshaping
T31Python ETL Scripts (+Quiz)
Write ETL scripts that survive production. Structure extract-transform-load as separate functions, handle CLI flags with argparse, manage file I/O safely, and catch errors before they corrupt the target.
extract-transform-load, argparse, file I/O, error handling
T32Pipeline Orchestration (+Quiz)
Schedule pipelines that run reliably without a human watching. Chain dependencies with cron, prevent overlapping runs with lock files, log every step, and catch failures before data consumers do.
cron, dependency chains, lock files, logging, monitoring
Lab4B: Schedule the Job
Take an ETL script and turn it into a production job. Write the cron entry, add locking to prevent duplicate runs, wire up logging, and prove it recovers cleanly when something goes wrong.
Covers: cron configuration, lock files, logging, error recovery, production scheduling
ExamLevel 4 Master Exam
Prove you can engineer data at scale. 10 questions covering duckdb, pandas, advanced awk and jq, Python ETL, and pipeline orchestration. Score 80% or higher to pass.
Covers: duckdb, pandas, advanced awk, advanced jq, Python ETL, orchestration

Analysis and insight. SQL window functions. Data cleaning strategies. Statistical analysis covering mean, median, correlation, and distributions. Regex deep dive. API data collection. CLI visualization. Report automation. An analytics capstone bringing it all together.

T33SQL Window Functions (+Quiz)
Rank, compare, and aggregate across rows without collapsing your result set. Use ROW_NUMBER, RANK, LAG, LEAD, and SUM() OVER with frame clauses to answer questions basic SQL can't.
ROW_NUMBER, RANK, LAG, LEAD, SUM() OVER, frame clauses
T34Data Cleaning (+Quiz)
Detect nulls that aren't really null, deduplicate on composite keys, validate types, flag outliers, and normalize inconsistent values. The work that happens before every real analysis.
null detection, deduplication, type validation, outliers, normalization
T35Statistical Analysis (+Quiz)
Apply real statistics to your data. Compute descriptive stats, understand distributions, measure correlation between variables, and learn the hypothesis testing concepts behind every A/B test.
descriptive stats, distributions, correlation, hypothesis testing
T36Regex Deep Dive (+Quiz)
Master regex for real extraction work. Use character classes and quantifiers confidently, capture groups, apply lookahead and lookbehind, and know which dialect each tool expects.
character classes, quantifiers, groups, lookahead, sed/grep/awk
Lab5A: Find the Signal
You're given a dataset and a vague business question. Clean the data, run the stats, surface the real pattern, and deliver a finding your manager can take into a leadership meeting.
Covers: data cleaning, statistical analysis, pattern discovery, business findings, report delivery
T37API Data Collection (+Quiz)
Pull data from live APIs the right way. Call REST endpoints with curl, handle pagination across thousands of results, respect rate limits, and parse JSON responses into clean records.
REST APIs, curl, pagination, rate limiting, JSON parsing
T38CLI Visualization (+Quiz)
Turn numbers into pictures without leaving the terminal. Render bar charts with awk, draw real graphs with gnuplot, build inline sparklines, and generate histograms you can paste straight into a report.
awk charts, gnuplot basics, sparklines, histogram generation
T39Report Automation (+Quiz)
Ship reports that land in inboxes on schedule. Template with sed, wire cron to trigger runs, format output for each channel, and understand the email delivery concepts behind automated briefings.
sed templates, cron scheduling, output formatting, email concepts
T40Analytics Capstone (+Quiz)
Run a complete analysis end-to-end. Start from raw source data, clean it, run statistical checks, build visualizations, and deliver a formatted report that answers a real question.
end-to-end analysis from raw data to formatted report
Lab5B: Ship the Dashboard
Build a real analytics dashboard end-to-end. Pull metrics from multiple sources, compute week-over-week trends, visualize the results from the CLI, and schedule it to regenerate every morning.
Covers: multi-source metrics, trend computation, CLI visualization, cron scheduling, automated delivery
ExamLevel 5 Master Exam
Prove you can deliver real analytics. 10 questions covering window functions, data cleaning, statistics, regex, API collection, visualization, and automated reports. Score 80% or higher to pass.
Covers: window functions, data cleaning, statistics, regex, API collection, visualization

Production database skills. PostgreSQL fundamentals. Schemas and PostgreSQL-specific types (JSONB, arrays, timestamptz). Data loading patterns. Indexing and EXPLAIN. Users, roles, permissions. Transactions and integrity. Backup, restore, maintenance. Production operations with pg_stat views.

T41PostgreSQL Fundamentals (+Quiz)
Graduate from sqlite3 to a real client-server database. Connect with psql, learn the meta-commands that save your hands, work with PostgreSQL types, use RETURNING clauses, and wrap operations in transactions.
psql, meta-commands, types, RETURNING, transactions
T42Schemas & Data Types (+Quiz)
Use PostgreSQL's type system to your advantage. Organize tables with schemas, query JSONB fields, store arrays natively, handle TIMESTAMPTZ for real time zones, and enforce rules with constraints.
namespaces, JSONB queries, arrays, TIMESTAMPTZ, constraints
T43Data Loading Patterns (+Quiz)
Load data into PostgreSQL the way production systems do. Bulk-load with COPY, stage imports before they hit target tables, handle conflicts with upserts, and know when to TRUNCATE vs DELETE.
COPY, staging tables, ON CONFLICT upserts, TRUNCATE vs DELETE
T44Indexing & EXPLAIN (+Quiz)
Make slow queries fast. Read EXPLAIN plans, design B-tree and partial indexes for real workloads, understand when the planner picks an index over a scan, and know when not to add one at all.
B-tree, partial indexes, query plans, cost analysis
Lab6A: Tune the Query
A production query is running slow and users are complaining. Read the plan, identify the bottleneck, add the right index, verify the new plan, and measure the improvement.
Covers: EXPLAIN analysis, bottleneck identification, index design, plan verification, performance measurement
T45Users, Roles & Permissions (+Quiz)
Control who can touch what in your database. Create roles, grant and revoke privileges, use role inheritance to avoid permission sprawl, and configure pg_hba.conf to lock down access.
CREATE ROLE, GRANT/REVOKE, inheritance, pg_hba.conf
T46Transactions & Data Integrity (+Quiz)
Wrap risky operations in transactions that actually protect your data. Use BEGIN, COMMIT, and ROLLBACK correctly, set savepoints for partial rollbacks, pick the right isolation level, and avoid deadlocks.
BEGIN/COMMIT/ROLLBACK, SAVEPOINT, isolation, deadlocks
T47Backup, Restore & Maintenance (+Quiz)
Protect your database from the worst day. Take consistent backups with pg_dump, restore them with pg_restore, run VACUUM and ANALYZE to keep things fast, and schedule the work so it happens automatically.
pg_dump formats, pg_restore, VACUUM, ANALYZE, scheduling
T48Production Operations (+Quiz)
Run a PostgreSQL database in production. Watch active connections with pg_stat_activity, find hot tables with pg_stat_user_tables, catch slow queries in the log, and understand why you'll want connection pooling.
pg_stat_activity, pg_stat_user_tables, slow queries, pooling
Lab6B: Harden the Database
Take a default PostgreSQL install and make it production-ready. Lock down roles, configure pg_hba.conf, set up automated backups, and produce a runbook the next engineer can follow.
Covers: role lockdown, pg_hba.conf, automated backups, runbook creation, production readiness
ExamLevel 6 Master Exam
Prove you can run a production database. 10 questions covering PostgreSQL fundamentals, schemas, loading, indexing, roles, transactions, backups, and production operations. Score 80% or higher to pass.
Covers: PostgreSQL, schemas, data loading, indexing, roles, transactions, backups

Building pipelines that run in production. ETL vs ELT architecture. Extraction patterns. Transformation patterns. Loading strategies. Error handling and recovery. Logging and observability. Data quality testing. Pipeline capstone.

T49Pipeline Architecture (+Quiz)
Decide how your pipeline should actually work. Compare ETL vs ELT tradeoffs, design for idempotency so retries are safe, manage state between runs, and externalize configuration so environments don't drift.
ETL vs ELT, idempotency, state management, configuration
T50Extraction Patterns (+Quiz)
Pull data from sources without breaking them. Paginate through APIs, respect rate limits, extract only new records since last run, and detect when a schema changes before it breaks your pipeline.
API pagination, rate limiting, incremental, schema change detection
T51Transformation Patterns (+Quiz)
Shape raw input into usable output. Apply validation rules, cast types safely, deduplicate on composite keys, denormalize for query speed, and know when to use Python vs SQL for each step.
validation rules, type casting, deduplication, denormalization
T52Loading Strategies (+Quiz)
Land data in the target without corrupting it. Choose between full reloads, incremental appends, and merge/upsert patterns, and use staging tables to verify data before it hits production.
full reload, incremental append, merge/upsert, staging→target
Lab7A: End-to-End Pipeline
Architect and build a complete pipeline. Extract from an API, validate and transform the records, load into PostgreSQL using staging tables, and verify every count before the data goes live.
Covers: API extraction, data validation, staging tables, PostgreSQL loading, count verification
T53Error Handling & Recovery (+Quiz)
Design pipelines that fail safely. Fail loud when the data is wrong, catch expected errors with try/except, retry transient failures with backoff, and route permanently bad records to a dead letter queue.
fail loud, try/except, retry with backoff, dead letter queues
T54Logging & Observability (+Quiz)
See what your pipeline is actually doing. Emit structured JSON-lines logs, use Python's logging module properly, expose metrics per stage, and wire alerts so someone knows when it breaks.
JSON-lines logging, Python logging, metrics, alerting
T55Testing & Data Quality (+Quiz)
Verify the pipeline did what it claims. Assert expected row counts, check for nulls in required fields, enforce referential integrity across tables, and catch schema drift before downstream queries break.
row count assertions, null checks, referential integrity, schema drift
T56Pipeline Capstone (+Quiz)
Trace a full pipeline from source to consumer. Extract, validate, transform, load, verify, log, and hand off a system that runs unattended and recovers from the failures you expect.
extract, validate, transform, load, verify, log
Lab7B: Pipeline in Production
Take a working pipeline and make it production-grade. Add retries, dead letter handling, structured logging, and data quality checks that stop bad data before it reaches the warehouse.
Covers: retry logic, dead letter handling, structured logging, data quality checks, production hardening
ExamLevel 7 Master Exam
Prove you can engineer production pipelines. 10 questions covering architecture, extraction, transformation, loading, error handling, logging, and data quality. Score 80% or higher to pass.
Covers: pipeline architecture, extraction, transformation, loading, error handling, data quality

Machine learning for data engineers. ML concepts. Feature engineering with awk and pandas. Training models with scikit-learn. Model evaluation. Prediction pipelines. Text processing for ML. Model monitoring. ML capstone.

T57ML Concepts (+Quiz)
Learn the ideas behind every machine learning model. Understand when to use classification vs regression, what features and labels actually are, how training differs from inference, and why overfitting ruins otherwise good models.
classification vs regression, features, labels, train/test, overfitting
T58Feature Engineering (+Quiz)
Turn raw columns into features a model can actually learn from. Encode categorical variables, derive new fields from existing ones, handle missing values, and scale numeric features so no single column dominates.
one-hot, ordinal encoding, derived features, nulls, scaling
T59Training Models (+Quiz)
Train your first real model. Use scikit-learn's fit/predict API, start with Decision Trees and Linear Regression, split your data for honest evaluation, and persist trained models to disk with joblib.
scikit-learn fit/predict, Decision Trees, Linear Regression, joblib
T60Model Evaluation (+Quiz)
Measure whether your model is actually any good. Compute MAE, RMSE, and R² for regression; accuracy, precision, recall, and F1 for classification. Read a confusion matrix without squinting.
MAE, RMSE, R², accuracy, precision, recall, F1, confusion matrix
Lab8A: Train the Model
You're handed a dataset and a business question. Engineer features, train a model end-to-end, evaluate it against multiple metrics, and write up whether it's actually ready to ship.
Covers: feature engineering, model training, multi-metric evaluation, ship readiness, write-up
T61Prediction Pipelines (+Quiz)
Put trained models to work. Load saved models with joblib, run batch predictions on thousands of rows, write results to CSV or PostgreSQL, and version models so you know which one produced which output.
loading models, batch prediction, writing to CSV/PostgreSQL, versioning
T62Text Processing for ML (+Quiz)
Turn raw text into something a model can classify. Use bag of words and TF-IDF to represent documents, train a Naive Bayes classifier, and tie it all together with scikit-learn's Pipeline.
bag of words, TF-IDF, Naive Bayes, scikit-learn Pipeline
T63Model Monitoring (+Quiz)
Watch for the moment your model stops working. Detect data drift in incoming features, spot prediction drift in the outputs, track monitoring metrics over time, and know when it's time to retrain.
data drift, prediction drift, monitoring metrics, when to retrain
T64ML Capstone (+Quiz)
Ship an end-to-end ML pipeline. Extract data, engineer features, train a model, evaluate it, run batch predictions in production, and monitor the system over time. Every stage, in sequence.
extract, engineer, train, evaluate, predict, monitor
Lab8B: Ship to Production
Take your trained model and get it serving real predictions. Wire up a batch prediction job, load results into PostgreSQL, add monitoring for drift, and document how to retrain when the time comes.
Covers: batch prediction, PostgreSQL output, model versioning, drift monitoring, retraining docs
ExamLevel 8 Master Exam
Prove you can build and ship machine learning models. 10 questions covering ML concepts, feature engineering, training, evaluation, prediction, and monitoring. Score 80% or higher to pass.
Covers: ML concepts, feature engineering, training, evaluation, prediction, monitoring

Python as a data engineering power tool. Scripting fundamentals, data manipulation with pandas, database connectivity with psycopg2, API data collection, data validation and testing, visualization with matplotlib, workflow automation, and building production data tools.

T65Python Fundamentals (+Quiz)
Get comfortable in Python. Work with variables, data types, functions, conditionals, loops, and the core data structures (dicts and lists) that every data script depends on.
variables, data types, functions, conditionals, loops, dicts, lists
T66pandas Advanced (+Quiz)
Push pandas to production scale. Pick the right merge strategy, pivot complex tables, apply window functions, chain methods for readability, and understand where the performance cliffs are.
merge strategies, pivot tables, window functions, chaining, performance
T67Database Connectivity (+Quiz)
Connect Python to PostgreSQL and treat the database like any other resource. Manage connections with context managers, use parameterized queries to avoid SQL injection, and bulk-load thousands of rows in one call.
psycopg2, connection management, parameterized queries, bulk ops
T68API Data Collection (+Quiz)
Pull data from APIs in Python the right way. Handle pagination with the requests library, respect rate limits, retry transient errors, and extract clean records from noisy responses.
requests, pagination, rate limiting, error handling, extraction
Lab9A: API to Database
Build a complete ingestion script. Pull paginated results from a live API, transform and validate each record in Python, and bulk-load them into PostgreSQL with idempotency guarantees.
Covers: paginated extraction, record validation, bulk loading, idempotency, Python automation
T69Data Validation (+Quiz)
Stop bad data at the gate. Run schema checks, enforce assertion frameworks that fail loudly, score data quality numerically, and flag anomalous records before they corrupt downstream tables.
schema checks, assertion frameworks, quality scoring, anomaly detection
T70Visualization (+Quiz)
Turn Python DataFrames into charts stakeholders will actually read. Build bar, line, and scatter plots with matplotlib, style them with seaborn, and generate publication-ready graphics straight from automated reports.
matplotlib charts, seaborn, automated report graphics
T71Workflow Automation (+Quiz)
Coordinate multi-step Python jobs that run themselves. Schedule work, manage dependencies between stages, send notifications on completion and failure, and add retry logic that handles transient errors.
scheduling, dependency management, notifications, retry logic
T72Data Automation Capstone (+Quiz)
Build a complete production data tool in Python. Connect every piece you've learned: extraction, validation, transformation, database loading, visualization, and scheduling into a single deliverable.
production data pipeline tool in Python
Lab9B: Ship the Tool
Package your capstone tool as production software. Add CLI flags with argparse, log structured JSON, write a setup script, and produce a runbook another engineer could use to deploy and operate it.
Covers: CLI packaging, argparse flags, structured logging, setup scripting, operations runbook
ExamLevel 9 Master Exam
Prove you can build production data tools in Python. 10 questions covering fundamentals, pandas, database connectivity, API collection, validation, visualization, and workflow automation. Score 80% or higher to pass.
Covers: Python fundamentals, pandas, database connectivity, APIs, validation, visualization

Four capstone labs that combine everything you've learned across all levels.

LabMessy Data Challenge
A dataset arrives with every kind of problem at once: mixed encodings, inconsistent schemas, broken dates, silent duplicates. Clean it, normalize it, and produce a loadable file the rest of the team can trust.
Covers: encoding repair, schema normalization, deduplication, date standardization, trust verification
LabPipeline Debugging
A production pipeline is failing silently and the data team is blind. Find the stage that's dropping records, reproduce the failure locally, write the test that catches it, and ship the fix with a postmortem.
Covers: silent failure detection, local reproduction, test writing, fix deployment, postmortem
LabAutomated Intelligence Report
Leadership wants a weekly briefing with real findings, not screenshots. Build an automated pipeline that pulls from multiple sources, computes week-over-week changes, and delivers a formatted report every Monday.
Covers: multi-source extraction, week-over-week metrics, formatted reporting, Monday scheduling, automation
LabML Model Showdown
Three candidate models, one business problem. Train all three, evaluate them on the same metrics, defend your choice with numbers, and ship the winner to production with monitoring in place.
Covers: multi-model training, comparative evaluation, defended recommendation, production deployment, monitoring

Where these skills take you

Real job titles that use the tools taught in this course.

Entry-level
$55K – $75K
  • Junior Data Engineer
  • Data Analyst
  • Junior ETL Developer
  • Business Intelligence Analyst
  • Database Administrator (Junior)
  • Reporting Analyst
2 Years Experience
$95K – $140K
  • Data Engineer
  • Analytics Engineer
  • ETL/Pipeline Developer
  • Data Platform Engineer
  • Machine Learning Engineer
  • Database Reliability Engineer
4+ Years Experience
$150K – $210K+
  • Senior Data Engineer
  • Staff Analytics Engineer
  • Data Infrastructure Lead
  • Principal Data Engineer
  • Engineering Manager (Data)
  • Data Platform Architect

Salary ranges based on 2025-2026 US market data. The first role in each column is the most common entry point from this course.

Start building data skills

One purchase. Lifetime access. No subscription.

Get Data Pro