Click any level to see what you'll learn.
The foundation everything else builds on. You'll learn to navigate a Linux system, create and manage files, control who can access what, and chain commands together to automate tasks. By the end, you'll be able to set up users, connect to servers over SSH, install software, and run your first Docker container.
Your bridge from Linux to data work. Recognize and inspect data formats (CSV, JSON, TSV), use grep with regex for pattern extraction, clean data with sed, extract columns with cut/paste/join, get started with awk, parse JSON with jq, write basic SQL with sqlite3, and chain it all together in a mini pipeline.
Now you go deep. Advanced awk with BEGIN/END blocks, associative arrays, and string functions. Full jq mastery. csvkit power tools. Advanced SQL with JOINs and subqueries. Multi-stage pipelines. Log analytics. Report generation. Python data tools introduction.
Bigger data, better tools. duckdb for analytical SQL directly on files. Pandas for DataFrame operations. Advanced awk and jq pipelines. Python ETL scripting. Pipeline orchestration with cron and dependency management.
Analysis and insight. SQL window functions. Data cleaning strategies. Statistical analysis covering mean, median, correlation, and distributions. Regex deep dive. API data collection. CLI visualization. Report automation. An analytics capstone bringing it all together.
Production database skills. PostgreSQL fundamentals. Schemas and PostgreSQL-specific types (JSONB, arrays, timestamptz). Data loading patterns. Indexing and EXPLAIN. Users, roles, permissions. Transactions and integrity. Backup, restore, maintenance. Production operations with pg_stat views.
Building pipelines that run in production. ETL vs ELT architecture. Extraction patterns. Transformation patterns. Loading strategies. Error handling and recovery. Logging and observability. Data quality testing. Pipeline capstone.
Machine learning for data engineers. ML concepts. Feature engineering with awk and pandas. Training models with scikit-learn. Model evaluation. Prediction pipelines. Text processing for ML. Model monitoring. ML capstone.
Python as a data engineering power tool. Scripting fundamentals, data manipulation with pandas, database connectivity with psycopg2, API data collection, data validation and testing, visualization with matplotlib, workflow automation, and building production data tools.
Four capstone labs that combine everything you've learned across all levels.
Real job titles that use the tools taught in this course.
Salary ranges based on 2025-2026 US market data. The first role in each column is the most common entry point from this course.