When I started learning data engineering, I always wanted to try a real-world dataset instead of just “toy” examples. So I picked up the India Road Accident Dataset from Kaggle and built a complete ETL pipeline using PySpark and Delta Lake.
Note: This project is a sample ETL pipeline I built for learning and practice. It’s not production-ready, but it’s a great way to understand how raw data becomes analytics-ready data step by step.
In this blog, I’ll walk you through how I designed the pipeline using the Medallion Architecture (Bronze → Silver → Gold). Don’t worry if the terms sound heavy, I’ll explain everything in plain English