CineETL: Movie Insights Data Pipeline

  • Tech Stack: AWS Redshift, Airflow, PySpark, AWS Glue, Amazon S3, Docker, Python
  • Github URL: Project Link

CineETL is a robust data pipeline designed to extract, transform, and load movie-related data, providing comprehensive insights into the film industry's dynamics.

Crafted ETL pipeline for 26M user ratings and 45K movies with a data ingestion rate of 10K records/minutes into AWS Redshift.

Normalized data model, automated data quality checks, orchestrated using Airflow and achieved 99% daily ETL cycle success rate.