End‑to‑End FMCG Data Engineering Project on Databricks Free Edition

 4 min read

YouTube video ID: U6ZUKWdfSLY

Source: YouTube video by codebasicsWatch original video

PDF

Introduction

Today we built a complete data‑engineering pipeline for a merged FMCG company using Databricks Free Edition. The stack includes Python, SQL, AWS S3, Medallion architecture (bronze‑silver‑gold), Databricks dashboards and the AI assistant Genie. No credit‑card is required, making it ideal for students and practitioners.

Business Context

  • Parent company: Atlon, a large sports‑equipment manufacturer with a mature OLTP/OLAP system.
  • Acquired company: Sports Bar, a fast‑growing nutrition‑bar startup whose data lives in spreadsheets, cloud drives, WhatsApp exports and ad‑hoc APIs.
  • Problem: Mismatched data formats, missing months, and inconsistent reporting hinder unified supply‑chain forecasting.
  • Goal (set by COO Bruce):
  • Provide a single reliable dashboard aggregating both companies.
  • Keep the learning curve shallow for future hires.
  • Build a scalable, long‑term solution.

Technical Architecture

  • Medallion layers – Bronze (raw), Silver (cleaned), Gold (BI‑ready).
  • Parent pipeline already exists (bronze → silver → gold). We only create the child pipeline and then merge child gold tables into the parent gold tables.
  • Data timeline: Historical back‑fill of 5 months (July 1 – Nov 30) from Sports Bar, then daily incremental loads starting Dec 1.

Data Model

  • Star schema with one fact table (orders) and three dimension tables (dim_customers, dim_products, dim_gross_price).
  • Parent and child schemas differ (e.g., column names, case, missing fields). The pipeline normalises these differences.

Step‑by‑Step Implementation

1. Databricks Account & Catalog Setup

  • Sign‑in with Google, create a FMCG catalog, and three schemas: bronze, silver, gold.
  • Create a workspace folder consolidated_pipeline and notebooks for setup, utilities, and each dimension.

2. Load Parent Data (Gold Only)

  • Import CSVs for customers, products, gross price, and orders directly into the gold schema.
  • Create a date dimension table programmatically (Jan 1 2024 – Dec 31 2025).

3. Ingest Child Data into S3

  • Create an S3 bucket (e.g., sports-bar-dp).
  • Upload the child CSVs (full load under full_load/, incremental under incremental_load/).
  • Establish an external Databricks connection to the bucket.

4. Bronze Layer (Raw)

  • Read each CSV with Spark (spark.read.format('csv').option('header','true').load(path)).
  • Add metadata columns: read_timestamp, file_name, file_size.
  • Write to Delta tables in bronze using mode('overwrite') for the initial load.

5. Silver Layer (Cleaning & Transformation)

  • Customers:
  • Drop duplicates, trim spaces, standardise city names, fix case, replace unknown cities, concatenate customer_name + city, add static columns market='India', platform='SportsBar', channel='Acquisition'.
  • Products:
  • Remove duplicate rows, title‑case category, split product_name into product and variant, correct spelling (protein), map categories to divisions, generate a surrogate product_code using SHA‑256, replace non‑numeric IDs with 9999.
  • Gross Price:
  • Uniform month format via to_date with multiple patterns, convert negative prices to positive, replace unknown with 0, join with product table to attach product_code.
  • Orders (Fact):
  • Append raw rows to bronze, move processed files to a processed/ folder, clean dates (remove weekday text), convert null quantities, filter non‑numeric customer IDs, join with product table for product_code, drop duplicates.
  • Write each cleaned dataframe to the corresponding silver tables (mode('overwrite') for full load, mode('append') for incremental).

6. Gold Layer (BI‑Ready)

  • Select only the required columns from silver tables, rename customer_idcustomer_code, and write to gold.
  • Merging: Use Delta Lake upsert (MERGE INTO parent_gold USING child_gold ON product_code) to update existing rows or insert new ones.
  • For the fact table, aggregate daily child orders to monthly granularity before upserting into the parent fact table.

7. Incremental Load Workflow

  • New daily CSV lands in S3 landing/.
  • Create a staging bronze table (append only the new file).
  • Apply the same silver transformations, write to staging silver, then upsert into child gold.
  • Re‑aggregate monthly totals and upsert into the parent fact table.
  • Move the processed file to processed/.

8. Orchestration (Jobs & Pipelines)

  • Define Databricks jobs for each notebook (customers → products → price → orders).
  • Set dependencies so tasks run in the correct order.
  • Schedule the pipeline to run nightly (e.g., 23:00 UTC) after business hours.
  • Configure email notifications for success/failure.

9. BI Dashboard & Genie AI Assistant

  • Create a denormalised view joining fact and all dimensions (date, product, customer attributes) in the gold schema.
  • Use Databricks Genie to ask natural‑language questions (e.g., total revenue by quarter, top 5 customers).
  • Build a visual dashboard with KPIs, bar charts, pie charts, and filters for year, quarter, month, channel, and category.
  • Export the dashboard or embed it in presentations.

Project Outcome

  • A unified, reliable data layer covering both Atlon and Sports Bar.
  • Scalable architecture that can handle future system changes for the child company.
  • A ready‑to‑use dashboard that answers key business questions.
  • All work performed in the free Databricks edition, making the project résumé‑ready for students and professionals.

By leveraging Databricks Free Edition, AWS S3, and the Medallion architecture, we turned chaotic, heterogeneous data from a recent acquisition into a clean, scalable, and BI‑ready data lake. The pipeline delivers a single, reliable dashboard for the merged FMCG giant, meets the three success criteria, and provides a resume‑worthy end‑to‑end data‑engineering showcase.

Frequently Asked Questions

Who is codebasics on YouTube?

codebasics is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

PDF