End‑to‑End FMCG Data Engineering Project on Databricks Free Edition

Name: End to End Data Engineering Project using Databricks Free Edition | FMCG Domain
Uploaded: 2026-02-03T06:07:55.446368+00:00
Channel: codebasics
Description: Summary and key takeaways on End‑to‑End FMCG Data Engineering Project on Databricks Free Edition, covering Introduction Today we built a complete
codebasics
Feb 03, 2026
•
4 min read
YouTube video ID: U6ZUKWdfSLY
Source: YouTube video by codebasics — Watch original video
PDF
Introduction

Today we built a complete data‑engineering pipeline for a merged FMCG company using Databricks Free Edition. The stack includes Python, SQL, AWS S3, Medallion architecture (bronze‑silver‑gold), Databricks dashboards and the AI assistant Genie. No credit‑card is required, making it ideal for students and practitioners.
Business Context

Parent company: Atlon, a large sports‑equipment manufacturer with a mature OLTP/OLAP system.
Acquired company: Sports Bar, a fast‑growing nutrition‑bar startup whose data lives in spreadsheets, cloud drives, WhatsApp exports and ad‑hoc APIs.
Problem: Mismatched data formats, missing months, and inconsistent reporting hinder unified supply‑chain forecasting.
Goal (set by COO Bruce):
Provide a single reliable dashboard aggregating both companies.
Keep the learning curve shallow for future hires.
Build a scalable, long‑term solution.
Technical Architecture

Medallion layers – Bronze (raw), Silver (cleaned), Gold (BI‑ready).
Parent pipeline already exists (bronze → silver → gold). We only create the child pipeline and then merge child gold tables into the parent gold tables.
Data timeline: Historical back‑fill of 5 months (July 1 – Nov 30) from Sports Bar, then daily incremental loads starting Dec 1.
Data Model

Star schema with one fact table (orders) and three dimension tables (dim_customers, dim_products, dim_gross_price).
Parent and child schemas differ (e.g., column names, case, missing fields). The pipeline normalises these differences.
Step‑by‑Step Implementation

1. Databricks Account & Catalog Setup

Sign‑in with Google, create a FMCG catalog, and three schemas: bronze, silver, gold.
Create a workspace folder consolidated_pipeline and notebooks for setup, utilities, and each dimension.
2. Load Parent Data (Gold Only)

Import CSVs for customers, products, gross price, and orders directly into the gold schema.
Create a date dimension table programmatically (Jan 1 2024 – Dec 31 2025).
3. Ingest Child Data into S3

Create an S3 bucket (e.g., sports-bar-dp).
Upload the child CSVs (full load under full_load/, incremental under incremental_load/).
Establish an external Databricks connection to the bucket.
4. Bronze Layer (Raw)

Read each CSV with Spark (spark.read.format('csv').option('header','true').load(path)).
Add metadata columns: read_timestamp, file_name, file_size.
Write to Delta tables in bronze using mode('overwrite') for the initial load.
5. Silver Layer (Cleaning & Transformation)

Customers:
Drop duplicates, trim spaces, standardise city names, fix case, replace unknown cities, concatenate customer_name + city, add static columns market='India', platform='SportsBar', channel='Acquisition'.
Products:
Remove duplicate rows, title‑case category, split product_name into product and variant, correct spelling (protein), map categories to divisions, generate a surrogate product_code using SHA‑256, replace non‑numeric IDs with 9999.
Gross Price:
Uniform month format via to_date with multiple patterns, convert negative prices to positive, replace unknown with 0, join with product table to attach product_code.
Orders (Fact):
Append raw rows to bronze, move processed files to a processed/ folder, clean dates (remove weekday text), convert null quantities, filter non‑numeric customer IDs, join with product table for product_code, drop duplicates.
Write each cleaned dataframe to the corresponding silver tables (mode('overwrite') for full load, mode('append') for incremental).
6. Gold Layer (BI‑Ready)

Select only the required columns from silver tables, rename customer_id → customer_code, and write to gold.
Merging: Use Delta Lake upsert (MERGE INTO parent_gold USING child_gold ON product_code) to update existing rows or insert new ones.
For the fact table, aggregate daily child orders to monthly granularity before upserting into the parent fact table.
7. Incremental Load Workflow

New daily CSV lands in S3 landing/.
Create a staging bronze table (append only the new file).
Apply the same silver transformations, write to staging silver, then upsert into child gold.
Re‑aggregate monthly totals and upsert into the parent fact table.
Move the processed file to processed/.
8. Orchestration (Jobs & Pipelines)

Define Databricks jobs for each notebook (customers → products → price → orders).
Set dependencies so tasks run in the correct order.
Schedule the pipeline to run nightly (e.g., 23:00 UTC) after business hours.
Configure email notifications for success/failure.
9. BI Dashboard & Genie AI Assistant

Create a denormalised view joining fact and all dimensions (date, product, customer attributes) in the gold schema.
Use Databricks Genie to ask natural‑language questions (e.g., total revenue by quarter, top 5 customers).
Build a visual dashboard with KPIs, bar charts, pie charts, and filters for year, quarter, month, channel, and category.
Export the dashboard or embed it in presentations.
Project Outcome

A unified, reliable data layer covering both Atlon and Sports Bar.
Scalable architecture that can handle future system changes for the child company.
A ready‑to‑use dashboard that answers key business questions.
All work performed in the free Databricks edition, making the project résumé‑ready for students and professionals.
By leveraging Databricks Free Edition, AWS S3, and the Medallion architecture, we turned chaotic, heterogeneous data from a recent acquisition into a clean, scalable, and BI‑ready data lake. The pipeline delivers a single, reliable dashboard for the merged FMCG giant, meets the three success criteria, and provides a resume‑worthy end‑to‑end data‑engineering showcase.
Frequently Asked Questions

Who is codebasics on YouTube?

codebasics is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.
Databricks Certified Associate Developer For Apache Spark Book Recommended
A comprehensive guide to mastering Spark on Databricks, essential for building medallion pipelines like the one in this project
Amazon →
Python Crash Course Book
Covers Python fundamentals and data‑engineering libraries used in the notebooks, helping beginners replicate the workflow
Amazon →
Sql For Data Scientists Book
Provides practical SQL techniques for transformations, window functions, and upserts required in the gold layer
Amazon →
Amazon Web Services Certified Solutions Architect Study Guide Book
Teaches AWS concepts such as S3 bucket setup and IAM permissions, which are crucial for the data lake integration
Amazon →
Data Engineering With Python Book
Focuses on building ETL pipelines with PySpark and Delta Lake, directly mirroring the steps described in the article
Amazon →
Links may be affiliate links. We only include resources that are genuinely relevant to the topic.
Summarize another video
Full Transcript YouTube

Today we are going to build end-to-end
data engineering project using datab
bricks. Datab bricks is gaining lot of
popularity and we will use this platform
to build a project in FMCG domain that
you can add directly to your resume. In
terms of technical stack, we will be
using Python, SQL, S3, Medan
architecture, datab bricks dashboard and
Genie. We will be doing all of this in
datab bricks free edition. Yes, it is
completely free. It doesn't even ask you
for your credit card. This addition
enables students and practitioners to
learn data and AI tools for free. For
people who don't know about data bricks,
it's a leading AI and data platform for
enterprises. The reason for its
popularity is its ability to simplify
their architecture and speed things up
at the same time. It has helped many
companies like HSBC to perform analytics
at scale while keeping it secure. All
right, let's look at the problem
statement for this project.
Atlon is one of the leading
manufacturers of sports equipment
operating across several countries. For
years, their operations ran on mature
ERPdriven systems, predictable,
structured, and reliable. Recently,
Atlon acquired Sports Bar, a fast
growing startup in the energy bars and
athletic nutrition space. But unlike
Atlon, sports bars data is everywhere.
Spreadsheets, cloud drives, WhatsApp
exports, and hostily built APIs. Atlon's
leadership decided to retain sports bars
culture, teams, and processes. A
respectful but risky choice, and soon
the consequences became clear. Their
data formats didn't match. Their
reporting cycles didn't align and some
months of sports bars data simply did
not exist. Lost during their hyper
growth phase as Atlon prepared to unify
supply chain forecasting and inventory
planning across both companies. The data
chaos became a bottleneck. Metrics were
inconsistent. Sales number conflicted.
Nothing tied together. That is when
Bruce, the seasoned COO known for
solving operational puzzles, stepped in.
He approached the data engineering team
led by Tony Sharma and driven by curious
Peter Pandi, the data engineer. Bruce's
request was simple but daunting. Give us
a reliable data layer for both
companies, something we can use until we
fix this mess. And that sets the stage
for this datab bricks journey where Tony
and Peter team must tam messy formats,
align metrics, fix missing data, and
build a scalable foundation for a newly
merged giant. Together with Tony, Peter,
and Bruce, you will be solving this
problem using data bricks free edition.
Get ready, folks.
Tony, I saw the email from Bruce. Looks
like we have a very short deadline.
>> Welcome to the world of data chaos.
Peter, I know this was going to come,
but not this soon.
>> Well, but the good news is team at
Sportsbar is very supportive. Some of
their data engineers know data bricks
already. So, we can do this, I guess.
>> You guess. No more guessing, Peter. We
need to do this. Let me tell you there
are three success criterias for this
project. Number one, you should be able
to provide the aggregated analytics of
both these companies in one single
dashboard and the most important thing
the solution should be reliable.
Number two, the learning curve should
not be very steep. Let's say we are
going to hire a bunch of data people who
can work on this tool. I don't want that
to be very difficult for them to learn.
And number three, this is also a very
important thing. see if this can be a
long-term solution. I mean, it should be
scalable. Why I'm saying is that uh we
can't trust right now. They're saying
they will change the systems of sports
bar in an year. But what if what if it
takes more than one year? What if it
takes 2 years, 3 years? We should have a
solution that is quite scalable. So, see
if this can be a scalable solution.
Okay. Does it sound clear to you?
>> Yes, Tony. Super clear. I'm excited to
begin work on this project.
>> Great.
Before we begin the actual project work,
let me clarify something. You need to
know datab bricks foundation for this
project. If you don't know it already, I
have another video that you can watch.
I'm going to provide you a link. And if
you know the datab bricks foundation,
then you can just continue further.
We already saw in the problem statement
that Atlon is a manufacturer of sports
equipment such as cricket bat, badminton
raet, volleyball and so on. And it's a
big company which has acquired this
small company called sports bar which
makes energy bar and athletic nutrition
products such as this bar and sport
strings and so on. Now Atlon has an
established data infrastructure. So they
have OLTP system where they have a
proper web application mobile app you
know when you go to Atlon store you buy
let's say cricket bat you'll be
performing a transaction and all of that
data will go into OLTP system they also
have OLAP system which is your
analytical processing where they use
data bricks and they have bronze silver
and gold layer bronze will contain all
the data from OLTP so it's a raw data in
silver layer you will do data cleaning
uh you will do currency conversion you
know some transformation and so on and
in the goal layer you will generate BI
ready columns so you might generate some
additional derived columns based on the
raw columns that you have and you might
perform aggregation etc eventually you
use gold layer to build BI dashboards in
data bricks you might also use Genie
which is an AI assistant okay so this is
a very nice data infrastructure that
Atlon has but their problem is this
small company that they acquired they
don't have a proper data engineering
system they have OLTP and directly they
are building Excel dashboards by pulling
data from the relational database let's
say if they're using Oracle from that
they will directly build this Excel
reports and the management has decided
that they don't want to get onto this
multi-year mig migration project. I mean
they will do it eventually but in a
short term management wants streamlined
insights. So let's say if CEO of ATL
wants to know the revenue from both the
products you know Atlon products as well
as sports product they need some single
dashboard where they can see it. If they
want to see top five products it should
aggregate both the products atons and
sports bars product. So data engineering
team came up with this smart solution
where you will take OLTP of sports bar
and you will dump that data into S3 AWS
S3. Okay. And then you will build a data
pipeline for sports bar. So you will
have bronze layer, silver layer, gold
layer, you will do all the
transformations. Okay. And I have this
line here which shows that when you're
doing daily file processing when the
files are processed you move it to
archive. Okay. So when we do coding
we'll actually see how it works. But the
idea is this diagram represents the data
pipeline for sports bar which is a child
company. Okay. And once you have gold
layer for child company you will merge
that data with the gold layer of Atlon.
Right. For at leon also there is a
bronze, silver and this is a gold right.
So I have not shown this here but this
pipeline is established. So we are not
making any changes to it. We are going
to be mainly focusing on building the
data pipeline for child company and then
we will merge this sports bar gold data
with here. So now when you build a
dashboard using this gold you will have
the aggregate products information from
both sports bar as well as aton. In
terms of data timeline this project will
be uh going live in production on
November 30. So you have about 5 months
data for sports bar. So when this merger
happened uh sports bar gave this five
months data I mean in their systems are
containing five months latest data and
we will be doing historical backfill. So
we will be doing essentially batch
processing here. Okay. So in data
engineering there is batch processing
and there is live data processing. So
for historical backfill we'll use this
five months data from sportsbar and then
from November 30 onwards uh for December
we will do incremental update and as far
as Atlon is concerned as I said in this
project we are not going to look into
the data pipeline of Atlon okay we just
assume that Atlon has a good data till
November 30 in their gold layer and in
that gold layer we will do first so
let's say there is a gold layer right so
this is a gold layer of Atlon. So we
will first do historical backfill of
this FS data into here and then from
this point on onwards this gold layer
will contain the incremental update from
a sports bar. Okay. All right. So that
was a quick overview of technical
architecture.
Let's now understand the data model
which is nothing but the structure of
the data tables and how they are related
or linked to each other.
We are going to provide you this data
folder where we have data for parent
company and child company. If you go to
parent company, you will find two more
folders. Full load and data load. Full
load contains data till November.
November end. Incremental load contains
data for December 2025. Okay. If you go
to full load, you will find four tables.
So there are three dimension table and
one fact table. So let's look at these
tables one by one. So the first one is
dim customers. This table is super
simple. You have customer code, name of
the customer, market, platform and
channel. And if you set up the data
filter, see in the channel you will see
direct or retailer. Okay. Then the
platform let's say if uh Atlon is
selling their product on their website
platform value will be e-commerce. If
the item is sold through the store you
know they have the physical stores as
well then that is brick and mortar and
market wise we have just one market
India in the future if they expand to
other countries they will have another
country okay and these are the
customers. Now this particular uh
company Atlon is selling to their other
companies basically other small
companies. Okay. So this is like a large
manufacturer of sports product. So small
sports company which are selling the
sports equipment will buy from them.
Okay. So it's mostly B2B business.
That's why we see
only these many customers. So prime
sports for example is a sports store in
less in New Delhi and they will buy from
Atlon. Okay. So I hope you're getting
this data model. So this is your
customer table. The second table is
products. So product table is also
pretty straightforward. You have product
ID, division, category, product. So
let's say if you are searching for a
cricket, right? So let's say I am here
going to go with cricket bat. So cricket
bat. Okay. Okay. So, cricket bat that is
a category. Division division will be
cricket and then that will be actual
product name. Okay. And the variant. So,
usually you get variance right? See
arrows like you get 12 pack. I think if
you look at this data you will get it
folks. It's pretty straightforward. Then
you have gross price. So you see this
product code. So same product code is
available in the pricing table and you
have price INR and you have year. See
usually in the industry what you'll see
is the the pricing will be having two
columns effective from and effective to
so the price is valid between these two
dates because price changes right all
the time but just to keep things simple
folks what we have done is for this
particular year this is the price okay
so now if you join this gross price and
product table you will be able to know
okay how much let's say this particular
arrow item cost you at last you have a
fact table and in fact table you have a
date when the transaction was made. This
was a product. This was a customer. So
customer number this bought 161 quantity
of this product on 1st January 2024.
Pretty straightforward simple table
folks. Pause this video and think about
it. This entire data model is extremely
simple. And here's a pictorial
representation of this data model. It is
essentially a star schema. star schema
because you have this fact table in the
middle and it will form something like a
star right like see it is forming sorry
for the bad picture but it is forming
this kind of a a star that's why it's
called the star schema and as you can
see customer ID from fact orders is
linked to this customer ID in dim
customers then this product ID is linked
to this two tables in dem products and
deem customers so if you look at any
order and if you want to compute the
price of the order. It's easy for you
because using the product ID, you can
fetch the the price, right? And then
using the same product ID, you can fetch
the quantity. Quantity is here. Okay?
And you can also know okay, what is the
product name, category, customer name,
uh city and so on. There is also data
for child company and in child company
you have full load and incremental load.
For incremental load, we have orders
from December. Okay. So when we look
into incremental update part we'll look
at this order but uh if you look at the
full load you have customers table now
see customer table looks something like
this. So here customer ID, customer
name, city and you know like you can
just look into it like there will be
some data errors like this this kind of
uh leading space we'll fix all those uh
data errors then other than customers
you have the products so let's look at
the products table see product table has
product name product ID and category now
if you look at the product table for
parent company it was a little different
right Like see here uh they call product
code here they call product underscore
id. Here is like a number and here you
can see it's alpha numeric. So these are
two different companies. So they will
have different names for the data
columns different way of representing
this information. Then there is division
and category. So you don't have division
but you have category. And then uh this
one has a variant but this one doesn't.
So you can see that the this kind of
data differences occur most of the time
when you are working on industry or data
engineering project and your job as a
data engineer is to work through these
differences and create a streamline
pipeline also this is order table right
order ID order placement date and so on
and then what is left so you have then
gross price table as well check video
description for the resources link and
you will be able to download all this
data. Now you already see see there are
data issues. There are negative values,
unknown and this kind of situation
happens in real life. So don't worry we
will do the cleaning of this information
in our silver layer. All right. So
before starting the work on this project
I encourage you to look at all these
data files and in your brain think for a
while try to understand this entire data
model. Try to understand how this
information is laid out. A good data
engineer will do this. they will spend
time studying the data and then it will
help you during your actual work of
building the ETL pipeline.
Let's go ahead and create an account in
databicks. All you have to do is click a
link down below in the video description
and you will come here. This is datab
bricks free edition. They have created
this free edition so that learning
community can benefit and we truly
appreciate that a effort. Setting it up
is very easy. I'm going to use my Google
account. You can use your email id as
well. So click on continue with Google
and I will select my Google account. Say
continue and here for personal use get
free edition. Click on continue and
start the puzzle. Okay. So I think this
is the right answer. Hit submit. After
few seconds my account is ready.
I am in datab bricks environment. My
first step for this project will be to
set up the catalog and schema and tables
for my project. Now, what kind of
catalog do we want to set? Well, for
parent company, we already know that
they have an established data
infrastructure.
So, they have bronze, silver and gold
layer. But for this particular tutorial,
we are not going to cover bronze and
silver because we are going to mainly
focus on the child company's data
pipeline. Okay. So for parent which is
aton we will directly create a gold
layer and we will import the data from
those CSV files into gold. See in real
life they have this data pipeline set up
already from many years. So in real life
what will happen is when you join aton
as a data engineer you will go into data
bricks and you will already see this
pipeline but since this is a learning
project you will not see this pipeline.
You have to manually import that data.
Okay. So let's create that gold layer.
We will also create bronze and silver
layer for our child pipeline. Right? See
for child pipeline we need to do it. So
we will do all that setup and for that
you need to go to workspace and let's
create a folder where we will store all
our code. We will call it consolidated
pipeline. Okay.
And then you create a new notebook.
Let's call it one
setup. And here if you want to create a
catalog you will say percentage SQL.
This is how you run a SQL query. Create
catalog
if not exist. What should we call our
catalog? Well, FMCG. Okay. Create
catalog if not exist. FMCG. And usually
the next line is use catalog FMCG. Okay.
So let's run this code. And folks, if
you don't know the fundamentals, you can
refer to the foundation section in this
3 hour and 35 minutes video that we have
on code basics channel. Okay. So, this
video contains the foundation section
where we have covered all these
fundamentals. I initially got the error
because there was a syntax problem. You
know, I did not have as but now we are
good. Okay. So this catalog is created
and if you want to verify it you can
right click open this catalog in new tab
and here you will see this FMCG catalog.
Now let's create uh our schema. So we
will say SQL create schema if not exist
FMCG dot gold and you will do the same
thing for silver
and
okay and bronze layer 2. So let's
execute this. Okay. So this worked.
Okay. And when you refresh the icon here
under FMCG you see all three schemas.
Now let's upload data from the parent
company folder full load. Okay. So we
are going to create tables using these.
So for this you can go to catalog click
on plus and say add data. You can do
this via data injection tab as well.
Right? Data injection is the same thing.
Create or modify table and let's drag
and drop dim customer. So I'm going to
drag and drop. And here my catalog is
FMCG. So that is perfect. My schema is
default. Okay. Uh schema should not be
default. Schema should be gold because
this data is directly going into gold
layer. Okay. We are doing one time
import for this project. As I said, if
you're working in the industry, you will
already have this pipeline set up. So
this is how my table looks. I will make
one change. I will change customer code
from integer to string. Why? Because
customer code is a categorical data.
Although it's a number, it's really
category. It can be a string also,
right? So that's why I'm converting it
to a string. You keep things numbers
only if you're going to do some
arithmetic operation. Here we are not
going to do arithmetic operation. We are
not going to do something like customer
code this is greater than customer code
that. We are not going to do that. So
this is string. Everything is string.
Let's create a table here. All right. So
see this table is created. And if you
look at sample data, you will see all
this sample data. In similar fashion,
let's import
products. All the columns here are
string. So click on create table. Okay,
I think I made a mistake. Uh this landed
into default. We want this to go to the
gold table. So you can drop it. Maybe
you can go to SQL editor, right? See
here SQL editor
and here in the SQL editor just say drop
table this run it and it should be
dropped and I will import it again. Add
data
create table drag and drop products.
Make sure you're selecting the gold
schema.
Create table. Okay. You do the same
thing for gross price. Okay. So, let's
drag and drop. Make sure you're
selecting goal here
and product code is string. Price inR is
number. Year is number. Looks good.
Click plus again and go to fact orders.
Now, drag and drop
here once again.
Gold customer code. You want to change
it to string because we did it in the
other table
and create table. All right. So all four
tables are created
and you can just say select count star
from
let's say I want to know how many
records are there in my fact table.
See it has 93,000 records. And if you
look at the CSV file, you can do some
quick QC. Okay, so this looks good. We
need one additional table called date
table. So date is also a dimension
table. And usually uh if you look at the
data engineering projects, many times
this date table is created via scripting
because you don't need to import from
CSV etc. You can just write a Python
code to create this date table. Okay. So
essentially that date table will be the
fourth component here. Okay. So it will
be dim date table like this and you will
associate that date with this date. See
here order placement date and then you
can do this analytics using months
quarters and so on. So as we move ahead
you will see how this thing happens but
for now what I will do is I will go to
workspace and then in that go to
consolidated pipeline and I have a
notebook uh ready for dim date table
creation. So I'll just drag and drop
okay and this uh is something we are
going to give you this code. So in the
code folder there is a setup directory
in that you have dim date table
creation. Okay. So let's import that
over here. Dim date table. Okay. So in
this code, let me explain what we are
doing. So see we have a start date and
end date and those dates are January
1st, 2024.
End date is December 2025. Okay. So our
data engineering project is for this
time range and then for that time range
we are using all this SQL queries and
all these functions from Spark to create
that data frame. Okay. Okay, so let me
just create that data frame. And to
explain this to you, we are using this
explode function where you are giving
start and end date and you are saying
interval. So this explode will take this
start date and end date and for one
month interval it's like Python range
function. For one month interval, it
will give you all the dates and that
will be your month start date. Okay? And
that will be essentially your data
frame. And in that data frame once you
have dates
you will add all these additional keys
okay or additional columns actually. So
let me just show you that so that you
understand what's going on. So I
executed this code. So see you have
month start date. Okay. So you created
that date through that statement and
just from that date alone you can get
date key. So date key is take month
start date and format it in this format.
Okay. So see this has hyphen this
doesn't have hyphen. So this is like a
primary key for that table. And then you
have year. So year you can easily get
from the date. Okay. So this year is
this year from this date. Month is this
month and convert it to a string text
January, February etc. Then short month
is Jan. Okay. And so on like you have
quarter and so on. So you see you have
quarter year quarter and so on. And once
that data frame is set up you will write
it to dimore
date table in your gold layer. This code
is executed. Now I will go here refresh
and I see this dimore
date table. So see it has all this
awesome data and this date table will be
useful because when we build a dashboard
we want to let's say aggregate data
based on quarter. So now we can do join
using this particular date right month
start date will uh link it to the order
date and we can do all this analytics.
All right so my gold layer for my parent
company is now ready.
Let's summarize what we have done so
far.
So we have this parent company's data
pipeline. So we prepared this gold layer
and gold layer had both the dim table as
well as fact table. Right? So we created
that as a next step. We are going to now
take the data of child company from OLTP
layer and put it in S3. And we don't
have this mobile apps etc. So we will
use bunch of CSV files and we will
ingest those files into S3 and this S3
is really like a data lake okay and it's
just a object store where you have all
your data and from here we will ingest
data into bronze layer. So let's do an
account setup for AWS. Okay. And if you
already have AWS account you can skip
this section for the AWS account setup.
But if not you can just follow next two
minutes and set it up. Folks, we will be
using AWS in the upcoming chapters. So
let's first set up the account. You can
Google create AWS account and click on
this link here. Click on create a free
account link. If you already have an
account then you can skip this video.
You can log into your account. But if
you don't then sign up for AWS and here
provide your email address. Okay. And
then you can provide some AWS account
name. I'm going to say code basics
double whatever uh account name doesn't
matter that much. But here you can
provide your Gmail ID. Once you do that,
it will send an email to your email id
to verify. So I have done the
verification and now I can uh provide my
password.
So just provide a password and proceed
further. Okay. At some point uh it will
ask you for selecting your plan. So I'm
going to say okay free 6 month plan.
Okay. So let's select that. And uh you
can provide all the details here.
So see in the previous screen it asked
me for my credit card. It is not going
to charge actually they just do this
credit card verification. So you will
get some free credits. Okay. So you can
provide your credit card details etc.
And then it will um ask you for this SMS
just for another verification. So here
it send a code on my phone. So I'm just
confirming my identity. See it will
require multiple step of verification.
Okay. So now see my AWS account is set
up. At any step if you're feeling stuck
folks use common sense. You have to use
common sense. Take help of chat GPT.
Setting up an AWS account should not be
a rocket science. You should be able to
figure it out folks. Okay? So please do
that. And now I can go to AWS console
where I'll be able to kind of see
different offerings and different
options that AWS has.
As a next step, we will upload child
company's data to S3. Now if you don't
know S3 is an object store offered by
AWS. So here you can type S3. Click on
that and here you need to create a
bucket. Bucket is like a folder in the
cloud where you store your data. So
create a bucket. You need to give a
unique name. So I will say sports bar
and then DP. Now you won't be able to
create bucket with the same name because
it has to be globally unique. So make
sure you do hyphen and you give some
string which is unique because there are
so many viewers watching it. Okay? And
if you give the same name as them, it's
not going to work. And then just click
on create bucket. So it's like on a
cloud I my folder is created which is
this sports bar bucket and here I will
upload my child company's data. So child
company's data is in this folder. It has
two subfolders full load and incremental
load. Full load contains data from July
1st. Look at this files July 1st all the
way till November 30. An incremental
load contains data from December 1st
December to 38th December. Okay. So we
are going to just focus on full load for
now. So it's like a batch load and
incremental updates will take care of it
later. So here click on upload
then go to full load and all these
folders just drag and drop. Okay. So
drag and drop here.
Click on upload. It will take few
seconds or minutes. And finally my data
is uploaded. Okay, you can verify it.
You can go to this bucket. And now you
see these individual folders. See order
landing. And you see all these files. At
this stage we are done uh with this
step. Okay. So from OLTP data is
ingested to S3. And now we can move on
to the next step.
We will now establish S3 connection uh
from datab bricks. Okay. So from datab
bricks we will access this particular S3
bucket. For this go to your catalog.
Okay. Click on catalog. Click on
external data. Create external
connection AWS quick start. Next. And
here type in the bucket name. So our
bucket name is portbar DP. So let's copy
this
S3 colon bucket name I think.
Yeah, click on generate new token. Copy
this token. Click on launch in quick
start and make sure you log into your
AWS account in that same browser tab.
And then here personal access token is
this. Okay. So you copied that.
Go ahead and I think you can keep
everything default. Create stack. Here
it says create in progress. It's going
to take some time. After a few minutes
of waiting, it shows create complete. At
this stage, you can go here,
press okay, go to external connection,
go to this particular connection, see S3
sports bar and test connection and see
here it is able to connect with S3. You
can go to browse and you will see all
the data that you have in S3. See
orders, landing and all these files. Now
folks, if you go here and if you delete
something, it's going to delete it from
S3. So this is like a direct connection.
It's a connection, a direct reference to
your S3 bucket. All right. So our
connection part is done.
Now let's move this data to bronze
layer. And we have dimension and fact
tables. Correct. So we are just going to
focus on dimension table for now. Let's
go to workspace.
There we have this consolidated pipeline
folder. And remember we created these
two files initially. This one setup I
will call it setup cataloges. Okay. So
let me
rename it as setup cataloges and tables.
And then let me move this code into a
folder. So I will call it setup folder.
And in this folder I will move these two
files. Right? So these two files I will
move it here. and create another folder
called dimension
data processing. Now in dimension we
have customers products and gross price.
So let's start with a notebook that says
customer data processing. Okay. So one
customer
data processing here I will start by
importing some essential functions and
then we need to define the bronze
schema, silver schema, gold schema, you
know all those schemas. Now I can just
say okay bronze schema name is bronze. I
can do it here but I will use a
different approach that is what people
do it in in the industry. they have the
schema names in a different
configuration file and you can just
change that schema. You know tomorrow if
you change this schema to something else
you can easily change it to that
configuration file and you don't have to
change this code. Now that configuration
file in my case is going to be u another
notebook. You can do it via notebook via
another Python file or just a simple uh
configuration text file. You know there
are multiple ways of doing it. I'm just
showing you one way. So here I will
create another notebook called
utilities. Okay. So this is your
utilities
and there I will simply create these
three variables. So essentially you are
importing these variables into that
other file. Okay. And you run it. Now
see this utilities if you look at it uh
let's go to that particular folder and
in the workspace
I have this utilities right. So if you
say
copy full path
and go to your another
uh notebook which is this and if you
just say percentage run and then in the
utilities you have those three variables
defined right so I'm going to run it and
after I run that if I print name of
those variables
it will print it. See bronze schema,
silver schema and gold schema is
something that it imported from that
utilities. So this way let's say if you
want to change the schema name etc you
just change it in the utilities file and
you don't have to make any code changes
through this particular file. Let us now
create something called widgets. So what
this widgets will do is let me execute
it so that you can see it. You see at
the top it will put all these controls.
So now you can specify your configurable
catalog name, your data source name etc.
here. So we are doing customer data
processing and in our bucket the name of
the folder is customers. Tomorrow let's
say the folder name changes to something
else. In that case you can just change
it directly here in this UI control and
then whatever is the value in this UI
control you can get it by calling this
code. Okay. So let me just show you. So
you'll say dbuttails dot widgets.get
catalog and it will get the value
whatever is present here. So let's say
if I say ABC okay and if I execute this
okay let me just put it in a next cell
because that's needed.
See when I do that I will get ABC. If I
say customers
and if I run it I get customers. So
whatever is the value there in the UI
control uh you get it and this is a
common practice where people who use
data bricks they use this widgets to
specify let's say environment if you're
running in dev alpha prod then you put
an environment variable here if you're
running it for different cataloges
different buckets as as you can see in
this example then they use this inbuilt
DB util functionality okay so now we
have our catalog name which is FMCG and
our Our data source is customers. The
next step is specifying the bucket. So
we want to get all the files from this
bucket. Sportsbar DP/c
customers and whatever CSV file you have
right here. I know there is one CSV file
but it's a usual practice to specify
star dot CSV. So when you execute this
uh see the base path is port DP
customers
CSV file. Now from this path we will
create a data frame and the way you do
that is spark dot read dot format. Okay.
So format is CSV
dot load base path. Okay. And that is
your df.
And let me display your df. And see you
got your df. But there are problems
right like see it's not taking the
proper header and so on. And for that I
need to specify bunch of options. So
here I will put this code. So here what
you're saying is header. It has a
header. Then infer the schema. Then also
add two new columns. So these are the
two metadata columns that we almost
usually add. Okay. It's a read time step
and metadata file name. So let me just
run it and you will see what these two
columns will do. So see file name is
like from which file it loaded the data.
This is beneficial in terms of lineage
as well as in terms of debugging and if
you are not familiar with all these
concepts once again watch the other
video that we have on our channel for
data bricks foundations. Okay. So now we
have a data frame ready which has all
the raw data and two metadata columns.
Okay. So these two. Okay. So file size
we have read time stamp and we have file
name and file size actually three um
columns right. So see read time stamp at
what time this data was read. You can
also print the schema and you see so see
here customer ID is integer by the way
because we got data from CSV and we just
said infer schema. So obviously if you
ask computer to infer it will say
integer. We'll convert it to string
later on because in our gold layer
remember the customer ID is string. Uh
this is looking good. Now as a next step
we will directly write it to our bronze
table because bronze is your raw data.
You're not doing any transformation. And
the way you do it is like you'll say
df.right.
What is the format? We are we are using
data format and we are doing mode
override. Okay. So let's say if there is
already a table you will override it
because we are doing batch processing
and this is dimensional data. So you
don't want to do append whatever you
have it's a first time let's go override
it okay just in case if there is a table
now this enable change data feed will
turn on CDF so if you read about CDF in
data bricks you will find
the information on uh what it will do
right so it will allow you to
track the changes at row level I If you
know the fundamentals of delta table,
delta table provides you certain
properties such as acid then time
travel. Okay. So if you're making any
changes, it will track okay who change
what row all of that. So if you want to
do time travel, if you want to do some
kind of audit, it will be very helpful.
So let's run this code now. And this
should create a table called bronze.
Okay. So see here what we are doing is
save as table catalog catalog is FMCG
dot bronze dot customers okay so we
should have that kind of table so let's
check it here
and if you go to FMCG
bronze
uh see now we have this customers table
and you can see some customer data as
well so this data will be exactly same
okay whatever you have in that CSV file
it will be exactly same. Now let's start
working on silver processing. Okay. So
silver processing. So in the silver you
will first get the bronze uh data in
your bronze data frame. Okay. So let's
say this is my bronze data frame
and in that you are now going to do some
transformation. So what is the purpose
of silver? So bronze contains raw data.
Silver will contain clean data. So if
you have any null values, if you have
any duplicates, you will remove all of
that. Okay. So let's print the schema
here. And the schema schema looks good.
And now we are going to do certain
transformation. So the first
transformation is we will find out if
there are any duplicates,
if there are any duplicate customers.
Okay. So see this is my raw file and by
the way there is not much data. In real
life you will find a lot of like a huge
volume of data. But here I already see
some problem. See these customers don't
have city. So there are some blank
values. Then you know leading spaces you
see spaces that is a problem. Another
problem I see is see Hub H capital and H
is small. So there is always a fat
fingering and there might be duplicates.
So duplicates let's see I mean here I'm
not able to find. So I will do it
programmatically. So I will say that df
dot df bronze.g groupoup by customer id
and then you count. Okay. So let me do
just this so that you know this will
produce a count of for each of the
customer ids and we want to find out
customer ids which has count two. So
that those are duplicate right. So here
see when you say where count greater
than one and if you run it so these are
the duplicates that you have. There is
another way of doing it. By the way, you
will find in spark pi spark there are
multiple ways of doing same thing. See
you can do it like this also. You are
saying filter whenever count this count
is a column is greater than one. Okay.
So these are the duplicate values that
you have and you want to drop them by
simply calling this particular function.
See you are just saying drop duplicates
customer ID. Very simple function. Okay.
And we will also print count before and
after that operation. And you'll notice
that previously there were 39. After
dropping duplicates there are 35. So
four drop, right? See 1 2 3 four. So it
will keep the first one, drop the second
one. All right. So that is drop. Now we
will also trim the spaces from the
customer names. Right? We already saw
that some of the customers have these uh
leading spaces right like see this is
the space so you want to remove it and
the way you remove that is you will say
filter
customer name not equal to this so this
will only show you which customer have
those leading space issue right I hope
you're getting the point what we are
saying is when you do f trim from
customer name. If that is not equal to
the actual column which means it has
this leading space. See all of these
values there is a leading space and to
remove that you will use this trim
function. So trim function is this here.
This column is a very popular function
and frequently used function in pi
spark. Okay. So you'll say with column
customer name and you replace that
customer name with f.trim. So take the
customer name column trim trim the
spaces and assign it back to that
customer name and after that when you
run this code again let me just run it
for verification you should not find
anything it should print zero see no no
rows return so that issue is fixed
we will also see if there are any issues
with city names so so we'll find the
distinct cities okay you can just say
select city dot distinct
And it will show you all the distinct
cities in your data frame. And we are
again finding issues. See Bangaluru
versus Bangalore. This city used to be
called Bangalore. Now it is called
Bangaluru.
Then there are you know fat fingering
see typos like see Hyderabad there are
2Ds. See Delhi versus Delhi. You see all
this issue Bangaluru extra U. These kind
of issues happen folks. Like when I was
at Bloomberg, we have worked with so
many such problems while building data
engineering projects. Even in Atlake uh
our AI and data consulting company when
we look at the real life customer data
from see we were working with someone
big enterprise recently and you would
think oh this is such a big company they
will not have this quality issue but
there are so many issues like this. So
these kind of issues is common and what
we want to do here is we want to provide
a correct mapping right we want to say
okay Hyderabad D is actually this Hydra
Hyra see there is a missing so you are
fixing right like wrong name to the
correct name. So once the mapping is
created
you will create another list called
allowed. So these are the three cities
which are allowed. we talk to business
manager and say like okay if you find
anything other than this is garbage. So
in this logic we will replace okay so we
will first replace all these values with
the new value then what we'll say is if
city is not one of these three cities
then it should be uh null. Okay. So see
uh f dot call is null then none and if
it is in then this otherwise it's none.
So if it is in this list then keep it
otherwise make it null. Okay. So I'm
going to run this and then run again
this um distinct command to see the
number of unique cities. See now those
issues are fixed. Okay. So cities are
fixed. The next problem is that title
case, right? I think let me pull that
CSV file from the screen. You notice
like the C here is small case. The C
here is capital case. So let's fix that
problem. Okay. So we'll just say uh
distinct show and here see this this two
you see like hub h.
So to fix it we are going to use this
init cap function. So the way you write
it is customer name okay replace
customer name column with this new
formula which is if it is null then
don't do anything but if it is not null
then you just say init cap init cap
means the first letter is capital
remaining are all small okay so after
running this I would like to run the
same command customer name and now see
those are gone actually So when you say
distinct you will not find this
duplicates. I mean kind of duplicates I
see nutrition n capital n small. Now
those things are gone. Now one thing you
noticed is there are some records with
city equal to null. I want to see it.
Okay let's see what are those records
with city null. So you'll say filter
city is null then show show those
records. So let's run this. See spring
nutrition, Zenet foods, all these
customers have null uh cities but I
think I noticed that the same customer
has other cities also. Okay. So what I
will do is I will create an array and I
will put all these like spring zenath
primal recovery lane and then I will
find all the records which has this
customer. So I'll say filter if customer
is in this then show me I want to see
all the records which has any of these
four customers. So see here now see we
have three cities Bangalore, Hyderabad,
New Delhi. So just look at this first
three records. If it is Bangalore and
Hyderabad this has to be New Delhi
right? So if it is Hyderabad and New
Delhi this has to be Bangalore. And then
you will find this kind of case also
where there are only two records. So you
don't know if this one is Hyderabad or
New Delhi. So you go, you talk to your
business manager, your business user and
they will provide some clarification.
They will say okay this 403 customer 403
which has null is actually New Delhi. So
this kind of fix they will help you to
fix this data and then you will create a
data frame out of this customer ID and
city. Okay. So all we are doing is
creating data frame out of this and we
are displaying this data frame and then
we will use this data frame to actually
fix our issue later on. Okay. So we got
this data frame. It's a pretty
straightforward code and then you will
perform join. So you will say silver dot
join with this other one using customer
ID right left join. And then for city u
replace null with the fixed city. So, so
this will replace that null city. Okay,
this will give you the null record.
Basically, coils function will give you
the null record and then eventually uh
see when you join it will also create a
fixed city column. So, you want to drop
that as well. So, let's do this. And
then I want to display this DF silver.
And here you will notice right 789 403.
So 789
and 403 now has New Delhi. See? So all
those issues are gone. You don't see a
city now with null. See our data frame
is getting better and better at each
stage. And I will run the same command,
right? Like the same code.
that we ran before like this particular
code I have copy pasted and I will run
it again and see all those cities are
fixed. See this is fixed. All these are
now fixed. And the last thing you want
to do is you have customer ID column as
integer. Now if you remember in our gold
layer for parent customer ID when we
imported initially remember it was a
string. So I want to convert the
customer ID to string. So that see now
customer ID is a is a string. It's it's
a categorical variable right? We
discussed it before why we are doing it.
All right. Now I want to do one more
thing one last thing which is see in our
table we have customer ID customer name
and all these columns. But if you look
at our gold table right so let me just
show you gold table. In gold uh where is
my gold dim customer is here. sample
data
my column names are different see here
my customer ID is customer ID and here
is customer code then my customer name
is in customer column here is customer
name then I have these three extra
column market platform channel which I
don't have it here so I need to create
those columns and also one more thing I
want to do is see if you look at this
data It has customer and directly
country but this has city. So how do you
differentiate right like so let's say if
this child data goes into this table and
you want to know whether uh let's say
Janet food is from Hyderabad or Janet
food is from Bangalore. How do you know?
So don't you think it will be better if
for customer name for this child table
we have customer name hyphen city. So
that way it's Zanet food Bangalore,
Zanith food Hyderabad. Just imagine just
visualize that in this table you have
customer column in this column you have
Zenith food Bangalore, Zanet food Delhi.
That way you know exactly like which
Zenith food you're talking about. Okay,
because we don't have a separate city
column in this particular table. So we
will do all of this in customer name we
will put customer name and city that is
one change. Second change is we will
introduce three new columns market
platform channel and these are hardcoded
actually. So sports bar we know that it
is from market India platform we will
set it as sports bar and channel will be
acquisition. Okay we discussed these
values with business and they're okay.
So we will say okay market is India
platform sports bar static values for
all the records and here you are saying
that create a new column called
customer. See this is called customer
which is nothing but a concatenation of
customer name and city. And if it is see
what this will do is if it is null it
will say unknown. Okay. So let's uh run
this code
execute it.
Uh now see see first of all you got
market platform channel. So that is
incompatible with this. You have
customer code.
So it is customer ID here. Maybe in the
next step we'll change it from customer
ID to customer code. And we have this
customer column, right? So this customer
column will merge to this column. And
here see you have like city and all
that. Okay. U now let's write this to a
silver table. So we'll write it to
silver table. Silver schema data source.
Okay. So let's write it.
And once it is returned, you can just
quickly verify by going here. Hit
refresh.
Go to silver. Uhhuh. See, I see
customers now. And there you see
customer name, uh customer ID, etc.
Right? All these three columns. Now for
gold layer, what you need to do is very
simple. See, you need these three
columns. From silver, you pick these
three columns. Then you pick customer
column. Okay? What other columns do you
need? and you need customer ID which
should be renamed to customer code.
Okay. So let's do that uh in the gold
layer. So here I'm going to say gold
processing control enter
and you're creating a data frame out of
your silver table initially and then you
will pick only these columns. You will
drop all other columns right? We don't
need file size. We don't need all of
these columns. We need only these
columns. Okay. So, control enter
and then you can display it but I know I
have tested it. It works. Okay. And you
can write it to gold layer. So, here you
are writing to gold schema. Okay. And
the table we are calling it see we can't
call dim customers because dim customers
is your parent company right at uh
table.
So, SB means sports bar. Okay. So now
when you refresh
you see SB dim. So this dim customers is
Atlon uh gold data. This one is a gold
data of sports bar. See customer ID,
customer name, city and so on. Okay. So
we we got all the data in the gold layer
of our sports bar customer. But when we
merge actually see so what we did here
let me summarize in the gold layer of
our sports bar we kept the columns right
customer ID customer name customer our
goal columns are little different but
when we merge so now some of these
columns for example market platform
channel and customer will not have city
and customer name because those columns
are not there in the gold layer of
parent data and customer ID we will
rename to customer code okay so let's do
that merging part. So the last one is
the merging part and merging part is
this. So you have a delta table. So this
is your parent company's gold customer
table and child customers is nothing but
customer ID you rename to customer code
and then remaining columns are same.
Okay. So control enter and then you will
do absert operation.
The reason you do that is let me show
you the code first. So here what we are
saying is my source is my child my
target is my parent table and condition
is customer code. So if customer code
matches in that parent then update but
if it doesn't match then insert. So this
is called upsert operation. Up means
update means insert. If the customer
code is same in a target table then
update. But if you don't find customer
code in the target table then insert.
Okay. So we have covered this
fundamentals in that other foundation
video the previous video which we had on
our channel. So there we discussed this
in detail but I hope this is not that uh
difficult. Now let me go to dim
customers
and sample data
and um see um you can say that show me
all records. This is like the that AI
genie feature okay AI feature where you
just say run query and it will show you
all the records. So now you see all the
latest records are in gold. Remember in
gold schema dimore customer table is
your consolidated table. It contains
everything. So now it contains data from
sports bar as well as it contains data.
See this is the data from the parent
company the previous data that we have.
All right. So we are done with
processing of our customers uh table.
All right. That has been quite a lot of
work and you must be realizing that as a
data engineer you have to uh do all this
hard work to perform data
transformations, data cleaning, building
this data pipeline and so on. So that
eventually AI engineers and data
analysts can benefit. When you are
building a good data infrastructure in
your company, it will produce high
quality data that the other teams can
use. It is like you're working in a
pizza store. Okay? And chefs are the
people who prepare the pizza. So chefs
here are AI engineers or data analyst.
But they can't make a pizza if you don't
provide high quality ingredients. So as
a data engineer your job is to provide
highquality ingredients the vegetables,
cheese, dah etc. Clean them okay and get
them into a stage so that the chef which
is AI engineer or data analyst can use
it for their recipe and that is the
reason data engineers get paid very high
because they bring a lot of value to any
data organization. With that now we are
going to move to products table. Now to
save some time for this tutorial video,
what I'm going to do is I'm going to uh
copy the notebooks that I have created
for products dimension and the gross
price dimension. Okay, you will find
this code down from the video
description. There should be a a
resource file which will give you the
code. So you just take this code, drag
and drop here
and let me see if this is the right
directory. Yes. So you should go to
consolidated pipeline dimension data
processing and drag and drop these two
files and then click on products data
processing. Here some of the initial
code is exactly same as dim customers.
You are importing some function right?
This function is very important folks.
Pispark SQL functions as f. This is
extremely important. You will use it a
lot throughout your pipark coding. And
percentage run is used to run a
notebook. So utilities.ipynb
is a notebook which has these three
variables. Bronze schema, silver schema,
gold schema. See, we have not defined
these variables anywhere, right? But
since these variables are present in
that file, you are able to import it.
You can also create a py file. Let's say
utilities.py and have those three
variables here. Then you can say here
from utilities you know from
utilities
import bronze schema. You can do it that
way also. Okay. There are multiple ways
of doing it. I'm just running the now
notebook directly. Okay. Next step is
creating those widgets. Remember we
created these widgets in customers
dimension processing as well. This is so
that let's say if you have dev alpha etc
different environments and if your
catalogs differ let's say this is dev
and dev uh prod alpha then you can just
change it here you don't have to you
know mess with your code because when
you change the code errors happen if
you're a software engineer if you done
coding you will understand that's why to
keep our code safe we use this kind of
configuration this kind of widgets okay
and then make sure here you are
specifying your bucket my bucket is DP
your bucket name will be different so
remember that and execute this so here
I'm specifying the path of products okay
so if you look at that
u in the S3 bucket
you have sportsbar DP products and
product CSV so see so that's what you
have here sportsbar DP products and
product dot CSV now the next step will
be to create a bronze table using this
CSV file and How do you do it? Well,
it's pretty straightforward. You just
say spark read.format
and specify the path. What is base path?
Well, you see here very clear. You are
going to S3 through the connection that
you have established reading CSV files
into this data frame and also creating
these additional metadata columns. The
first one is read time stamp. So when
you read the file from S3, whatever is
the time, you will put it in a column.
Then whatever is the file name right
product dot CSV what is the file size
you will put all of that at a row level
so there is a lineage there is audit
tracking and so on so let's run this
code and once this is run we will print
the schema so the schema is
straightforward see everything is string
here okay and you can display your data
frame here all right so that's my data
frame and then you write this data to
your bronze layer see this is your
bronze schema Huh? So, FMCG.bronze
dot products it will write it in a data
format and data format gives you time
travel acid compliance and so on. Let's
make sure we have that table. So, you go
to catalog FMCG
bronze. Uh-huh. See, you see the product
tables here. Now this product table has
a different schema and different way of
presenting information compared to our
gold table. So this is child company's
data right. Let me open the table in a
which is a parent company's data which
is in a gold schema. Okay. So this is
our eventual table where we want to
merge our data. And you will notice
differences. See the first difference is
this one has product and variant in the
product name column. See sports bar
energy bar is one product but a variant
is it can be 60 g, 40 g etc. So we need
to split this column and create two
columns product and variant because
that's why we have it in our parent
company. Right?
product this and variant 26 lb 30 lb I
hope you you can see it so we'll split
that that is one gene second change is
here we have category column or you look
at any column it is capital case which
means first letter of every word is
capital see finger tab f and t is
capital whereas here it is all small
case so maybe we'll convert this column
to capital case Then you also see some
invalid values. See here when you talk
to business manager of sports bar they
said that valid product ID in sports bar
company will have all numeric columns
but that is not the case here. In the
future you might get some invalid
product IDs. So we need to do something
with this invalid record. And then one
more thing I noticed is the spelling of
protein is wrong. See protein it is it
should be p r o t e i n. It is t i n.
It's a common you know fat fingering and
you have same problem here also. So
we'll fix all of those. Okay. So let's
go to our code now and we will do all
this cleaning in silver layer. So first
we need to load the data. So we are
loading the data in silver layer now. U
so this df bronze is a data frame right
which is read from the bronze schema.
And first we will drop duplicate. So if
you look at the records here for child
company these two are duplicates. See
25891101
258 91101 and the description choco 60 g
choco 60 g 40 g 40 g these two are
duplicates. So let's drop those
duplicates first. So here I will do
control enter
and you can see there were 20 rows and
18 because two duplicates were dropped.
And the way this drop duplicate function
works is for product ID it will find
duplicates and it will keep the first
record and drop the remaining. Okay,
very straightforward. Then this is the
title case fix. So energy bar you want
to convert to energy bars. So you will
first find all the distinct values and
you found that everything is small case.
So now you use this function called edit
cap which is something we used in the
customer transformation too. Here this
syntax folks this is a very common
syntax you have to remember. So whenever
you have spark data frame okay and when
you do with column with column here you
specify the column on which you want to
operate let's say I want to do some
transformation on this column so you'll
just say category and then the second
parameter here will be the
transformation that you want to apply.
So on category column I want to apply a
bunch of transformation and then you
save that back to a new data frame like
this. I mean you assign it to the same
variable.
So that's a common method of doing pi
spark transformation. So here the
transformation we are doing is f dot
call. By the way this f is this I hope
you remember right this f. So you're
saying f dot Okay, let me just go here.
Oh, not here actually. Okay, so where
were we? Yes, so we'll say f dot call.
If category is null, then don't do
anything. If it's null, no need to
change it. But if it is not null, then
init cap. So this function init cap will
make it init cap. First letter capital
remaining small. Okay. So controll
entertrl
enter and as you can see see first is
capital now. So that issue is fixed.
Wonderful. All right folks we are doing
good so far. Let's move on to the next
transformation. And by the way sometimes
you might have question like how do I
know how many transformation I need to
apply? Because here see for products we
have 20 records. Is easy to eyeball and
kind of find out the errors. Well, the
answer is you will not get it right in a
first shot. You will build a pipeline,
apply some transformation, data analyst
will eventually build a dashboard on
goal layer and they will spot some
errors. They will come to you then you
will fix it in transformation layer.
Then you deploy again then again they
spot error you fix it. So it's an
iterative process. The second way of
doing it is is is to use some common
techniques. So we use common techniques
right like duplicates is one finding
null values is one then talking to
business and figuring out for example
for product ID it should be all numbers
so you talk to business manager and say
okay can you give me all the business
rules which I can use for my
transformation and they will give you
and you can just translate that into a
code okay so now let's fix the spelling
mistake in protein now see the spelling
mistake is happening in both the columns
category and product. So we will fix it
in both the columns. Here the protein
word is appearing as a first word. Here
it is in the middle. So essentially what
you'll do is in both these columns if
you find the word this protein anywhere
in the string then replace it with
protein. And if you don't know the
regular expression what you can do is
you can go to YouTube. Okay. I hope you
all know about YouTube. YouTube is a
very famous popular website. All right.
And you just say code basics regular
expression.
I have 25 minute tutorial where I have
gone into details of regular expression.
Once you understand regular expression,
it will give you this superpower as a
programmer. And then you can use tools
like claude. Claude is a best AI model
for coding. Now you can say that in pi
spark give me code to fix
to or let's say replace protein
with
protein
the word can be anywhere in the string
and we need to do case
insensitive
to match
using ragex. See it's a skill uh what
exactly you write to AI it is also a
skill people call it prompt engineering
but once you know that skill it will
help you a lot. So see here it says
specify your column name and just say
rag x replace reg x replace means do the
regular expression match and replace it
with this. So this is the regular
expression it is giving and question
mark I means case insensitive. Okay. So
I is case insensitive which means it
will match protein protein any variant
you know capital and small case and then
it will replace it. So here we are going
to use the same code. See same regular
expression for both the columns product
name and category. And let's hit control
enter.
And when you execute this you will
notice that now
protein C protein the error is fixed.
Okay. The other problem we had was in
our parent table we had division column
but in child we don't have division. So
you went to your business manager and
like hey buddy how do I place my
division? So he will be like okay he or
she will be like okay category X means
this division A category Y means this
division B they will give you a mapping
so I took that mapping with me and I
wrote this code so energy bar means
nutrition bar granola and cereal means
breakfast foods okay so that's your
division so when you say df silver with
column division it will create a new
column called division using this
expression. Okay. And if nothing match
then it will be other.
Then you also uh split the variant.
Okay. So variant is this. So let's let's
talk to claude again
and just say that how do I extra product
variant from this
right?
The variant here is the weight
60 g and it will give you the regular
expression for that too. Now see it can
give you anything here. You can say that
the variant
is any the var variant is any string
enclosed
in the brackets after the product name.
So you can refine your logic. Okay.
Don't just blindly copy paste the code.
Uh you want to capture anything. So see
anything is this. This is one way of
doing it. Okay. And the other way of
doing it is um this one. And if you want
to understand it, you can give it. But
basically I'll explain you. So see this
bracket is a special character in
regular expression. So if you bracket in
the string that you want to capture, you
have to use this character. slash
bracket means the it is going to capture
that bracket. Okay, slash bracket and
then uh this will say dot means any
character star means it can be visible
any number of time. Okay, let me do one
thing. Let me give this to
uh can you explain this rag x
and let's see. So see slash this will
match the opening bracket then this uh
dot matches any character star means
zero or more and question mark is
non-grid we can that's an advanced
concept we don't need to go over it but
essentially this will match that 60 g
and if you want to verify see you can go
to ragex101.com I love this website you
type your regular expression here and
the string that you want to taste. So
anything that you want to taste right.
So here see drink when you say 52.4
ml you write anything and in the group
one see it is capturing that 53.4 ml 60
g it is capturing. So whatever that
green patch is the information you
extracted whatever is the blue patch is
what matched with it. Okay. So that's
again it comes under regular expression
fundamentals which I have already taught
in this particular video. All right. So
now so this will extract that 60 g
whatever and folks for the product code
what I'm thinking is this see this
product code here
uh contains invalid value. Tomorrow when
we do incremental processing it you
might get another invalid value. So I'm
not going to rely on this product ID.
What I will do is I will create a new
product ID from this string. So from
this string somehow I will use hashing
function some kind of sha and create a
new product ID which we can use as a
surrogate key. Okay. So that's what we
will do in this via this code. So here
I'm taking a product name and I'm using
this sha function. So sha function will
create unique like a long unique
identifier and for each string it will
be different for this it will be
different this it will be different so
it will create a unique number let me
run it so that you you know and then we
will also clean up so see this our
business manager is telling that product
ID should be all numbers so if it is xyz
it's not a valid number then we we ask
our business manager what should we do
they're like replace it with 9999.
So that's what we are doing here. So if
the string is like this, so this will
capture numeric. So if you have all the
numbers then you keep as it is otherwise
you put 999.
So what it will do is it will keep all
the other numbers same but here it's
going to put 9999. Okay. So that is this
code and you will rename product name to
product. So let's run this code right
here
and uh below this I'm going to
display this so that we can understand
what has happened so far. Now see here
that product number has become 9999.
Okay. Also we got this variant. See 60 g
40 g
from here we extracted. Then this
product name we changed it to product.
We also got this product course. It's a
long string. I hope you know the
fundamentals of SHA. So it's a hashing
function takes the string as an input
and gives this unique identifier. So
that's the identifier we generated
because we don't want to rely on this
product ID column since there was an
error. See tomorrow some other invalid
identifier comes and if we replace it
with 9999
and if we are having product ID as a
primary key then two products will have
the same 999 key. It's not good. That's
why we want to create this code which
will be unique for all the products.
Okay. And then in the silver layer you
want to select all these columns. Okay.
So these are the columns that you want
to have display silver and this is just
to reorder the column. See the column
product ID is at the last. So when you
say select this see now it came as a
first column. Okay. It's just a
reordering thing and then you write it
to silver table. So control enter. Okay.
So it's returned to a silver table. We
can
go here maybe in the catalog
and go to silver.
Uh-huh. See in products I see my
uh silver table now. Okay. While it is
loading let's move on. In gold what we
are going to do is we will select only
few columns. See we are not going to
have all the columns. Okay. We are not
going to have this metadata and all
those columns. So in the goal layer we
want to have product code ID, division,
category, product and variant. And this
is the goal layer for child data. We
have still not merged into the parent.
Okay. So this is the gold layer for
child. Let's write it and see how do you
differentiate this. SB_DE means it's a
sports bar. It's a child data. Okay. So
let's execute this. Okay. And so this is
a silver layer which is loaded and you
can just quickly verify it. You can also
look at the gold. See gold you have SB
dim products and go to sample data. So
the difference between the silver and
gold is that in a gold you don't see
certain columns. See usually in gold you
will generate new columns to BI ready
columns you'll do aggregation. But here
we did some transformation in silver and
we don't have any new columns. Let's say
you had columns like revenue and uh
expense and you want to generate profit.
So that profit column you will generate
in a gold layer. Okay. Right now we
don't have any such need. So our gold
layer is pretty simple. And then what
you will do is you will merge this child
gold table into parent. Remember our
architecture diagram. We are going to
merge child data from child goal layer
into parent goal layer. Okay, so this is
parent goal layer dimore product and sb
dim products is a child. Okay, so this
delta table is parent. Maybe I should
have named it parent table but remember
delta table is a parent table and this
is a child and to merge it you will use
this upsert. So when match update when
not match insert. So what this will do
is it will match things based on product
code. So we are not using product ID
because it's not reliable. This product
ID we are not using. We are using
product code which is a an additional
column that we have generated. Right?
this product code to for our upsert
operation. An absurd operation is if you
have a record with same product code in
this particular uh table then update
otherwise insert. Okay. And for setting
see we are using this division is source
division category and values we are
using this. Okay. So that's it. We are
now done with the processing of products
and we can verify
the dim products. Okay, you can just say
uh select all records and this is the
consolidated table folks. So now this
should have data for both child as well
as parent. So see product code okay and
if you scroll down so this is of course
Atlon's data which is parent company and
if you scroll down ah see this is now
child's data so you got this energy
drink products and previously you had
all the sports product all right we are
now done with the products dimension
table let's now process the pricing data
which is our gross price table the first
portion is same as uh the other code
right you're importing bunch of
functions you are importing those
variables
then you are creating this widgets.
Okay. And then from sports bar DP gross
price star dot CSV
which is this one. See let's look at
this gross price has this gross price
CSV file. Okay. So you are creating
a bronze table from that along with the
metadata columns. Okay. So let's h view
that particular table
and the table looks something like this.
You notice some obvious issues like
gross price is negative we'll convert it
to positive. There is unknown price
we'll maybe convert it to zero or
something.
There is these date column which has all
this non-uniformity. So we'll bring it
in a uniform format. Also for product
table we got rid of product ID column.
We had this product code column that we
created. So we need to put that here as
well. And then we'll do bunch of other
transformation. But first since bronze
is a raw data, let's write it to a
bronze layer. So this catalog. Okay. So
if you print this thing here,
um this will be FMC bronze gross price
table. Okay. So that table
should be visible here. Uh let's see if
you go to bronze
this gross price table has come up. Okay
so that bronze part is done. Now let's
do silver processing. So first from that
bronze table that we just created we are
creating DF bronze data frame and then
we will normalize the month field
because month fill has so much see
different formats and the way we do it
is by using this try to date function.
So let me explain this step by step. See
when you have this coalesque right so
the way cos function works is so cos
will take bunch of values okay so it
will take bunch of values let's say the
values are null 5 -4 null what it will
do is it will return the first nonnull
value so the return value okay so if you
have val here the val will now contain
what five If you have this as also null
then it will contain minus4.
So it will return the first non null
value from the list that you're passing.
Okay. So that part is clear. Now this
try to date what it will do? So it will
take the month column. Okay. We have
month column right which which is this
month column. And it will use this
format to convert this string. The data
type of this column is string. So it
will try to convert the string using
this format to date you know. So there
is another function by the way f2 date
also this function try to date is better
compared to two date in a way that if it
cannot find this format it will uh
return null. So let's say this thing is
returning null. Okay. Let's say this
thing returns null here. Okay, in that
case it will try this another format. So
it will try all this for format. Okay,
and let's say you you think about this
first date, right? Which is year/ month/
date. So in that case like for the first
guy here, it will convert it to the
standard date format. Okay, standard
date format.
It will convert it to a date object
actually and it the function will
return. But let's say if it is in this
format right let's say if the date is in
this format let's say the date is in
which which format this format so in
that case this will return null
because see the format here is slash
here it's this the second one tell me
what it will do that will also return
null because you don't have slash here
right the actual format you have is this
one third one so here it will return a
valid date and the fourth one it will
return null. Now the way coless function
works is it will return the first
non-null value. So it will return this
valid date. Okay. So that's what it is.
So let's execute this code and then if
you look at month see they are now in a
uniform format. Previously there was all
kind of format now it's in a uniform
format. Then you will handle the gross
price. So gross price has three issues.
It has negative values. It has unknowns.
Okay. So these are the two known issues
that is I think. Okay. So in the comment
I have written. So what this function is
doing is see when you say f. When this
is the condition so this condition is
checking if it's a valid value. If it is
negative, positive, it should be a valid
value. If it is not a valid value like
if it is blank or unknown like this
unknown then it will go to otherwise and
it will put it zero. We discuss with
business manager. Business manager is
fine. Okay, let's make it zero. But if
it is valid value which is negative or
positive then you further detect you
further detect. If it is less than zero
if it is less than 0 then multiply with
minus1. So - 84 is less than 0. If you
multiply - 84 with minus1 what do you
get? 84. Essentially you are converting
it to a positive value. And if the value
is already positive otherwise then just
use that you know cast it to double.
Okay. So let's execute this. And now
when you see the code see you notice
this was negative. First value was
negative. Second one was unknown. So
first value negative got converted to
84. Unknown got converted to zero.
Unknown got converted to zero using this
condition otherwise. and negative got
converted to positive using this
condition. See? All right. Next one is
we will enrich this by pulling that long
product ID code that we generated.
Right? So we are doing a join with
products table using product ID. See on
product ID and then we are getting this
product code. So now if you look at the
join table, it has product ID as well as
product code. Remember that SHA we use
SHA and we created that long code. So we
are pulling that into gross price table
because eventually product code is going
to be our main key. Okay. And now let's
write that to a silver table. Once it is
written it's a good idea to do some
sanity checking. So let's go to silver.
Uh-huh. See we have gross price. You go
to sample data
and see you see product ID. Product code
month is looking uniform. Gross price is
good. see no negative value, no unknown
and other things are the um metadata
columns. Now let's look at the gold.
Okay. So in the gold table here we have
product code, month and gross price.
Correct?
But if you look at the gold table of our
parent, let's look at the gold table of
parent. It is this. Go to sample data.
Here you have product code. Okay. So
here you have product code.
Okay. So this is parent. This is child.
So parent you have product code and
child. Okay. So this is let me see. So
this is gold
and you're looking at the silver, right?
So let's go to silver sample data. So
here you have this product code. You
have product ID as well.
Then you have month. Okay. And you have
gross price. So now in month, this is
month. This is date. Okay. Month. Date.
So this is silver. But see the Okay. I
think this are looking good. All right.
So this are looking good. So now
uh I think did we run silver? Yeah,
silver. We ran and now we are creating
gold. Okay. So we will write it to gold
table for child. Okay, child gold table.
I should be looking into the child table
of gold. So SB gross price. This is the
table. This is child gold table.
Okay, that's what it is. And this is
parent gold table. Now see the parent
table has pricing per year whereas this
has pricing per month. See you you want
to know for a given product like see
this is the same product right? If you
see this is the same product for this
product. What is my price?
I think this is the same product, right?
Is it 83 or 84? Because in a parent
table for a given product for a year,
there is just one price. So now we need
to do something. So maybe we can use a
window function. So this is a gold
table. But when you merge this child
goal table to parent, we can use window
function and we can group things based
on product code and a year. So from this
month column, we need to extract year
and then we use window function. I hope
you know about SQL window function. If
you don't, we have a SQL course, we have
SQL videos, you can watch it. And then
we will use rank. So when you use window
function and you rank this, okay, you
will rank it based on the latest month.
So whatever is the latest month price
that we will put it in a year and we
discussed this with business. I know
there is not a clear logic clear
decision here. So we discuss with
business and business is saying that
whatever is the price in the latest
month use that as a yearly price. See
yearly price for a product is this. So
for child use that logic. So let's do
that. So just to revise we will extract
year from this month column. Then we
will use window function based on
product code and year and we will rank
all these five rows and we want to pick
up the latest right whatever is the
latest month. Latest month is what?
Latest month is this 11. So we'll pick
83. Okay.
So let's do that. So you are loading the
gold price data for child. Then you have
to handle zero prices in a special way.
So you are creating this new column
called is zero. So if the gross price is
zero the flag is
one otherwise zero. Okay is zero is a
folks it's a common flag. It's a very
common sense flag. Okay. So if you have
price as let's say this then this is 0
will be what is 0 will be zero right?
Because this is not it's because is not
zero. And if the price is zero is 0 is
one. See even AI is noticing that you
should be able to figure it out. So that
is what we are doing in this line. Then
in the second line we are using window
function by product code and year. Okay.
And then we will rank it and whatever
has rank one we will take it as a latest
price. Okay. So that's what we are doing
in this code. And if you look at this
see what we did is we extracted year.
See we extracted. When you do this, we
have a month column. First of all, we
have month column and on this month
column, see this is the month column.
When you say f dot year, it will extract
this just the year part and that you put
it in a year column. Right? So see you
see this year column. So from month you
extracted year column. Second is zero.
So is zero is the column that you
created here. Okay. If the gross price
is zero, it will be zero. Otherwise it
is it will be one otherwise it is zero.
So you understood this too right? Year
and zero. See these three columns were
there. We created year and zero. We also
created rank. Okay. So for rank let's
look at this product. Okay. I want to
see how this product looks.
So this product
um
filter maybe
product code
is one of is product code is equal to
this.
I think it already filtered it by one.
So you are getting 83 price. Right? We
we discussed the logic that it will see
this looks like a window. This blue
rectangle looks like a window. So it
performed that window function and it
took the latest price which is 83. So
that's what you see here. All right
folks again if you're having issues
understanding this code take a help of
claude and try to understand it. Okay
this is not a SQL tutorial. If I start
explaining window function and all the
SQL fundamentals this tutorial will
become like 10 hour long. Okay. I don't
want to do that. Okay. So now you are
taking the required columns whatever you
need and then you will merge it to your
gold parent table using upsert
operation. Okay. So this is common.
You're using product code as a matching
key. If the product code already exist
then update otherwise insert. Okay. So
that's what it is doing. We are done.
And now when you go to the gold for dim
gross price. Now dim gross price
contains data for both parent and child
company. Okay. So if you say all
records and if you run you will see see
these are these are parent company
product codes but you also got child
company see and you got it by year. So
for that year whatever was the latest
price we took it. All right we are done
with now this dimension table.
We are done processing our dimension
data and we will now process our facts
table. So let's go to workspace here. So
here I'm going to click on workspace
and I will create a new folder
called
uh fact
data
processing
and then create a new notebook
new notebook called 1ore
full load. So, so there is incremental
update and then there is a full load.
So, we are creating the file for the
full load and the initial portion is all
same. Okay. So, we'll import the
required libraries which is something we
have done 10,000 times. And let me just
go to that file and copy some of this
code. So I will go to the dimension data
processing
and copy some of the code which is okay
running this
and then you want to print
right I I hope I don't need to explain
this we have done it enough number of
times so far and also
we will make sure here you specifying
your bucket. Okay. So there are widgets,
there is a base path, landing path. Now
see in our AWS we have landing folder.
So see for dimension etc is simple
right? You have a folder and you have
CSV file but for orders is different.
You have a landing folder.
So this folder will contain all those
files daily files. And once these files
are processed what we will do is see
once we consume the file in datab bricks
we'll create another folder called
processed. Okay. So let me create the
folder so you understand. So this
processed folder here
see once you ingest data from here to
bronze layer of data bricks you move
that file to processed. And this is a
common practice that many companies use.
When I was at Bloomberg, we used to do
this a lot. You have your Unix machine
where you are processing some data. The
daily file is processed. You move it to
your processed folder. Okay. So that's
what I have specified here. So let me
run this.
All right. So see orders landing order
processed. Okay. Lending path and
processed path. And then you will create
the table using your landing path.
Right? So when you run this it's going
to take some time because see at this
time what you're doing is spark read
your path is landing path. What is
landing path? It is this and if you look
at that path
see you want to take all these files you
know all these files and you want to
create a data frame. So this DF will
contain data from all those files and
this is the beauty of Spark that it can
read all those files seamlessly. It can
also infer schema. You have all these
parameters and so on. Right? So let me
use display function here. So let me
just say display
uh df dot limit maybe 20 rows
because show will sometimes show it in a
not in a very good format. So see here
your fact table has order ID, order
placement, customer ID, order quantity
and some metadata columns. Okay, so this
looks good. So this is your bronze and
you will write that to your bronze
table. Right? This is pretty
straightforward and you will use append
method. Okay. So your bronze table
is essentially this is child data. So
let's go to bronze
and in bronze
we had so far customer gross price and
products. These were our dimension
table. Now we just got orders table. And
if you look at this sample data
here
um it will show I think let's see yeah
this is done okay we'll we'll come back
here when whenever it loads and the next
thing you want to do is you want to move
the files to process see we loaded data
into bronze right so let me see see
bronze is loaded so now these files we
need to all these files we need to move
to process folder. Process folder is
empty right now. Okay. So how do you do
that? So first there is this DB utils
module. When you say fs.ls lending path
it will give you all the files. See it
will give you files as a list. See all
the files. It gives you all the files as
a list. And then those files
you will go through each file and for
each file you will say dbuts.fs. FS.move
M MV is move function move this file to
processed path and file info dotname see
file info.name name is this correct? So
this will move it. So let's run it. It's
going to take some time. Now see it's
going to S3 and it is doing the Herafery
the transfer of these files from landing
to processed. So now when I go here into
processed I see some files. See maybe
it's it's not all the files is still
lending will still have some file
because it's still actively moving it.
See it's still working on it. Okay. So
let it do the work. Let's now focus on
our silver processing. So for silver you
will load all the records from your
bronze. Okay. So that is one thing you
will do. I think it will execute after
this line. So we'll have to wait and
then let's think about the
transformations. So for transformations
um we can see some clear
transformations. So this transformation
is similar to what we did with the gross
price. So the formats of dates are
different. So you want to bring it to a
uniform format. That is one change.
Order quantity has null. So you want to
handle that as well. Then these are the
metadata columns. Okay. Product ID.
Let's see. Customer ID is good. I mean
customer ID is invalid. See some of the
records has invalid. I'm thinking
product ID might also be invalid in some
of them. So we need to handle all these
scenarios. Okay. So first let's handle
the order ID null, right? So order
quantity order quantity is null. What
should we do? Well, we can just filter
out the records.
Um
I mean based on how many records are
there, you can filter out. You talk to a
business, your business and business
said okay it's okay to filter out those
records. So you filter it out. Then for
customer ID we want to keep we have done
this before right? We want to keep those
customer ID which are numeric others
uh otherwise
you know see this what this regular
expression will do is if it is numeric
then you use the customer ID otherwise
if it is non- numeric you use 9999 okay
so that's what you will do as a step
number two third step is
let's look at this order see you have
this Tuesday July 1 2025
you want to remove Tuesday we don't care
about weekday right we care about the
date July 1 2025 so how do you remove it
well you can ask this to your friend
right jpd
if you have a regular expression you
will figure out you want to look for the
portion after first comma so you look
look for comma
Okay. And in that comma everything is
you have bunch of letters and then you
take the other portion. Okay. So now I
want to check the scenario where if you
have July 1, 2025 then you don't want to
miss it, right? So you want to exclude
number just the text. Okay. So for that
we can use a regular expression which
looks something like this.
So here the regular expression says that
okay this is the comma this is the first
comma right this is the first comma so
before this first comma if there are any
a toz character okay which means this
Tuesday
right then replace that with this space
okay and after comma if there are any
see after comma this means any uh white
spaces so after comma see you have white
space okay so then you're saying that
replace this entire thing with space. So
when you replace it, what do you get?
July 1 this right? So that's essentially
the purpose. See if if you do 0 to 9,
right? Then that will be a problem
because if you have a string like this
already as an input, it will then remove
this. You don't want to remove it. You
want to remove only the day name which
will not contain any numbers, right? It
will be just letters. So therefore you
don't have 0 to 9 here. You can
understand this by claude. Okay, we have
discussed this before. If you don't
understand this reg regular expression,
go to claude say that okay explain me
this and it will be a nice teacher and
it will explain you with lot of
patience. And then the next one is for
order placement date. Right here you
want to make it uniform. This is
something we did already in gross price.
We are just doing exactly the same
thing. And we want to also drop some
duplicates. So the duplicates are
basically if there are two records which
has exact same order ID, placement date,
customer ID, product ID and order
quantity or five columns are matching in
two records. Then you keep the first
one, drop the remaining.
At last you want to convert product ID
to string. Okay, so this is
transformation number six. product ID we
are converting it to a string okay let's
execute this and then we want to also
check for what is the minimum and
maximum date right so we have data till
November
30 from July to November 30 okay all
right now let's see how my DF orders
looks okay so display df
uh df orders dot limit
20.
Okay.
All right. Let's see. See, order
placement date is all uniform. Looks
cool. See, here it was all messed up. Uh
where is it? Where is my order ID here?
See here it was like string and all
different format. But here now it's
uniform format. Also the order quantity
is not null. See we drop those records
and then duplicates are dropped.
Customer ID if it was not numeric we we
took care of that. Okay. So my data
frame looks good. Now one thing that it
doesn't have is that product code. See
it has product ID but it doesn't have
that long product code. Remember that
sha that we created. And how do you get
it? Well that product code is present in
this table. Correct. DF products. So let
me display DF products limit five.
So this product code is present here. I
want to get this product code into this
table. So we can do join using this
product ID. Correct? So we have product
ID as a common link and using that we
can do join. So df orders dot join df
products on product id and then select
these columns. Okay. So now
when you do
display
df join dotlimit 10
it should have that product code. Okay.
So other columns are same. What you got
is at last this additional column which
is your product code. Once that is done
you will write to your silver table.
Okay. So you will say that if the table
does not exist then you create it.
Right? This is similar to all the other
code we have written before. If it exist
then you want to do absurd. Correct?
Because see fact for fact table we are
doing batch processing as well as
incremental updates daily updates. So
that is the reason uh even when you are
doing full load I mean ideally you would
do this in your incremental update but
even in full load just to be kind of
sure you know sometimes your code
doesn't work you want to rerun there
there can be other reasons so if table
exist then you you want to merge it
right so this is that absurd code okay
so once you are done running this you
will find a table in your silver schema.
So let's go to silver and now see this
table just appeared. Wonderful.
All right, we are done with the silver
schema and now we will start processing
gold. And by the way, you can always use
this markdown. Okay, this is gold and
the gold table is first you read from
silver. Okay, so this is although I'm
calling it DF gold actually it is
silver. Okay, because I'm preparing I
want this DF gold to be the actual gold.
So here
um we are not generating any BI ready
columns etc. So we will just write it to
gold table. Right. So creating new new
using the same um absurd concept here to
write that table. Okay. So it will be
the same table. So here let's go to gold
and you can see that
uh this table is
you see SB fact orders
okay so that's the table
so gold table here
um if you just want to verify the name
it is the SB fact order so that that's
the table we wrote here okay now we want
to merge with a parent company. So let's
write a markdown
merge with parent company.
So this is just the child data. Okay,
this one is only the child data. We want
to merge this to parent.
And when you start merging to parent,
you realize one thing. See, let me open
the parent table here. So parent fact
order looks something like this.
It has the quantity for so okay fact
orders.
It has the quantity aggregated at the
month level. So this customer this
product for January month they order
total 161 which means all the dates in
January. Okay. Similarly you will have
February, March and all that. So here
the schema is such that although it says
date actually it's I mean ideally I
should not have this date it is actually
January 2024
and you're just using the first date
whereas in this table you have daily
level data. So now when we merge to
parent one challenge that we need to
tackle is in child aggregate all the
data uh from a same month and then
upload to parent. Okay. So let's do that
step by step. First step is of course
you will load the child. So see this is
a child right? So now you want to
aggregate for a given product code and
customer code. you want to take all the
transactions in a month. So let's remove
this date
and or let's let's change this date to
one. So if it is July 10, July 14,
whatever, let's just change this to July
1st. So essentially we are talking about
July month not individual date and then
we do aggregation based on product code,
date, date, product code and customer
code. Okay. So I hope you're getting the
point. So first
change the date to first day of the
month. Correct?
So if it is 2025
let's say
07 10
you will make it.
So let let me do 12 right. So 12 then
make it one.
Okay. Then if it is
13
then make it one. I hope you're getting
it. So this way we are aggregating
things at the month level.
All right. So
let's let's see how many records the
child has. So my child has 4811
records.
And now I want to extract month. So by
the way to extract the month from this
you can use this function.
Okay.
So when you say df child with month
start right. So it will take this month
start in the child actually in the child
you are taking the date and creating a
new column called month start and this
function will essentially give you this
right for this date it will give you
this. So just remember that and we are
doing the next state group by operation.
Once you got all these months then you
group by that month product and customer
code. See month start product code
customer code you are doing group buy
and you want to aggregate all the sold
quantity right you are just uh it's like
you sold 10 cricket bits today on 2nd
July and on 4th July you sold five
cricket bits
meaning same product to same customer
then for July it will be 10 and 515 so
that's why you're doing sum of that and
then uh you will rename See this month
start date you will rename back to date
because our parent table has it as a
date. Okay. So that's why we are
renaming it to date
and then
uh display
dfmon monthly
dot limit 10. Let's execute this.
Aha. See by the way you can do some QC
on your own. Um, and
that's it. Okay. DF monthly limit and
then DF monthly count.
Okay. This is the count and
the last one is this is gold parent data
and you are merging it with same absurd
logic folks. So nothing new here. Now
when you look at the parent data here
right and if you say uh
okay so
let's
okay show me
I think all records are going to be too
much but I'll run it anyway
okay so I just ran a query with a
specific product code see we have this
product code right Just to make sure the
data has landed into the goal parent
goal table and in fact it has landed.
See so we have fact orders table in goal
schema which is the consolidated table
containing both parent and child data.
See this is child data right? These kind
of long long product codes are from
child table. All right folks. So that's
it for uh the full load or the batch
processing for fact table.
Let us now understand how incremental
load is going to work. Incremental load
means you have bulk loaded your
historical data you have done your back
fill and now the system is up and
running. Now every day when transactions
happen every day at night at let's say
9:00 when the business is closed the
OLTP system will drop the file for
orders in S3 bucket. So let's say this
is the file for 2nd December docsv.
Okay. And once you get this file, this
file will be ingested in datab bricks.
Now in datab bricks we have three
layers, right? So we have bronze,
then silver
and then gold.
Okay. So let me just draw these
boundaries.
And it already has the historical data,
right? So this bronze
has historical data. So I'm going to
mention it using green color. Okay. This
silver also has historical data.
Whatever we backfilled.
So that is also represented with this
green color. And this gold also has the
historical data. And by the way, this is
all child. Okay, parent gold is
different. So let's say we have the
layer for parent gold. And it is this
particular database which has all these
records. So this has this is parent plus
child, right? This is the main table
parent plus child. And this gold is the
table which starts with SB underscore.
Okay. Now when this second December file
comes in, you want to append that data
into bronze. So let's say this second
December CSV is this purple. It is this
purple data. Now what we're going to do
is create uh another table called
staging table. Okay. So let me do this.
So let me just change this back to
purple. Let's draw a staging table. So
there will be a staging table here.
Okay, there will be staging table here
too by the way. And let me explain what
this staging table is. So staging table
will contain only the second December
data. So only incremental data. That way
when you are doing transformation, see
in your daily update when you're doing
transformation, you don't need to touch
this historical data. that is already
done. So staging table will allow you to
keep the new data. This is all the new
data which we have not seen so far and
you transform it and you put it in this
staging layer of silver. You have to
also add it to this right because this
silver layer should contain all the
data. So you do that then from this
staging you add it to gold.
Okay.
In gold you don't need any staging table
because at that point you are done. And
then from here we do our merging with
parent in our usual way. So this 2nd
December data we will put it here. Okay.
So this will be our architecture. I'm
going to copy the file for incremental
load. So see from here we can just
simply drag and drop. So let me go to
this place
workspace
and let's go to consolidated pipeline
fact data processing and drag and drop
this incremental upload. Now
transformation etc will be same as full
upload. There will be few changes
related to staging table. Okay. So here
let me copy this particular line. Okay.
We have seen this enough. So I don't
need to I think explain this is all
pretty straightforward stuff.
So I'm running it. Then let's run this.
And let's run this.
And at this point, okay, let's just
change this to DP. Okay. And we are just
creating all the tables.
We are just initializing variables by
the way. We are not doing anything more
than that. And let me also print this
three. Right? So see bronze table,
silver table
and gold table. You should not get
confused folks. These are the tables of
child company. Okay. See SB and for
silver and bronze we anyway don't have
parent data. So these are all the child
companies table. Now here we will be
reading files from lending path. Okay.
So landing path. So let's see what do we
have in landing. So in landing we don't
have anything because we processed all
the files. In process we have files all
the way till July, right? July July to
November 30. Okay. Now assume today is
1st December. It's 9:00 and we are going
to now drop that data. So let me uh get
that. So you go to incremental load
orders and this 1st December data you're
going to drag and drop. Okay. So let's
click on upload and in the upload 1st
December. Let's drag and drop here.
Okay. So see orders landing upload.
All right. So my landing folder now has
1st December file. Okay. Now what I will
do is
I will read it into data frame right
from landing. So when I read uh read it
from landing folder
I will get only first December data and
you will see it in a row count right
this row count is not going to be very
big. It will be I think whatever 300 or
400 whatever number of records we have
for that first December file this one.
Okay. So let's check this file. So I
have opened that file here
and if you look at this file we have
total
350 records.
Okay. So 350 records and here see 350
records. So we have data for only 1st
December and this is incremental data
and we will write that to bronze in
append mode. See folks let's be very
careful and vigilant. Okay, you need to
pay attention here. So I'm talking about
this bronze folder which has green which
is historical data and this purple is
1st December. We are just inserting
that. So now when you say append see
append this is important the data frame
whatever 350 records data frame has it
will be appended to that bronze table.
And if you want to verify
you can verify it. See, you can verify
it by going to catalog
and click on the bronze
orders.
Go to sample data. You can say get me
total count
and it will do this. And you run this
query
and total count is 51810.
Okay. Now when you upload to bronze
table. Okay. What is bronze table? Okay.
Let's verify it. Bronze table is bronze.
Right.
So here it is. Bronze. 51810.
We are appending it.
So
51
810 + 350 is how much? 52 160.
So let's go here.
Total count. See 52 160. And when you
refresh, by the way, you have to refresh
it. Sometimes it takes time. So I went
to other tables and came back and see
now you have 52160. Okay. So it ex it
inserted those 350 records. And we are
also creating this staging table. See
staging table. So that's a new table
that we are creating. So this is done.
Let's refresh it. And here you see
staging orders. Now this will have 350
records. Okay. So count. So when you say
count, it will execute the query
and see 350. Okay. So so that staging
table is this this brown table and we
also inserted data into bronze. Okay.
Now from this staging we are going to
create the
silver okay and also the silver uh
staging table. And by the way before
that you need to move the file. So we
are moving file from landing to
processed. Okay. So landing to process.
So now when you refresh this file is
gone. When you go to processed
all the way down you will find December
1st.
All right, that worked as expected. Now
let's create a silver table. Okay, so
from staging, see now from staging. So
what you're doing is you're using this
staging data because it's just a new
data and you will do all the
transformation. Which transformation?
Well, the same transformation that we
did previously with the full load.
Same transformations. Okay, same
transformations we are doing here.
And
we will see that now this DF order data
frame contains data only for 1st
December. Okay, we will join it with
product for obvious reason because we
want to get the product code. That is
something we always do. And then we will
write to silver table now. Okay. So we
are writing to this silver table. What
is that silver table? Well, that silver
table is the main child silver table. So
this okay so from here we perform
transformation and we are writing it to
this particular silver table
and we have append okay so append it's
going to append because see if table not
exist then it will do this but table
exist in our case so it will merge right
it will merge it will do absert whatever
and
let's See?
Okay. So that is done. And you also
write to staging table. Which staging
table? Well, this this brown table that
you see here, we are going to write to
to that table. And that's it. Now from
this we can write to our child SB_ch
child goal table.
So let's first get the data from
staging. Not that we are just notice
staging underscore we getting this data
frame from staging 350 records. Okay. So
count will be 350. Okay. 261 because we
did transformation. We removed some
duplicates. Right here during
transformation we removed duplicates did
data cleaning and as a result we got 261
drag records which is okay. And then we
are going to push that to gold table. So
what is my gold table? My goal table is
nothing but SB fact orders. Okay, SB
factor orders is this the child goal
table.
So here we will upsert.
Okay, this is usual code.
Now this is important. We want to merge
the data with parent company. So in
parent company what happens is this. See
in this parent company you will have the
let's say right now this is 1st December
when let's say you are processing five
December then you'll have data from 1st
to 4th December right and it is at at
monthly level see in in child goal table
it's okay because it is at daily level
but when you go to parent you want to
aggregate it at monthly level okay so
let's Let's verify that. So let's look
at child. So see SB fact orders is a
child table, right? Data is at daily
level. See 19 July, 14th July etc. But
when you look at the gold of parents
which is this the data is at monthly
level. See although you heard it it's
it's actually month. So essentially what
is happening here is
uh let me just write it. So in this
child table
you have data for let's say 1st December
when you process 2nd December you will
have 2nd December 3 December you will
have separate records but here you have
just December
correct.
So what you need to do is you need to
from this table see let me just show you
here
you're going to silver schema and from
staging silver schema
you want to get the month. So see see
here what we did is we just got we
created this increment incremental month
df which is nothing but just a data
frame with the start of the month. Okay.
So you have December. This is December.
Okay. And then this code is important.
So here what we are doing is we are
selecting some records from SB factor
order. So SB factor orders is this this
one
and then we are joining it with
incremental months which is this on
start of the month. Right? So u
essentially you are getting um all the
records. So this 1st December, second
December, third December all these
records will be in green right. So let
me I think to remove the confusion. So
let me just uh
draw it using green
because
see if you have green
that means data is that green this data
is coming from here okay so that data
you have 1st December 2nd December
3rd December and so on and let's say
this file that you're seeing let's say
this file uh is let's It's a four
December
4 December dot CSV. Okay. So 4 December.
So now you want to get all the
quantities from here. Let's say there is
10, there is five
and there is two and this one has seven.
So you want to get all this historical
record first
and then this record and aggregate all
the quantities. So 10, 5, 15, 17 and 24.
So now when you write it here then
December will be uh this 24. Okay. So
that's a kind of aggregation that we are
doing here. So at this point
when you run this this monthly table see
monthly table what it will contain
is
all the records all the records from
historical data which is green from
December. see all the December records
it contains.
Okay. So let's uh let's see what what
dates we have. So monthly table has a
distinct date which is this. So all the
December records.
Okay. And then what you do is then you
take your
uh this one DF child, right? So you are
taking DF child. I think you're taking
DF child. No. So here you are taking
monthly table. So monthly table is this
right? All your December data and that
by the way this this new data you have
already merged in this gold table.
Correct? So that data that new data is
already there. So you are just taking
that and you are just doing grouping. So
you are grouping by month start date
product code this and you are doing sum
of soul quantity and that right? So when
we are doing this folks uh understand
that all these records are coming from
gold table only and you are doing sum
and you're doing sum of quantity which
is 24 right. So that is the sum function
here
and eventually you you rename your month
start date. This is something we have
done before. So you do this and then
when you do count
20
17 records are there in total for
December. Now see right now it's 1st
December. So it's just incremental file
right but scenario I presented is it
this will happen when you have second
fourth December. Right now it's just
first so it's understandable
and then you write it you do up absert
operation as usual and write it to fact
order. So this is your parent table and
once this is done you clean up your
staging tables right those stagings
tables are not needed. Now let me
process the other two files. So here
okay where is my data? So let me go to
orders and
see here uh orders landing in landing
let me draw like second and third
now let's see what happens that will be
interesting because you already have now
217 records and from I'll just say run
all
okay let's let's see
okay I forgot to upload. So, okay,
upload succeeded. So, upload succeeded.
And now I was getting some error by the
way. Okay, now I'll say run all. So,
let's see. It is running
and I I don't think I need this.
Okay, whatever. Let's Let's keep it as
it is.
And
here
see 2nd and 3rd December right we
process that
and number of records it should print
somewhere. See the the two dates
combined had this 528 records.
Okay.
And then when you merge with the parent
company,
see
here you get the records from first,
second and third. I hope you're getting
the point right. So the first is
historical and second and third is this
incremental. So that combined is 789.
Okay folks see I will encourage you to
debug it. Just stay calm. I know this
can get confusing. Pause the video. Take
a break. Okay, take a sip of water, go
for a walk and try to consume, try to
digest this information. This is not
that hard. It's just that you have to
have the patience. See, you get rewards
in life by doing the deep work. Just by
scrolling on Instagram and getting
instant dopamine, you're not going to
get much help. The real reward comes
after grunt work, after doing lot of
hard work and without losing patience.
Okay. So, I hope you complete this
project folks. Promise me you have to
finish this project. I know it might be
little difficult but finish it. You will
separate yourself from rest of the crowd
and you will build some amazing skills.
All right. So, that's it. And if I
refresh
now, see that that file is gone. And it
you can check in processed you will find
it. Okay. Let's drag and drop all other
files. Right.
So from 4 till 30
I'm going to upload. So here in the
upload
4 till 30. See
uh did it upload. Yeah. 4 till.
So started with okay this
28
whatever
see 29 30 I think it's not in order I
was just confused like what's going on
okay so here go to landing and in the
landing
you will see four all the way till 30
and I go to this code click on run all
button see it will take time cuz there
is some data that it has to process. So,
it's going to take time. Okay, that's
it. It is done. I can check my
uh order table here. You can say that uh
order by date
descending.
Okay.
So let's run this
and see you are seeing all December
data. You don't see individual dates
because remember we normalize data to
month level. So it will be always 121.
But if you go run the same query in SB
fact orders you will find it. Uh so let
me just run it. So order by date
descending
and then let's run it. In child table we
have day day level granularity. See
12:30
12:30 12:29 and so on. All right.
Wonderful. Our incremental
uh load incremental daily processing
code is complete. Just to summarize in
the incremental load uh Jupyter notebook
we created we use this architecture
essentially so we just created this
staging table when we perform the full
load we didn't have a need of doing this
staging table right it was just
straightforward but in incremental load
we created staging table and we
completed it all the code is available
here and folks I hope you are practicing
along with me
let's set up the orchestration now so
that this pipeline can run automatically
daily. Okay. And the way we are going to
do it is we will go to jobs and
pipelines
and we will create a new job.
Okay. And here u run a notebook.
So what is the notebook? So let's
specify the path here. So the path is
going to be uh let's see consolidated
pipeline
dimension data processing customer data
processing. Okay. So first you would
like to
do dim processing customers. Okay. So
you are processing the customers in the
first go. Okay. And the parameters.
Okay. So what are the parameters? I
think we had catalog which was FMCG
and we had data source which was
customers. Okay, if you don't know what
I'm talking about then let me just show
you. So here
um
let's say if you go to workspace
and
consolidated pipeline
customer data processing
here see catalog is FMCG data source is
customers so that that those are the
parameters we are specifying here okay
remaining things look good so just say
create task and then you can clone this
task. So, so now
uh let me just reduce this and you click
here clone the task. Do the same thing
for products, right? So, products
clone
and see here you see the products,
right? So, this products has a
dependency
on the customers, right? So, when you do
that, you will see this kind of arrow
here. You of course need to change the
path here. So here you will say customer
data processing
and then the data source will be product
or products.
So let's check that
is it products or products. Okay let's
see. So here
if you say products
products okay products. So products
save the task. So task is saved once
again. Clone it.
And dim processing
prices, right? So that's
the other one. And this pricing actually
depends on okay, let's set up the
notebook. So this is the pricing
notebook. And it depends on the
products. Okay.
And what is the
parameter? So it will be catalog FMCG
data sources gross price. So let's copy
this. Setting this thing is very
straightforward folks. I don't think
there should be any confusion
and then you will have
fact processing
for orders. Okay. So orders fact
processing. Let's clone it.
And that one depends on the
pricing. Okay. So we are executing
things in sequence. You can also execute
things in parallel based on the use
case. Here it makes sense that we
execute things in in this particular
sequence.
And what is the parameter? So let's go
here. Consolidated pipeline
incremental load. Right?
So
data source is orders. Okay. So you will
say
orders
and the notebook
is going to be
incremental right you are see pipeline
is for incremental load not for full
load. So this is incremental load. Save
task and that's it. See this pipeline is
set up. Now you can run it manually but
before running it manually let me do
this. See we have data in our gold table
till December 30th right December 30
actually 30. So let's say gold table
uh this is our final final table right
fact orders and you can say that okay
it's it's starting meanwhile what I'll
do is I have created new file quickly
for 31st December and it has couple of
records okay so that file I'm going to
drop to my landing folder because right
now my lending it's all processed so
let's drag and drop that file here
12:31 right
so I'm uploading upload is done
and here uh it is still starting and by
the way you can set up the schedule here
right schedule and trigger so usually
for this kind of job we will say that
this is scheduled for every day at you
know let's say 8:00 p.m. Once the
business is closed, let's say every day
at 11 p.m. Once the business is closed,
all the transactions are done every day,
you want to run it. So when you do this,
okay, so let's say uh 11 11 is 23. So at
11 p.m. every day, I want to run this.
And this is by the way Unix Chrome
syntax, right? If you are aware. So I
will say save.
And when I do it, see this this will run
it actually. So if I leave my computer
at 11, it will automatically run. And
that's how people set it up. But right
now it's a learning tutorial. So I'm
going to pause it. So let's just pause
it. And let's go back here.
And here you can say that uh find max
date. Okay, find max date. And the
maximum date here actually maximum date
here will be 121 because this is monthly
aggregated data right we should look
into this SB fact orders
and
find max date and here we should see 30
December as the max date. Now our
pipeline is kind of ready so let's run
it. Okay. So, by the way, you can give
job parameters, what kind of computer
you want to use, tags, you can edit
notification, by the way. Okay, I want
to get notified via email for failure,
success. Oh my god, you can do so many
things, right? Email, whatever. And then
permissions, advanced setting. You can
just explore folks. It's all common
sense. Let me run this pipeline right
now. And see, this is now running. Okay.
So when you click on view run and before
we run let's just say that this is FMCG
incremental
update whatever and
okay so this is running so let's go to
runs
and in the runs you're seeing this thing
is running so if you go here see this
thing is processing now the dimension
table we can these are like small tables
okay and they don't change months. So we
can process them daily. Uh the
thing that we need to be careful is in
fact processing make sure you are doing
incremental update. You you're not doing
full load every day, right? Every day
you are loading just that day's file. So
that's why we have incremental up load
here and for remaining thing whatever
dimension tables are in S3 it will just
get it and overwrite it. If you look at
our code we are overwriting it. So
that's how it will work. After some time
my pipeline is done and you can see all
green just in case if you face any
errors then you can click on it and you
can go to the notebook and you you can
debug it. If there are any errors you
will see red mark here. Okay. So our
pipeline worked. Okay. And what we are
going to do now is
in our SB fact orders
go to sample data and you can say select
max date and you should see the 31st
December because see 31st December. So
it worked okay as per our expectations.
One last thing I want to cover in the
orchestration part is that for parent
when we started we had data till
November 30. But for parents also you
know as the day passed by there will be
incremental data correct. So for parent
company there will be incremental data
and we have not covered the pipeline for
parent in this project. The project was
all about child company which is sports
bar data pipeline. So but if you're
working in the company there will be
another pipeline running which will take
care of incremental load for parent. So
if you look at it this is parents data
for December 2025 and it is aggregated
at monthly row right so you will see
something like this. So let's ingest
this data into datab bricks so that when
we do dashboarding we have a complete
picture.
Just remember that whatever we are doing
right now that's not what will happen in
the real life. In the real life as I
mentioned there will be a separate
pipeline for parent which will run on a
daily basis or let's say aggregate let's
say at the end of the December and it
will ingest all the data. Okay, we are
doing it manually because we are not
going to simulate parents pipeline here,
right? It's out of the scope. So here
just say add data. Okay, click on add
data and upload files to volume and you
just drag and drop this incremental load
of parent. Make sure you are selecting
the right folder. Okay, parent company
incremental load
and then for FMCG
we will say gold. Okay, in the gold we
are going to create a volume. So parent
incremental
update.
Okay, I think this looks good. Click on
create
and upload. So what has happened now is
if you go to catalog
and go to gold
in the volumes you see now this now you
can go to SQL editor
and run this query. So we are going to
provide you the query. Okay. So this
query uh let me just make it small
uh and let me hide this.
So this is the query we'll be writing.
So here we are saying copy into gold
fact orders. Okay. File format CSV
header true. and then select from this
select date, product code, customer code
and sold quantity. We are just casting
it to the appropriate data type.
You just say run all. And it says that
this volume does not exist because uh we
need to copy the path of that volume.
Okay. So let's go here and gold
volume this and
here. Make sure you're copying this
path. Okay. So copy path here
and
okay. So that's the correct path. You
run the query
and before running the query you can
verify okay how many records are there.
Uh see number of inserted rows is 4485.
All right. So our data is complete. We
have data for parent till December. We
have data for child till December and
it's in terms of our data model right
which is bronze, silver and gold all the
layers are complete. Now
one common practice in the industry is
that before people build the BI
dashboard they will create one huge
denormalized view or a table which
contains all the columns from fact and
dimension table. So just think you're
performing a big join between fact table
and all your dimension table and
creating one flat table which has
everything. So now when you create your
BI dashboard you don't have to uh look
into multiple tables. You have a single
table where you have all the
information. Now this will of course
duplicate some information some columns
but from dashboarding perspective it is
very fast approach. So we are going to
create that and we will do it by
creating a view. Okay. So we will create
this big view and we are going to
provide you a query so you can run it.
So just think about this view in your
goal layer where you have date, product
code, customer code, you have all the
date attributes, customer attributes,
product attributes. Okay, just look at
this query how beautiful it is. Metrics,
you have sle quantity into price which
is your total revenue. Okay, then you
have customers, products. See, you have
everything. Okay, let's execute this and
then we'll do select star. Okay, so
let's run this once again. It is
creating a view. We are not creating a
table here. Okay, just creating a view.
Now see, look at this table. It has
date, product code, customer code, date
key, all the attributes, gen quarter 1.
Now if somebody's building BI dashboard
and they want to get a revenue by
quarter, they can easily get it. There's
a single table which answers all your
questions. see platform category product
and so on. Okay. So you can see that in
so if you go to catalog
and if you refresh
and if you go here see you will see that
view over here. Now you can make use of
genie to get some quick answers. So
let's create
u genie uh by using FMCG
gold and this. So genie is the AI
assistant offered by datab bricks where
you can ask questions. Okay. So here
you can say that okay explain the data
set. What is the average sold quantity
of products? Okay. So you can say show
me total revenues
by quarter or by year. So AI will take
your question, it will convert that into
a SQL query because see if you look at
this code, it has essentially ran that
SQL query. So if you know SQL, you can
verify that code as well. Okay. And then
it plotted all this. See how beautiful
is this. See total revenue and you also
get this plot. So this is like a quarter
by quarter revenue. You can also say
what are the total revenues for each
product category by quarter. What are
the total revenues by customer for each
quarter? Okay. So I will type a
different query. What are the top five
customers
by let's say sold quantity instead of
you can say revenue also but I want to
know which are the top five customers
which bought the maximum amount. Say
fitness world is number one fast sports
is this is so cool see you are seeing
all this and then what are the top five
products
by revenue say for Apple company like
Apple iPhone will be number one then it
will be iPad or Mac you know something
like that so here see my cricket bat is
selling maximum because it generated
this much revenue okay so I encourage
you to ask different kind of questions
and utilize this genie feature. So you
that way you are building the skill of
doing data analysis in this AIdriven
world.
All right. The last step is to build the
BI dashboard. Now this is something that
will be done by data analyst but we will
do it so that you understand how the
work that you are doing as a data
engineer is going to be used by the
other team. In data bricks we have this
option dashboards and this is something
very unique about this platform where it
is called unified platform. Basically
you do your data engineering your data
analysis AI everything in a single
platform. So let's create a dashboard
here and we can call it aton sales
insights. Okay sales insights or
business insights. Okay, at leon BI 360
let's call it that. And we can have a
page called sales. You can have
different views sales, marketing, supply
chain and so on. And here we can connect
first with data. So you can say okay add
a data source. Right? So basically
you're going to gold and selecting that
view. This view is the only thing that
you need. Okay? If you want you can
write a custom SQL query as well. Okay.
So warehouse is starting it will load in
a few seconds but you can connect with
direct table or view or you can create
it from a SQL. So you can write custom
query SQL query and the output of that
query will be the source for this
dashboard. While this is loading I want
to show you a preview PDF file. So see
here at the top what we'll do is we'll
put filters right like year quarter etc.
And then we'll have KPIs revenue
quantity etc. Then you have top products
by revenue uh revenue share by channel.
You can have monthly trend chart. See
the exact dashboard depends on the
business requirements. Okay. You can
have all customers by revenue. Right?
See quantity and revenue. Top variant by
revenue. You can do so many things. You
can have this kind of chart where
scatter plot sold quantity and price and
so on. Okay, query has succeeded. We are
not able to show the results. Okay, I
don't care about that. Let's start
building the visual. So the first thing
is I will put this particular just the
the string. Okay, so just the text. So
we can say sales insight. Okay. And this
one we can make it 30 maybe. Uh 36.
It's going inside this.
Okay. Let's call it sales insights
and then select it in here.
Make it 30. Okay. See for this you can
change the background.
And in the background, let's say I'm
going to use this particular
color. H.
So, let me switch to light mode. And
that is the border color is
this.
Okay. And what else? Make it center
line. Make it bold. Okay. It I think it
switch back. Okay.
All right. I think something is going
on. All right. Let me just change this
back again.
Again, again, again. Okay. I don't know
why it keeps on switching back. So, I'm
going to stick to this. So, we'll just
make this text color.
Okay. Let's make this a black color.
All right.
And let's make this center line. Okay.
Then uh you will have the
visual. So let's add visualization here.
So this is the visualization and we are
going to have counter. So I want to know
total revenue. Okay. Very simple. So
here the value will be total amount INR
and you're going to sum it actually. So
you will say sum. See you will perform
some kind of operation average median
whatever right? So we will say sum. Okay
this is the sum of total INR and you can
make it like this. You can also have a
title and the title will be
total
revenue. Okay. I mean I'm not good in
the coloring etc. So I will keep this
pretty basic. You can uh make a border
color here. And let's say I want to make
a blue border. Whatever. This is my
border color.
Okay. And then you just copy paste.
So copy paste. And here you can have
total
quantity. Okay. and you just change.
So here you just say sold quantity and
it will show you the total quantity
which is sold. Now you want to have some
kind of filters at the top. Right? See
here is good to have filters because
right now it's showing the total revenue
and quantity for all the data that I
have in my database. Right? So what I
want to do is add another visualization
or filter maybe right this visualization
let's keep it on side let's have a
filter this filter
will be what this filter will be your
filter
okay and the title is year you see all
the years we have two years of data
right so 20 24 4 and 205 and just play
with these options folks. This is pretty
simple, pretty common sense. Okay, then
after year you have quarter quarter.
Okay, so this will be quarter. So let's
add
quarter. Okay, where is my quarter?
Here.
Okay, that's quarter. So now and then
you will have this at the bottom. Right.
So let's move this thing here and here
you can have after quarter you can have
month maybe. So let's say you want to
visualize data by month and then you can
have
uh what else after month you have
channel and category right? So you have
channel
channel channel. So this is channel
category and so on. Now see when I
change the year to 2024 it will see it
is refreshing. So it will now pull the
data for 2024. See this is 2024.
Then 2025. So now it's 2025 data. And if
you just hit on this cancel button it
will show you all the data. Okay. You
can do similar thing by quarter as well.
See here you can use this AI also. You
can say top products by total revenue.
Right? So when you do this it will use
AI to plot that. Okay. So you have top
products by total revenue revenue on
xaxis. Right? So you can give the
specific instructions. It will do this.
Right? So now you can say top 10
products maybe.
Right. I want top only top 10 products.
So,
uh is this top 10 products? 1 2 3 4 5 6.
No. So, some of the things we might have
to change manually, right? So, here you
can say default number of categories you
want only 10,
right? So, you are seeing now
uh top 10. So if you hide this thing
right here,
you'll probably get a better view. And
you can also put labels, right? So you
can put labels. So see this one uh in
2025 the top product by revenue was hex
dumbbell 5.47 billion rupees of sales.
You can also add a tool tip on that for
sold quantity. And what was the quantity
sold? Well, the quantity sold was
531.79.
You see that little dip. Okay. So that
way you can display that. You can also
add this particular pie chart here.
Okay. So we will add let's say how do we
add pie chart? Okay. So this one is a
pie chart
and you can say okay it's a pie chart
which is what pie chart which shows
revenue share by channel channel okay
pie chart that shows revenue sales by
channel
revenue by channel okay and it will do
it you can do it manually also but see
so 77.32%
is coming from retailer. Then this is
from direct acquisition. We have looks
like we have some null values. Maybe we
need to take care of it. Anyways, so you
get the idea, right? And similarly you
can plot this monthly chart all of this.
Okay, I'm not going to go into too much
details of doing this. This will be your
exercise folks. Okay, build an awesome
looking dashboard and also write a
LinkedIn post and you know post the
picture of your dashboard. Post the
summary of the projects that you have
done something like this. See this
person wrote a nice LinkedIn post saying
that okay I worked on this and this is
the data engineering project and these
are the insights. You can attach some
simple maybe PowerPoint, maybe some
video presentation and also
just uh tag us. Okay, so you can let me
show you another post by Arian. So here,
see this is an amazing post by the way.
Arian got a job through this post.
Believe me, uh he posted this and a
recruiter reached out and they
interviewed him and they hired him. So
you can make this kind of post. You can
tag me Haman. You can also tag data
bricks and just create this kind of
presentation.
>> It will do amazing things folks. You
know the whole world LinkedIn will
notice you. They will recognize your
skills, your sincerity, your
communication right via this video
presentation and so on. All right. So
please go ahead do it complete the
dashboard make necessary changes and I
wish you all the best. I think that's
all we have for this dashboard.
So Peter, how you doing? Go ahead and
share the report. Let me have a look and
uh we shall share it to Bruce later.
Sure, Tony. The report is on my screen.
We basically had three criterias. The
first one was it should be able to
provide aggregate analytics for both the
companies uh and it should be reliable.
So we tested it. The ETL pipeline is
working as expected and we can call it a
reliable layer. Second question was on
adoption. So I spent about 3 weeks and I
was pretty much able to get this project
done and the adoption won't be difficult
for the team. The third point was about
this being a reliable long-term
solution. So I'm confident to pitch this
as a long-term solution.
>> That sounds good. That sounds good. I
like what I'm seeing. All right, let's
send it to Bruce and uh let's see what
he what he thinks.
I hope you enjoyed building this
project. Many viewers who are watching
this video are students and
professionals who can't afford paid
versions of softwares and tools. And for
that reason, we want to thank datab
bricks for providing this free edition.
Using this free edition, any student or
professional can practice data and AI
tools for free. Now, if you want to get
certified for doing this project, then
visit our website. I'm going to provide
you a link in the video description
below with some instructions. So, if you
visit that link and follow those
instructions, you will be able to get
the certificate. If you have any
question, there is a comment box below.
Thank you.