Transitioning from Power BI to Microsoft Fabric: A Complete Guide

Name: Microsoft Fabric for Power BI developers - 3.5 HOUR FREE COURSE
Uploaded: 2026-01-12T14:38:28.350706+00:00
Channel: Learn Microsoft Fabric with Will
Description: Summary and key takeaways on Transitioning from Power BI to Microsoft Fabric: A Complete Guide, covering Introduction This article distills a 3‑hour video
Learn Microsoft Fabric with Will
Jan 12, 2026
•
6 min read
YouTube video ID: hwwU8V48g-4
Source: YouTube video by Learn Microsoft Fabric with Will — Watch original video
PDF
Introduction

This article distills a 3‑hour video series that walks Power BI developers through every major decision when moving to Microsoft Fabric. It covers environment setup, capacities, workspaces, security, data ingestion, storage options, semantic modeling, data validation, end‑to‑end migration, and career pathways.
Why Move to Fabric?

Simplified governance – unified access control and documentation.
Self‑service analytics – easier for business users to discover and use data.
Robust data quality – built‑in validation checkpoints reduce errors.
AI/ML readiness – Fabric is positioned as the data platform for the era of large language models and Copilot.
Planning the Fabric Environment

Capacities
Decide how many capacities you need (usually one, but more may be required for:
- Data residency (GDPR) – separate capacities per region.
- Cost‑center alignment – separate capacities per department.
- Workload segregation – intensive data‑engineering vs. reporting workloads.
Workspaces
Organize by personas (data engineers, data scientists, analysts) or by architecture layer (bronze, silver, gold).
Use naming conventions to avoid duplicate dev/test/prod workspaces.
Prefer Entra ID or Microsoft 365 groups over individual user assignments.
Access Control & Roles
Four roles: Admin, Member, Contributor, Viewer.
Understand role capabilities per item type (e.g., Viewer can only query the SQL endpoint of a Lakehouse).
Combine workspace‑level sharing with item‑level sharing to reduce workspace sprawl.
Remember that some items (data pipelines, data flows, event streams) cannot be shared individually.
Getting Data Into Fabric

Method	When to Use	Key Points
Data Ingestion (ETL/ELT) – Data Flows, Data Pipelines, Fabric Notebooks	Large or complex transformations, need for custom code, on‑premise gateways	• Data Flows: >300 connectors, low‑code, Power Query UI, can access on‑premise via gateway.\n• Data Pipelines: best for massive copy jobs, orchestration, control‑flow logic; no native transforms – embed notebooks or flows for that.\n• Fabric Notebooks: Python/Scala/Spark, ideal for API calls, custom libraries, data‑science tasks.
Shortcuts (one‑leg)	Near‑real‑time sync of files stored in ADLS, Amazon S3, Dataverse, etc.	• No ETL, automatic incremental sync.\n• Works for both files and tables.\n• Beware of cross‑region egress fees.
Database Mirroring (preview)	Real‑time replication of Snowflake, Cosmos DB, Azure SQL, etc.	• Creates Delta‑format replica, uses change‑data‑capture.\n• Still in private preview (expected Q1 2024).
Choosing the Right Data Store

Lakehouse – Stores structured tables and files (CSV, JSON, Parquet, images). Ideal for bronze layer, data‑science, and unstructured data.
Data Warehouse – Purely structured, T‑SQL based. Best for silver/gold layers, star‑schemas, aggregations, and row/column‑level security.
KQL Database – Optimized for streaming and real‑time analytics. Use for event‑stream data or time‑series workloads.
Typical Medallion Architecture

Bronze – Raw files & tables in a Lakehouse.
Silver – Cleaned/validated data, often still in a Lakehouse (or moved to a second Lakehouse).
Gold – Aggregated, business‑ready tables in a Data Warehouse; optional KQL layer for real‑time feeds.
Building Semantic Models

Every Lakehouse gets a default semantic model, but it does not auto‑sync with new tables. Create a new semantic model and manually add tables.
Direct Lake mode combines the speed of Import mode with the freshness of Direct Query:
Queries Delta tables directly, loading only the needed rows (on‑demand loading).
Provides near‑real‑time results without full refreshes.
Configure fallback behavior (Automatic, Direct Lake only, Direct Query only) to control how Power BI reacts when a query cannot be served by Direct Lake.
Use the "Keep your Direct Lake data up to date" toggle to decide whether the semantic model should refresh automatically after each pipeline run.
Data Validation & Quality Assurance

Three validation layers are recommended: 1. Schema Validation – Verify incoming files (CSV, JSON, Parquet) for correct column names, data types, and loadability. 2. Table/Data‑frame Validation – After each transformation step, check for nulls, range violations, duplicate keys, etc. 3. Semantic Model Validation – Ensure relationships, DAX measures, and calculated columns return expected results.
Tooling

Great Expectations – Rich library (~400 built‑in expectations), supports Pandas and Spark data frames, generates Data Docs for documentation.
Pandera / Pydantic – Simpler, Python‑only validation; good for quick checks but limited enterprise features.
DBT – SQL‑based validation and transformation; ideal for Data Warehouse pipelines.
Semantic Link – Fabric feature to read tables and evaluate DAX from notebooks, enabling automated semantic‑model tests.
Centralize validation results in a dedicated Lakehouse; build Power BI dashboards to monitor data‑quality trends across the organization.
End‑to‑End Migration Example

The series concludes with a hands‑on project that: - Sets up capacities and workspaces. - Ingests Yelp review data via a Python notebook. - Stores raw files in a Lakehouse, validates schema with Great Expectations, cleans data, validates tables, and writes to a silver layer. - Moves the final gold tables to a Data Warehouse, builds a Direct Lake semantic model, and creates a Power BI report that demonstrates fast, up‑to‑date visualizations.
Career Pathways in the Fabric Era

Role	Core Skills	How Fabric Enhances the Role
Power BI Developer	DAX, Power Query, report design	Ability to create and manage Lakehouse shortcuts, understand data‑governance, and add modest data‑engineering tasks.
Analytics Engineer (DP‑600)	T‑SQL, basic Python, data‑modeling	Designs end‑to‑end pipelines, builds semantic models, enforces data quality, bridges BI and engineering.
Data Engineer	Spark, Python/Scala, orchestration (Data Pipelines)	Deep work on Lakehouse tables, large‑scale ETL, performance tuning, automation.
Data Scientist	Python, ML libraries, statistical modeling	Access to unified data in Lakehouse, can train models directly on Delta tables, leverage Fabric’s AI integrations.
Next Steps - Review the 36 decision‑making questions (capacity, workspace design, security, ingestion method, store choice, validation strategy). - Join the Learn Microsoft Fabric community on School.com for notebooks, docs, and support. - Consider the DP‑600 certification to formalize analytics‑engineer skills. - Start small: pick a single data source, ingest with a Data Flow or Notebook, validate with Great Expectations, and build a Direct Lake model.
Conclusion

Transitioning from Power BI to Microsoft Fabric is not a single‑step migration but a series of strategic choices—capacity planning, workspace organization, ingestion method, storage technology, semantic modeling, and rigorous data validation. By following the framework outlined above, developers can build trustworthy, scalable analytics solutions, unlock AI/ML capabilities, and future‑proof their careers in a data‑centric world.
Master the fundamentals of capacity, workspace, and security design; choose the right ingestion tool and data store for each workload; embed schema, table, and semantic validation using Great Expectations or DBT; and leverage Direct Lake for fast, up‑to‑date reporting. Following this roadmap lets you migrate Power BI solutions to Fabric with confidence and positions you for new roles such as analytics engineer, data engineer, or data scientist.
Frequently Asked Questions

Who is Learn Microsoft Fabric with Will on YouTube?

Learn Microsoft Fabric with Will is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
Why Move to Fabric?

- **Simplified governance** – unified access control and documentation. - **Self‑service analytics** – easier for business users to discover and use data. - **Robust data quality** – built‑in validation checkpoints reduce errors. - **AI/ML readiness** – Fabric is positioned as the data platform for the era of large language models and Copilot.
Summarize another video
Full Transcript YouTube

hello if you're a powerbi developer and
you're looking to understand the world
of Microsoft fabric then this video is
for you this video is 3 hours long of
back-to-back tutorials we're going to
step through a lot of the major
decisions that you're going to have to
make when you're transitioning from the
world of powerbi to the world of
Microsoft fabric so we're going to be
going through how to set up your
environment how to design capacities and
workspaces how to give access and manage
security to all these different items in
fabric how do we get data into fabric
how do we store data in fabric you know
how do we choose between a Lakehouse
data warehouse kql database then we're
going to look at how do we actually
build semantic models how do we build
powerbi models on top of data in fabric
how do we validate these Solutions
because that's one of the biggest
benefits for me at least in
transitioning to the world of fabric we
can build really high quality data
pipelines to make sure that your data
that you're serving and the visualizing
ations that you presented to your users
can be trusted towards the end of the
video we're going to go through 36
questions that you need to ask yourself
when you're designing and Building
endtoend Solutions in Microsoft Fabric
and finally I've got a bit of career
advice at the end of the video for you
personally you know what should you
think about if you're considering a
career change to work more in Microsoft
fabric so considering all of that this
video offers a really good introduction
for you to really understand the world
build a fabric in a bit more detail so
that you can make good decisions for
your company and for your career if
you're new to the channel my name is
will and I'm a full-time fabric trainer
I help you understand the world of
Microsoft Fabric and I do that through
these videos on this channel and also in
my community on school so as well as the
video content that you see on YouTube
here it's a really good idea to follow
the link in the description because for
a lot of the tutorials that you're going
to see in this video the related
documents and notebooks and files
they're all saved in the school
community so make sure as you're
following along this course all of the
stuff that you need to download is in
the school community so make sure you
join there as well thank you very much
for watching if you do find this video
valuable please let me know in the
comments share it with anyone in your
organization that you also think might
find it valuable thank you very much I
hope you enjoy the
[Music]
course hey everyone I'm starting a brand
new series and this is the first video
in that series I'm going to be walking
you through how to transition
successfully from the world of powerbi
to the world of Microsoft fabric I know
from speaking to a lot of you out there
a lot of you are powerbi developers and
a lot of you have seen the kind of the
Innovation and the new tools available
to you in fabric now and you want to
know what are the best strategies and
what do we need to consider to make that
migration really successful when I talk
about that migration this is really what
I'm talking about right so you're
probably familiar or you're probably
working towards what I would call a
powerbi Centric architecture currently
so you got some data sources and each of
these data sources might be either going
into a data flow so a powerbi
traditional data flow or you might just
be loading it straight into your power
query engine in powerbi desktop and this
is what I would call a powerbi Centric
architecture and what we can move to now
to start leveraging all that good stuff
that fabric provides us is what we call
a fabric Centric architecture right so
instead of ingesting our data straight
into Power query in our powerbi data
model instead now we're going to be
ingesting into Fabric and we're going to
be storing that data in one of the data
stores available to us in fabric so
either a Lakehouse or a data warehouse
or a kqo database and then we can read
that data into our powerbi reports from
fabric so this is what we call the fabc
Centric architecture and it looks like
quite a simple change right but actually
there's so many things that we need to
consider of how we set up fabric and
there's lots of different strategies so
depending on your requirements here you
might have got quite stuck right there's
so much choice and a lot of these
Concepts might be new to you so in this
video and in this series I'm going to be
filming seven or eight different videos
that walk through some of the key
decision points that I think you're
going to face in the future when you're
migrating from this powerbi Centric
architecture to a fabric architecture
and I know these are important because I
asked you in my community so if you
haven't joined our community yet on
school then I would definitely recommend
that you do that and I asked people in
our community you know what's the
biggest unknown in fabric what are you
currently struggling with what would you
like to know more about and we've had a
really good discussion lots of people
are getting stuck in but there's a lot
of recurring themes and these recurring
themes are exactly what I want to
produce videos on in this series so some
of the themes that I've noticed is about
this one's about getting data from AWS
so how do we get data into fabric is a
common theme how do we choose the right
data store so is it a Lakehouse is it a
warehouse what are the differences here
what are the nuances between the
different data stores when do we choose
which one depending on our requirements
there's also Cojo here asked about The
Medallion architecture so this is a
Lakehouse architecture that a lot of
people are applying in fabric so we'll
talk about the benefits of that the
drawbacks of that when to use it when
not to use it and things to consider
there also got a question around fabric
shortcuts so when do we choose whether
we ingest data or create a shortcut to
an external data source so perhaps
you've got data already in Azure data
Lake storage or even in Amazon S3
buckets so these are the kind of things
that lots of people are asking me about
in in our community another thing that
we talk about or has been asked about is
workspace organization right and it's
linked to The Medallion architecture or
whichever architecture you choose
because workspaces are so important more
important than they are in powerbi I
would argue because in powerbi it's just
kind of organizing where your data sets
or your semantic models and your reports
live in fabric they have some
constraints around because we're kind of
moving data between different workspaces
sometimes or creating shortcuts between
these things there are constraints
around how people can access different
data sets so depending on your kind of
Access Control requirements is going to
inform your workspace strategy so I get
lots of questions about workspaces as
well here again we're talking about data
Factory and Lakehouse and choosing the
right thing so just to flick back to
this this diagram here and in case
you're not 100% sold on the idea that
this is actually a better way of working
for most people not for everyone but for
most people why do we think this is a
good idea the fabric Centric approach
well number one I would say data
governance becomes a lot more simple and
here I'm talking about access to data
documenting your data all of this things
comes with fabric um and it's very
difficult to do that kind of thing in a
kind of powerbi world also enables or
makes self-service analytics a lot
simpler to implement I would argue now
one of the biggest benefits that I see
in this kind of fabric Centric
architecture is around data quality
right because in our traditional system
our data sources are being ingested
directly into a data flow or powerbi and
this is really flimsy right this is very
prone to error because anyone that's
worked in data analytics and powerbi for
a while you'll know that data sources
change the data structure change data
types change the quality of the data
that comes from these different sources
generally changes over time and there's
no protection against those changes in
this powerbi Centric architecture right
or at least you can't it's very
difficult to add in kind of data quality
checks and data validation checks and
what does that lead to well it leads to
your users seeing erroneous data errors
things or your powerbi model fails to
refresh if there's errors kind of in the
structure of the data that changes all
these things make users lose trust in
your reporting in your dashboards so in
fabric one of the key benefits that I
see anyway is we can really get a grip
on building a robust system for managing
data quality and the final one that I'd
mention here is around data science Ai
and ml machine learning so fabric is
marketed as the data platform for the
era of AI and if you're still using your
traditional architecture for powerbi
it's very difficult really to integrate
machine learning AI data science stuff
on top of that powerbi Centric
architecture so when we move to fabric
the fabric Centric architecture the
whole reason fabric has been built to
make your data sets in your organization
a lot more accessible for things like
training machine learning models letting
your data scientists analyze your data
you kind of a lot more easily and
there's the oncoming kind of llm large
language model Revolution I would say
co-pilot integration all of these things
basically making it as easy as possible
to expose your data to large language
models generative models that are going
to basically try and drive more business
value for your business so there's some
of the three good reasons but how how do
we do this bit in the middle and
obviously there's no right or wrong
answer here there just lots of
strategies and considerations so here's
what's coming up in the next seven or
eight videos there might be a few more
but I planned for about seven or eight
at the moment of some of the things that
I think are the most important things to
get right when we're transitioning from
powerbi into fabric so the first one I
want to mention is around how do we
organize our workspaces right so this is
probably one of the first things you're
going to think about if you've just set
up a new fabric tenant and you're kind
of responsible for your organizational
strategy in this what are some of the
strategies that we can Implement for
workspaces to make sure that everyone
has access to the items that they need
and they're structured in a good way and
these can be controlled at different
gring arities right so we can give
access to people at the workspace level
at the item level so by item level I
mean like at the data warehouse level so
giving someone access to the data
warehouse and also at the object objects
level by that I mean for example a table
in a data warehouse and at the row level
so we can do all of these different
things and we'll talk about how to
combine these different things depending
on your requirements now at the item
level is quite interesting because
different items so for example a data
warehouse and a lake house actually have
different ways of configuring their
access right in Microsoft fabric
currently so there are a few nuances
between the different items that you
need to understand if you're going to be
implementing your strategy in your
organization so we'll go through those
in the first video next we're going to
be looking at data access and these
follow a bit of a chronological order
that you might notice kind of mimicking
how you would approach this in your
organization so for data access what we
mean really is how are we getting access
to data okay so generally data isn't
generated in fabric we have to bring it
in somehow so that we can do stuff with
it and there's three main ways that we
can do that there's ingestion so data
ingestion is where we copy dat data from
outside of fabric and we bring it in
that could be either through one of the
connectors or via an API some other
method they're probably the only to
database mirroring is a fairly new
feature in fabric that basically allows
us to mirror the content of an Azure SQL
database or a cosmos DB database and it
basically creates a replica in your
fabric environment and it manages all of
the creates updates and deletes kind of
in near real time so that's another
option that we need to be aware of at
least and think about different
strategies for that then we'll look at
shortcuts as well because shortcuts are
a very important part in Fabric and
that's why we're not actually physically
bringing data in to fabric we're just
creating a reference or an external
shortcut to data for example that lives
in Azure data L Storage and also in
Amazon S3 as well and then there's some
kind of edge cases and thing maybe not
edge cases but other things to consider
like is the data in AWS is it on premise
do I need to create a Gateway all these
things are important to understand when
you're planning out this migration so
the third video in this series or the
fourth video If you include this one
then we're going to look at data
ingestion right so if you've chosen to
ingest some data here well actually
there's three main ways that we can
choose of how to do that either you can
use data pipeline or a data flow or a
fabric notebook and from what I've been
seeing a lot of people sometimes get
these confused and they don't fully
understand the capabilities of each of
these tools and when to use which one so
it might be a combination of all three
in some instances but each of these have
positives and negatives and things you
can do and things you can't do so we'll
be looking at what those are and when to
choose which one now the next one in our
list is a very important one which is
around Okay so we've ingested some data
but where do we actually store this data
what are the different architectural
patterns that we can apply and
specifically which are the kind of data
stores in fabric should we use based on
the requirements that we have as a
business as I've mentioned previously
there is the Lakehouse the data
warehouse and the kql database and each
of these again have different
personalities different things you can
do with them different things you can't
do with them right so we're going to
look at what those things are and then
help you decide which one is best for
your your specific use case in your
organization here we'll also look at
kind of data architectures in general so
this is where we'll bring in The
Medallion architecture concept and have
a look at that how to apply in Fabric
and then we'll move on to my favorite
topic which is data quality so once
we've got our architecture and we've
chosen our different data stores that we
might need for our architecture and
we've kind of ingested some data into
these data stores then we're going to
look at how do we ensure data quality
right because I've said previously that
one of the main benefits at least for me
anyway of fabric is that we have the
opportunity to build robust analytics
systems where users can finally trust
the data that's being presented to them
and you can sleep a lot better at night
because you know that each stage in your
processing pipeline has been validated
so I'm going to be talking you through
how we can apply endtoend data
validation right across the data
architecture and the data processing
Pipelines that you create in fabric so
we'll be looking at what you should be
validating when you should be validating
it why you should be validating it and
how you should be validating it right
for each of those different what I would
call like validation checkpoints along
the data processing Journey so that is
going to be a very good one then we're
going to be looking at okay so all of
our data is in fabric how do I connect
to that data in fabric via powerbi and
here you might have heard of direct Lake
there's a few new options available to
you over and above the what you might be
familiar with with import mode and
direct query now we have a few different
options that going to help you get data
into your powerbi report so that you can
do your data modeling in there and kind
of build your reports on top of data in
fabric so we'll be going through what
that looks like in that video then to
finish things off I'm going to be doing
kind of like an endtoend migration
project right so we're going to be
looking at each of these and we're going
to be be using real world data and we're
going to be setting up some workspaces
choosing a strategy for data access
choosing data ingestion and I'm going be
talking you through my thought process
all the way through so that you
understand why I think a certain
architecture or certain in ingestion
pattern might be best for this data set
or for this problem so that's going to
be the endtoend migration project done
here and that might be quite a long
video but I think it' be a lot of value
in there just to see how all these
things connect together so that's what
I've got in store for this series I hope
you like the sound of that hopefully we
can tick off some of these really big
decision points that people are asking
me about and I'm sure there's hundreds
of thousands of people out there with
similar question marks around these
things so I want to spend a bit of time
just talking about each of these in
detail so if that sounds interesting to
you make sure you're subscribed and make
sure you click on the next video which
will be workspaces and access control R
I'll see you
[Music]
there setting up your workspaces and
controlling access to Microsoft fabric
is one of the most important factors to
get right when you're transitioning from
the world of powerbi to the world of
fabric and it's something that I see
people make a lot of mistakes in they
rush in they set up their environment
very quickly then they spend a lot of
time trying to make it work for for them
and this can cost a lot of money and it
can lose business confidence when you're
migrating and there's lots of things
that can go wrong here so probably a
better approach is just to take a step
back try and understand what some of the
most common mistakes that people make
are understand the best
practices understand some of the
limitations of some of the tools in
fabric that we need to bear in mind when
we're creating a strategy for setting up
our fabric environment so in this video
I'm going to walk through what I think
are seven of the most common mistakes
that I see people making when they're
setting up a new fabric environment and
how you can avoid these mistakes and
this video is part two in a nine-part
series that I'm doing on this channel
really to help you transition from the
world of powerbi to the world of
Microsoft fabric so if that's a journey
that you're going on or you're about to
go on in the future then make sure you
subscribe to the channel because
there'll be lots more videos coming very
shortly let's jump into it the general
structure of a fabric environment can be
drawn like this in a bit of a hierarchy
and in this video we'll be looking at
capacities workspaces and items within
workspaces and how to set up each of
these for success and normally there
will be one tenant per organization
although some can have multiple tenants
but most organizations will have just
one and in that tenant you can provision
one or more capacities and a capacity is
a distinct pool of resources as
allocated to Microsoft Fabric or to
powerbi in the case of a powerbi
capacity and you can purchase peir you
go fabric capacities in the Microsoft
Azure portal but how many capacities
will you need as an organization this is
one of the first big decisions that
you're going to make when entering the
world of fabric and it's where a lot of
companies make their first mistake which
is provisioning too many or too few
capacities for their requirements so how
many capacities should you purchase well
in an Ideal World you'd only need one
capacity that's much easier to manage a
lot less Administration that needs to be
done but there's a few common situations
that mean that you might have to buy and
maintain multiple capacities the first
one being to comply with data residency
regulations such as gdpr that state that
you have to store and process data
within a specific geography so in fabric
when you provision a capacity you do so
in a specific region of the world which
defines where the data is stored so to
comply with these regulations you'll
need different capacity in different
regions reason two is to align with
existing organizational cost centers so
in some organizations billing and
budgets are managed through cost centers
which might align to specific
departments within organizations so you
might have an it cost center marketing
cost Center and they might want to
create one capacity for each cost center
to make billing easier to align it with
existing profit and loss cost centers
and it makes it easier to attribute
what's being used in fabric by which
department reason three is that you
might want to split capacities based on
the type of workload so capacity has a
set amount of resource and if either
your data processing is very intensive
so you've got a lot of data engineering
or you're doing lots of machine learning
training for example or you've got a lot
of existing powerb reports that have
quite a stringent refresh schedule lots
of read operations and you want to make
sure that your powerbi consumers are
getting a good experience then you might
want to split these kind of workloads
out into different capacities moving on
to number two and in a capacity you can
have more than one workspace so now
we're going to be talking about
workspaces and a workspace is a place
you can create fabric items like a data
warehouse a data Pipeline and generally
collaborate with your colleagues on new
data infrastructure and data projects
now at the workspace level this is also
the main way we can give access to
people or groups in fabric but again how
many workspaces do you need that can be
quite a complex question to answer and
there's a ton of different strategies
for workspaces so you might have
different workspaces for different
departments if you're working towards
more of a data mesh architecture or you
could have different workspaces for
different layers in a medallion
architecture for example plus you might
have duplications of each workspace for
development testing production you can
quickly see how the number of workspaces
can quickly add up if if you're not
intentional with your workspace design
now a common mistake I see is
organizations designing too many
workspaces it's difficult to say how
many is too many because every
organization is different but if you're
suffering from any of the following you
probably have too many so data is stored
across so many different workspaces that
it's difficult to communicate and
understand the endtoend data processing
workflow in your organization or if the
administration of these workspaces
becomes an unmanageable burden or if the
user experience is really poor
and people are being granted to hundreds
of workspace and they can't make sense
of what's Happening where it's
impossible to find the fabric items that
you need so what can you do to reduce
the number of workspaces well three
things for starters one idea is to align
your workspaces to personas in your data
team for example creating a workspace
specifically for data engineering items
then we can add data Engineers that
Persona into this workspace to take
ownership of these items similarly you
might have a dat science workspace that
contains all of your data science items
and again the data science teams are the
main of owners of that workspace
secondly a good way of reducing the
number of workspaces you provision is to
name your fabric items using a naming
convention now you could leverage this
strategy for differentiating between Dev
test and prod data warehouses for
example in the same workspace removing
the need to have three times the number
of workspaces for Dev test and prod
workloads and thirdly it's important to
understand the difference between
workspace level sharing and item level
sharing and we can use these two in
combination to try and cut down on the
number of workspaces you need for
example you might not need a separate
workspace just to host the gold layer of
a medallion architecture for example and
that could be a data warehouse that
you're surfacing for people using
powerbi instead if we understand about
item level sharing we can see that there
is actually an option for sharing data
sets just for consumption of powerbi
reports for example and we'll go into
item level sharing in more detail later
next up we've got an absolute classic it
was around in the powerbi days and it's
just as relevant today in fabric now
we've all got that email or teams
message from someone saying oh I've just
joined the data engineering team for
example please can you add me into the
data warehouse so either you share the
data warehouse or the workspace in
general with that user Can you spot the
problem here so if you're adding
individuals to a workspace or sharing a
specific fabric item with an individual
things get really complicated and
difficult to manage quite quickly now a
much better solution is to create either
entra ID security groups or Microsoft
365 groups both of which you can do in
the Azure portal under entry ID then you
can give access to the group to the
workspace or the indiv ual items so you
only ever sharing stuff with groups and
then when a new person joins you can
just add the person into the group
rather than sharing the individual items
and the workspace with that individual
and if you want to do a bit of spring
cleaning of your workspaces to try and
understand where individuals have been
added to groups you can either do that
through the UI using manage access or
you can also do it via the AP using
these end points here so you created
some groups and you want to add a group
to a workspace you'll notice that at the
workspace level there are four roles you
can assign now a key mistake that people
make here is not fully understanding
what each role means in general but also
what each role means for specific item
types in your workspace so what can a
viewer do with a data warehouse for
example if you review the documentation
you'll find this quite helpful chart
that shows four roles admin member
contributor and viewer then down the
page they list a number of actions that
the user or group can take with that
role now once this visualization is okay
I think it's still pretty difficult to
interpret so instead what I've done is
I've taken the same information but I've
split it down to make it much more
explicit for each fabric item now some
of them I have grouped if they have
exactly the same permissions profile and
we'll explore that in a bit more detail
but when we visualize it like this I
think it makes it a lot more easy to
understand what each role can do with
each fabric item so here are some of the
key points to understand I've actually
split it out into two tables one is for
workspace Administration type duties and
for these each of the four roles can do
slightly different things so that's
important to knowe but for all the
fabric item related stuff so all the
stuff related to data pipelines related
to Data Warehouse all the fabric item
stuff as I would call it you'll notice
that the permissions for admin member
and contributor are all the same which
is why I've grouped them and in general
a viewer in a workspace can only view
contents of stuff right so for example
they can read a notebook but they can't
execute it now the exception to this
rule is the data pipeline where a data
can actually execute and cancel the
execution of a data Pipeline and this is
one of the only things that a viewer can
do over and above viewing stuff and the
other thing to note about the viewer
role is that they can only access the
SQL end point of a Lakehouse they can't
access a lake house through spark
because they can't execute any code in
the spark environment or through any
other methods right so you can't use the
TDS endpoint any sort of apis they can't
do any of that they can only use the
sqle endpoint and we'll see why that
might be a bit later on now in the last
section we were talking explicitly about
workspace level sharing I.E giving
people or groups access to the workspace
and therefore everything in the
workspace however we can also share
fabric items individually without having
to share the whole workspace but this is
where a common mistake is again often
made because not all fabric items can be
shared individually and I think this is
something that's not widely understood
the data pipeline the data flow and the
event stream cannot be shared
individually at the time of this
recording and the only real option at
the moment to give people access to
these is to give people access to the
whole work space which contains these
items so that's something to bear in
mind when you're planning out your
strategy let's take a quick look at the
options for individual item sharing so
for the items that you can share let's
look at how they differ oh and to share
an item you just right click on it when
you're in like a workspace view click on
share and then you get this menu item
and you'll see the options for the
permissions that you'd like to share and
you can select with whom hopefully a
group that you want to share that item
with and for each item you get a
different set of options and I've copied
them here for you to take a look at
right so each of them are slightly
different so just bear that in mind
depending on the item that you're
sharing you get a slightly different set
of options to choose from so that's item
level sharing but we can also go to a
more granular level if need be in the
data warehouse and the SQL endpoint of
the lakeh housee we can get even more
fine-tune access using any of these
features
so object level sharing which is sharing
tables and Views so not the whole data
warehouse but an individual table and an
individual view with a particular user
column level security row level security
Dynamic data masking now how to do all
of these things is beyond the scope of
this video but just know that it's
possible and if you want to do that I'll
leave a link to the documentation on the
school page in this video so what's the
mistake here well not many people are
aware of the fact that object level
sharing only exists in the data
warehouse or SQL endpoint of a Lakehouse
but using spark these object level
permissions are not really respected
it's all or nothing so if we go back to
this diagram that we looked at before we
remember that a viewer in a workspace
can only access the SQL endpoint of a
lake house and so if you share at an
object level with a viewer in a
workspace the object level permissions
will be respected but as soon soon as
you give them contributor or above they
will be able to access the Lakehouse
tables using spark and these object
level permissions will mean nothing so
that's something to be careful of next
up is number seven and in a fabric world
it's possible or even likely that you've
been conducting workloads across
different workspaces meaning that you
might be planning on reading data in one
workspace doing some cleaning or
transformation and then saving it in
another workspace now this is where the
final mistake is sometimes made in
planning out your works space design
because people fail to understand the
limitations of some of the fabric items
to perform across different workspaces
and I'm talking specifically about the
data pipeline so with the data pipeline
copy activity you can only set a source
and destination for data stores in the
same workspace as your data pipeline
that means you can't use a data pipeline
to orchestrate the movement of data
between workspaces now you can use a
notebook or a data flow to perform cross
workspace reading and writing so that is
possible and it's a similar story for
stored procedure activity in a data
pipeline your data pipeline must be in
the same workspace as the data warehouse
that contains the stored procedure that
you're trying to trigger with it and
it's the same with invoke pipeline both
the parent and the child pipelines must
be in the same workspace so this is
something to bear in mind and for this
feature and some of the other features
that I believe are missing from fabric
currently I've added ideas to the ideas
portal of Microsoft fabric so if you
want to follow the link and vote for
these ideas to get them introduced to
fabric that would be awesome too so now
you've got your fabric environment set
up for Success the next thing you need
to consider in your transition from
powerbi to fabric is how you'll be
getting data into fabric will you be
using data ingestion database mirroring
or fabric shortcuts that's what I'll be
covering in the next video in this
series
[Music]
in this video I'm going to be explaining
to you how to get data into Fabric and
there's a number of different tools that
we can use to get data into fabric you
might have heard of data flows data
pipelines fabric notebooks shortcuts
database mirroring there's a lot of
different options and it can be a bit
confusing really to understand which
ones you should use for which task so
the goal of the video today is to
introduce you to these methods to help
you make informed architectural
decisions so if you're moving to fabric
you're going to be making decisions
about which uh methods you should be
using so this video is going to help you
if that's you and we're going to
communicate when to use and when not to
use certain tools and I'll be giving you
implementation notes that go beyond just
the surface level understanding so on
top of just the what are these tools for
each one I'll be talking you through
things you should bear in mind when
you're implementing them because
frequently when you watch a YouTube
video it can look like you understand
everything but then when you go to
implement it it's a completely different
story and this video is part of a series
that I'm doing on the channel to help
you transition from the world of powerbi
to the world of Microsoft fabric we're
on video three which is this video in
the series and there'll be seven or
eight in this series so if you're not
already subscribed to the channel then
definitely make sure you subscribed if
this is something that you think will be
useful to you now I just want to start
with a bit of a recap of what we're
trying to achieve here so if you're
coming from a powerbi background this is
the kind of workflow that you're
probably working to currently so you've
got some Source data then you're going
to be using power query probably in
powerbi desktop you're going to be
connecting to different sources of data
and transforming them then building your
semantic model creating your data
visualizations on top of that and then
publishing your report now when we think
about a fabric Centric architecture this
is going to look a little bit different
so obviously we've still got our source
data and that can be databases files
apis whatever that can be but instead of
putting it straight into kind of a
powerbi dashboard we're going to get it
into Fabric and we're going to store the
data so that get data into fabric that
can actually take a lot of different
shapes there's lots of tools that we can
use now to get data into fabric that's
one of the main differences between kind
of the traditional workflow on top where
everything really is done through Power
query whereas now in the world of fabric
we have four five lots of different
options available to us and so choosing
the right tool for the job becomes very
important and then we build our semantic
models from the stored data and then we
visualize them like this so this is
going to be the focus of this video
different options that we have for
getting data into Fabric and the next
video in the series will be on storing
data and the different options and when
to choose which of the different data
stores for different use cases in your
architectures so let's just start with a
high level overview of what we're going
to be covering in this video so how can
you get data into fabric well the first
method is data ingestion and you might
know this as ETL or elt extract
transform load basically we're going to
be connecting to data sour sources with
this method then we're going to copy the
data from the source and we're going to
save it to somewhere in Fabric and in
fabric there's quite a few tools that
allow us to do data ingestion we can use
the data flow the data pipeline the
fabric notebook or the event stream so
we're going to be looking at each of
these and understanding what they are
how to implement them and when we should
use certain ones over the other one then
we'll be looking at shortcuts Now
shortcuts are a really cool feature in
Fabric and they don't require an ETL
process instead with shortcuts we create
a replica of files in external locations
so if you have files already in ADLs
Azure data L Storage or maybe even in
Amazon S3 or in the data verse maybe
you've got a power app you can create a
shortcut to this data and it creates
kind of like a Live Sync of that data
now there's two different types of
shortcut we can have an external
shortcut which as I said allows us to
connect to the these external systems or
an internal shortcut that allows us to
create links to other tables that we
have within our fabric environment so
we'll also be looking at shortcuts and
finally I just want to mention database
mirroring and database mirroring at the
time that I recorded this March
2024 is still in private preview so
they're still testing it out privately
so it's not publicly available for
people to test yet but the concept of
this is similar to shortcuts but allows
you to create a Live Sync of a database
not just some files and from the
marketing we can see that already they
have connectors planned for snowflake
Cosmos DB Azure SQL so if you're using
any of these databases in your business
just know that soon you'll be able to
use database mirroring to create a live
replica in your fabric environment again
with no ETL it's just always up to date
so let's just start at the top here with
data ingestion in general so this is
what we're going to be doing in data
ingestion we're going to connect to
Source data use ETL or El and then we're
going to store the data and then build
our semantic models on top of that and
as you saw previously we've got four
real options or tools that we can use to
perform this data ingestion now in
general when do we want to use data
ingestion over and above the other
options so shortcut or database
mirroring well we get access to a lot of
out thebox connectors especially with
data flows there's over 300 connectors
that we can use to connect to a lot of
different Source systems and that makes
it really easy data pipelines as well
have a lot of connectors inbuilt and
we'll look at that in a minute less than
data flows but still quite a lot and if
you use the fabric notebooks you get
access to all of the Python client
libraries so perhaps you want to connect
to HubSpot client libraries to bring in
your marketing data for example and
there are a lot of python client
libraries so depending on what tool you
use that can be a really good option as
well and again it's highly configurable
so especially with fabric notebooks you
can really do lots of things with your
data before it gets written into fabric
so when do you not want to use this sort
of thing well it can be slow depending
on the amount of data that you're
transferring with data ingestion you do
need to handle incremental update logic
yourself most of the time whereas with
the other options so for example the
shortcuts all of that is managed for you
you just get the live replica in your
fabric environment you don't need to
worry about incremental updates merging
operations all that kind of stuff but
it's not real time so with data
ingestion normally you'd run a data flow
or a data pipeline for example on a
specific schedule could be every hour
every half hour every day so there's
always a bit of a lag between the data
that you've got in your fabric
environment
and the data in the source system
because you're only updating it every
hour for example so let's just dig into
the data flow Gen 2 in more detail so
with the data flow you can connect to
over 300 different Source systems and
these are pre-built and you can just
connect to these systems and bring the
data into a familiar power query low or
no code interface and then from there
you can perform data transformation if
that's what you want to do and then
write it to any of the fabric data
stores so when we want to be using data
flows well if you want to make use of
any of the 300 external connectors then
that's a pretty good reason it is a no
and low code solution so you can get by
without knowing any coding you can also
use power query to kind of extend the
functionality if you want to but for
most people it's suitable for
practically anyone can write a data flow
another point to remember about the data
flow is that it can do extract transform
and load you can do all three of those
things in one tool which some of the
other ones can't do now the data flow
can access on premise data it's the only
tool in this list that allows you to set
up an on- premise data Gateway and you
can access it via a data flow so if you
have on- premise data you're going to be
wanted using data flows currently now I
do think that in time Microsoft will
build the functionality to allow you to
access on premise data via data
pipelines as well but currently data
flow is the only way of doing that so
the next next point is that you can
actually access more than one data set
at a time what I've mean by that is you
can build multiple power query queries
in the same data flow and then combine
them if that's what you want although
you might want to space these out and
you know just do one data set per data
flow write them into Fabric and then
combine them later on in your pipeline
in two videos time we're going to be
talking about data validation and so by
spacing out your Transformations it
allows you to validate the data so that
is an option in data flows but you might
want to think carefully if you're going
to do that so when do you not use it
well in my experiences and quite a few
people that I've spoken to in my
community as well there's struggles with
large data sets so if you've got
gigabytes of data that you're
transferring regularly data flows can be
quite expensive in terms of capacity
usage and time taken to perform these
things and as I've just mentioned
previously it's difficult to implement
data validation with this method now
it's good if you want to do extract and
load functionality and then validate
afterwards but if data validation is
important to you which it should be then
it's difficult to actually implement
this in the total itself so here's some
implementation notes for the data flow
and these are just things that I've
picked up over the last Nish months
working with data flows and from
speaking to lots of people and reading
about lots of different things to bear
in mind when you're implementing data
flows so you can include data flows
within a data Pipeline and this can help
with orchestration and logging as well
and also error handling so out the box
data flows don't really have error
handling so if there's a problem in one
of your transformation steps it's just
going to fail and you can't do anything
after it fails but if you embed your
data flow in a data pipeline which we'll
talk about in a minute you can handle
those errors a bit more gracefully
building on that the data flow in the
data pipeline can enable the El pattern
so extract load transform whereby we get
data from somewhere and we write it
somewhere and then we read it from that
location transform it and then write it
to a transformed location and we're
splitting those out into do two separate
parts and we can orchestrate them in a
data pipeline so parameters are not yet
implemented so if you've used data flows
or even the power query engine in
powerbi you'll be familiar with the
concept of parameters and this is not
yet implemented in data flow Gen 2 but
if you have this data somewhere in
fabric then you can kind of mimic the
effect of parameters next there's no
ability for looping out the box now you
can code your own if you're familiar
with power query but it's probably
better to manage this with a data
pipeline so if you need to implement
control flow logic then it's probably
best to do that in a pipeline or in a
notebook if you know python now an
example of when you might need this is
maybe if you're retrieving data from a
paginated API for example and you need
to Loop through the different pages of
the API to get all of the data now one
thing to bear in mind is that with large
data sets then data flows can struggle
now I know they're making a lot of
improvements but I think there are still
some issues around very large data sets
so if you're using very large data sets
then i' probably suggest you use data
pipelines or notebooks now there is a
feature called staging that helps or is
supposed to help with with this but it
can actually have the opposite effect in
some instances and basically what
staging does it's going to create a kind
of temporary Lakehouse that's going to
store your data and it's going to do the
transformation on The Lakehouse data
instead of doing everything in the the
data flow itself and the theory is that
this will speed up your Transformations
but I'm not sure how successful this is
working at the moment and one thing to
bear in mind if you're watching my last
video is that it can perform reading and
writing across different workspaces so
you can read data from a table in Fabric
in one workspace do some Transformations
with it and then you can write it to a
table in a different workspace that's
something that you can't do with data
pipelines so next up we're going to be
talking about the data Pipeline and this
is primarily an orchestration tool so
you can create different activities that
you're going to run through in a kind of
workflow but we can also use the data
pipeline to get data into fabric using
the copy data activity so when should we
use the data pipeline well as I
mentioned previously it's much more
efficient when copying very large data
sets and it's particularly efficient at
importing data from cloud so from Azure
Azure D services Azure seal all of those
kind of services it works really well
with and as I mentioned previously it's
good for when you need to implement
control flow logic so if this then do
this otherwise do this that kind of
thing that's when data pipelines are
really useful and data pipelines can
trigger a wide variety of actions in
Fabric and also outside of fabric as
well so we can trigger data flows we can
trigger notebooks to run we can trigger
stored procedures both fabric data
warehouse store procedures and even
stored procedures in external databases
like an Azure SQL database you can use
it for that as well we can use it to
trigger kql scripts generic web hooks
Azure functions Azure machine learning
and Azure data bricks as well so all of
these kind of connectors are built into
a data Pipeline and it can trigger
different things depending on your
workflow so when do we not want to use
this well data pipelines don't really
have any transform functionality
natively so we can't really do data
cleaning in a data pipeline normally if
you want to do that you would embed a
notebook or a data flow into your data
Pipeline and as we mentioned previously
we can't use it currently for accessing
on premise data so we can't set up a
data Gateway and connect to it via a
data pipeline now there's no ability to
upload local files so one thing I didn't
mention previously is that with the data
flow you can actually upload kind of
oneoff local files and then transform
them and then get them into fabric
that's not an option with data Pipeline
and from my previous video you know that
it doesn't work really very well at all
with cross workspace so you can't read
data from one workspace when your pip
line is in a different workspace and you
can't write it to a different workspace
and you can't trigger stored procedures
all these different things when they're
in separate workspaces so that's
something to bear in mind as well so
let's dig a bit deeper on some of the
implementation notes so for users of
azure data Factory the data pipeline is
quite similar very similar in fact but
there are a few differences or
limitations with fabric data pipelines
so the big one is that currently only
schedule triggers are available in data
pipelines in fabric so you can only
really trigger it by a schedule and
mapping data flows if you've been used
to using them in azid Factory these have
been kind of replaced by the data flow
Gen 2 next we cannot connect to on
premise data gateways that's what we've
mentioned previously before next up you
can build nested structures nested
workflows to help with control flow in
data pipelines what we mean by this is
that you can invoke pipelines from
another pipeline so this becomes quite
powerful in terms of scalability and
maintainability of sometimes quite
complex data pipelines and this is
typically used in conjunction with a
metadata driven approach so what this
means is you're going to create a lake
house or a data warehouse with metadata
about all of the different pipelines so
if you need to write 100 different data
pipelines then either you can create 100
data pipelines or you can create a
Lakehouse table with 100 rows and each
row gives the metadata for a particular
Pipeline and then you just query that
Lakehouse table Loop through all of the
rows and kind of build a data Pipeline
on the fly so that's a fairly
established approach for dealing with
data pipelines in general and it really
helps you to scale up into the tens or
hundreds of data pipelines very simply
and it also helps with maintainability
of course and finally you just need to
be aware of the crossw workpace
limitations when using data pipelines so
your data pipeline must be in the same
workspace as where you want to write the
data to or read data from as well next
up is fabric notebooks now fabric
notebooks are obviously quite general
purpose you can do lots of different
things with them but in this context
we're going to be talking about using
them to get data into fabric we can do
that normally by connecting to apis we
can use Python libraries such as
requests to query apis rest apis to get
data or we can use client python
libraries so if your data is in HubSpot
for example and you want to extract that
or HubSpot actually has a python library
for accessing your data so it can make
it quite easy to connect to data if the
place where your data resides has a
client python Library which a lot of
them do for the major SAS providers and
also kind of the big platforms of
service Azure tools as well also allow
you to extract data using python so when
should we use notebooks well as I've
just mentioned it's really useful for
extraction from apis because sometimes
authenticating with apis can be a bit
fiddly and although you can do that with
a data flow and a data pipeline it's a
lot simpler to do it if you know python
using fabric notebooks you can really
get very customized API Logic for
handling tokens and all this kind of
thing so if you've got complex apis I'd
recommend using fabric notebooks and
again if you want to be using those
client libraries the only opportunity to
use those is with fabric notebooks
obviously it's really good for code
reuse so you can write a connection once
and then you can use it all over your
company you know if you need to re use
your code that's quite simple to do in a
notebook and one of the major benefits
is that you can embed data validation
and data quality testing on the incoming
data and notebooks are one of the only
opportunities that allow you to do that
very easily and in this series I'll be
having a whole video where I'll be
talking about how you can be doing this
so notebooks are built on top of the
Spark engine so if you have really large
data sets then you can use the Spark
engine to speed up getting this data
into fabric so that's when we should use
fabric notebooks but when should we
maybe not use fabric notebooks well the
main one really is when you don't have a
python capability in your organization
or you don't want to grow a python
capability so if you only really have
people that know SQL or power query or
no code low code tools in general then
it's obviously not recommended to use
fabric notebooks so that was Data
ingestion but what if we don't want to
do this repetitive ETL process for some
data sources fabric provides
functionality to replicate data from
your Source into fabric let's look at
some of these approaches in more detail
so let's start by talking about in
general terms what we mean by file and
database replication so if we don't want
to do this data ingestion thing that
we've just been talking about then we
can create live syncs to files and soon
we'll be able to create live syncs to
databases as well for external files and
databases things that live out outside
of fabric now the way to do that for
files is using one leg shortcuts and for
databases it's going to be using
database mirroring this is a feature
that's coming soon now the key thing to
not here is that there's no ETL so
there's no going to the source copying
new data into fabric writing these merge
statements yourself no it's not going to
be any of that so how does the file and
database replication work well for both
shortcut and database mirroring they
both work fairly similarly conceptually
at least the first step is to create a
connection to your Source then fabric is
going to create a Delta table kind of
like a cache of the full data set the
first time you connect to this Source
because you need to get that data into
fabric for the first time and then
fabric is going to monitor that source
for updates and it's going to
automatically sync the source with that
Delta table cached that they've got in
Fab fac whenever you make updates to the
source now obviously the pros of doing
this is that you get near real time
updates to your data available to you in
fabric you don't have to worry about any
of the merge or update logic that you do
with ETL and the updates are incremental
which meaning that potentially it could
drastically reduce the amount of
capacity that you're using for these
kind of things now the cons of file and
database replication in general is that
currently it's only a limited selection
of data types and data sources that this
works with so let's dive into shortcuts
first and look at what some of these
data sources are so this is the one leg
shortcut and a shortcut enables you to
create a live link to data stored either
in another part of fabric which is an
internal shortcut or external storage
location such as ADLs Azure data leg
storage or Amazon S3 now one thing to
bear in mind quickly with internal
shortcuts is that there are some
limitations about what we can shortcut
to where so if we start by looking down
the left hand column here if we look at
from a Lakehouse table to another
Lakehouse table then yes we can do that
so what I'm trying to say here is we can
create a shortcut from a Lakehouse table
to a lake house table similarly we can
create a shortcut from a data warehouse
table to a Lakehouse table and the kql
database to a Lakehouse table and it's
the same where the Target location is
kql database so you can shortcut
anything into a kqo database but with
the data warehouse you can't shortcut a
Lakehouse table into a data warehouse
and you can't shortcut a kql database
table into the data warehouse either now
in the middle here I put a green tick
although it's not really a shortcut but
you can do cross Warehouse queries which
act like a shortcut so you can connect
two or three different warehouses
together and it acts a bit like a
shortcut so that's why I put a green
tick in that middle box there so that's
something to bear in mind so
implementation notes for one leg
shortcuts so a shortcut can be set up
for individual files but more commonly
for a folder so a whole folder in Azure
data Lake for example and you can create
a shortcut to a base folder or a root
folder and it will Monitor and sync all
of the files in separate subfolders for
internal shortcuts the user will need
access to both the source and the Target
location to be able to access the
shortcut data in fabric they need to be
careful of cross region egress fees and
this is what I mean by that so if your
fabric capacity is in UK South Region
for example but your ADLs storage
account is in West us for example you
will be charged across
egress fee every time that data is
transferred and that currently is at 1
cent per gigabyte so it's not too much
but it can add up recently Microsoft
have also announced that you can create
shortcuts via the fabric rest API so if
you're into automation then that's
definitely something to check out that's
a newly released feature for the API now
next we're going to talk about database
mirroring now this is a feature that is
currently in private preview as of March
2024 but it is expected to be released
in q1 of 2024 so basically I would
expect to see that by the end of March
and a database mirror allows realtime
data replication of external databases
as Delta tables in one L and again this
prevents the need to ETL the data from
the database so this is what this looks
like here so you can have a snowflake
database or maybe you have data in
Cosmos D or perhaps you have a net
application in Azure SQL the back end is
an Azure SQL database that you want to
bring into fabric now you could do a
traditional ETL using data pipelines or
data flows but it's not going to be real
time you're going to be queering it
every hour and for some applications you
might might want a more Real Time
replica of this data in fabric so say
this is our goal we want tables in
fabric what we're going to do is the
first time that you set up database
mirroring it's going to convert the
tables that you have in the source
database into Delta parade format and
it's going to write them into one Lake
and then what it's going to do is set up
kind of like a subscription to that
database that's going to check for
updates and it uses a protocol called
change data capture that these databases
have within them that allows them to
pass updates to fabric and what that
means is that any updates that you make
in the source location in your snowflake
database for example these are
automatically written into your fabric
mirror of that data set so again it
prevents data duplication and it also
eliminates the need for ETL and another
thing to bear in mind is that you can
use the fabric Delta format for time
travel and history tracking of these
tables so if you've used for example
Azure sequence you can build history
tables within Azure SQL but another
option is to let Fabric and the Delta
technology handle that history rather
than having to rely on the SQL and
building history tables that you know if
you ever used them they're not they're
not particularly great Delta has much
better functionality for tracking
history so if you want to keep a record
of how your tables are changing over
time then this is particularly good
method for that so that's a high level
overview of the tools available to us
how do we decide which to use well to
kind of summarize everything that we
talked about here I just wanted to go
through a few key considerations that we
should bear in mind when you're choosing
a method for getting data into fabric so
the first one is the need for realtime
data so if you have a requirement where
you need near realtime data then it's
going to be pushing you towards a
shortcut and database mirroring when
it's eventually released the other thing
to consider is what the skill sets are
in your team is it predominantly low
code or no code in your team do you have
professional python developers if it's
the latter then notebooks offer the most
flexibility in terms of getting data
into fabric so if you do have a proper
data engineering capability within your
organization you're going to be using
data pipelines and fabric notebooks
probably for most things I would argue
so another thing to bear in mind when
you're choosing between these methods is
across workspace limitations
specifically of data pipelines and we
talked about that a lot in this video
but also the last video more as well you
also need to think about scalability and
alongsid that is the size of your data
so as we've discussed different methods
here are better or worse at handling
large data sets if you're transferring
gigabytes of data lots more data every
day that's going to have an impact on
the method you choose as we've discussed
now on that note I do think Microsoft
are working pretty hard to make sure
that all of these data ingestion methods
have a similar kind of speed so I would
definitely watch that space to see how
those improvements are made in time so
another thing I wanted to mention
alongside that is the cost and capacity
usage so if you're particularly
sensitive over the amount of capacity
usage that your data ingestion methods
are going to use then I recommend doing
some mini tests of you know getting a
sample data set ingesting it in Via
different methods including shortcuts if
your data is in ADLs or Amazon S3 and
when it comes out database mirroring as
well and comparing which of these
methods uses the most or least capacity
usage and this is something that I'll be
doing hopefully in a video coming soon
on this channel as well as that another
important consideration is the need to
access private endpoints or on- premise
data and again there's been lots of new
features released in fabric to help you
do this now each of the different tools
for data ingestion in fabric handles
this slightly differently and there's a
lot to go through with this so I'm going
to be dedicating a whole separate video
on connecting to private endpoints
trusted Works spaces on premise data all
of that kind of stuff that trips people
up quite often so if you made it this
far thank you very much for watching and
I would love it if you leave this
hourglass Emoji in the comments just to
let me know that you've watched all the
way through and if you have any
questions I recommend you join our learn
Microsoft fabric community and I ask
answer any questions that you might have
in there you can find the link here in
the next video we're going to be looking
at different data stores in Fabric and
understanding when you should use which
the pros the cons all that good stuff
thanks very much for
[Music]
watching hey everyone and welcome back
to the channel today we are continuing
in the powerbi to Microsoft fabric
transition Guide Series with the very
important one now one of the biggest
mistakes that people can make when
they're moving to Microsoft fabric is to
select the wrong data store so in fabric
there's three main data stores got the
Lakehouse the data warehouse and the kql
database and each of these have slightly
different characteristics use cases
features and also limitations and if you
don't fully understand these things then
you might not make the right decision
for your requirements in your business
so in this video we're going to be going
through each one in detail understanding
what they are what they should be used
for again understanding the features and
the limitations and comparing them side
by side and then in the second half of
the video we're going to be learning
about how you can combine these
different data stores into some sample
architectures so we were going through
some use cases typically these aren't
just used on their own but in fabric you
might be using one or two Lake houses
plus a data warehouse maybe a kql
database as well if you're using real
time data so that's what we've got inore
for you today enjoy the video so to
start us off today I'm actually going to
start in the one Lake file explorer so
this is a plugin that you can download
and it acts a bit like one drive for
your local machine and the reason why
I'm going to start here is to understand
that under the hood the lake house the
KQ database and the data warehouse they
all store their files in the same format
and this is particularly visible when we
look in the one L file explorer so here
I've connected it to my fabric
environment and we can see different
artifacts different fabric items and
when we click on it it's just folders so
we've got a tables folder and then these
are tables in my Lake housee this is a
lake housee and if we click into this
you can just see the paret files and the
Delta log so these will be Json files
that just show the the Delta log for
that particular table in the Lakehouse
and if we go to another one so this is
my workspace if we go to for example
this data warehouse you'll see that it's
exactly the same structure tables we do
have a schema in the data warehouse but
if we click through then we can see that
again we've just got some parket files
and the Delta log so under the hood
everything is exactly the same so that's
the first point to know about all these
different data stores under the hood
it's the same data so what is different
then well when we flick through through
to the UI here we're in a Lakehouse and
I'm just going to give you a quick tour
of each of the three data stores just
that you can understand what they look
like and what you can do and some of the
unique features in each one so here we
are in the lake housee and what's
particularly unique about the lakeh
house well we have access to tables and
files so the lakeh house we can store
structured data so this is these tables
for example here we can see a preview of
data in these tables but what's unique
about the lakeh house is we also have
this files area so the Lakehouse kind of
the definition of a lake house is that
it's part data Lake IE files
unstructured data and semi-structured
data so data in csvs and Json parket
whatever you want and we can also have
completely unstructured data so image
data audio data and if you're a data
scientist then that's really useful
right because you might be doing some
image analysis computer vision problem
all that kind of stuff you're GNA be
saving your data into a Lakehouse files
area and one of the main kind of
workflows that we use in The Lakehouse
is to get data from an external system
might be a rest API might be files in
ADLs for example we're going to bring
them into our files area we're going to
pass them transform them clean them
maybe and then we're going to put them
into more structured tables so that we
can use them later Downstream in our
Pipeline and visualize them in powerbi
for example so for that reason the lake
housee is very often used as the bronze
layer in The Medallion architecture
because it can handle both files and
tables so in terms of using the data in
a lake house or interacting with this
data in The Lakehouse the tool that we
use is a notebook so hand inand with the
data in the lake house we use notebooks
to analyze or visualize or transform or
validate or test all of this stuff and
you can create a new notebook just by
clicking on open notebook if you got an
existing one you can link it to this
Lakehouse or you can create a new one so
I've got one here I'm just going to
click on here so the thing to know about
the notebooks is we can write code in a
variety of languages and the notebook is
built on top of the Spark engine so if
you watched my 38 minute video
introducing you to Microsoft fabric
you'll know that the engine Behind The
Lakehouse and the data engineering
experience in general and the data
science experience that is going to be
the Spark engine and because of that we
get access to four main languages that
we can write in a notebook we can use P
spark spark Scala spark SQL spark R and
that's if you want to use the Spark
engine so if you have really large data
sets then you're going to be want using
the Spark engine because it's going to
be able to chop through all this data in
parallelized way and I've got a whole
series about using spark and notebooks
in general it's about a 4H hour long
video I'll leave a link to that in the
description below if you want to learn
more about that it's not just spark as
well that we can use in notebooks if
you've got slightly smaller data you can
just write conventional Python and use
libraries like pandas and map plot lib
and all this good stuff that's also
available to you but you won't benefit
from the parallelization that you get
with spark so we can also create create
shortcuts in the lake house so we can
create shortcuts to anywhere in fabric
so this table here is a data warehouse
table and you can see there's a
small link there in this Delta table
icon to indicate that is a shortcut so
this data doesn't actually live in this
Lake housee but it's accessible to us in
our notebook code for example now as
well as the lake house view which is
what we've been looking at currently the
lake house also has the SQL endpoint so
we can write tsql scripts to view data
in our Lakehouse now the important thing
to note here is that it's read only so
you can't do anything like alter tables
adding columns update inserts deletes
you can't do any of that only really
select statements and you can also
create views on that data as well so
that's the only real like right
capability because the view is not
actually altering the underlying data
set right so here we get a view that
looks a lot like the data warehouse but
it's actually the data in a lake housee
and we can write t-sql scripts as well
and we can also create measures from
this data Dax measures as well build
semantic models and all that good stuff
from this view so next up we're going to
look at the data warehouse so here we
are in the data warehouse and if you've
used any sort of SQL database before
it's a very similar structure so we have
schemas we have tables views functions
stored procedures right and the language
that we're going to be using in the data
warehouse is tsql so we can write tsql
queries and then we can save them for
our colleagues to look at as well in
this my queries and shared queries
option so the data warehouse is going to
be very familiar for database developers
SQL developers maybe even powerbi
developers that have used SQL before so
a typical workflow or workload that you
might be doing in the data warehouse is
more more towards the end of the data
processing pipeline right the gold layer
of The Medallion architecture for
example you might be creating Dimensions
creating Aggregates analytical models in
here because you might already have SQL
code that describes business logic in
your organization right and you're going
to be using SQL to write those things
and again you can create semantic models
from your data in your data warehouse
okay so here we are in the kql database
and I've got some data here just from
the weather sample which is the sample
data set that you get in fabric so the
kql database really is built for
streaming and realtime data sets so a
few different data sources that we can
connect to to get data into a kql
database now some of these are like
onetime operations so if you've got a
big batch of iot sensor data for the
last 2 years you can upload it here and
just store it in your database but the
main use case really for kql is these
continuous data sets so we can use event
streams or connect to existing Azure
event hubs to stream data into the
platform so once we've got some streams
set up we organize things in tables we
can also use shortcuts and we can use
materialized views so if you've used
Azure synapse or DBT or any sort of
database that incorporates materialized
views basically means we can create a
view of the data
but then we write it out into a
materialized views which acts more like
a table and it improves read performance
when we're for example quering it in
powerbi we can also create functions and
data streams now in this one I haven't
actually got any set up but that's just
to give you a bit of an idea about what
the kql database looks like now in terms
of languages that we're going to be
using in the kql database well the main
one is kql and we can also Implement SQL
kind of as well so here we've got a bit
of a cheat sheet that's included in the
documentation and if you use this
keyword explain in a kql query language
statement is going to Output the kql so
if we try that then so say for example
here I'm going to create a new new kql
query set now this is going to take us
through to the query set editor and so
if we only know cql then we can just
copy this and it's going to return a kql
query and then if we we want we can copy
that paste it into our editor and then
just run that and then we get the result
so if you're familiar with SQL then
that's a good way of kind of learning KQ
out as well so that's a bit of a tour of
the three main data stores that we have
in fabric now let's do a bit of a
sidebyside comparison just to summarize
everything that we've gone through so
far and I'm going to be talking about a
few other considerations when we're
comparing between these three different
data stores so let's jump over to the
iPad and I'll walk you through those now
okay so here we are down in the iPad and
what I've got is a bit of a side by-side
comparison of the three tools so the
three data stores that we have in fabric
are the lake housee data warehouse and
the kql database as we've just seen and
then to give it a bit of structure I'm
going to be talking through these things
down the left hand side so what the
different data types we can use the
languages co-pilot integration the
different ecosystems that that
particular data store might open up
which I think is very important the
different security considerations that
we need to bear in mind and
configurations that we can make in each
of these the typical workloads the
unique features of each of them and the
personas and then afterwards what we're
going to do is look at some endtoend
architectures that combine these
different data stores in some unique
ways depending on the requirements in
your organiz ization so let's get
started in this and talk about the data
types so we're going to be basically
revising what we just spoke about
earlier just to reinforce what we were
talking about so let's start with the
data types for The Lakehouse and as I
mentioned we can do structured data and
by that we mean tables we can also do
semi-structured so things like CSV in
the files location and Json for example
from a a rest API and we can do
unstructured data and by that we mean
images audio anything really and this is
particularly useful for data scientists
if you're doing AI projects based on
computer vision or anything like that
that requires the analysis of
unstructured data we need to get that
data into Fabric and the place we're
going to get it in is in the files
location in a lake house so the data
warehouse we only have the structured
data right we only have tables really
now kql is kind of the so we have the
structured tables we can also do
semi-structured because we can import
files that we can analyze in a kind of
oneoff batch scenario as well okay so
next up we're going to be looking at the
languages so for the lake housee
obviously we have access to the Spark
engine so we can write any languages
that we can write with spark so for
spark that's going to be spark SQL which
is similar to tsql but there are some
differences you've obviously got P spark
did a long series on that spark up and
Scara now on top of the spark languages
we can actually also write just raw
python right so pandas map plot layer
caborn all those traditional python
languages as well and on top of that we
can also actually write tsql as well in
The Lakehouse via the SQL endpoint but
that's going to be read only right so we
can do select statements but we can't do
inserts updates deletes any of that
stuff in the data warehouse really we
only have tsql right so we can write
queries stored procedures all that kind
of stuff there now one thing to bear in
mind here is that if you have a table in
your data warehouse then is an option to
shortcut that into the Lakehouse right
so this is my shortcut logo very bad but
then we can create that here and then we
get access to all of these things as
well right so just because you store
your data in the data warehouse doesn't
mean that the only thing you can do is
to use tsql you can still shortcut it
into the Lakehouse if you need to use
pis spark or anything like that right
but natively you only have access to
tsql and the tsql here importantly is
read and write so you can do inserts
updates and deletes with data so that's
one of the the main important things
about the data warehouse right so the
kql database obviously is kql and seq as
well so in terms of cop pilot
integration here so if you're using an
f64 capacity or P1 capacity then you get
access to co-pilot in The Lakehouse
experience or technically in a notebook
right but the notebook's connected to a
lake housee so yes that is available in
The Lakehouse currently there's no
integration with the data warehouse so
you can't use co-pilot to write your
tsql scripts for example and it is
integrated into kql so that's something
to bear in mind if you want to take
advantage of low and no code stuff than
copy that is accessible to the lake
housee and kql now I'm sure they will
build this soon but currently it's not
available and also we're on the topic of
no Code and low code there's a few
different options there as well for each
of these so with the lake house I'm just
going to add in a new topic here like
low code no code bit messy but is what
it is so in the lake house obviously we
have co-pilot we also have data Wrangler
which is a feature that kind of helps
you build python queries pandas as well
if you're not too familiar with those
languages in the data warehouse we have
visual queries so this is where you can
create a query just using a UI interface
so if you're not familiar with tsql you
can still build queries with that as
well and in the kql database I don't
think there is anything that's low code
and no code I might be wrong with with
that but I think currently there's no
options for Noah Loco okay next I want
to talk about ecosystems cuz this for me
is a very important point to consider
when you're choosing a data store so
let's talk about the lake housee first
and because the Lakehouse is built on
top of the Spark engine and we can use
Python this opens up the spark ecosystem
and the python ecosystem if your data is
stored in a lake house and with these
you get a lot of stuff right so you get
things like machine learning there's
huge existing ecosystems in both of
these tools so for data scientists
really want your data to be in a
Lakehouse and you can easily access
machine learning libraries and build
machine learning models on top of that
data but other examples might be things
like data validation Frameworks
particularly python there's a lot and in
the next video I'm going to be talking a
lot about data validation and how to
implement that across all of these
different data stores now in terms of
ecosystem
in the T well tsql world with data
warehouses the good thing is that it
opens up SQL projects so we can manage
our tsql code and our database
configurations using SQL projects and we
can also use DBT right so DBT is a third
party tool that is very commonly used
for analytics engineering so what that
means is we can use it as the the
transform tool and it's all in SQL right
so you can connect it to your fabric
data warehouse you can use it to build
aggregations you can use it to build
analytical models all that stuff and the
good thing is that your code in DBT is
Source controlled and there's a lot of
functionality for validation so this is
a really powerful ecosystem to tap into
if you're going to be using the data
warehouse now the kql database I'm sure
there are ecosystems but I'm not
actually aware of them so I need to
check that out but I can't really
comment too much on the ecosystems that
the kql database opens up but is a very
mature framework so I'd be very
surprised if there's not existing
ecosystems and tools that we can plug
into that okay so next up I want to talk
about the security configuration options
with each of these so let's start off
with the Lakehouse what I mean by
security configuration is how granular
can we get with sharing objects or rows
or columns and that kind of thing in
each of these well in the lake house if
we're using the Spark engine then
there's actually none so we can't really
do any of that if we're using the spark
and we're accessing our lake house via
notebooks so that's something to bear in
mind talked about this a lot in one of
the previous videos about access control
in this series now if we're using the
SQL endpoint then you do get access to
Road level security Now one thing that I
mentioned in my previous video is that
this row level security is only enforced
if if you give viewer roles only right
because anything higher than a viewer
role so if you give them contributor
then technically they are going to be
able to write a notebook and then your
road level security is not really
enforced so if you want to lock down at
the row level for Lakehouse data then
you want to be giving people maximum of
view up permissions only now in the data
warehouse there's a lot more options for
granularity in terms of what we can
share with people so we can share at the
object level which is a specific table a
specific view for example we can also do
row level security we can also do column
level security so if you want to hide
particular columns from particular users
or groups then that is possible in the
data warehouse we can also do Dynamic
data masking which means masking
sensitive data in particular columns for
particular people or groups so if you
got sensitive information like an email
or credit card for example that data can
be masked in that column if you use
Dynamic data masking so there's a lot
more granular permissions that we can
apply in the SQL data warehouse if
that's what you want and in the kql we
have access to row level security and
that's it okay so in terms of the
typical workloads that we have here so
if we look back at what we've written
here we can start to understand what the
typical workloads might be right so if
we start on the on the Lakehouse well
the fact that we can have unstructured
and structured data in the same place
means that a typical workload really is
getting data from outside a fabric
passing these files getting them into a
structured format so this file to
structure transformation is a typical
workload that we can do in a lake house
and especially if this file is complex
so if you have to do a lot of passing if
you've got really Json structure coming
in really you want to be doing that with
code right CU it's a lot simpler than
doing it with SQL or anything like that
so that's a particular use case that you
would definitely use the lake house and
notebook for on top of that any sort of
machine learning data science stuff
you're really going to be wanting your
data in a lake housee now the data
warehouse here some typical workloads
we're going to be doing analytical
modeling normally in preparation for
powerbi so like Di ition creation facts
star schemers creating all that stuff
aggregations all of that and for the kql
obviously the main workload is for
incoming streaming data it's got a lot
of inbuilt functionality for time series
analysis now also you can use Python for
time series analysis but what I mean by
time series in this instance is time
series on Real Time data as it comes
into the system right not just in a
batch processing style time series
analysis so in terms of unique features
for these three data stores another
thing that we haven't quite mentioned
yet is git integration so in terms of
the git integration for each of these
The Lakehouse is something an artifact
that you can back up with Git now
currently the data warehouse is not
supported but that's I'm sure will be
supported soon and OB if you're using
SQL projects then there's an opportunity
to use that is your git integration
method and the K database as well I
don't think is supported bya git now
some other unique features that I want
to mention here for the data warehouse
is table cloning so that is particularly
unique to data warehouse and what that
is is you can create clones of tables in
your data warehouse for development and
testing right so if you've got a
dimension table in your data warehouse
you can create a clone of that Dimension
table alter it change the structure of
it and then kind of test it out before
then reintegrating it into your main
code base your main Dimension table
right so that's a feature that they
released fairly recently and if that's
something you want to be doing then just
know that that exists there okay so
finally personas right so hopefully it's
fairly clear who's going to be using
which of these in the lake house
typically we're going to be having our
data engineers and our data scientists
because they're going to be the ones
that are comfortable writing spark code
or python code they're going to be one
using machine learning they're going to
want to be doing data validation all
that kind of stuff that is easiest in
the lake house now for the data
warehouse you're going to be having dbas
SL developers database developers maybe
powerbi if they know some SQL to create
dimensions and facts and things like
that if they can create them for
themselves that's great and the kql is
going to be Engineers analysts who are
comfortable with real time and kql and
yeah that's basically that okay so we've
gone through quite a lot of different
considerations there things you need to
think about when you're choosing a data
store next I just want to finish by
looking at some sample architectures
right so taking into account all of this
we can begin to understand that some
data stores are better suited to some
tasks and others are better suited to
others so
if we talk in general terms about the
data processing pipeline we have data
coming into fabric here and then the
other end we're going to have a
dashboard with your user on this end
right and if you think about a sliding
scale from unstructured or
semi-structured so these could be like
files API data databases and it's
typical that these are in not good shape
when they enter Fabric and the goal is
to gradually
create more structure in our data
pipeline right so as we move from left
to right in this pipeline actually this
should be here the goal is to clean
transform add structure to these
semi-structured and unstructured data
sets right and normally that involves
converting from this area might be files
and this part of the pipeline is going
to be tables so then when we think about
how we architect a solution here well
towards this end really we need to start
at least with a data store that can
handle practically anything right
because these are all going to be
different formats you don't know what
format they're going to be in we
definitely need a place to store files
and tables so typically we're going to
start in an endtoend Pipeline with a
lake house because we can have files and
we can have tables right so all of our
files can get stored in here data from
apis even database dat data we might
bring in and write it as parket for
example then in our lake house we can
create notebooks that validate and clean
and transform all of this raw data and
then we're going to have some tables in
our Lake housee that would be like raw
tables like we haven't done anything
with them we've just got the data out of
files and put it into a table given it a
bit of structure but we haven't done any
sort of we haven't done much cleaning or
validation then typically from here we
want a platform where we can do data
engineering and data science a lot of
the time or if you need that in your
organization then typically what most
companies will do is create another
Lakehouse that's going to read in these
raw tables and do things like cleaning
getting them into a a more refined state
right now it could be conforming and by
that we mean merging different data sets
into a single data set so if you've got
a custom table you might want to bring
in data from three different Source
systems and conform that into one table
that is your kind of Master customers
table for example also this is where
your data scientists might be working
right so you might have some tables that
take these tables here and then add
another layer of insight right so it
could be predictions or it could be some
sort of data science analysis so you're
creating new data from your existing
data and then in this part of the
analytics workflow where you're getting
close to powerbi right and powerbi
developers like structure now either
this layer it's typically a data
warehouse because you have to think
about the skill sets available in your
organization and who is going to be
wanting to do this kind of stuff so here
we mean like creating star schemas
creating aggregated data sets that are
going to be fed into powerbi and
typically the people that are building
these things are a lot more comfortable
with SQL so that's why in this layer of
our pipeline a data warehouse makes a
lot of sense right now you could have
just Lakehouse Lakehouse and Lakehouse
here but it would require people in your
team to know python so they' have to
build all of these things like your
dimensions and your facts in Python
which is possible but it's not what most
companies do now the way that you would
integrate a kql database into here is if
you had streaming data then you'd
probably want a kql database here so you
can create an event stream that's
getting stored analyzed and processed in
your kql database now what you can then
do is then root that into your silver
layer of this Medallion architecture so
once you've kind of got the Insight
you've done some analytics in real time
then you probably want to save that data
into more long-term storage on this lake
housee so that's how you'd integrate a
kql database into this kind of
architecture but all these things are
highly dependent on the skills that you
have in your organization so hopefully
now it's quite a long video I've been
talking for a long time but I think this
is really important material people are
making these decisions about which data
stores they're going to be using in
their organizations so hopefully now you
have a better understanding of the key
features for each of these some of the
limitations some of the things you need
to bear in mind with each of them so
that you can make good informed
decisions about when to use each one in
the next video we'll be focusing on data
validation and this is one of my
favorite topics how do we introduce data
validation into an architecture like
this to build a system of data quality
assurance at each stage in your data
processing pipeline so make sure you
join us for
[Music]
that hello and welcome back to the
channel and in today's video we're going
to be continuing the powerbi to
Microsoft fabric transition Guide Series
so after a bit of a introduction to the
series we first looked at how to set up
your environment in Microsoft fabric so
how to set up capacities workspaces
Access Control security all of that
stuff we then looked at how to get data
into fabric using either data pipeline
data flow shortcuts notebooks all of
that kind of stuff in my last video we
looked at where to store the data so are
you going to choose a data warehouse a
Lakehouse or a k database and so in this
video it makes sense that we continue
that path and look at the next step
really which is how do we connect data
in our data stores so in a Lakehouse for
example to powerbi so how do we build
semantic models on top of Lakehouse data
and we're going to explore in a bit more
detail something called direct Lake mode
which if you're a powerbi developer
you've probably already explored a
little bit already because I know for a
lot of powerbi people this is the number
one feature and we'll explore a little
bit about why that is and the benefits
of direct Lake as well as some of the
configuration options that you need to
bear in mind just to make sure that it's
working correctly for your requirements
so the starting point for this video is
that we've got some tables in a
Lakehouse we got four tables here and
these tables represent Yelp review data
so the data set comes from kaggle and
Yelp is a website where you can leave
reviews on businesses and restaurants
and things like that and this is a
collection of all those different
reviews so we've got a reviews table
which links to the business table which
Al and it also links to the users table
and in total this data set is around 9
gab so it's not absolutely huge but you
know it's good enough to try out some of
these capabilities in direct Lake mode
so I've got this data into the lake
house using a python notebook so I've
used the kaggle python Library which
connects to the K API pulls some Json
from the API and then it saves it into
our Lake housee here and then I
converted that into Lake House tables
now I'm not going to be going through
this code here but if you do want to
have a look at the data engineering
aspect of this then what I'm going to do
is I'm going to download this notebook
and I'm going to upload it into the
school community so if you do want to
follow along at home with this example
then go to the school Community I'll
leave a link in the description below if
you want to follow along at home and so
we've got these tables Here and Now what
we want to do is build a semantic model
that connects these data sets we want to
build some measures on top of that to
kind of encode our business logic and do
some calculations that we then want to
visualize in a powerbi report okay so
here we are in the workspace our
workspace and in this workspace we have
our lake housee so that's the one with
the four tables that we just saw and
attached to this lake house we can see
that we have two objects directly
connected to this Lakehouse we have the
default semantic model so that's
suffixed with default to indicate that
that's the default model for this
Lakehouse and also the SQL endpoint
right so I just want to discuss what the
default semantic model is in a bit more
detail here every time we create a lake
housee we get a default semantic model
as well now one thing to bear in mind
with this semantic model is that it
doesn't actually automatically update
there is a setting if you go to the SQL
analytics endpoint and then go on
settings here we can find the settings
for the default powerbi santic model and
there's this setting here called sync
the default powerbi semantic model and
by default this is turned off and what
this means is the first time you create
a lake house it's just going to be an
empty default semantic model right and
then when we add new tables into that
lake house if this setting is off then
the default semantic model is never
going to update we need to manually syn
it right so that's something to bear in
mind if you're connecting to this
default semantic model in powerbi and
you realize it's empty that's probably
the reason and so because of this and
for a few other reasons what I would
recommend is just to kind of avoid that
semantic model default if you want to
build a semantic model what I recommend
is you create a new one so what we're
going to do is go into the lake house
and click on this new semantic model and
then what you'll get is the tables in
our lake house down here to select from
and if you can't see all of them then
you can just click on this refresh
button and it brings in any new tables
that you've recently created so we can
give our semantic model a name like Yelp
analysis model and you can select which
tables you want to bring into this
semantic model so I just want to bring
in these three and then I'm going to
click on confirm okay so here we are in
the modeling view which you'll be
familiar with if you're coming from a
powerbi background and we've got our
three tables here and what we can do is
if we just hover over them then we can
see for example this reviews table is
using the storage mode of direct lake so
if you're not familiar with direct Lake
then this is a new connection mode that
we have available to us and it's the
default so if you click on new semantic
model in Microsoft fabric it's going to
by default use direct late mode and what
is direct late mode well you're probably
familiar with import mode whereby you
bring the data into the powerbi engine
and it enables really quick read time
right because the data is is right there
when you need to read it and perhaps
you've also used direct query mode which
is whereby we don't actually import the
data into the powerbi engine but instead
we send out queries to the The Source
database so it's mostly on databases
where this is possible and this has the
benefit that it's always up to date
because when your user goes into your
dashboard and needs to view it then
that's when the database is going to get
queried and you're going to get fresh
data every time in direct query mode but
the problem with that is that it takes a
long time to get that data back because
it has to go to the database perform the
SQL query for example and then return
the results back to you and so with
direct Lake we get some of the benefits
of import mode and some of the benefits
of direct query mode direct lake is
similar to direct query in that it reads
directly from the data source and our
data source is Delta paret tables in one
lake so how direct Lake works is when
your user logs into a dashboard for
example and they want to look at this
specific page of a dashboard then
powerbi will query the Delta Lake tables
directly in one Lake and return the
results very very quickly now it can do
that because of a number of Tricks
that's been done under the hood number
one is uses a feature called On Demand
loading which means that it's only
grabbing the data that it needs for that
specific view or for that specific query
that's being done by the power engine so
it doesn't bring back that 9 GB of data
every time it only brings back the data
that the user needs in that specific
instance so it's very quick there's also
some quite clever transcoding that's
done on these Delta parket files to
basically speed up that read performance
so that's a little bit about direct Lake
and what it is and how it works and so
just to underline the benefits here the
user is getting really quick read access
to very very large data sets and they're
getting fresh data right so there's no
hourly refreshes required no data going
a b stale no they're logging into
powerbi and that query is generating
fresh data every time so the user is
getting a real time experience even
though these data sets can be very very
large okay so let's connect up this
semantic model and we can create
relationships between these tables just
by dragging the keys of both tables from
one table to the other so here we're
connecting from the business table to to
the reviews table and this is actually
going to be a one to many because
businesses is my Dimension table and
reviews is my fact table so let's just
change that okay and we're going to do
the same for user ID as well so from
users it's going to be one to many in
this fact table making sure that the
relationships are active and you got
single cross directional filter okay so
now we have a very simple semantic model
and on top of this we can do things like
create measures create calculation group
manage roles and we can also at this
point save the semantic model and then
perhaps you want to load it into powerbi
desktop if you know you're more
comfortable working in the desktop tool
or you can load it into third party
tools like TBL editor Dax Studio these
kind of things as well so for the
purpose of this I'm just going to create
a quick measure so I'm just going to
create a row count on the reviews table
just so we have some Dax in there so you
can view that in the next section again
one thing to bear in mind is that
there's no save functionality in this
semantic model editor view it saves
automatically so you don't have to worry
about that all right so now we have our
semantic model built very simple one now
we can create a report so if you click
on new report it's going to take you
through to the report creation editor
not sure if that's the real name but a
place where you can create your own
powerbi reports now you can see that
we've got the powerbi canas here ready
to build our visuals Okay so now I'm
going to create just some basic visuals
just to show you what this looks like
have actually included this Dax measure
in the wrong table but is what it is and
you can see how quickly that loads right
so I've just clicked there and very
quickly it's loaded or it's performed
the count of 7 million rows and put it
onto the canvas right so it's very quick
to do that calculation and the fact that
we can analyze 7 million rows worth of
data in the service online it's pretty
impressive I think so if we convert this
into a chart and then then chop it up by
business name now we can look at the
number of reviews by business name so
there's over 20,000 reviews in this data
set of Starbucks McDonald's there's
18,000 Dunkin Donuts some big ones there
and so you can begin to build out your
powerbi report now this online editor is
not very feature Rich so if you're a
powerbi developer you probably want to
be doing this in powerbi desktop and to
begin editing in powerbi
well you can first save this report if
you give it a name yep well then once
we've got our report saved then we can
just download it as a PB now what you're
probably going to want to do is choose
this option here copy of your report
with a live connection to the data
because we don't want to be downloading
9 GB of data and you know it could be a
lot more than that so generally we want
to click this live connection to data
online that's going to give us this pbix
file that you can open in powerbi
desktop and continue with your report
creation process so this is what it's
going to look like when you're in
powerbi desktop and you notice we have
the model view and the report view so
you can't actually do any sort of
Transformations on that data other than
you know creating new measures for
example it's similar in a way to how
direct query mode works in that respect
now I'm just going to go back to my
semantic model in the service here
because I just want to point out a few
configuration options that we need to
bear in mind when we're looking at
direct Lake and direct Lake mode
Behavior so the first one is if you
click over to the model tab here on the
right hand side and then you click on
this top level semantic model you'll see
the properties paying for that semantic
model and we can see a few different
options here and the one I want to draw
your attention to is this direct Lake
Behavior so there's three different
options here we have an automatic direct
Lake only and direct query only so this
behavior is actually alluding to the
fallback behavior and this is explained
in a few of the different documentation
pieces that they have and also in this
article here by so Douglas and in
general what fullback means is fabric
will always try to use your direct L
connection mode and get the benefits
that you can from direct late mode
basically until it can't and then it
will fall back to direct query mode and
some of the instances of why that might
be is if you've got a Dax query that
exceeds the limits on an SKU or you're
using features that are not supported by
direct Lake such as SQL views in a data
warehouse then it's going to fall back
to direct query mode and so what this
setting does is basically configure what
that fallback Behavior looks like so the
automatic setting is the default and
it's going to allow direct Lake to fall
back to direct query if it can't execute
in direct Lake mode now direct Lake only
ensures that there's no fallback option
so I don't actually know what that will
look like if you've got your behavior
set to direct Lake only but you include
a SQL view in your semantic model for
example I presume that it's going to
return error and your report is just
going to fail and the final behavior is
direct query only which will always fall
back to direct query running all Dax
queries as SQL queries through the
warehouse now this is a really good
article here it goes into a lot more
detail about what I've described and
probably described it a lot better than
I have so I'd recommend that you go to
this article here and I'll leave a link
to this in the description if you want
to understand fallback behavior in a bit
more detail so now let's go back to the
workspace just so we can have a look at
some other settings that are quite
important here so if we go into the
settings of a particular semantic model
that we've got by clicking on this
ellipses and then the settings here we
can look at some more detailed settings
that we have available to us and the one
I want to point out is in the refresh
section this keep your direct leg data
up to date so this is another important
setting that you need to bear in mind so
currently this semantic model is set to
on and what that means is when you
change your data in one Lake it's
automatically going to be reflected in
that direct lake table now this might be
what you want in some instances but in a
lot of other instances you might not
want that say for example you've got a
chain of Transformations being done on a
particular data set you might not want
to refresh the direct Lake model until
all of those Transformations have been
done otherwise you might get a little
bit out of sync you might get some
tables that have updated as soon as
they're ready and other tables might not
have passed through a particular
pipeline yet so you might want to switch
this one to off and then then manage
your direct Lake semantic model updates
at the end of the pipeline and one way
that you can refresh a direct lake table
and a semantic model is via the fabric
rest API that you can access via Senpai
and I can leave a link to the
documentation if you want to achieve
that in a bit more detail as well but
that's just one method that you might
use in conjunction with this refresh off
button to make sure that your semantic
model refreshes at the right time okay
and so the finished result it's not
going to win any awards but we built a
Power report and we built a semantic
model in fabric from lakeh house tables
and one more thing I just wanted to
point out this blog post on fabric. Guru
by Sandy PA and it's a really awesome
article that talks in more detail about
controlling direct Lake fallback
behavior and this post was written in
November 2023 and the part I want to
draw your attention to is a new update
which is this section here semantic link
v0.7 so this is a new release of
semantic link and it gives us new
capabilities to understand what's going
on under the hood with direct link data
sets so in this article Sandy basically
walks through how we can use semantic
link to understand our direct L data
sets in a bit more detail so for example
for each data set in a workspace we can
understand what the fallback behavior is
whether it's a direct L data set so is
it going to fall back and these are
really important things to know about
our data sets in our fabric environments
that's actually quite difficult to work
out if you don't have access to this
information he's also provided some more
functions that allow us to change the
direct Lake Behavior through semantic
link as well so definitely recommend
that you check out this article if you
want to dive a bit deeper into
understanding what's going on in your
direct Lake semantic models now in the
next video I'm going to be going into
something that I think is probably the
most important thing to get right and
the biggest opportunity for you when
you're transitioning from the world of
powerbi to the world of fabric and that
is how to get a handle on data
validation and data quality that's very
difficult to do in powerbi but in fabric
it's now easier than ever so make sure
you watch the next video we're going to
go through that in detail thanks very
much for watching see you in the next
[Music]
one in this video we're going to be
exploring endtoend data validation
strategies in Microsoft fabric now I
think this is probably the most
important thing that you need to get
right when you're building that
foundation in Microsoft Fabric in your
data strategy I want you to walk away
with an understanding of why I think
data quality should be at the heart of
your fabric data strategy as well as
that we're going to walk through how to
architect and build Enterprise scale
data validation systems in fabric that
give you confidence and help you build
trust in the solutions that you're
creating now to do that we're going to
be looking at why I think you should be
doing data validation what you should be
validating and more importantly how you
can validate your data in Microsoft
Fabric and to do that we're going to
focus on three different types of data
validation that you can perform in
Microsoft fabric the first one is going
to be schema validation so this is
validation of incoming data that's
coming into fabric so it might be files
csvs Json from rest apis that kind of
thing and I'm going to walk you through
a demo of how you can achieve this in
fabric now the second one is table and
data frame validation so say for example
you want to validate the data in your
silver layer of a medallion architecture
how do you go about doing that in Fabric
and again I'm going to be walking you
through step by step how to achieve this
finally we have semantic model
validation so this is validating our
semantic models that we build in
Microsoft fabric making sure that the
the relationships have been set up
correctly that your Dax measures that
You' defined are returning the right
outputs and again we're going to be
walking you through a demo how to
achieve that in fabric now on top of
that I'm going to be showing you not
just how to validate individual parts of
your pipeline we're also going to be
looking at Enterprise scale data quality
monitoring so if you've got 100 data
pipelines and data sets that you're
managing in your organization this is
going to tell you how to monitor the the
data quality of all of them to look for
insights about where you should be
spending your time and your investment
in improving the quality of your data
sets as a whole now this is video six in
our Series where I'm helping you make
the transition from the world of powerbi
to the world of Microsoft Fabric and
alongside the slides that I'm using here
I'm also going to be publishing the
implementation notebooks for each data
validation exercise that we go through
there's going to be three in total and
all of them are going to be for free on
our school community so
school.com Microsoft fabric I'll leave a
link in the description as well let's
begin so to kick us off I just want to
start by looking at why I think data
validation is so important and try and
get across this passion that I have for
data validation try and impart some of
that onto you well number one in many
Industries and many businesses
regulation demands that we take control
of data quality you might get audited
for certain data sets for example if you
work in financial services is a good
example or Healthcare certain regulatory
bodies are going to be auditing your
data sets to make sure they comply with
certain data quality standards and in
Industries where perhaps you're not
regulated for data quality it's still
really important as well because if you
have really bad quality data in your
organization then your decision- making
is going to be impacted you're going to
be making bad decisions based on bad
data so this could incur such problems
such as loss of Revenue or customer
dissatisfaction as well and for people
coming from the powerbi world having
poor quality data in your analytic
Solutions is going to lead to a loss in
trust in your Solutions so your users
are going to stop looking at your
dashboards because they don't trust the
data in them and once they lose the
trust it's very difficult to win back so
that is why data validation is so
important let's look at what we mean by
data validation so to kick us off I've
got this traditional data processing
pipeline we move from getting data from
a source performing some cleaning on
that data enriching it maybe you're
joining different data sets together
creating some Dimensions building
analytical models aggregations and
finally we're going to do some semantic
modeling and what I want you to think
about is at each of these stages in this
traditional data pipeline what can go
wrong because now we're looking to build
data Engineering Systems robust systems
so to do that we have to be a bit
pessimistic and think about what can go
wrong at each stage in this pipeline so
that we can build validation rule sets
to try and defend against where things
can go wrong so we're going to play a
little game called pipeline problem
Bingo and I want you to think about what
can go wrong at each stage in this
pipeline so pause the video here give
yourself a minute or two and think about
what can go wrong maybe based on your
experience and then we're going to
review just some examples of what can go
wrong at each stage in the pipeline
after this
okay so when we're getting new data from
a data source lots of things can go
wrong maybe our connection details
change if you're connecting to an API
for example maybe the token expires we
can even have things like schema changes
so if you're connecting to Azure data
Lake storage and you're getting some
parket files or some CSV files maybe the
data producer has added new columns or
they've changed column names and when
they do things like this it's going to
impact and cause problems in your
Downstream analysis activities so we
want to know when these things happen
right up front another example is
perhaps you're calling an API and data
doesn't get returned for whatever reason
perhaps it's because of an API or server
error it's nothing to do with you but
you're not going to get data back from
that API and when we talk about cleaning
steps lots of things can go wrong here
maybe we don't clean properly for
example you could have null values in
important columns and important columns
by that I mean columns that you might
want to join on later in your pipeline
or perhaps you've got a really important
date column that you're going to be
doing a lot of time intelligence
calculations on in your semantic model
so having no values in that column is
going to lead to poor quality data and
poor analytics Downstream maybe the
values in a certain column is outside an
expected range we've looked at date
formats and data types as well if these
are wrong then again that's going to
cause problems for our are passing and
our cleaning and our transformation
steps Downstream and maybe you have
values outside of a predefined list so
say you're implementing metadata
management and you've got a predefined
list of department names for example and
the data that you're getting from a data
set includes values that are outside of
this predefined list again that's going
to introduce data quality issues that
really we want to identify in our
pipeline so when we're joining data and
enriching it well we can introduce
things like referential Integrity
violations this is where the unique set
in one of our joining tables does not
match the unique set of the other table
so we can also have incorrect joining
logic so these could be data quality
issues that you yourself as the the data
engineer you're introducing obviously
not on purpose but these things can
happen perhaps you built your
transformation logic and it was working
but then 3 months later the data changes
slightly and you didn't think about a
specific Edge case so that's going to
introduce data quality issues as well
one of which could be a duplicate
primary key and we're building
analytical models some things that can
go wrong is well you're implementing the
business logic incorrectly perhaps
you're doing an incorrect aggregation
perhaps the granularity of your
aggregation is incorrect and maybe
you're not handling null values
correctly as well and in the semantic
modeling phase perhaps you haven't
correctly defined your relationships
you're implementing the wrong business
logic and the Dax measures that you have
created aren't really working correctly
for all edge cases and it's not
returning the correct values so these
are just some examples and once you've
got a lot of experience in data
engineering and business intelligence
you can probably list off hundreds of
things here of things that have gone
wrong in your pipelines in the past so
what we want to be doing in fabric is
thinking deeply about the data sets we
have thinking about what can go wrong
and building validation rules sets that
guard us against those things now if
you're moving from the world of powerbi
then this whole pipeline or at least
most of it gets done in the power query
editor and for each step there's
different applied steps down that right
hand side in this query editor now the
problem with that is that it's a bit of
a black box we connect to our data
sources we do some Transformations and
at the end we get this semantic model
but at no point can we really validate
the data that's passing through these
different steps and this is one of the
biggest problems that I found with
powerbi it's very difficult to assure
data quality after each transformation
step yes you can do it once maybe when
you publish your dashboard but what
about 3 months down the line maybe your
data changes a bit and it introduces
data quality issues that you didn't
think about when you building the report
and you definitely didn't catch when you
building the report so in fabric we have
this huge opportunity to really get a
handle on data quality and this is what
this could look like so in fabric we're
going to be using the same high level
data processing pipeline that we looked
at before so we're going to get some
data from a source but this time instead
of doing everything inside power query
we're going to write out that file into
a Lakehouse files area we're going to
space out our transformation steps and
at each stage we're going to write out
the results to a new location so from
our files area we might do some passing
and cleaning and then we're going to
write it out to a lake house table then
we might read it into another notebook
for example or a data flow we're going
to enrich the data join it with other
data sets create our Dimensions all that
sort of stuff and then we're going to
write it out to another table then maybe
you want to read it in and perform some
sort of analytical modeling task create
your aggregations your gold layer and
you might want to do that in your data
warehouse then we're going to build our
semantic model and then we're going to
publish our semantic model for people to
consume in the form of a powerbi report
so there's lots of different advantages
of doing it this way and lots of people
call this The Medallion architecture and
there's a reason why it's becoming so
popular because has a lot of benefits
now the benefit that I specifically want
to hone in on is the fact that it makes
it really easy to perform data
validation because at each point when
we're writing out this data into new
layers of our architecture we can
validate that the data that we're
writing into the next layer of our
architecture conforms to certain
expectations that we have about that
data set at that point in the pipeline
so we have five in this example data
quality checkpoints and we can
progressively assure the quality of that
data set as it passes through our
pipeline now let's just compare what we
had previously so in the powerbi example
if you're using power query you're not
really progressively assuring the data
quality whereas in the fabric example
we've got five data quality checkpoints
so every time you write out that data to
a new layer in our architecture the data
quality is going up so that by the end
of the process by the time we've
published our semantic model and we
validated that the user is going to be
absolutely sure that they've got a lot
higher quality data and it's been
properly validated at each stage so to
enable this really there's three
different types of validation that we
need to perform and each of them a
little bit different so the first one is
schema validation and this is validation
of incoming files so you could be
getting Json files from AR rest API it
could be CSV files or paret files
anything like that or data from a
database in exteral database that you've
written into your lakeh house files area
as a file the second stage is this bit
in the middle so that's going to be
table or data frame validation spark
data frame validation and in this stage
we're going to be validating that the
transformation steps that we do between
each of these different layers has been
done to meet our standards and the third
one on on the right hand side there is
semantic model validation so in fabric
pretty much for the first time we have
the opportunity to validate the semantic
modeling that we do in our organization
so making sure that the relationships
that we build and the Dax measures that
we write are actually performing as they
were designed to perform so let's step
through these one by one starting with
scheme of validation so again the core
purpose is to validate that the incoming
data conforms to an expected schema
that's the main purpose here because if
the schema changes is and it's going to
ruin all of our Downstream analytics
activities cuz maybe you've got lots of
notebooks and semantic models and they
expect a certain schema for that
incoming data so if the incoming data
schema changes it's going to ruin
absolutely everything Downstream of that
so we want to highlight this really
early on so this is what this could look
like say you've got some incoming files
and you're bringing them into fabric
into a Lakehouse files area from here
you're going to pick up those files in a
scheme of validation notebook you're
going to perform scheme of validation
and then you might move the file into
two different folders depending on the
results of that validation if they fail
the tests you might move them into a
failed folder and these will get logged
and you might trigger a failure strategy
and we'll talk more about failure
strategies a little bit later on but if
it passes all of our validation checks
we might move it into a past folder and
then we're going to load it into the
next layer our raw layer of our
Medallion architecture so what's going
on inside this scheme of validation
notebook well we want to validate that
the data type are correct that the data
structure is as we expect and the column
names are as we expect as well we also
want to validate whether the data is
passable can we actually pass the data
that has been given to us because if we
can't validate all of these four things
then that load into the raw zone is not
going to work now there's a few
different tooling options available to
us now one of the best things about
Microsoft fabric is that it opens up the
python ecosystem and allows us to use
that ecosystem on our dat data now data
validation is probably one of the best
examples of that there are so many
mature data validation Frameworks that
we can tap into I've just listed three
here so pantic Pandera and great
expectations are three python libraries
that you can use to validate your data
I'll be going into a bit more detail
about the pros and cons of each of these
in a minute but before we do that I just
want to focus in on Great Expectations
and I'm going to walk you through an
example of how we can perform schema
validation on incoming files using GR
expectations in this video I'm going to
be explaining to you schem of validation
of incoming files with great expectation
so in Microsoft fabric it's likely that
you're going to bringing in a lot of
files into Fabric and then passing them
and then loading them into lakeh house
tables so what this notebook is going to
do is validate the schema of these
incoming files now we're going to be
using a CSV file for this particular
example customers 100. CSV and I've
listed the source of that in this
prerequisite so what you're going to
have to do is download this from GitHub
put it into a lakeh house files area and
I put it into this nested file structure
that you can either copy or you can just
change the incoming file path to be
validated this first parameter here but
what is great expectations well Great
Expectations is one of the most used
data validation libraries in Python it's
a very mature framework can be quite
complex to learn so what I've done in
this notebook is Tred to make it as
simple as possible an approach that I
would call GX life so it's not really
using Great Expectations how it's meant
to be used it's really stripping away
all of the complexity and creating a
really simple method for you to get
started with data validation in fabric
now if you want to follow a bit more of
the traditional process let's say then I
did do a video on that a few months ago
so I'll link to that here as well this
is this one here and in this video I
walk you through the process of setting
up a data context setting up expectation
s
setting up checkpoints and these give
you a lot more functionality so it
unlocks more advanced Great Expectations
features like actions and data dots and
all this stuff but for this tutorial
we're just going to keep things very
very simple so to get us started we've
got this parameter cell so this is
indicated with this parameters here and
we're passing in two parameters file
path to be validated and the output
table name now I've made this a
parameter cell so that you can embed
this notebook into a data pipeline we
can parameterize it so you can pass in
new file paths and it will validate them
as you as you wish so to get started we
need to install Great Expectations I've
already done that as you can see here
but I am going to need to because I'm
just walking you through this as a demo
I'm going to initialize these parameters
here and the first thing we're going to
do is import gr expectations as GX then
we're going to initialize this context
so we're going to call gx. get context
that's like an empty shell that we're
going to use and we're going to pin
different things two the first thing
that we're going to create is a data
source and this data source is going to
be from CSV so the way that it works is
it basically reads your CSV into a
pandas data frame and from that it
creates what we call a validator and the
validator can be used to perform data
validation obviously so we call this
read CSV and we pass in the file path to
be valid dat which is the parameter that
we declared up here let's run that okay
so now that has successfully run let's
move down to adding some expectations to
our validator so when a new file comes
into Fabric in our heads we probably got
a lot of expectations about that data
set we expect it to have for example 12
columns those columns must be named like
this we expect certain columns to have
data types and they must have this data
type now in this example we're using a
CSV file but this method can be used for
any sort of incoming file or anything
coming from an API or external databases
or anything like that really so an
expectation is basically an
off-the-shelf expectation that we can
grab from this Great Expectations
library and you can find a full list of
expectations here on this website here I
think there's about 400 pre-built that
you can use or you can create custom
ones as well so the ones that we're
going to be doing I've declared four
expectations here now this is by no
means exhaustive so the first one is to
expect that the table columns match this
set so I've declared this list of column
names and they must match this otherwise
the expectation and the validation is
going to fail I've got two expect column
values to not be null and I've passed in
the column ID and the column ID is going
to be the primary key of this table and
we're going to use this to link to other
tables further Downstream in our data
processing pipeline so it's important to
know that there's no null values in that
column the next one is description date
and this could be a really important
column for some Downstream analysis task
that we've got planned so maybe you're
doing some time intelligence
calculations on this that are really
important and you need this column to
not be null the final expectation is
that column values match this specific
date time format so passing of date
times from strings is something that can
go wrong so here we're just
double-checking again this really
important column on subscription date
matches this expected format of year
year year month month date date so again
if that changes maybe we get a new data
set and it doesn't have that date format
then again we're going to our passing of
that is going to fail so we want to
understand when that date time format
changes let's run these and by running
them all we're doing is we're adding
these expectations into this validator
object okay now this has finished
running we can see the output of the
last one there and it's basically this
object that shows success equals true so
the last validation test has passed and
it gives us a bit more details about
okay we've had 100 rows of data and all
of them passed so next what we're going
to do is called validator do validate
and this is going to run through each of
our expectations based on the data that
we provided in this file and it's going
to return this validation results object
so let's just inspect that just to have
a look at what that looks like okay so
it's returned this object and kind of
the first key here is Success equals
true so if all of your expectations pass
then you'll get a success equals true
then if you want to inspect individual
expectations then we get this results
list which has for each expectation the
individual result so the first one was
true and this was when we expected table
columns to match the set so all of the
table columns did match the set and it
does this for each individual one so it
just goes through each of our
expectations and gives us success equals
true so finally what I've done is just a
little section on handling the results
so passing that validation results
object and here I've just defined a few
functions that basically just helper
functions one of them is to write data
to a Lakehouse table move a file to a
folder write validation results
centrally now those were just helper
functions these are bit more important
functions so we' declared a handle
success function and a handle failure
and then what we're going to do is if
validation. results is is successful
then we're going to run handle success
and in this function what we're going to
do is move the file to a validated
folder write the validation results to a
central Lakehouse for logging and then
write the data actual real data to a
Lakehouse table so if all the validation
results are positive and it passes the
validation results then it's going to
run through these different things and
this is just a sample of activities and
actions that you can run on a successful
validation run very much depends on your
business and what you want to do but
it's just to give you some ideas if it
fails then we're going to again write
the validation results centrally maybe
you want to move that file to a failed
validation area in your lake house files
area for example so let's just run that
so we can see that it's printed past
validation and now it's running the
spark jobs in progress so our spark jobs
are in progress so it looks like it's
writing something to the tables and
they've succeeded so if now we just
refresh this here we have a raw
customers table so we've validated this
incoming data set and we've written it
into a raw table in our lake house next
up we're going to focus on table and
data frame validation so that's this bit
in the middle and here we're trying to
validate that the structured data so the
tables and the data frames that we have
normally after some sort of
transformation that they meet certain
expectations in terms of data quality
now this is going to be highly dependent
on the data set you know you're going to
have to use a lot of domain knowledge
here and also the different stages in
the pipeline so let's look at that in a
bit more detail what that looks like so
say for example you've got some data in
a raw table and you want to do some
cleaning on that data and then after
you've cleaned the data you want to
perform validation that it meets certain
requirements so you might do some
cleaning validate the output and then if
it's successful you might write it to
the next layer in your pipeline now if
it fails you're going to want to log it
and then trigger the failure strategy
again so what's going on inside this
notebook well really it depends a lot on
your data and also your Downstream
analytics tasks but typically some
things that you might want to check here
as we mentioned previously that you not
be checking that there's no null values
in what I'd call important columns and
this is again based on your domain and
your data sets checking that values are
in expected ranges is an an example of
what you might do at this stage checking
that categoric variables are in some
predefined list and also that data sets
that need to be D duplicated have been
and that primary keys are are unique so
the tooling for this stage is a little
bit different so yes we can still use
Python and Spark so we can use Great
Expectations or Pandera but another
option here is to use SQL and
specifically DBT now DBT is a third
party library that we can connect to our
data warehouses in Microsoft Fabric and
it's built really to do that
transformation piece so we can transform
tables between different layers in our
architecture and we can also build into
our DBT scripts data validation so that
kind of comes with DBT it's a core part
of DBT let's have a look at what that
might look like so if we're using DBT
for this transformation between
different layers in our Medallion
architecture say for example you have a
raw data warehouse and a cleansed data
warehouse and we want to use DBT to do
that transformation and then validate
the results when we're writing it into
the data warehouse so this is what that
could look like there now I want to show
you what this cleansing and validation
notebook would look like if we wanted to
use Python and Spark and great
expectations so let's dive into the
example of how you can achieve data
frame validation in Microsoft fabric
using Great Expectations okay so here we
are in our second notebook and here
we're going to walk through spark data
frame validation with great expectations
so in the last tutorial we looked at how
to validate incoming data in files next
we're going to look at this middle
section so here we're going to be valid
ating spark data frames so say for
example we have performed some cleaning
and passing of a Lakehouse table and now
we want to validate that that cleaning
has been done successfully and hasn't
introduced any errors and all of the
expectations that we have about that
data set are correct so that we can
write it into the next layer in our
pipeline so this is this what this looks
like here now again this approach uses
Great Expectations and again uses an
approach what I would call Great
Expectations light so it's not using a
traditional workflow that you might use
with great expectations it's
specifically designed to be easy to
understand for beginners to help you get
started with data validation and with
Great Expectations in general so the
prequisites for this notebook well it
assumes that you have run the first
notebook and that you have a Lakehouse
table called raw customers so that's the
one that we wrote out in the first
notebook we've also got this parameter
cell so we've got two parameters we've
got the table name to be validated which
will be raw customers in this example
but again these are parameterized so you
can embed this notebook into a data
pipeline to parameterize it and we've
also passed in a output table name as
well so let's just run those next up if
you haven't already you'll need to
install the Great Expectations Library
into our notebook what we're going to
begin with is getting some data from
this Lakehouse table so here I'm just
using spark SQL and we're going to
select star so we're getting everything
from this table and we're passing in the
table name to we validated and we're
saving it into this DF spark data frame
object next we're going to call some
passing and cleaning function on this
data frame now this function is just not
doing anything but for the purposes of
this notebook doesn't really need to do
anything this isn't the point of this
notebook it's just to show the
validation side rather than the the
passing and the cleaning side so let's
run this so that's successfully finished
now the output of it is this clean DF so
that's going to be what we carry forward
into the validation step Okay so now
we're going to define a set of
expectations similarly to how we did in
the first tutorial but we're going to
initialize our spark DF data set in a
slightly different way so we're going to
import great expectations as we did
previously and next we're going to use
this data set. spark DF data set spark
DF data set class and we're going to
pass in our cleaned DF and this is going
to initialize an object of type spark
data Frame data set which is a great
expectations object and that's going to
allow us to attach all these
expectations and run the validation
similarly to how we did previously with
the pandas read CSV object so in this
example again we've defined some
expectations and these are again very
dependent on your domain and the data
set and the downstream analysis
activities that you're going to carry
out with this data set we're expecting
that the column exists so subscription
date exists we're expecting that the
column record creation date also exists
we're expecting that the values in this
subscription date column are greater
than this specific value so this is very
much based on domain experience say you
know that the subscription date for this
data set is from a particular product
that a customers just started using and
this product wasn't launched until the
1st of January in 2020 so any dates
prior to that date we can assume are
erroneous similarly we've got another
one for greater than now so any
subscription dates that are in the
future and again this is very much
dependent on your domain and your data
set whether this makes sense but just to
show you how it works if the
subscription date is in the future then
I'm going to assume that that is also
erroneous and the last one is matching a
Rex regular expression so it's just to
check that the email column is a valid
email let's run these okay so that has
succeeded next let's run to the validate
function this works in a very similar
way to what we saw previously we're
going to validate all of our
expectations that we've assigned to this
GX DF object and we're going to return
some validation results and again we've
done some handling of those results so
here we got a function that writes this
spark data frame to a Lakehouse table so
if all of our validation tests have
passed what we're going to do is write
it out to a valid table so here we are
with the handle success and the handle
failure Works in very much the same way
here I've just written one success
action which is to write out this table
into a new table for valid data okay so
all of those spark jobs have now
succeeded and if we refresh this table
then we will see that we've got
validated customers now in practice you
might want to save your valid data into
either a different Lakehouse or some
area differently depending on your data
architecture and your data strategy but
for the purposes of this demo I just put
them all in one lake house to make it
easy to follow now the third type of
data validation that we can do in
Microsoft fabric is semantic model
validation and by that we mean
validating that the semantic information
that you define in your semantic model
so the relationships any calculations
that you've done any Dax measures that
you've created that these have been
correctly implemented and they return
the correct results so this is what this
could look like so say for example you
have a data warehouse your your gold
layer and you're bringing in some dat
sets into a powerbi semantic model
you're performing your powerbi report
development process and the output of
that is going to be a report and a
semantic model now before the user lays
eye on that report what we want to be
doing is validating that semantic model
and to do that we can use semantic link
and Great Expectations now semantic link
is a feature in fabric that allows us to
basically inspect what's going on inside
a semantic model through a python
notebook so we can have a look at the
data in tables in the semantic models we
can explore relationships we can even
evaluate Dax measures and inspect the
results so the combination of that plus
Great Expectations means we can validate
that what's gone on in that semantic
modeling phase actually reflects reality
and you know you've encoded the right
business logic and the calculations are
returning the right results so just to
summarize what we've been through here
at this stage we're going to validate
that the Dax measures that we've created
return the correct values and that the
semantic relationships between tables
have been correctly defined the tolling
as we've mentioned is going to be
semantic Link in a python notebook Plus
Great Expectations now let's look at an
example of how you can achieve that in
Fabric in this tutorial I'm going to
show you how you can perform semantic
model validation with Great Expectations
in Microsoft fabric so the goal here is
to validate that the semantic modeling
we've been doing so building
relationships between different tables
our measures that we've defined in our
semantic model and the table data that
we have in our semantic model conforms
to certain expectations that we have now
Microsoft themselves have actually
published a sample notebook to show you
how to perform semantic model validation
in fabric using great expectations it's
actually very detailed and it steps you
through the process of setting up a data
source your context your expectation
Suites adding different resources into
these data sources as well so this is a
good resource to understand a little bit
about how to do this in a more proper
way in Great Expectations now the
approach that I'm going to be using is a
little bit different I've tried to make
it as simple as possible for you to
start using data validation on your
semantic models so let's have a look at
this approach in more detail The
Notebook assumes that you have the first
two notebooks and you've created a
semantic model in your workspace now
I've called mine my semantic model this
is what it looks like it's very very
basic just to show you the data
validation piece really and in my
semantic model I have one table and I've
also created a measure here called
number of customers which I think is
just the row count of all of the rows in
this table but it's good enough for this
example here so you're going to need to
install Great Expectations and also this
time semantic link so semantic link is
what we're going to be using to extract
that data from our semantic models and
make it accessible with our python code
so once you've installed both of those
we're going to look at how you can
validate three different things in your
centic model we're going to start by
looking at validating the table data
then we're going to look at validating
the measure outputs so outputs from any
measures that you've to finded and then
finally we're going to validate the
output of some Dax queries so starting
at the top top here with table data so
say for example you've done a lot of
cleaning in your power query engine in
your semantic model here we want to
validate that any cleaning steps that
you've done within powerbi itself have
been implemented correctly and this is
going to be really simple here so all it
takes really is four or five lines of
python so we're importing Great
Expectations we're importing Senpai
fabric so that's that semantic link
module now we can tap into fabric. read
table to get any data set that exists in
our semantic model so we're going to
pass in the data set name so that's your
semantic model name and your table name
and it's going to return a data type
which is a fabric data frame which
inherits from the pandas data frame and
that is really important because we can
use the Great Expectations functionality
for creating a validator from a p pandas
data frame now great expect ations does
actually have some integration with
fabric built already and if you look at
the Microsoft supplied examples you can
use things like when you're adding a
data data source do add powerbi Dax
asset or for tables it's DS doore
powerbi table Assets Now I'm not
actually using this approach I'm
basically hacking the system and
benefiting from the fact that this
function here returns a pandas light
object and this makes it really simple
to just add gx. from pandas and that's
how we're going to create our validator
in this case and it's the same for all
the other examples so all of these
functions return a pandas like data
frame so we can use exactly the same
logic here next we're going to follow
exactly the same process that we did in
the last two tutorials so once we've got
this validator object we can assign
different expectations to it and in this
example I've just done one very simple
expectation but obviously in practice
you're going to want to think about how
you want to validate that semantic model
in more detail then we call validator
validate and we return this validation
results and that results object is the
same as we've seen in the last two
videos so if we run that we should get a
success equals true in the next example
it works very very similarly but we're
actually using a different function in
the sepai library the goal here is to
validate outputs from any measures that
you define in your semantic model and
we're going to call fabric. evaluate
measure and we're going to pass in the
measure name so here we've only got one
measure in our model which is number of
customers and again this is going to
return a fabric data frame which is very
much like a pandas data frame and so we
can pass it into gx. from pandas and
that's going to return us our validator
that we can assign expectations to now
in this example I've done an expectation
that the column value is to be 100
basically cuz I know that there's 100
rows in in this data set doesn't really
make much sense in The Wider scheme of
things it's just to show you the
functionality finally we've got
validating Dax query outputs in a
semantic model so here you can define a
Dax query in a string and then you can
pass it into fabric. evaluate Dax that's
going to return a fabric data frame
again and again we're going to pass it
into gx. from pandas now in this example
I'm expecting that the column values are
between 1 and six again just a toy
example of an expectation I'm more
interested in showing you how to bring
together semantic link and Great
Expectations the actual defining of the
expectations is down to you based on
your domain knowledge and your expertise
in your data sets so that is a very very
simple way to validate semantic model
information using Great Expectations in
Microsoft fabric so next I just want to
step into a bit of a wide review of
tooling and approaches we've covered
I've mentioned quite a few tool up until
this point but now we're going to do a
bit of a side by-side comparison so that
you make a good decision for your
business and we're going to be comparing
against a few different criteria so
we're going to be looking at which
language it uses the relative difficulty
of that approach whether I think it's
suitable for kind of an Enterprise scale
solution the cost if that's relevant and
any other notes about that particular
approach so I think it's worthwhile
starting with code your own as a bit of
a baseline because it is possible to
write data validation rule set sets if
you know python or SQL you can write
your own now the difficulty is quite
high at least to write good ones and I
would argue that it's difficult to scale
to an Enterprise level there's lots of
features that we get with some of the
other things on this list that you will
have to code yourself and you're not
taking advantage of the huge ecosystems
the mature libraries that already exist
for data validation for example Great
Expectations has about 400 different
data validation rule sets that are
probably going to cover 95% of your
business's use cases and you can just
get those out the box off the shelf
that's going to save you a lot of time
and this code has been battle tested
with millions of users so I definitely
recommend not to code your own so next
we have three different python libraries
for data validation so if you are
comfortable using python or people in
your business are comfortable using
python these are just three tools that
you can use now it's by no means
definitive this list there are are lots
and lots more data validation libraries
in Python but the first two are Pandera
and pantic pantic I think is one of the
most popular data validation libraries
very common in the data science world
and these are quite easy to get your
head around they're quite
straightforward they do pretty much
exactly what you expect and they're
quite quick to learn but the problem
with them is that in my opinion They
Don't Really scale up to an Enterprise
scale solution they're good if you're a
data scientist and you want to validate
individual data sets but there's no real
functionality for Enterprise scale
logging they don't have a common format
for writing validation results that's
very important and we'll go into the
reason why in a minute but they are free
as well so all of these python libraries
are open source Great Expectations is
slightly different so with great
expectations it's a very mature and wide
ranging featur Rich Library you can do a
lot with it but what goes handin hand
with that it's actually quite difficult
to learn to a good level there's a high
learning curve let's say but a lot of
these features are useful if you want to
do Enterprise scale data quality
monitoring we have features like writing
out your validation results there's data
docs which automates the creation of
documentation for each of your data sets
that you're validating we have a really
huge library of pre-existing
expectations and you can even configure
it to give you a list of actions that
you need to rectify data quality issues
in your organization so it's definitely
the most feature Rich but also the most
difficult to learn and Implement so next
up I want to talk about SQL so if you
prefer using SQL and you want to perform
data validation there's a few different
options again one of them is to code
your own with SQL stored procedures but
again I maybe wouldn't recommend that
because it's going to be quite difficult
to implement and it's difficult to scale
you're going to have to write your own
loging it's going to be difficult to
build robust validations and everything
you do you have to code your own now if
you are going down the SQL route then
DBT is a really good tool for performing
data validation and again you get a lot
of Enterprise scale features as well
like documentation you also get really
good validation results all of this
stuff comes out the box with DBT now
there is an open source version of DBT
but you're probably going to be wanting
to use the cloud version of DBT if you
use DBT which does come with quite a
hefty cost so bear that in mind finally
I just wanted to mention Dax because Dax
can be used to validate data I've seen
some examples of people trying to
validate their data sets with Dax and
again this is not really something I
would recommend because now we have the
benefit of the Python ecosystem for me
it makes more sense to tap into that
pre-existing ecosystem with very mature
libraries for doing data validation and
obviously the problem if you do Dax you
can only really validate that semantic
modeling piece you can't validate
incoming files you can't really validate
spark data frames either and that's
going to cause an issue in a minute on
the next slide so up until this point
we've looked at various stages in this
Pipeline and which tools we can use for
individual parts of that Pipeline and
this represents one data set flowing
through your system in reality you might
have hundreds of pipelines like this so
how do we manage and monitor all of
these pipelines across your organization
how do we get that picture across your
organization to understand where the
data quality issues are rising where the
trends are where do we need to spend
more of our time and our investment in
improving data quality across our
organization well now we're going to
walk through how you might achieve that
so let's just look at that Medallion
architecture that we've been discussing
in a bit more detail where on the left
hand side you've got files coming into
fabric you got various layers of lake
houses and then at the end You' got some
sort of semantic model that you're going
to be using for data visualization
showing to your business users well one
option here is to use Great Expectations
to validate every time you move data
between various players in your lake
house now it's important to note here
that great expectations is one of the
only tools that can perform validation
at each of these different stages so we
can use it for file validation right on
the left hand side and we can also use
it for semantic model validation and of
course spark data frame validation in
the middle and this becomes important
because every time you validate
something in Great Expectations we get
given a set of results that is in the
same format and so when we get that we
can write out those validation results
into a central data quality Lakehouse so
we're going to build up this record of
every time something gets validated in
our organization now this is going to
become a very useful and valuable data
set and from this we can build a data
quality monitoring report in powerbi
that looks at all of these results as
they're coming in we can also make use
of a feature of Great Expectations
called Data docs I think I mentioned
that previously it's going to
automatically generate documentation
based on the validation tests that you
write for each data set now this is
going to be a really important and
valuable Resource as well especially for
people that enter your business say you
can give them the documentation for your
data they can see every data set you've
got in your company and how it's being
validated so they're going to see the
specific columns if you've done the data
validation correctly they're going to
understand the specific columns that are
most important for each data set and how
are they're going to be validated and
from that you can build up a really good
picture about what is important in your
company so the key point in this kind of
Enterprise scale monitoring of data
quality is that it really helps if you
only use one tool for data validation so
it makes sense to choose one tool and
just use that and the reason is because
it's going to Output validation results
in exactly the same format and then we
can combine those in a central Lakehouse
now for that I recommend dat uh Great
Expectations because out the box it can
handle all three types of data
validation want to do in fabric your
source files your tables in the middle
and your semantic models and entral
logging Lakehouse centralizes all of
these validation results for every
checkpoint and every Pipeline and on top
of that you can build data quality
monitoring ports in powerbi and you can
make use of the data docs functionality
to document all of your data sets that
you're validating and that again can be
written to a lake house and you can
build a powerbi report there like the
documentation for all your data in your
business so finally I just want to touch
on a few other features that we can
integrate into into this data quality
monitoring system that we're building
the first one is around failure
monitoring so say we implement this sort
of architecture here and we have our
data quality monitoring report in
powerbi well we can use a new feature in
fabric called Data activator and what
that's going to do is it's going to
monitor the data that's flowing through
that data quality report and it's going
to check for certain conditions so we
can build data activator reflexes that
look for validation failures in any of
our pipelines and when we spot a
validation failure we can trigger
notifications teams messages or even J
tickets that automatically get generated
from those failure events now why is
this so important all of this well if
you work for a consultancy and you have
your client ringing you up saying this
dashboard doesn't quite look quite right
to me I think there's a problem with
data quality really you want to be in a
position to say yes I know we've seen
that it's failed validation checks and
we already working on a solution to get
that recovered rather than oh I didn't
know about that we'll definitely check
that out and we'll get back to you in a
few hours when we find out what's wrong
it puts you in a much more proactive and
positive light if you can say truthfully
that you already know there's data
quality issues and you're acting to
resolve them and this alerting feature
is going to help you do that so another
feature that we can tie into in fabric
is the data governance and specifically
the certification feature feature so say
for example you've got your gold
Warehouse well by the time it reaches
that gold Warehouse we can be very sure
that the quality that we've got in that
warehouse is very high we've probably
validated it several times by that point
and so we can certify that gold
Lakehouse and say that any data that
lands in that gold W Warehouse you know
it's certified you can use it for
whatever you want in your organization
and we can also do the same for semantic
models now to enable endorsement you
have you go to your Warehouse settings
go to the endorsement setting and then
click on certified for that specific
warehouse and you'll need to enable that
in the admin portal if you haven't
already now one thing to point out here
is that you can specify a URL to the
documentation page for certification and
this could point to that data docs
dashboard that you've got generated via
great expectations so just to summarize
here what are the next steps that you
can Implement to embed data quality and
data validation in your organization
well it starts by putting data quality
right at the heart of your fabric data
strategy I think this is so important
that you really get a handle on data
quality really has the foundation for
everything that you do in fabric before
you start looking at fancy llms machine
learning amazing visualizations in
powerbi I would argue that it's much
more important to get a handle on data
quality and progressively assure the
quality of each data set that passes
through your architecture and to do that
you need to design the right
architecture specifically for data
validation so you don't really want to
be thinking about data quality and data
validation as an afterthought really
need to be thinking about it when you're
architecting your company architecture
then you want to think about okay what
do we need to validate at each stage in
your pipeline and this is very dependent
on your domain expertise a really
thorough understanding of your data sets
and also an appreciation for what
Downstream analysis activities you're
going to be doing on each data set and
more specifically each column on that
data set you're going to want to choose
the right tool for the job then you're
going to think about setting up
Enterprise level monitoring so not just
individual pipelines but how we can pull
all of that data and start to generate
insights about data quality in your
organization you're going to want to
design and Implement a failure strategy
for your data validation routines so
understanding so agreeing a process for
how you're going to deal with when
things fail and finally you can tap into
the other fabric features like data
governance and certification to prove to
your organization that these data sets
have been validated and you certify that
they meet certain standards in terms of
data quality if you made it this far
thank you very much for watching let me
know you have with a green tick Emoji in
the comments it really means a lot to me
all of the notebooks and lesson notes
can be found on the school Community
absolutely for free so go down there
school.com Microsoft fabric now in the
next video we're going to be doing an
endtoend project in Microsoft fabric
showing you how to migrate from powerbi
to Microsoft fabric we're going to be
walking you through all of the different
decision points that we've gone through
in this series so setting up your
environment getting data into fabric
choosing a data store building semantic
model with direct Lake and then
validating the whole solution in an end
to end project so make sure you join us
for that I'll see you
there
hey everyone and welcome back to this
series where I'm helping you transition
from the world of pobi to the world of
Microsoft fabric today we're going to be
looking through this document here I've
produced this and it's nine pages long
and it's specifically to give you
guidance for Designing endtoend Data
Solutions for your organization in
Microsoft fabric because what I found is
that there's two kind of coexisting
observations number one is that there
are infinite different ways in which we
can configure Microsoft fabric it's a
very very flexible tool there's lots of
different options at each stage in your
data pipeline whether it's getting data
storing data serving data building
semantic models there there's lots and
lots of different options and I would
say that there's no real best way of
using Microsoft fabric really it's
dependent on your organization every
organization has different requirements
your data is stored in different
locations
different Source systems each with
different security kind of
configurations you got different budgets
different skill sets in your teams all
of these things make it really difficult
to kind of prescribe one architecture
that you should follow I have a lot of
people in my community here on
school.com Microsoft dfbc they were
asking me what do you think of this
architecture and it's quite difficult to
provide advice without really
understanding your business because you
need to understand the specific
context of your business to be able to
make good architectural decisions so the
original plan for this video was to
build an endtoend project that Drew
together all of the lessons learned from
the last five six videos but when I was
planning that video I thought what would
actually be a lot more valuable for you
if you're planning out endtoend
architectures in Microsoft fabric is to
come up with a decisionmaking framework
that you can apply to your organization
and what this document is is a list of
30 six questions that you can ask of
yourself and of your team and of your
organization that's going to help you
think about the right things when you're
designing data infrastructure in
Microsoft fabric so this document
provides some of those questions it's
not meant to be exhaustive it's just a
bit of a guide about some things that
you should consider at each stage when
you're planning out your Solutions
there's probably many areas that I
haven't covered and if you can think of
something very important I haven't
covered then leave a a comment below and
I'll definitely add it in I'm thinking
of making this kind of like a live
document so as kind of new information
comes in new features are released or
best practices start to emerge then I'll
be updating this document as well now
you can download this document
completely for free it's on the school
community and I'll leave a link in the
description as with all of the videos in
this series I've got resources and
things that you can follow like here's
the PDF for this module direct Lake mode
it's got an IPython notebook that you
can download
the validation series had a number of
notebooks as well and when I launched
this one it will include this PDF
completely for free so you can use it in
your organization so the document is
structured following the similar
structure to how this Series has been
structured so it's going to step you
through a logical kind of path of
decisions that you need to make number
one is around structuring your tenant so
thinking about capacities workspaces
security access control and that stuff
then we'll look at getting data into
Fabric and for each of these we're going
to be looking at a number of questions
so between like 3 to 10 questions for
some of these sections that you need to
ask yourself storing data in fabric
building semantic models and validating
your data and semantic models so let's
start with structuring your fabric
tenant so this is going to be the first
thing really that you need to get right
when you're designing a new architecture
a new solution in Microsoft fabric this
is just the sample capacity and
workspace design from the documentation
and one thing to note here I'm going to
be going through these relatively
quickly because there's 36 questions in
total so I want to do 30 seconds to a
minute on each just to make sure that
the video is kind of not too long now
each of these I've mentioned in detail
in my previous videos and where relevant
I tried to link to that specific point
in the videos earlier in the series so
if you want more information you can
look at those videos there so let's
start with number one do you have
requirements to keep data stored in a
specific geography for example to comply
with gdpr so if you have customers and
you need to keep their customer data
within the EU for example the capacity
is what defines where your data is
stored so if you create a capacity in
West us your data will be stored in West
us if you create another capacity in
Paris France your data will be stored in
Paris France and obviously this has a
pretty big implication over complying
with regulations so you need to get that
right do you want to separate your data
processing kind of data intensive stuff
with your powerbi reporting so if yes
then you might want to have separate
capacity for these as well and what
about billing because billing can have
an impact on the capacities that you
design as well so in some larger
companies they might want separate
capacities for different departments so
that they can have like ring fenced
budgets for this department maybe a
finance department is going to have a
separate uh capacity and they can be
build via the finance department whereas
your consultancy division might have
their own capacity that they can use and
obviously these capacities can have
different sco different levels of res
resources assigned to them and there
different costs associated with them now
the next one is what is the planned
intensity of your workloads and are you
willing to wait for the completion of
certain workloads so the skew level of
Your Capacity determines the amount of
compute resource that's available to you
so data intensive workloads like
ingesting a lot of data high volume of
data or training machine learning models
for like 4 hours or more that's going to
use up a lot of your compute resources
so if you're on an F2 it's going to
throt it's going to start throttling
Your Capacity it's going to make it very
difficult to use if you're doing a lot
of these intensive workflows so if you
do have kind of data intensive workloads
you got a lot of data then you're going
to be wanting to use one of the higher
capacities to Crunch through that data
in a lot shorter time to free up your
capacity to do I don't know report
serving or anything else that you want
to do in Your Capacity next we move on
to workspaces and workspaces is a bit
complex this is probably the area that
there's the most kind of creativity or
different ideas about how you should
architect different workspaces so some
guiding questions when thinking about
your workspace design are which groups
of users need access to which fabric
items so as I mentioned the workspace is
the primary way that we can restrict
access to items in fabric so it's
worthwhile thinking about who needs
access to what and create logical
groupings of people like your data
scientists your data Engineers your bi
analysts your bi consumers think about
grouping like personas in your
organization and then perhaps creating
workspaces for each of these groupings
or at least giving access to workspaces
based on these groupings do you have
requirements for separating Dev test and
production workloads and the data stored
in data stores as well because this is
going to have an impact on your work
spes you might want to split out
different workspaces for Dev test and
production and relating to the above do
you have any requirements for version
controlling your fabric items using the
git integration git integration is done
at the workspace level and then you kind
of select items within that workspace
that you want to be Source controlled
but it's done at the workspace level so
that can affect your workspace strategy
there's more information in this link
here now finally one thing that we
mentioned in our video on capacities
Access Control was around data pipelines
so if you're going to be using data
pipelines then you need to be careful
and you consider some of the constraints
that exist around crossw workpace
Reading Writing and triggering other
notebooks store procedures all that kind
of stuff and I've linked to the specific
section in the YouTube video where I've
mentioned that in more detail next we
move on to security and access control
so this is obviously very very important
this is the high level representation of
the fabric security architecture linked
from the documentation so what are some
of the guiding questions that you need
to ask yourself around security and
access control well we've touched on
this a little bit previously what are
the core groups of users who need access
to your solution the core groups of
users and as we mentioned it's always a
good idea to create entra ID security
groups that manage your access via
groups rather than giving individuals
access to a workspace or individual
items because this can get very messy
very quickly and it's difficult to
manage and maintain do you have
requirements for Road level security Now
roow level security can be managed at
the data warehouse level but propagation
to kind of Downstream items is currently
currently working on that feature that
that capability it's what's called One
Security and that's coming soon in
Microsoft fabric so just bear that in
mind if you're considering Road level
security you're probably going to be
thinking where do I do rooll of security
is it in the data warehouse or is it in
the semantic model which you might be
familiar with already so that is a bit
of a a question mark currently around
one security and they're going to make
that a lot clearer soon now I've put a
bit of a a catch all here like do you
have data stored behind firewalls and
virtual networks accessible via private
endpoints all this kind of stuff now in
the last few weeks Microsoft released a
lot of features around this kind of
security and accessing private data data
behind firewalls to help you access this
data from inside Fabric and I haven't
mentioned a lot about that yet on the
the channel in this series but if that
is your situation your requirements then
there's a lot of really good
documentation here that you can follow
just goes through to the security page
which is kind of the top level page and
then underneath this there's a lot more
information about how you can do all of
these different things if your data is
secured maybe in Azure or different
cloud services or even on Prem Services
as well so next we're going to move on
to getting data into fabric so some
initial guiding questions here really
depends on where your data is stored so
is your data stored in ADLs Gen 2 Amazon
S3 Google Cloud Storage or the data
verse and if yes you can use shortcuts
if your data is stored in Azure SQL
database snowflake Cosmos DP then you
can you use the new feature database
mirroring so that's only released this
week at the fabric conference in Las
Vegas for public preview so that's an
option if your data is stored in any of
these three is your data stored on
premise so if so you can use either data
flow with an on- premise data Gateway or
you can use the new data Pipeline on
premise access feature functionality
that again they released only a few days
ago at the fabric conference and I've
linked to some documentation about how
you can achieve that as well are you
ingesting Real Time Event data so if
that's the case then you can use the
event stram now these four cases are
ingestion methods or ways of getting
data into fabric that have specific
features designed for them to make it
easier now if you've got data living in
any other system then you're going to be
want to be doing plain old data
ingestion ETL right and to do that in
fabric there's three main tools so
there's the data flow gen two the data
pipeline copy activity and the fabric
notebook now choosing between these
which one you want to use is dependent
on these questions so you might want to
consider if you already have
pre-existing powerbi gen one data flows
and it's a fairly easy process to kind
of Port over that logic into a data flow
Gen 2 and there's a tutorial on how to
do that linked here and related to the
above what's the current skill level of
the people in your organization who will
be creating these ETL items if it's
predominantly no or low code users then
probably going to want to be using data
flows or data pipelines if you have data
Engineers who are comfortable using
spark python Scala those kind of
languages and Frameworks then you're
probably going to be wanting to use a
combination of data pipelines for the
orchestration part and fabric notebooks
for the transformation part another one
I put here is are you going to be
connecting to rest apis with complex
authentication or pagination or all of
these kind of more complex apis that we
can interface with and if it's yes then
it's probably best to use a fabric
notebook because you can get really good
control about where you're passing
tokens and when you're you're passing
them so i' definitely recommend that for
rest apis another consideration is is
the data accessible using a python
client Library so some of the big SAS
providers like HubSpot they have python
client libraries that you can use to
really easily authenticate with their
their apis and the data and it kind of
abstracts away logic you just
authenticate with their client service
and you request a data set in one line
of python so if that's an option if
you're using a SAS provider that has a
python client Library again you can do
that within a notebook now the final one
I put here about getting data written is
do you want your ETL jobs and your items
to be version controlled so this is now
available for the data Pipeline and they
linked to where you can learn more about
that but it's not currently available
with the data flow so if you want to be
really tightly version controlled on
your ETL stuff your data pipelines it's
possible now with your data pipeline
data flow I'm sure will come in the
future notebooks you can Al Version
Control as well so next we move into
choosing a data store and did a long
video about this as well so if you want
more context then go to this video here
but once you've selected a strategy for
getting data into fabric you need to
select where you want to store it in
Fabric and there's three choices there's
three different data stores that are
available to us the lake housee the data
warehouse and the kqo database and as
with all these things there's pros and
cons of each and they're designed for
slightly different use cases and
slightly different personas in your
organization now a few questions to get
started When selecting an individual
data store for a specific task well is
your data structured or is it
semi-structured or unstructured so is it
in tables CSV Json is it image data is
it audio data The Lakehouse files area
is the only location in fabric where you
can store files so semi-structured or
unstructured so if that's your
requirement you're going to be wanting
to use a lake housee and typically a
lake housee is used to get these files
from external systems then transform
them and load them into some more
structured format into a Lakehouse table
if you're storing Real Time Event data
then you're going to be wanting to store
that in a kql database at least
initially now you can root that data
into a lake house as well either at the
same time or after the event for more
long-term storage and if you want to
integrate it better with spark for kind
of machine Learning Time series forecast
all that kind of stuff but at least in
the kind of the bronze layer you're
going to be wanting to use the KQ
database now number 24 here is what are
the skill sets of your team so if you're
predominantly SQL based then the data
warehouse is obviously your best bet
which is a transactional data warehouse
if you have python or Scara data
engineers then the Lakehouse offers a
lot more flexibility for you and related
to kind of the skill sets is do you want
to make use of of the low and no code
functionality and features available in
fabric because in the data warehouse you
have visual queries that you can use for
data transformation data querying and
that's one of the easiest methods the
data Wrangler is another feature that's
available in Python notebooks in fabric
notebooks so they can interface with
your louse data now this basically gives
you some code in a no/ Lode way I still
argue it helps to know a bit of python
to understand what that code is then
doing before you execute it but that's
another option if you want to use the
data Wrangler and you want to use python
in notebooks finally we're going to look
at the Version Control Point again
because this is where some of the data
stores differ slightly so the lake house
and any notebooks that you create can be
git controlled it can be integrated with
a git system currently it's only Azure
devops Azure repos but they are
introducing GI integration soon so
currently git integration is not
available for the data warehouse
although this should be coming soon I
imagine and you can actually achieve
something similar if you set up your
data warehouse and you manage your data
warehouse through a SQL projects so that
was looking at individual data stores
how do we select the right one for the
right job in practice what you're
probably going to be wanting to do is
create more than one data store and
combine them in some sort of coherent
architecture so here's some of the
questions that you should think about
when you're combining multiple data
stores into one architecture are you
planning on following The Medallion
architecture so typically here the
bronze layer in fabric will be a
Lakehouse if you're going to be dealing
with flat files CSV Json all that kind
of stuff can be stored in the files area
and you can transform it into the lake
house tables area now it is possible to
not use a lake house for the bronze area
but you'd have to do all of your
transformation before you load it so you
can query a rest API from a uh data flow
for example or a data pipeline using the
the web activity and then you can
perform some sort of transformation on
that data and load it directly into a
data warehouse so it is possible to use
a warehouse for your bronze layer but if
you want to store keep a record of those
incoming files you want to be doing that
in a lake house we've talked about if
you're ingesting streaming data you
might want to add a bronze KQ database
to store that has a lot richer
functionality for time series analysis
and storage as well now in the silver
and the gold layer of a medallion
architecture that can be either a
Lakehouse or data warehouse depending on
as we mentioned like the skill sets in
your team your preferences the types of
workloads that you're going to be doing
on it one of those workloads is around
building and serving machine learning
models so if that's what you're going to
be doing any sort of data science
workloads and modeling and
experimentation Eda exploratory data
analysis that's best done from data in a
Lakehouse so if you're going to be doing
machine learning data science really
want your typically silver layer to be a
lake housee number 29 is do you have
requirements for row level security
object level security or dynamic data
masking or column level security as well
now these features are typically really
only required in that gold layer where
you're serving it up to some semantic
modeling and so a typical pattern is to
make your gold layer a data warehouse
because the data warehouse is the only
object where you can do all of these
features in fabric like your column
level security Dynamic data masking so
as it gets closer to the the business
user that's where you want to start
controlling more rigorously who can
access what next we move on to building
semantic models from your data and again
I've linked to the video where I explain
this in a lot more detail we're going to
be talking you through these quite
quickly but if you want more detail then
check out this video here now there's
three connection modes to choose from
when you're connecting to data in fabric
to build semantic models now we've still
got the import mode and the direct query
mode and we've got the new option to
create direct like connection mode
models so I just wanted to talk about a
few questions when designing and
selecting a semantic model connection
mode so if you're going to be using
views in your semantic model SQL views
from the dat to Warehouse then you're
not going to be able to benefit from
direct lake so if you've got SQL views
then it's going to fall back to direct
query mode so just bear that in mind can
I still use import mode well you can
technically yes and there's some
scenarios where that might still be
useful or a good decision so if you
don't need real time updates on your
data which direct query and direct Lake
provide if you don't really need that
then there's no reason why you shouldn't
pick import mode or when you want to
control over when that data refreshes so
in some instances you might not want
users to see real time information you
might want them to only get data once
per day or once per hour for example
another option where import mode might
still be a good idea is that when your
data size is really small you know it
can fit easily into a powerbi model you
can query it very easily we not talking
like gigabytes and petabytes where some
of the direct query and the direct late
mode become more useful so if that's not
a constraint that you have around data
size then again that could be an option
another question here is around do you
want to combine data from multiple data
stores of the same timee into one
semantic model So currently a direct
late model has to read from one data
store so if you want to use direct Lake
you can only do that from one data
warehouse for example now you can from
powerbi desktop get data from different
data stores for example one Lake housee
and another lake house or one where
Warehouse another Warehouse I think they
have to be the same type but you can do
this using direct query and just going
on get data getting data from fabric and
selecting multiple it's going to force
you to use direct query but I think it
is possible next up we want to talk
about validating your data and your
semantic models a really really
important point that I think is worth
going over again so some of the
questions that you might want to
consider at this stage are well do you
have a pre-existing data Quality Tool
that you might want to use and does it
integrate with fabric so if you've
already got a lot of data quality
testing set up in some sort of third
party tool then you want to check
whether that integrates with fabric and
if the answer to either of those
questions is no then you might want to
look at some of the open source tooling
that's available through the python
libraries python Frameworks like Great
Expectations because these can be
accessible through fabric notebooks and
I did a long video kind of looking at
how you can integrate Great Expectations
into to fabric notebooks to validate
each of these different file types next
what skill sets do you have in your team
so if you've got predominantly SQL users
then you can build validation rules
either directly in your Warehouse using
store procedures although that's maybe
not recommended or you can use third
party tools like DBT if you've got
python developers then you can use open
source python Frameworks like great
expectations or pantic for example is
another one but I recommending Great
Expectations because it allows you to do
testing of all of these different data
types incoming files tables data frames
semantic models very easily and it
allows you to do kind of Enterprise
level data monitoring as well now what
are the most important data sets and
columns within those data sets in your
organization and how can you validate
them okay I.E if there's a problem x
with column Y in data set Zed my
reporting SL analytics or machine
learning is going to be impact
so here we're getting a bit more fine
grain you need to think about what are
the really important data sets in my
organization and within those data sets
what are the really important columns
okay so what are the columns that have a
big impact on my Downstream analytics
maybe I'm building a whole report based
on this date column okay it's like a
Time series analysis and so if that date
column is of poor quality then that's
going to really impact the validity of
my outcomes in my analysis so how are we
going to validate that what are the
different ways in which that column
could potentially go wrong and how can I
build tests to make sure that that
column is always of good quality do you
want to monitor data quality across all
data sets in your organization so we've
talked in the video previously about
enterprise-wide monitoring of data
quality and for that I definitely
recommend you use Great Expectations if
you're going to look towards open source
Technologies definitely recommend great
expectations so that was it thank you
very much for reading I hope that wasn't
a too long video we've gone through 36
questions to help you design data
infrastructure endtoend Solutions in
Microsoft fabric if you found this
useful please write a comment on the
YouTube video because it definitely
helps more people find the video helps
the word get out there if you've got
powerbi developers or people in your
organization that are making these
decisions about building and designing
endtoend infrastructure in fabric make
sure you share it with them now I do
have one more video in this series in
the next video I'm going to be taking a
step back from the kind of more
technical elements of the transition
from powerbi to Microsoft Fabric and I
want to get a bit more personal so
specifically for for you sitting at home
you're thinking about your career
options over the next few years maybe
powerbi developer now but now you have a
whole new world to kind of explore and
step into potentially if you want to and
I'm someone that has kind of made that
Journey myself about six or seven years
ago I moved from building power reports
to focusing more on data science and
data engineering data architecture so I
do have a few comments uh that might be
valuable to you if you're thinking about
moving from being a powerbi developer
personally in your own career to some of
the roles that exist within the fabric
ecosystem and the fabric environment
thank you very much for watching I'll
see you in the next
[Music]
video hey everyone and welcome back to
the channel in today's video we've got
something a little bit different for you
now to finish off this series what I'm
going to be doing is stepping away from
the technical side of fabric you know
understanding about pipelines and lake
houses and all this kind of stuff and
really I want to focus on you and what
fabric could mean for your career now I
think it's normal for a lot of people
out there especially powerbi developers
to look at the news around fabric start
learning about Fabric and then suddenly
think hm this could really represent a
bit of a change in direction for my
career a lot of the world of data
engineering data science analytics
engineering suddenly becomes a lot more
accessible so you got a lot of ideas
around H should I be focusing on this
for the next few years in my career so
in this video we're going to go through
a few things we're going to start by
looking at some of the options available
to you in a fabric World in terms of
your career we're going to talk about
why I think that could be a beneficial
move for you and we're also going to
talk about some of the mindset shift
that you might have to go through some
things you might have to think about
differently when you're moving into
different roles like analytics
engineering data engineering data
science now this is a topic I'm
particularly passionate about because
this is a transition that I made in my
own career so about six or seven years
ago I was building powerbi reports
within a company and then I moved more
into data science roles and more
recently data engineering data strategy
roles so this is something I've got a
lot of views on and hopefully I can kind
of share some of the lessons learn and
hope that helps you make good decisions
over what you're going to be doing in
the next few years of your career so
let's get into it so to kick us off I
wanted to go through some of the options
that are available to you so the first
option really is to continue what you're
doing with powerbi like if you love
building reports and you're happy with
that that's absolutely fine I'm not
going to stop you from doing that that's
absolutely great now I do think it will
still be very beneficial for you even if
you want to continue down that powerbi
report development route to learn more
about fabric because in the future I
believe that a lot of the reports that
you're going to be building is going to
be built from data in fabric as more and
more companies migrate over they start
doing a lot of their storage and
modeling inside fabric it's really going
to help you in your career if you know
more about Fabric and I think the more
you learn about fabric the more valuable
is going to make you as a powerbi
developer for example if if you can take
control over you know creating sequel
views in a data warehouse for example
molding the data that you need that you
need for your reports on your own if you
can take some of that autonomy it's
definitely going to help out your
company and it's going to make you a lot
more valuable take the pressure of data
engineering teams takes the pressure of
analytics engineers and it makes you a
more valuable employee for your company
if you can do more of that work yourself
and one thing to bear in mind is that
powerbi is a very powerful tool it's
kind of like an all-in-one Swiss Army
knife it can do lots of different things
right you can do data ingestion data
transformation modeling and data
visualization all in one tour and that's
absolutely fantastic but just because
you can do something in one tour doesn't
necessarily mean that it's the best
place to do these things with the
introduction of fabric a lot of that
data ingestion data storage data
transformation piece is going to be
moved up stream right you probably heard
of Ro's maxim of data transformation
data should be transformed as far
Upstream as possible and as far
Downstream as necessary right and what
that means in a fabric world really is
that a lot of the ingestion
transformation stuff is likely going to
be done in Microsoft Fabric and for me
anyway I see the future of powerbi being
a lot more focused just around the
visualization piece semantic modeling
defining relationships anything that you
have to do in powerbi like filter
context and uh that kind of stuff that's
where I think the future of powerbi will
be a lot of the trans automation
cleaning all of that stuff should be
done in Fabric and this has the benefit
that your data sets can be validated
your data set can be shared more easily
they're not just confined to one
semantic model you're kind of building
Enterprise data assets that you can
share amongst your teams more easily so
if you do want to continue down this
route of developing powerbi dashboards
and reports absolutely go for it I think
it will still be worth your while to
learn little bits of fabric as much as
you feel comfortable with so if you're a
little bit bored of just churning out
reports you could shift into analytics
engineering so what is analytics
engineering well the way that I kind of
think about it is if you're a powerbi
developer your product is your report
right that's the thing that you're
you're building and you're focusing on
building beautiful reports and the end
user is your your business user right
the consumer of that report well in
analytics engineering that Focus shifts
right your data becomes the product
itself and your users are the people
that are going to consume that data so
it could be the powerbi developers could
be data scientist data analyst anyone
that needs like high quality data and
when you make data the product itself
really have to shift focus in your
mindset really you need to think about
how you can deliver high quality data
that's complete that's validated timely
well modeled and also documented as well
these are all slightly different things
so you think about when you're building
a powerbi report could mean things like
Source controlling your ETL jobs could
be testing your code validating your
data segregating different development
test and production workloads and data
so that you're really sure that you're
delivering a high quality product at the
end of each pipeline right your product
being the data ultimately your goal is
to build robust pipelines that reliably
produce high quality data sets that can
easily be consumed by the user now
personally I find this work a lot more
interesting and I would definitely argue
that it's a more valuable skill in
businesses because businesses are crying
out for high quality data I think we've
gone through the phase of bi where it's
nice to just build lots of reports now I
see the big kind of industry shift
towards the demand for high quality
tested validated data sets on which you
can build your reporting your machine
learning models all that kind of stuff
more stream the focus in investment and
the focus in data strategies becomes how
do we make sure that our data sets are
really high quality by the time they
reach that gold layer now if becoming an
analytics engineer is the goal for your
career then a good obvious first Target
is the dp600 exam to become a certified
fabric analytics engineer it's going to
introduce you to a lot of new
technologies ways of thinking ways of
working and it covers quite a wide range
of the fabric Technologies right lake
houses data warehouses pipelines data
flows all of these things and how we can
use them together git integration
Advanced semantic modeling building
large models and that kind of thing as
well so I definitely recommend if that's
your goal that's a really good first
step to you to learn what it takes to
pass that exam and on this channel in
the future in the next few weeks I'm
going to be starting producing more
content specifically around dp600 to
help you pass pass that exam so watch
out for the announcements if you're not
subscribed already then make sure you
are subscribed to get these videos when
they
launch so next up we have the option to
become a data engineer now the
distinction between a data engineer and
an analytics engineer those lines are
quite blurry and they can mean different
things to different organizations so I
don't want to get too hung up on the
syntax the terminology but for me a data
engineer is doing a lot more work in
spark in the data engineering experience
right so they're probably going to be
better python programmers they're going
to be doing more of the complex
Transformations working more around
parameterization and automation right so
how can you do ETL jobs across all of
the tables in your bronze layer for
example working out how to pass really
thorny like Json structures doing a lot
of more complex stuff as well as more
like system optimization how do we make
sure that this engineering system system
as a whole is working reliably so things
like setting up testing for for your
code making sure that that whole data
pipeline is robust now as well as that I
would expect data Engineers to be really
proficient in the data warehouse know
how to set up a data warehouse structure
data warehouse set up a lake house
optimize a lake housee and kind of the
underlying Delta tables as well so
they're a little bit more technical a
bit more advanced skill set so if this
is your goal then a good route into that
I would say is via the analytics
engineering
route now next up we got the data
scientist so a lot of you might want to
move into data science you want to be
building machine learning models that
predict things about your data in your
organization right so the distinction
here really for me at least anyway is
data science focused a lot more around
what's going to happen in the future so
powerbi developer you're doing more
descriptive analysis about what's
happened in the past data scientist is
looking about what's going to happen in
the future what's your revenue of your
company going to be what's going to
impact your Revenue in the future those
kind of problems and they're actually
really interesting problems and I've
worked as a data scientist in two
companies and I absolutely loved it what
I would say is for people that want to
transition from kind of powerbi to data
science is that it's going to be a bit
more challenging there's a lot more
skills that you have to learn you have
to learn about the data science process
you have to learn programming to quite a
good level in python or R I'd recommend
Python and alongside the programming
side of things you'd also going have to
learn about a lot more theory around
machine learning models how do they work
how do you choose the right ones for
different scenarios depending on your
data and your problem but the one thing
you have as a powerbi developer or at
least hopefully is a really good
understanding of the business right so
the data scientist is a lot more
business focused most of the time so
you're solving problems for the business
and often you're communicating your
results to people in the business as
well so your skills as a powerbi
developer in that respect are going to
be quite transferable and useful right
so if you got experience presenting a
dashboard to the business and kind of
selling that idea selling your analysis
to the business that's going to be a
really useful skill if you want to move
into data science but there is going to
be quite a few steps for you to make to
become kind of a useful data scientist I
would argue now for me how I did that
was I did a master's degree in data
science took two years part time so
that's where I built up a lot of the
theoretical knowledge and also practical
knowledge of how to like Implement data
science and do data science in a
business now one thing I would say that
if this is your goal then I would argue
it's still beneficial to go through the
route of analytics engineering because
what I find is that a lot of data
scientists especially in kind of the
early career data scientists a lot of
your work is going to be similar or more
similar to data engineering or analytics
engineering right you're going to have
to learn how to clean data sets very
effectively very quickly building some
sort of data infrastructure and so from
my experience at least the role of the
data scientist especially in smaller
companies can actually look a lot more
similar to a data engineer right because
if you don't have a te team of data
Engineers to help you productionize your
code or build views and extract data
from databases you're going to have to
learn all these skills anyway right so
it helps to be really good at SQL to get
some data in a format or to be really
good at python to learn how to clean
data transform data model the data that
you need in the shape that you need it
to do some sort of machine learning some
sort of predictive analysis so if your
goal is to be a dat scientist just be
aware that there's quite a few steps you
need to go through and I think it's
still worthwhile for you to go through
the route of analytics engineering data
engineering learn those skills first
because they're going to be really
valuable if you then decide to go into a
role in data
science so that's some of the options
that are available to you in the world
of fabric there's probably a lot more
these are just some of the main ones
that I wanted to touch on now one thing
that I think not many people understand
or think about when they're making this
kind of transition or thinking about
whether it's the right thing for them
and I think it's really important is
that fabric is built on top of
technologies that are very very widely
used in the industry right so you're
talking
tsql python spark Delta all of these
things are very very transferable skills
a lot more I would argue than the skills
that you're building up in the world of
powerbi so this opens up a lot of
options to you in your career and
learning these Technologies is actually
quite painful right so you need to
choose your pain wisely I would argue so
for me it makes a lot more sense to
direct your energy your learning energy
into the technologies that are going to
give you the biggest leverage in your
career so for me those are things like
tsql and python because every company
uses tcq and Python and by going really
deep on Technologies like Dax and power
query in the powerbi world that's great
if you want to spend your career working
in powerbi but if you invest that time
and the energy into learning things like
SQL and python that's going to open up a
huge amount of opportunities to you
because every company uses these
Technologies so let's just round up this
video with a few next steps for you I
would argue that a good first step for
anyone really is to get a thorough
understanding of analytics engineering
and how that works in Microsoft fabric
so understanding more about that
engineering mindset because it underpins
a lot of the other roles right if you
want to become a data engineer a data
scientist a lot of these things the
Technologies and the skill sets that you
build up in the role of an analytics
engineer they're going to be really
really useful so how do you get going as
an analytics engineer well we've already
mentioned it I think a really good
Target for you is the dp600
certification might require you to learn
quite a lot of new new technologies new
skills but it gives you focus in your
learning now that's not to say that if
you become a certified fabric analytics
engineer that's the Pinnacle of your
career now you can rest and relax no for
me it's a bit like passing your driving
test right it shows that you're
competent enough to be trusted to do
certain things but a lot of people say
that once you've got your certification
that's when you really start to learn
the technology and how actually apply it
in the real world so that would be my
initial Focus if far as working in
powerbi today because it's very adjacent
in terms of skill set right so you can
leverage a lot of the knowledge you
already have terms of semantic modeling
I think semantic modeling is about 25%
of that exam 75% might be new stuff but
it's definitely worthwhile learning that
the final point I want to make is around
how you can become valuable very quickly
because you might have built up a lot of
skills in the world of powerbi but then
if you want to transition into the world
of fabric really you need to learn how
to provide value to your company very
quickly right because you might be
starting at a lower level in the world
of analytics engineering CU it's quite
new to you and the best route that I see
for powerbi developers to become
valuable in the world of fabric if you
want to become an analytics engineer is
to focus really on a few Technologies
and learn them really really well fabric
is a really vast platform it can be easy
to get overwhelmed you know you got real
time analytics data science sence data
engineering data warehousing and so you
really need to focus on the areas that
are going to give you the most value as
quickly as possible and for me as an
analytics engineer I would be focusing
on the data Factory experience so really
learning data flows which hopefully you
know pretty well already if you're a
powerbi developer and data pipelines so
these are kind of the two orchestration
ETL tools that are used right across the
data stack and so it would be really
worthwhile learning how to get really
good at these now I used to interview
quite a few people at my last job and I
was actually surprised how many people
have full-time jobs just creating
managing and maintaining ADF pipelines a
a data Factory pipeline so can be your
full-time job in some companies just to
focus on the data pipelines part and
it's quite a simple tool but there's
layers and layers of complexity so it's
easy to get started but then there's a
lot of complexities and strategies for
building more complex pipelines that you
can learn in time I think it's a really
good place to focus your energy as well
as the data Factory experience I would
definitely recommend really
understanding the data warehouse so this
is more than likely going to be your
gold layer in a company so it's the
nearest really to the world of powerbi
so here you're going to be able to
transfer a lot of the skills that you
built up in modeling building St
schemers doing all that kind of stuff
you're just going to be doing it in
slightly different Technologies right so
you're going to have to learn how to
apply what you've learned in pobi to a
different technology stack to the world
of SQL right how can you produce that
stuff using SQL views SQL tables stored
procedures all that kind of stuff and
SQL is a really really valuable skill
set for you to learn and it will make
you employable by basically any company
in the world that does data not just
those that use powerbi so I definitely
recommend that as an area to focus on
thank you very much for watching that's
all we got time for today and I hope you
found that valuable my kind of ranting
on different career options for you let
me know if you are considering a career
in Microsoft fabric which of these
different options are most interesting
to you this is the last video in the
series where I have been helping you
transition from the world of powerbi to
the world of Microsoft fabric thank you
so much for joining me hope you found
the content really valuable and if you
have I would be really really grateful
if you can leave a like let me know in
the comments share the series the
playlist with people in your
organization that you think might find
it valuable thank you so much for
watching and I'll see you on the channel
soon