Introduction to Strata and Clusters

Name: Classroom: Cluster Sampling
Uploaded: 2026-03-03T12:26:50.916567+00:00
Channel: Chisquares
Description: Summary and key takeaways on Introduction to Strata and Clusters, covering to Strata and Clusters Strata are internally homogeneous but externally

Chisquares

Mar 03, 2026

•

3 min read

YouTube video ID: Kx9-awI3cew

Source: YouTube video by Chisquares — Watch original video

PDF

Strata are internally homogeneous but externally heterogeneous, meaning items within a stratum are similar while different strata differ from each other. In contrast, clusters are internally heterogeneous but externally homogeneous; members of a cluster vary, yet clusters resemble one another. When stratifying, each stratum must be sampled to preserve representativeness, and a person can belong to only one stratum.

Systematic and Stratified Sampling Rationale

Systematic sampling relies on a sorting variable that should be strongly correlated with the outcome of interest. The same principle applies to stratification: the stratifying variable must be highly correlated with the outcome. Using an irrelevant variable—such as favorite color in a smoking study—fails to improve precision. Stratification can also ensure representation of specific subgroups, for example minority groups in a smoking prevalence survey.

Implications of Stratification

Stratification increases precision because each stratum is treated as a separate population, but this precision comes at a dramatically higher cost. Adding multiple stratifying variables (e.g., gender, race, population density) can generate dozens of strata; a design with 24 strata, a design effect of 2, and a 50 % response rate required about 36,879 participants. Incentives of $30 per participant and recruitment fees of $120 per participant alone total over $5.5 million, and the full study can easily exceed tens of millions of dollars.

Cluster Sampling Rationale and Characteristics

Clusters are often groups of people living close together—households, schools, or neighborhoods. They can be defined arbitrarily and do not need to be naturally occurring. The primary advantage of cluster sampling is cost and time efficiency: “Clustering” (physical proximity) leads to “clustering” (statistical similarity). Real‑world clusters vary in size and composition, which can introduce bias if not managed. Strategies to address variability include splitting large clusters, combining small ones, stratifying clusters by size, or using Probability Proportional to Size (PPS) sampling.

Sampling Techniques and Their Impact

Simple Random Sampling (SRS) draws individuals directly from the entire frame, providing the lowest variance.
Single‑Stage Cluster Sampling selects whole clusters at random and includes every member, saving time but increasing variance.
Two‑Stage PPS Cluster Sampling first selects clusters with probability proportional to their size, then samples a fixed number of individuals within each selected cluster.
Stratified Two‑Stage Cluster Sampling adds a stratification layer, sampling clusters within each stratum before selecting individuals.

Across these designs, cluster‑based methods generally produce larger variance and wider confidence intervals than SRS.

Confidence Intervals and Variance

A confidence interval gives a range where the true population parameter is likely to fall. A 95 % confidence interval means that if the study were repeated an infinite number of times, 95 % of those intervals would contain the parameter. Variance directly influences interval width: higher variance yields larger confidence intervals. Because cluster sampling inflates variance relative to SRS, it also expands confidence intervals, reducing the precision of estimates.

Practical Exercise Using the K‑qu Platform

Using the K‑qu software, several sampling designs were applied to estimate the mean of Metabolite X:

SRS on the combined frame produced the smallest variance.
Single‑Stage Cluster Sampling selected clusters randomly and included all members, resulting in higher variance.
Systematic Sampling sorted clusters by population size before selection, also showing increased variance.
Two‑Stage PPS Cluster Sampling chose clusters proportionally to size and then sampled individuals, further widening confidence intervals.
Stratified Two‑Stage Cluster Sampling stratified clusters, then sampled within strata, delivering the greatest variance among the methods tested.

These results illustrate how sampling design choices affect both the point estimate and its associated uncertainty.

Takeaways

Strata are internally homogeneous but externally heterogeneous, while clusters are internally heterogeneous but externally homogeneous, requiring different sampling approaches.
Stratification improves precision but can increase survey costs dramatically, especially with many strata and high participant incentives.
Cluster sampling saves time and money but typically raises variance, leading to wider confidence intervals compared with simple random sampling.
Probability Proportional to Size (PPS) and stratifying clusters by size are effective strategies to manage variability among clusters.
Using the K‑qu platform, practical exercises show that cluster‑based designs consistently produce larger variance and confidence intervals than simple random sampling.

Frequently Asked Questions

Who is Chisquares on YouTube?

Chisquares is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Sampling Software Recommended

K‑qu 是一个用于抽样和数据分析的工具，帮助研究者快速实现不同抽样方案并比较结果

Amazon →

Statistical Textbook

一本涵盖抽样设计、置信区间和方差的教材可以帮助学生和专业人士深入理解这些概念

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

so which which is stru which which are
clusters and which are clusters let's
say this is a and this is B right ignore
ignore the text below so um at this
strata or this cluster some someone
should explain
why any given answer is
correct so nobody
or you can read what's there and tell me
why and explain it to me your own words
why um how how what you understand from
from that designation strata
cluster am I am I on
mute we can hear you okay I know that's
a trick question I know you can hear me
very
well
I I think this is the difference here is
in the characteristics of the the items
within the the cluster the Str so for
for Str they
are more more likely homogeneous and in
clusters they
mogeneous exactly so so I want someone
to just describe what they can see
forget about technical terms and you can
the current Speaker can go on like what
can you literally see on the screen
without any technical
terms yeah in in in in
Strutter each strum has
similar um similar items or similar
people and although different stru have
different items but each stru has um
similar similar similar people similar
items exactly and here then in clusters
they they are different each Str is
different as in each Str sorry each
cluster each cluster is
different but
within within within cluster there are
different uh kind of
items okay one when you compare one
cluster to the other what do you
see
recording oh okay
um
so the there are different items the
Clusters actually
similar compare different
clusters So in theory when we compare
two clusters are they the same or not
all the same yeah in the they may
process certain
characteristics gener but there may be
some
differences so in clusters it's simple
simple terms when we compare two
clusters oh they're the same in strata
when we compared to CL strata they are
very
different so in in technical you know
jagon we say that um strata are
internally homogeneous
but externally heterogeneous meaning
that when we take one one stratum to
compare with another stratum they are
worlds different all right this one has
only brown people this one has only
green people this one has only black
people one has only um you know white
people I don't know what that color is
so from one stratum to another stratum
to another stratum they are ex very very
different right and so if you want to
sample and get the representative sample
of people that are black and white and
brown and green what what does that
imply then that you must
do from what you can see there what is
the apparent implication of the fact
that the strata are completely different
if you want to get a sample of a
represent sample of everybody then we
must do what draw from each exactly so
once you stratify you must you must draw
from each stratum each each of a divided
strata
right can one person belong to two
strata just from the figure this is not
this not a trick
question can you belong to strata no no
you can't everybody belongs only to one
stratum and one stratum alone so the
rules for stratification are pretty
simple one person can only belong to one
Straton right and if you divide people
if you break into strata you must sample
from each stratum otherwise your sample
will not be
representative so that's that's the
that's the gist of that's all you need
to know about strata really all right um
we we talked last week about Str uh
systematic sampling and we said that
the that is the only sampling that is
influenced by what you sort on and who
can remember what was said is the
guiding rule for
determining what variable we choose to
sort our our data frame on when we do
systematic sampling that that certain
variable must be what in relation to the
outcome who remembers
I think I have an idea okay uh the S
characteristics must have a correlation
with the outcome we are looking for what
kind of correlation mild strong
moderate strong strong must strongly
correlated with the outcome the same
thing applies here with stratification
right because think of it right can just
stratify on many variables you can
choose to stratify on gender or race or
gender and race right whatever you're
stratifying on must be highly correlated
with
outcome you can't just say oh I'm going
to stratify on I'm doing a study of
smoking and so I'm going to stratify on
people's favorite color it's like okay
how is that related to smoking the
outcome it's not right but then you
might say oh you know studies show that
race ethnicity is associated with
smoking and know setting minority groups
that are very important for our study we
don't have enough population we don't
have enough sample we might not sample
them if we just did a random sample so
let us stratify on Race So this this is
the thinking that you have to go through
um when when when deciding right and so
if if you ever do get into you know
Consulting for large surveys well people
will give you very fantastic ideas on
what is what it should stratify on it's
your job as expert to say okay that that
that thanks a lot but that doesn't
really make sense you know there are
there are implications to stratifying
right the caveat here is that of course
you have increased Precision but you
have also drastically increased cost
because once you stratify that means
that you are taking each population each
of the stratum as a different population
of its own so instead of having just one
big population now you have four of them
you have to worry about right so there's
there's actual cost so um so when I
speak to clients it's it's very simple
just tell do you know there are real
world implications to this suggestion
you're making you know it bus down to
how much money you have to pay right
there there are there are cost
implications to these things we don't
just do them you know you can't just
suggest like there no want to tell them
the cost implication they were like
let's just go with an unstratified
sample right or let's reduce the number
of Strat because you know um when you
when you present a deal to people that
that has a way of bringing them back
into reality check
so that that's the issue about
stratification um from what you see here
right from Common
Sense we said here that when you
stratify we have four strata we must
select we must samp from each of them
just from Common Sense first principles
do you think that we must sample all
these
clusters why or why
not and I'm not asking for a technical
explanation just from Common Sense do we
need to sample from all the
Clusters think
no the Clusters must be homogeneous
sir us technical answer just a simple
answer from what you see on the
screen do you think it makes sense or is
there a need for a requirement that
would sample all the Clusters yes or no
on why no because they they are
similar okay so if we go principle that
are similar what might yeah sorry um
Jimmy you were saying
something all right maybe not so good
evening sir I was
saying
characteristics okay they have similar
characteristics so um so if if we went
by the principle of okay we have a big
pot of soup and want to take a sample
and we argue that if we steer the soup
well enough all parts of a soup are
similar and that's the basis upon which
you can just take a small sample right
and say okay this soup sucks or this
soup is fantastic right so on the
principle that all the Clusters are
similar what is the basis of the
argument so why would you then argue or
what would be the approach in this case
then we we the premise is that all the
closers are similar therefore
what's the
conclusion who pick one one is a
representative of others okay maybe not
NE maybe not just
one sample clusters sample clusters and
sample from
in exactly so you see that's the idea
right the cles are similar to each other
therefore we don't have to go and Sample
all the Clusters we can just take a
sample of them and then that should be
enough so that's the whole idea behind
clusters you know and stratification so
um and and why why why why what is the
basis for saying let's just take a
sample in the real world why why why are
you saying
that we do things for certain reasons
right so what's the base for saying okay
let's not go and take everything let's
just take a sample
why sorry
we don't have enough money for exactly
it's not complicated it's because we
want to save time want to save money we
don't have the money nobody has the
money to go and take all the all the
clusters
right so these things are not like
they're not some esoteric stuff they're
just applied stuff in day-to-day life
that okay what's the budget okay we
don't have that kind of money so what do
we do how do we of course we want
scientific validity but also within the
budget we have right and so that's it's
just it's not nothing nothing um out of
the sky it's just common sense so that's
why we take um that that kind of
sampling all right
um all right so so cluster you know
cluster are General people living you
know in close proximity clusters could
be anything it could be it could be
schools it could be um cluster of
households it could be um any any
context where people are together is a
cluster so um you can Define your
cluster however ever you
want it doesn't even have to be it
doesn't have to be natural
clusters I can take a map and say okay
we're going to Define these clusters
this way so they don't have to be like
naturally naturally occurring we can we
can however you can Define the Clusters
yeah that's fine you can Define however
you want but it's just a way to build in
efficiency in your
sampling again we've talked about this
we do that because of you know costs and
time we've talked about design effects
um the fact that people live in the
cluster means that they are similar to
each other um and so they they the
inherent the inherent joke in there is
that CL the the people in the cluster
because they in cluster there's
clustering you know that that that's a
you know those two words are quite
different right um cluster that they
live together but then
clustering as a statistical consequence
of that physical clustering I don't know
if that makes sense to you but I find
that very
funny okay that's just a statistical
joke if if you got it great if not
that's fine too all right so we have to
account for that um that design effect
when we when we calculate our sample
size for our
study all right now here in this picture
we made a very fine cute argument that
um the Clusters here are the same size
and everybody here is exactly you have
two brown people two black people uh one
white person and two green people right
every cluster is exactly the same do you
think that happens in the real world
like are you ever going to find clusters
like
this no what do you
think
likely almost almost impossible in fact
not only that each cluster has one two
three seven people right there this is a
perfectly perfectly sized clusters they
all exactly the same in both size and
composition of course you're never going
to find anything like that in real world
this is how real world clusters look
like some are very right and in fact I
should actually have changed the color a
little bit but you know but this is more
like the reality in the real world that
clusters are very varying in size they
look nothing like each other in terms of
um you know composition think of like if
you're looking at um counties
or or neighborhoods right of course not
all not all neighborhoods are the same
some neighborhoods are very rich um in
some parts of the country you just cross
one main road and it's like you live in
a different world like you just crossed
from a very very wealthy neighborhood to
a very very poor neighborhood obviously
those people not the same right and even
in size you might have very large
clusters as well as very small clusters
so in this case just taking a random
sample of a clusters may not necess give
us results that are unbiased that's what
we're saying and and the reason is
because those clusters do not match what
we expect in the ideal context in the
ideal context we say the Clusters are
almost like clones of each other but
this ones are not clones of each other
they are they are very very you know
heterogeneous in terms of how they look
right and so that
variability between clusters might
introduce some B
if we were just to take a random sample
of the of the Clusters so what then do
we have to
do um we can we can so common sense you
have this picture
right the Clusters are not the same and
we don't we don't want that right we
want to bring it as close to the ideal
as
possible just from Common Sense what do
you think you can do
here you are given the license to do
anything you want
from Common Sense what can you
do again this is not a trick question
it's just like think of it I think of it
just from simple Common Sense what what
could be
done and choose a number of
clusters say like three out of
seven okay um okay that's okay that's
one okay what else could be
done your task you given balls and you
you're said you can do whatever you want
there's no limit to what you can do your
job is to make these balls about the
same yeah I
sample the the same number of people
from each
cluster so that um the larger clusters
the participant have
um lesser probability of being selected
compared to the smaller clusters okay
thank you now we G you all giv us
technical
I just want I'm looking for something
very simple here yeah so here I I think
like the bigger ball here we can
actually reduce
it or we choose or we choose these two
in the middle that look almost
equal okay I I prefer the the previous
example I prefer earlier route you were
going so let's build on that one
right so let's retrace so can you can
you repeat the the first part of what
you said yeah yeah so I said that bigger
ball to the right can we can try attempt
reducing it so that uh it's smaller
maybe similar to this in the middle or
we go with this two in the middle okay
so um but we go with the two in the
middle is that
representative uh not really okay not
really so let's stick with that first
idea right and let's build on that one
that this ball is big how about we split
it into two
balls that's great that's one idea okay
what what idea could we
do
comb exactly this how about this this
three guys here right how about combine
combine them into one one one ball
right that could work too right so we
have come up with two excellent
suggestions here we take the very big
balls and split them right because um
if we were selecting with probability
proportional to
size what is the problem with this ball
here if if the probability of selection
is proportional to size as IIA suggested
right what problem do we run into with
this big ball where I oops
okay if we are saying that if you are a
bigger ball you have your probability of
being selected is higher what do you
think is the most likely consequence
with this big ball
represent
representation okay and somebody else
was speaking yeah you're on the right
track somebody else was saying
something yeah more more people will be
selected from the Big B compared to the
other smaller ones okay so now let's not
worry about the people inside inside the
ball right we're just saying we want to
pick a cluster first and then we select
people in the cluster so let's just
focus on the first part if we are
selecting clusters themselves with
probability proportional to size what is
the what will most likely happen
here more likelihood of selection being
selected okay do you think we're still
in the world of likelihood at that
point oh yeah yeah no so if it's not if
we're not talking about Pro so things in
life are either probability or a
what non
probability okay so what we call what we
call things what do we call things are
non
probabilities
chance um
[Music]
now okay what are some things that are
not you can look at in life what are
some outcomes in life that you say they
not it's not an issue of chance that
will must will will de will definitely
happen
sorry gr growth well maybe growth is
that's
that but say death right so we say death
is a
what
yeah no no I had I
certainty
right events
in that person is giving me a headache
literally please can you mute yourself
if there's background noise around you
please if you don't
mind thank you so if we say that the
probability of this this cluster this no
a probability right because it's so big
what do you think is the name we give to
this kind of things in statistics if a
cluster is so
big that we know that it will be
selected 10 times out of 10 we call it a
what a sure
event yeah but about clusters now we'
already talked about the name
already what do we call certainty
what it's a certainty cluster it's not
you know it's not a trick question so
those are called certainty clusters or
certainty psus PSU Sy means Parise
sample unit which is the first selection
stage so the first selection stage is
clusters so that's why clusters are also
called psus a primary sample unit so we
call this setting to psus do you think
that's a good thing bad
thing is bad why is it
bad because there already
by yeah there's no probability of
selection right we said you know random
selection is where everybody has a non
probability of selection right
probabilities must in general sampling
will fall between one and zero right but
generally not not include zero and
one I
agreed so certain PSU has probability is
one and that's not a good thing we want
we want probabilties that between zero
and zero and one not inclusive of zero
and one so that is why this is a problem
for us so let's go back to the ideas we
had we said we have cers of varing Si we
will break down this
cluster we'll combine this small
clusters so that they are about the same
right so that now we are back to around
this ideal structure um ju please okay
youve muted yourself already so that's
that's one suggest the two suggestions
we had what other suggestions could we
could we follow
along that is obvious remember our goal
is just to make sure we sample in a way
that is representative of the population
right
so out of a box like what can you think
about think of let's say you have um
let's let's use the case of soccer for
example you have under 17 World Cup you
have I don't know the other ones I know
there's on 17 and then you have the full
World Cup why do you think they divide
it into into those kind of um
leagues I then know soer f in the house
like you guys know
about
sorry so accommodate the AG range
different age groups difference okay so
building on that principle what else
could we do
here I think we can group them exactly
like we can group the BS and select from
each group so we can group them right
and then select so we group them first
and then select those initial groupings
are called
what is it
stratification excellent so the initial
groups first all stratify
them into maybe small medium and
large and then within those strata we
can then
sample does that make
sense yes so that's one option we've
talk about many options now right said
that sampling is both a a sign and an
art this is where the art comes in like
you know what however you want to do it
there's no wrong not it's not wrong it's
like what's what's your preference but
you have to know the all the possible
options at your disposal that you could
leverage so we talk about Sir go ahead I
just know best
practices this is all science so it
depends on what the best practice will
be determined by what you have on
ground plus the word best you know best
practice is also a word actually I'm not
a fan of that word what what does best
practice
mean best practice and best evidence the
same as the same
thing uh best practice can be subjective
but it's agreeable to a large number of
experts so do you think does that does
that make it acceptable or does that
make it
valid it can be valid over a period of
time if he has
tested
reliable so when B when best practice is
subjected to testing and then it's
accept it's proven to be valid it moves
from a category of best practice to
what evidence based the best evidence
exactly right but as far as it's still
best practice right is it valid in of
itself what do you think
[Music]
it's valid pending the
time got moving gold post like that it's
acceptable it's acceptable but that
doesn't ne mean that it may be valid we
don't know right so my point is that
sure and I'm not against best practice
but I'm saying best practice is just a
bunch of people coming together and
expressing their opinions and saying oh
let's this is how we've always done it
so um what should drive it your choice
should be the realities on the ground
what you're comfortable with you know um
because they all they all acceptable all
of these things are scientific valid so
that you know at that point it's like
okay what do I how do I want to do it no
no two surveys are the same in terms of
sampling or approach so my point is that
um you you want to you you should be
open-minded Right In terms as far as
sampling is concerned um and realize
that when you read a piece of work and
somebody say we sample that way um there
shouldn't be a knej reaction to say oh
that is wrong um because it's at the end
of the day it's yeah it's like sure it's
it's there an art form to read too
that's just a point I'm trying to make
that you know you bring 10 of a
statistician and ask them to sample from
the same population you'll see a 100
different methods they all right you
know as far of course as far as to
follow certain principles um so in this
case we said we can break big clusters
down we can combine smaller clusters we
can break them into into segments of
strata and then sample from each strata
and what else could we
do I think mentioned the other one which
was PPS we can we can do sampling with
probability proportional to size um
which um is another approach that we can
do so um so all of this this is just
what we're saying here right that some
of these things are not the sound
technical but the aren it's just common
sense that's is all that we've discussed
in one slide here how do you very reduce
variability in size among clusters you
can stratify clusters by sizes which
we've discussed right big medium large
or small medium large you can combine
smaller clusters into larger ones or
break larger clusters into smaller ones
or you can use PPS
sampl
um all right so talk about PPS sampling
for which is what is implemented on the
kqu platform all right let's talk about
stratification and what it entails and
then I any questions let me see I see
some
um okay um Peter says will they produce
the same results H that's one of the
questions in your
homework take a random take convenience
um take a do repeated closer sampling
and question is they produce same result
so please
can Danel can you mute yourself
please Danel can you
pleas please we really do not want to
listen to your private
life okay can mute everybody and then
remove maybe
you I try to
get Daniel please can you mute
yourself all right thank you
okay what was I
saying um okay stratified sampling so
that's a good question right will they
all produce the same results now it
depends on what you mean by the same
results can you get the same results
from any two sampling even if it's
simple random is it
possible what do you think is it
possible to get the exact same results
whenever probability is
involved the chances are slim to get the
same
results yeah it's almost zero right and
that's why we need what and that's why
we have we present our Point estimates
with
what this point you guys conf interal
sorry confidence interval excellent
that's present result confidence
intervals right to account for the fact
that yeah you know it's
it's this is probability involved and
what is what what what does confidence
intervals mean let's you know um you
guys have been taking these classes for
long enough to know these things like
the back of your hand what is the
confidence interval what does it
mean oh
explain to me as if you explain to your
grandmother what what what do confidence
intervals in in the context of service
mean the chances that other factors are
responsible for the outcome of
the okay of the study um no another Tri
another trial thanks good
it means that I am sure of this results
maybe up to 90%
sure okay the other 10 I'm not I can't
vote for
it okay okay maybe if the the test is
repeated I'm 95% sure that you get the
same
result I'm 95% sure that
what is it indicating strength of
Association okay Association associ okay
that that's a good trial but in this Cas
we're not looking at Association if I
said the smoking prevalence is 50% right
I'm not I'm not NE looking at
associations I'm just saying this is the
prevalence of smoking in this population
50% so a confidence interval so you you
are half correct in the context of yes
go ahead a confidence interval um so in
95 5% confidence interval shows that if
the study is repeated 100 times
following the same procedure
95% all the time the result is going to
fall within that range the result what
do you mean by the result you you guys
all have some elements of Truth so if
I'm go
ahead okay uh I
think from my point of view I think
confidence in t can be
in which we get from the DAT where we
just
estimation
paramet Yeah you mentioned a very good
word I wanted to hear parameter now
somebody should take so all of you have
said something that has some element of
truth but nobody is completely true I
want somebody to now take this different
answers and fuse it into one correct
answer for all of
us Point
estimate
okay now some of you said today is that
we're 95 so here are the wrong part of
what you said I'm going to point the
wrong aspect the different comments that
were
made so one person said that it is the
measure of
Association the the wrong wrong part of
that answer is that we're just looking
at a point estimate just we're not
comparing two variables we say prevence
of smoking the privil of smoking was 25%
with confidence interval between 20 and
30 so we're not saying what is smoking
and lung cancer so that is why that that
answer is wrong because it's not it has
nothing to do with
associations the other answer was wrong
that said if we took this survey 100
times
then 95% of those times the result will
fall between this and that the reason
why that is wrong is that the assumption
is not taking it 100 times the is
assumption is that if we took this
sample an infinite number of times which
means if we spend the rest of our lives
taking this sample over and over and
over so it's not it's not taking a
sample 100 times it's taking a sample an
infinite number of times that is why
that results is
wrong the other person said that it is
the likelihood of of other on on
alternative explanations accounting for
our result again that is also wrong
because we're not comparing other
variables yet we're just looking at this
point estimate right um somebody else
said that they it is we are 95% sure
that our result is accurate um that is a
bit close
but it doesn't
contain a keyword which was supplied by
the last you to find
it is it talking through
us
okay all right so now that so I'm
looking so whatever answer you give must
contain this words it must contain the
word
parameter it must contain the words an
infinite number of times
right it must contain the word estimate
so I want an your answer must contain
estimate parameter infinite number of
times and
95% so those four things I want them I
want somebody to put them all together
and give us an explanation of what the
confidence interval is and again that's
we're just trying to answer this
question it was a short question the
question was are the results going to be
the same this is rather a long answer
but you really need these are the basics
of you really need to understand this
fundamentals so that's why we're giving
a very long answer to a short
question so who wants to volunteer and
take all of those four words we said I
didn't even
know
sorry okay so who wants to who wants to
buy the bullet and give us what
confidence intervals mean the context of
surveys I've already given you the hint
give me the four keywords I want to
see Let me let me give it a try
excellent yeah so uh a 95% confidence
interval means if the the sample is
taken an infinite number of times that
we're 95% sure that the point estimate
or 95% of the times the the parameter
estimate will fall within uh that range
you are 99% correct the the only part
where you said you use the word
parameter estimate there is no such
thing you are combining two different
things in
one okay let me try Okay go ahead we are
95% sure that if an estimate is repeated
infinite number of time the
parameter would
the parameter of interest will remain
the
same okay um you are a little further
from what Moses was Moses Moses you were
almost there you also right you know but
Moses was closer to the truth Moses can
you reframe it and this time just make
sure you don't parameter estimate please
estimat of the
parameters okay
so who is putting it together for
us
can go ahead
Henry okay we're saying that if the stud
is conducted infinite number of
times um the the parameter is going to
fall within the um estimated range of
the
um of the
of the estimate um the range range of
the okay all right those are all good
attempts so what what we're saying is
simple right let's just put it on here
so that we all so we have an estimate as
99.0% with confidence intervals of 2.0
to let's say
15.0 that is what we have
right let me make this
bigger can you all see my screen and
this this this point this U Point
estimates so the point estimate here is
9.0 right and the confidence intervals
are 2.0 to 15.0 right so we're saying
that if we took the study if we
conducted our sample an infinite number
of
times
right what what is the parameter in this
case
who can tell us what the parameter
is 9.0
no 9.0 is a
what 9.0 is a point estimate the
parameter is what 9.0 represents so
what's the
parameter don't know excellent who who
is the only person that
knows I know what the parameter
is what we do know is that if we spend
the rest of our lives an infinite number
of times taking the sample over and over
and over and over again 95% of the
times the parameter will fall between
what the estimates no give me the number
I want to make sure you know talk about
2.0 to 15.0 exactly so that's all we're
saying that's what confidence are so if
it took this an infinite number of times
over and over not 100 times an infinite
number of times then 95% of those times
the parameter which we don't know we are
just sure that it's going to fall
between 2.0 and
15.0 right so that is what that is what
the confidence TS mean so again we all
use confidence T every day you you have
to know what these things mean it's it's
it's unforgivable not to know
them how how do we use the word
estimate sorry how do you use the word
estimate no that that was that was just
a trick
question
okay I have a trick question there
somewhere yeah um how do you have a
question your hand is up yeah uh I have
a question can an example of a parameter
be like a P value or I mean I'm just
trying to visualize like
how yeah this could def so let's say
smoking prevalence that's what we're
trying to measure
right that's what that's what we our
study is interested in the percentage of
people who
smoke the parameter maybe God knows that
parameter the parameter is 25% that's
what but only God knows that we don't
know that value and that's why we we
call our own um smoking prevalence
estimate
it's an estimate because it's trying to
estimate the parameter which we don't
know so the the parameter is the the
actual value of this construct we're
trying to measure this outcome we're
trying to measure but we don't know what
it is nobody knows what it is that's why
we have to resort to the estimate
because the estimate gives us a crude
value of what the parameter is but we
want some setting to that okay how how
good is this estimate is it is it any
good well that's why we need confidence
intervals are we all clear on
that yes I I want to take you back a
little bit sir
sure okay this
design our confidence
inter
that Des effect and the
relationship effect and confid
and and how we arrive at
figures okay um all right so oh okay I
think I hear I think I understand what
you're saying you're saying what's the
relationship between the design effect
and the confidence
inters if I had it
right so
they they when you one of the
consequences of of using cluster
sampling is that it it increases the
variance right so and you know what hold
that thought we're going to that's one
of the that's exactly the question one
of the questions that actually in the
homework you asking it back to me that's
the same question as so we're going to
talk about this in moment so let's hold
that question but let me write it down
so that you can see instead of me
explaining it we're going to see how it
comes out in real you know you know in
real time but the question is what is
the effect or what is the
relationship between the design
effects and the confidence
intervals okay and so like I said the
key word here you have to keep out watch
out for is variance so keyword so if I
forget this if I don't mention this at
the end please drag back to and say hey
let's we forgot talk about variance so
that's what we're going that's what's
going to
um that's going that's the key to unlock
all of that so um going back to the
slides okay we'll talk about SRS um
stratification already but the reasons
why we do stratification is to allow for
adequate sample sites it's purely is an
efficiency move we want to make sure we
have adequate sample sizes among the
groups we want to capture that is why we
use stratification in surveys and it
also allows for flexibility for sampling
design so once you stratify oh you are
free to do whatever you want in any
given strata if I've stratified this
population here I can do a type of
sampling here I can do another sampling
approach together you you it allows a
flexibility for you to do whatever you
want within any particular um strata or
stratum so again this is just a table to
show you the difference between um
stratification and clustering you know
terms of internal structure
stratification is internally homogeneous
clusters are internally heterogeneous um
and then the mutuality for
stratification they're mutually
heterogeneous while in CL in Clos
sampling they're mutually
homogeneous then representation every
stret must be represented in the sample
but in cluster sampling not every
cluster is represented the impact on
variance
um with with stratification you have a
decrease again this is very key to the
last question I was asked the question
was how does this impact confidence
intervals well this is your answer here
the impact on variance now from common
sense when we say
variance what what do you just forget
about statistics now right if you say
variance what does that mean what what's
the first thing that come to your mind
difference variation variation so if we
say large
variance can you link that to confidence
intervals large variance will from
common sense right large variance will
result in
what large confidence interval interal
if you have a large variance result in
large confidence interval you have a
small variance result in you know so you
know so would say that cluster sampling
increases the variance relative to SRS
what so um can you then answer your own
question right yes is what will cluster
what what what's the impact of cluster
sampling on on confidence intervals will
it increase the confidence interval or
will it decrease confidence
interval it to increase it increase it
why because it it it results in an
increased variance and we're going to
see um part of our part of the homework
for was for you to demonstrate to me how
and why closer sampling increases
variance relative to uh SRS and it also
ties in very nicely with the question
somebody asked the question was
this um let me see the question that was
a very good question the question this
was question by Peter he says will they
produce the same results that is a very
very important question right will they
produce the same results so what Peter
is saying is is the what is the variance
we
expect if if to going back to Peter's
question if they don't produce the same
results what's another way of saying
that with this with the word variance in
it is the result we get associated with
if we took the results over and over
with closer sampling and each time we're
getting a result that is drastically
different we can say the results have
what deviation I want the word variance
in it hi variant I variant they vary a
lot this is not complicated it's just in
simple English right if we took simple
random sample every time we get results
that very close to each other we say
they have
what L variance exactly so that's what
we're going to demonstrate now any
questions on this we're done with it
with with it our slides I believe H okay
um well this is this we can't we can
ignore this all right any any questions
before before we move on
so now we just want to go and tackle the
homework how many of you tried
it I
did only the simple Rand
sampling question too okay fair
enough all righty so let's let's go back
to that and let's try it
together okay um where do I find this
now okay competency
tasks can you see my
screen yes can all right great L is
gonna cover the second part L over to
you
please no comment that's
surprising all right
um so you are a consultant
um to conduct a survey and we are
supposed to fill in
this this table so
let's
sorry okay so let's take the okay since
you have done it before let's just take
your your um well let me just do it let
me not let let let me not be lazy I
wanted to just take the results from
from if since i' already already done it
um
what what results you get if let's let's
just take your
results are you still
there all right let's just do it let's
just do our own sampling then since if
he not here let's take the so for the
simple random sample the question is
um from the combined sampling
frame which is this one
we are supposed to take a sample so
let's pull the data out into a new
spreadsheet and let's save it as a
different sampling
frame
um so let's save this
somewhere as um sampling frame
combined all right now let's go to the K
platform and do the sampling for those
three um groups we
need and let's look at the issue of
variance so
sampling that's per probability sampling
on the sampling
frame let me up
this uh lecture
Series so if um what did you get for
your results
so simple random sampling we want to
select um 100 people so we we've
uploaded our sampling frame first we
have 2
9,769 rows so first we want to select
100 people we get
sample with simple random sampling these
are our people here let's just um let me
just download this we will download that
result all right it's downloaded now
let's um open
it so we are supposed to find the mean
of metabolite X so let's just take a
simple mean in
Excel the average of metabolite
X so the first one gave us a value of
33.712183
let's get
sample all right let's download
this as
well all right
downloaded let's open that
five the result from this second one is
35 can you guys see my
screen yes yes okay
great
35 35. 528
5.5
to8 and then let's sample
three unfortunately there's no other way
to learn these Concepts than to do it um
so if you want to learn them you just
have to do it um the next one was how
many people uh let's see forget
1,000 10,000
people so go back there sampling we
sample this time we on 10,000
people and we get
sample all right something is complete
let's download this too
all
right
six again we take the
average of metabolite
X answer is that's 4.2
69 and then um last is sampling with
20,000
people
resample with 20,000
people is anyone who confused by what
we're
doing is it very
clear well I'll take that it's clear
download all right let's
seven and this
average
um of this to
this so we have 34.4
329 all right so we we're done with the
mean value of this based on simp sample
then we're told to do a single stage
cluster sampling with s around selection
of five
clusters single stage single stage
selection as a name implies is that
there's just one stage involved when we
talk about stages we mean the stages
where sampling is involved so if you now
get to a point a stage where you're not
you're no longer
sampling what what is the opposite of
sample if it's not a sample it is a what
if it's not a if you're not taking a
sample you're taking
what so life is either it's all unnown
or sample right all unknown or sample so
if it's all what do we call
that so have population sensus the
sensus exactly right so if you are
sampling and sampling you get to the if
if you get to the point in samp to your
selection where you now take a sensors
that's no longer sampling that's that's
that's that's no longer a stage so the
stages are the stages where sampling was
involved so a single stage cluster
sampling means that in the first stage
we sampled clusters and then in the
second stage we took
everybody within the Clusters that were
selected does that make sense so stage
one we select clusters stage two we
selected
everybody so here the task says single
stage closer sampling with simple
selection of five clusters so it says we
should select five clusters how many
clusters do we have in this in this
community we have let me delete some of
this we don't need them anym don't
save look that any questions in the chat
box no there just an answer 10 exactly
we have 10 of them right um so and we
we're supposed to select a simple random
sample of how many
clusters five of them that's what the
question says right so let us say that
they um the the ones we need to select
are let me write down the list of
the
um jeez tiny ve so we we we have tiny ve
so in simple sample you have to first of
all write out all the Clusters you're
selecting you have so in this case we
have this tiny ve we have giggle Shire
we have metop
police we
have snecker
BG we have jumbo Ville and CH was the
one that g me those names by the way in
case you're
wondering we have Modo we have large
land we have these are the kind of
things I use charity for you know um
it's it's it's great at coming up with
ridiculous
stuff pin
sibg we have
colossal that's one
L colal Cove so this is I believe 10 of
them is it oh nine I'm missing
one which one are we missing G mopolis
nicar
chille mod larand P let's see
H 1 2 3 4 five six
7 8 nine oh small
inton I missed that
one where was I right typing this just
now this is annoying
ah my computer is acting up now all
right
great let's
see but the life of me I can't figure
out where I was just typing this thing
now let's undo one change and
see I think it's right
here don't I have this name somewhere
okay all righty
um I can't believe
this okay um well let me just R type it
but this is really annoying let me see
if I can find it if not I'll just type
it
again all right I will just type it
again all right let's so small
inton tiny VI
we
have gig
Shire
pois
oops next time I should type the names
in
that in that um homework
assignment but I always want you guys to
suffer now I'm suffering too
um
Jim
modle large
land in
size
BG so this is the first thing when
you're doing Simple sampling simple
sampling you have to first of all write
out all the the names of the members you
create a samp frame that's what we call
it
right and was
last
that's okay 10 of them great so create a
sampling
frame so this is our sampling
frame for the populations of Interest
and then we now have to select any at at
random of you know any any four of those
right um You can you can do that with
any single app you can use um you can
use of course you can use the K platform
to do that um in which case we can let
call
them counties
let's save
this so we have the list of
10 counties there so we can then we're
supposed to select five of those at
random in which which case we have let's
go to sampling let's
resample by uploading our
new and we
want simple random sampling of course we
want just five of
them let's get a
sample so these are the five ones that
have been selected at random right
Smalling things sneer jille modico and
larand all selected random so the
question the the task said we should
do a where's the
task a single stage cluster sampling
with simple random sampling of five
clusters so what we just did is the
first part select five clusters at
random with simple random sampling and
then of those five do a single stage
cluster that means that in those five we
are going to select all members and
Sample them and and include them now
survey right so these are the ones that
have been included so let's add them
to let's add the selected ones to let's
do now let's now do the single stage
cluster samp sampling let's go to the
ones we selected Smalling so we take
everybody in small
lington so this is the metabolite x
value everybody we take that we include
it here
right that is for Smalling teon next we
have to do the same thing for the other
ones so next on the line is Snicker BG
so we go to Snicker
BG and we select
everybody so this is the value of
the metabolite
X for sneaker everybody there we grabbed
all their values we selected everybody
in that
community and we add the value to our
database the next place was jumbo Ville
again single stage closer
sampling jumbo we select everybody in
that
Community right and that's why it's not
a stage because everybody's been
included and so we add that to our
database so if this were real real life
you are going there and you subing every
single every single person you see then
the next is modico we go to modico which
are selected at random and again with
draw everybody
there and we add that to our
database and lastly we go to
larand where's larg land
this and we select
everybody and again we add that to
database so what we have just done is a
single stage cluster
sampling we first selected clusters and
random five of them out of 10 and then
we took everybody we selected everybody
within the selected cluster so now let's
take the value of the average value of
metabolite X
is 33.7
so we go to our results here we type
33.7 so this is a single stage cluster
sampling with syum sampling then the
other one was single stage cluster
sampling with with systematic selection
of five
clusters so systematic selection of five
clusters um what can we so in this case
we still have the five the counties here
but now remember for systematic
selection we must sort on something what
do you think makes the most sense for us
to sort
on we have to select five clusters here
right what makes the most sense for us
to sort those clusters
on what do you
think we can't do systematic sampling if
we don't have something to sort on
so what are we going to sort them
on well we can sort on the population
sizes right so we get a flavor of
everybody in in the in the in the
community the hypothesis might be that
you know the outbreak of vibranium fever
is related to the the density of people
in that area so that in potion are more
dense the outbreak spreads faster and so
it makes sense that we get a flavor of
you know different population along that
spectrum that is our Theory that's our
argument for again sampling is all about
making a decision and justifying why
that makes sense right so that that is
what the the you know you you have to
provide a defense for why you have
sorted by this variable and not that one
for example so let us now sort the the
populations by the different ones so
small lington has a total number of five
516
people so
516 um sorry 516 tiny veille has a total
number
of let's
see tville has 278
people ggle Shire has a total number of
116 mopolis has a total
of
2891 snecker BG
has
one22 jumbo Ville has
2831 modico
has 4
219 larand
[Music]
has
9547 including
the the header like yeah yeah I'm aware
of that it's just like this this is just
a mock practice but you know you will do
better when you're doing your own work
you know the idea is that if it's non
differential that's fine but yeah um
2499 but you should definitely be more
precise when you're doing your actual
work and then lastly colossal a why do I
keep calling colossal a colossal Cove as
34
3944 so this is
our this this is the population we have
for for our stratified sampling so let's
save this counties and pop
size file
Savers oh let's just let's just save and
replace actually just
for so let's just save and replace and
now let's let's say we have to select
now what does the question say
five clusters
still so we go back to the app and this
time around we do a systematic
sampling let's
resample but now okay let's we have to
upload a new data
frame list of
10 that's uploaded now we have to change
it to systematic
sampling and
select the Sorting variable is
population size
one select five we get
sample now these are the ones that have
been select so you can see um with
stematic sampling the whole idea behind
stematic sampling is that we want to get
a flavor of everything from the smallest
to the biggest so it sorts everything by
the size and then we now have to do that
so the question was
to select five of them and in those five
you after selecting the five do a single
stage cluster sampling right so again
this time around too we're doing single
stage the only difference is that the
first time we did simple random sampling
of the five clusters now we're doing a
um simp a single stage but with
systematic selection in the in the that
single stage so let us let us do that um
so the selected ones
were so I want to see how this might
influence the results now so small
lington so everybody in small lington
will select their
results then next
was snecker
BG so again everyone in sneaker bog will
select their result
way then next jumbo
view select everybody there
you see how this has this homework was
just easy
right colossal
Cove select everybody
there
then large land
where is
larand Select everybody
there all right now our task is now to
take average of this
so the result is 36.
32 so um between those two so um between
these two for for example like just
looking at the which one has which one
has more variance in the
results from what you can
see I mean it's not very dramatic but
which one has slightly more
variance
sorry the answer cluster something yeah
cluster exactly yeah that had more
variant so so the next one was for you
to do a two
stage um probability
sampling with um two stage probability
proportional to size closer sampling so
that one we're simply asking you
to
um we're simply saying you where is our
data all right um
where's my spreadsheet okay right here
so a two-stage PPS for you to use a
two-stage PPS sampling you this is how
you have to prepare your data set you
have to have the counties in one way in
one one um column and the probability
proportional to
size you know the the population size in
another column so the way the the that
is is perfect for us to use BPS sampling
right so now let us just say just to
keep things fast we're going to select
three of those clusters so let's go back
to our um to the platform and
resample let me move this things out of
the way so we're sample but this time
we're not using systematic sampling
we're using a two- stage PPS sampling
right the measure of size is population
size um there
are there are 10 there are 10 um
clusters let's say we want to select
three of those right and from those
three we want to select total of let's
say um 2,000 people right so now the
platform tells us that okay we have to
go to jumbo bille we have to go to Modo
we have to go to colossal a and we have
to sample
2,381 people from from those three
places right so um so number sample 667
persons from Jumbo bille 667 people from
modico and 667 from colossal cve right
so that's what we have to do now we will
have to go and sample from those from
those Three Counties or those three
clusters the numberers specified here so
let's do jumbo Ville first of
all so jumbo Ville is
here
let's put its result here and the is why
I make this homework iterative is that I
need you to do it over and over all
right this is the only way you can
really understand these things so you
have when you do one thing the many many
times so that's why the homework is
designed by to be iterative so you
practice it over and over and over so
let's call this
jumbo so jumbo was supposed to select um
667 people from jville let's also select
um
the next one which is
modico
modico so this is exactly what what how
this is done the field just that instead
of playing with mock data sets we are
actually doing it with you know um
actual surveys and you know it's a bit
more complicated but it's exact same
principle um so if you if you can figure
it out then you at least you understand
it in principle to be able to do it so
save as that's save
modal and lastly let's save colossal
Cove so you see why I was disappointed
that nobody got my
money see
yeah give very simple
homework all right so remember we have
to sample 667 don't forget that number
667 why do I put that there okay all
right now let's let's sample the first
one which
was jumbo bill so long please you're
going to remind me of this three as I'm
going this screen is going to exit jumbo
modle and colossal and
667 so let's resample so we select our
first one colossal
Cove we need to sample with simple let's
do simple random sample we need to
select 667
people of that population
these are the people that have been
selected um so we would let's download
the
data sampling
8 and let's open
it so I'm going to
just copy jumbo View's
results and put it some
somewhere or let me just put it there in
highlight next we
have
um let's resample
again this time we need
Modo and again with the simple random
sample 667 people
all right that's those are the people
there let's download
that all right this is modore let's
and then
lastly let's
resample and let's get the third
population jumbo
bu 667 people again let's get that let's
download this
um let's so now we just now we've done
our we've done our two stage PPS sample
which means take a the first stage we
select clusters with probability
proportional to size the second stage we
took a fixed number of people from each
cluster 667 people we have our combined
population now we can
take the average levels of metabolite X
here so let's say
average
of this
everything and that is
32.4 and then they last question was
saying you should take a stratified two
stage closer sampling where is sample um
um chiefdoms in the first stage and
hospitals in the second stage so there
are two things we have to do now we have
to divide into strata we already have
the strata here in our spreadsheet which
are the yellow guys and the green guys
those are the two strata so first of all
we have to draw a simple random sample
right so for the sake of this argument
I'm just going to do a simple just
select one cluster in each of the strata
instead of for the sake of time so we
have Strat one two sorry clusters one
two three and four we want to select one
cluster random here I will select one
cluster random here so that is the
stratification then it says the
two-stage sampling select clusters and
then within the Clusters we select
hospitals remember that people went to
the hospitals so each hospital is also a
cluster first stage we select clusters
second stage you select hospitals and
then we can now select in you know in
all the hos in each selected hospital we
sample everybody or we select everybody
so first stage is we selected the
Clusters in the second stage we selected
the hospitals within each cluster so if
we have
um let's random between
one and
five or one and four actually is four so
if we're selecting a random sample of
one two three and four that is giggle
Shire so in in in the yellow ones in the
yellow stratum we selected giggle
Shire then in the green ones so so for
for
orange equals giggle
sh for the green ones we have six of
them so let's select one random random
between one and
six that is four so here one two three
four um green we've selected large
land this is the first stage of
selection first stage we've selected the
Clusters now in the second stage we now
have to go and select um select clusters
so we'll go to gigle share and we'll see
okay what are what are the what are the
um hospitals there right how many are
there and you know how do we select them
so let's first of all sort all of these
people by the
hospital that's column
d by
Hospital right so we we we now have all
these hospitals here we can now select
at random or however want to maybe there
are six hospitals in in you know in in
in giggle share we want okay of this six
how many can we go to say okay let's
select a random sample of three we
select three hospitals and in those
three we select everybody in those
selected ones all right and then of the
selected clusters in the green we also s
we do that same selection we go to
select hospitals and within those
Hospital hital we select everybody in
the hospitals that was selected and
that's how we draw draw a sample so
that's your homework right you supposed
to complete that fast part but the point
we're trying to make here is that
results from cler sampling as you can
see associated with more variant right
that's why that's why their sample size
sorry their confidence interance will
also be wider compared to simple random
sample of course all right um
moving on to question so yeah you you
guys will complete the remaining ones by
yourself I think you know we' made the
point clear
enough all right second question was
very easy based on the fantastic job you
did in the Kingdom of Zamunda you hired
as a consultant to conduct a national
survey for the kingdom of wakanda right
they want to stratify by these four
variables how many Strat up will you
have that was a very easy
question how many
what explain your
answer you're absolutely right by the
way okay so what I hey don't mind my
background let me shoot a bit so what I
actually did was I checked um if you
look at gender you see that you have
male and
female um for race we have uh three
um groups like majority black and non
black then what I did just to multiply
out the different groups we have exactly
so what what did he did you get 24 yeah
I got 24 yeah that's that's the correct
way of doing it so that means that each
population will be will have
male majority black low lands who are in
the low population density male majority
nonblack low L like that you are going
to have 24 distinct groups so you can
see why if you are doing a survey and
you're talking to clients or whoever
that is and they're telling you oh want
to stratified by this variable and that
variable like do you really understand
the implications of what you're
saying the costs are enormous right so
let's look at the sample size
calculation for this let let's go to um
back to the app and look at what will it
take this is what you have to let your
clients know right like doing na large
National
survey the kind of surveys I conduct are
typically
International and like yeah clients have
all these expectations especially um
multinational companies like yeah we
want to have in each country we want to
have like that is impossible like I know
you your company has a lot of money but
not even you have that kind of money
so let's come back to earth okay let's
talk about things that are possible so
let's look at the sample size for that
kind of study in that case um go to
sample size select um crosssectional
survey with personage as outcome let's
just keep you know all of this fixed and
we say cluster sampling was yes it's
going to be performed let's say the
design effect of two and you have strata
24
strata right um and let's say you have
half and half you know response rate how
many people will you need you need
36879 people to be enrolled my study
with the number of Str of
24 now think of okay what does that mean
right let's say you're doing a study you
are having an incent you're having
incentives of you're providing
incentives in your study typically let's
say you have $30 of incentives for each
participant that is from incentives
alone that is got to this times
30 and then you have recruitment fee
recruit in general for most large
clinical studies you're talking about
you know recruitment fee depending on
the country obviously and depending on
the how rare it is to find the
participant recruitment fees could be up
to like $120 per participant if it's for
for example a read Disorder so you're
talking about this is this times 120
right so just you know we haven't talk
about any of that thing yet just in
terms of administering the survey
finding the participants and
administering the survey we are already
talking about um a total
of um you know this is half a million
dollars right one two three one two this
is$ 5.5 million already right and we've
not talked about anything yet we've not
talked about you know major costs
associated with you know writing the
protocol and statistical analysis plan
and Fielding the study terms of you know
a whole bunch of other logistical costs
this study is easily going to run into
$10 million easily right and we not you
know so and that does not even include
the cost of publication cost of all
right after we've gotten the data how
are we going to generate results from
the study publish them and all of that
you could easily run into tens of
millions of dollars with just that kind
of study so when yeah when so when
people tell you want to stratify B this
fans and stuff like I can assure you
that that's not going to
happen because wait till I give you the
bill for the
study all right that's the end of our
class any
questions all right
not okay well um so you can you can um
you can try and redo the um homework by
yourself is