Introduction to Regression Analysis

Name: Classroom: Regression Analysis Bootcamp (Part 1)_1 of 5
Uploaded: 2026-03-08T15:54:59.972215+00:00
Channel: Chisquares
Description: Summary and key takeaways on Introduction to Regression Analysis, covering to Regression Analysis Regression analysis is presented as a tool that boosts
Chisquares
Mar 08, 2026
•
4 min read
YouTube video ID: SQmb5OLq6BU
Source: YouTube video by Chisquares — Watch original video
PDF
Regression analysis is presented as a tool that boosts scientific productivity by encouraging researchers to “step backwards” and view the bigger picture rather than focusing on individual data points. This backward step mirrors the literal meaning of the word “regression” and helps uncover overall patterns that would be missed when counting each element separately.
Forward or Backward?

A series of pop‑quiz examples—counting flower petals, tallying fruit in a basket, viewing the Earth from space, and observing a forest—illustrate the choice between moving forward (counting each item) and moving backward (understanding the larger phenomenon). The decision depends on the research objective: detailed enumeration versus holistic insight.
The Elephant Analogy

When observers stand too close to an elephant, they see only a fragment and draw incomplete conclusions. Stepping back provides a full view of the animal, just as stepping back in regression reveals the overall relationship among variables.
Core Components of Regression

Regression models consist of two essential parts: the outcome (dependent variable, left‑hand side) and the predictor (independent variable, right‑hand side). These components are the building blocks for any regression analysis.
Exploratory vs. Confirmatory Analysis

Two research approaches are distinguished.
Exploratory analysis is likened to a “fishing expedition,” where researchers probe data without a pre‑specified hypothesis.
Confirmatory analysis resembles aiming a telescope at a known star to test a specific hypothesis about its brightness.
A second pop‑quiz classifies research questions—such as factors influencing customer satisfaction or drug dosage effects—into exploratory or confirmatory categories based on their specificity.
Models as Simplified Representations

Regression produces models that simplify reality. Like a picture of an apple that cannot be eaten but still conveys essential features, a regression model helps explain and predict phenomena when direct data are unavailable. The classic maxim “All models are wrong, but some are useful” underscores this point.
Foundational Statistical Concepts

Key concepts supporting regression include hypothesis testing, P‑values, Type I (false positive) and Type II (false negative) errors, bias, validity, statistical power, and sample size. Understanding these ideas is crucial for correctly interpreting regression results.
Truth, Chance, and Bias

Observed outcomes can be attributed to three forces:
Truth (validity) – the actual effect researchers aim to uncover.
Chance – random error that introduces variability.
Bias* – systematic error that consistently skews results.
Common biases include confounding, selection, and measurement bias. Regression analysis seeks to isolate truth by controlling for chance (through P‑values) and bias (through proper design and adjustment).
Distinguishing Univariate, Bivariate, Multivariable, and Multivariate Analyses

Univariate – analysis of a single variable.
Bivariate – analysis of two variables and their relationship.
Multivariable – one dependent variable with two or more independent predictors.
Multivariate – two or more dependent variables.
The brief notes frequent misuse of “multivariate” when “multivariable” is intended.
Adjusting for Variables

An analogy of passengers on an airplane illustrates adjustment: to study one passenger’s movement, the others are strapped in (held constant). In regression, this translates to holding other predictors constant—often using reference groups—to isolate the effect of the variable of interest.
Sample Size, Power, and Precision

Adequate sample size is essential for statistical power—the ability to detect true differences. Small samples increase the risk of Type II errors. Sample‑size calculations differ for exploratory versus confirmatory studies.
The K‑Quest platform is highlighted as a tool for calculating sample sizes, requiring inputs such as outcome prevalence, effect size, confidence level, desired power, control‑to‑case ratio, and anticipated response rate. Example calculations include:
Confirmatory study: 393 participants per group (total 786).
Case‑control study: 56 cases and 221 controls for a 1:4 ratio.
Cohort studies depend on outcome prevalence, competing risks, and follow‑up duration. Structural Equation Modeling (SEM) follows standard sample‑size methods but places extra emphasis on measurement error and bias.
Types of Bias in Research

Bias is explored in depth:
Information bias – misclassification (differential or non‑differential).
Confounding bias – involving colliders and causal pathways.
Measurement bias arises from how variables are measured, including questionnaire design and participant‑investigator interactions. Social desirability bias reflects participants tailoring responses to please the researcher.
Research Designs and Advanced Techniques

Various designs are discussed: surveys, clinical trials, time‑series data, and joint‑point regression for detecting trend changes. Repeated surveys collect independent samples over time, while longitudinal surveys follow the same individuals. Joint‑point regression is suited for population‑level time‑varying data.
Key Distinctions

Precision vs. Power – precision concerns the width of confidence intervals; power concerns the ability to detect true effects.
Precision vs. Validity – a precise estimate can still be invalid if it lacks external generalizability.
Exploratory vs. Confirmatory – exploratory work generates hypotheses and emphasizes precise estimates; confirmatory work tests predefined hypotheses and emphasizes power.
Repeated vs. Longitudinal Surveys – repeated surveys use new samples each wave; longitudinal surveys track the same participants.
Takeaways

Regression analysis encourages researchers to step backwards and view the bigger picture rather than focusing on individual data points.
Exploratory analysis is a data‑driven fishing expedition, while confirmatory analysis tests a pre‑specified hypothesis with statistical power.
Models produced by regression are simplified representations of reality that are useful even though they are not perfect.
Adequate sample size and proper power calculations, such as those provided by the K‑Quest platform, are essential to avoid Type II errors.
Bias, chance, and validity are the three forces explaining observed results, and proper adjustment techniques help isolate the true effect.
Frequently Asked Questions

Who is Chisquares on YouTube?

Chisquares is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
Forward or Backward?

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.
Regression Analysis Textbook Recommended
Provides clear explanations and examples of stepping backwards, model building, and sample size calculations for both exploratory and confirmatory studies.
Amazon →
Sample Size Calculator
Helps researchers determine the appropriate number of participants to achieve desired power and precision, mirroring the K‑Quest methodology.
Amazon →
Statistical Power Book
Offers in‑depth coverage of power analysis, Type I/II errors, and the relationship between effect size and sample size.
Amazon →
Links may be affiliate links. We only include resources that are genuinely relevant to the topic.
Summarize another video
Full Transcript YouTube

very likely to be to be cited it's very
likely to be to be to be read by our our
peers in the scientific community so in
a n shell this is why you should be
concerned my regression analysis so if
you're a scientist and you want to
increase your productivity well you have
to learn how to do regression analysis
and so this boot camp is designed to
give you the skills to be able to do it
and interpret it very
well so the whole of our lecture today
is going to be framed around the concept
of we either move forward Ward or we
move backward and this this this is not
a mystery in most most cases in Life or
most scenarios in life we sometimes have
to take a step forward or we have to
take steps backward and if we understand
this concept that means we understand
what regression analysis is so in the
next few screens I'm going to show you
some scenarios and I we're going to do a
pop quiz where you will have to tell me
whether what we need to do at that point
is to move forward or to move backward
so let's let's start
so here your assignment is to count the
number of flower petals on the ground do
you move forward or do you move
backward what are your
thoughts let me see can I even see what
people are typ
in um no I can't um look any comments in
the
chat you have to be my eyes
now there are flower petals on the
ground you have to do you have do you
move forward or do you move
forward backward forward okay who said
forward we need one person who said
forward like why why why do you move
forward or the person who said backward
Why move
backward and it why okay let me
good evening hello am I audible yes you
are yes you are okay so I I think the
answer is backward because you you're
trying to you are trying to make a
comment you're trying to draw a
conclusion from what is already there
what has already been you know should I
say piled up for you to come and now
observe okay thank you I don't want you
to overthink this don't don't don't over
complicate it's just it's I I promise
you no tricks it's just we want to count
okay let me move to another example all
right you want to count the number of
fruits and vegetables in the basket do
you move forward or do you move backward
it so again I promise you there are no
trick questions here it's just practical
Common Sense do you move forward or do
you move all the way back if you want to
count the number of of of fruits in the
basket so uh looking at this Prof I you
think that I have to move forward per
the first one that is showed I I'm
supposed to move backward but looking at
this one I'll start from the bigger one
then I move forward towards uh I don't
know whether it is orange or the last
one so I'll move forward okay very good
you move forward and the reason why you
move forward is because you need to
count the number of individual items I
mean think of it right if you're a m how
are you supposed to count this items if
you are standing in M you can't see them
right because your interest now is in
the individual items in the basket all
right let's move to next one now you're
on the ground right but your your
interest is to view the entire Earth not
just the ground on which you stand do
you move forward or do you move all the
way
backwards sorry
um
Jonathan uh I would probably move
forwards but only because again I'm an
English speaker so I read left to right
so I would probably go from the bottom
left corner upwards but you're on the
standing on the ground already you want
to see the whole you want to see the
picture of the whole glob Globe right
not just the ground on which you're
standing so you do you move away from
the earth or towards the Earth you
standing on the ground already but you
want to see see the whole earth not just
the ground on which you're
standing towards the Earth Final
Approach
okay I beg to Def but let's move on
let's take the next example you want to
view the entire Forest not just the tree
by which you standing do you move away
from that tree or towards the tree again
your interest to see the whole Forest
not just the
tree you move away or do you move
towards it
I will move away my name is
Esther from Nigeria I'll move because it
will give me a better view of what I
want to get absolutely right now I'm
sure you've heard of the analogy of the
elephant right where different folks
were asked okay they were all
blindfolded and then they said okay what
are you what do you think this is they
said it's a fan it's a spear it's a
snake like who is correct and who is
wrong and if they are wrong why is there
so much what is the reason for the
systematic error
here I think it's because they are
viewing it from their own perspective
different people viewing it from their
own different are so they see what they
see looking at it from where they stand
very good from where they stand and
that's because they standing too what
too close to the object excellent so
from that means that for you to see the
whole object in its entirety what must
you do you move backwards yeah you have
to move backwards right and that is what
regression is the literal meaning of the
word regression means to step backwards
right Theology of regression means from
Latin means the act of passing backwards
the act of going returning backwards
right and so you might be wondering why
would we want to go backwards well it's
because we're not we're necess
interested in the fruits in the basket
we we're interested in the bigger
picture we're not interested in seeing
the trunk of the elephant or or the the
of elephant or the tail we want to see
the entire elephant in its entirety
right so we must we must move backwards
otherwise we won't see a big picture so
regression analysis is all about moving
backward so that we don't see just the
individual Parts but we focus on the
bigger picture so we we can see the
entire picture that is what regression
analysis is right so again these are the
answers to the pop quiz wow you've heard
of the expression someone can't see the
the forest for the trees that means you
are looking too intently at the
individual data points that you fail to
see the big picture so if you want to
see the whole Forest well you got you
got to move far away from a forest right
maybe go take an helicopter and move
like hundreds of feet above the air now
you can see the whole Forest that that's
the only way you can see the whole
Forest By ignoring the individual points
so you have to give something to get
something by moving away from a forest
you you no longer see the individual
trees but now you can see the whole
Forest
the same thing with the Earth the the
the question of okay you're standing on
the Earth but you want to see the whole
earth well how do you see that well you
have to go into space all the way into
space right then you can now see the
whole earth and of course when when into
space you won't see your house you can't
say oh that's my house on 56 you know um
Ethiopian Road right you you won't you
won't see that that level of detail
anymore so you lose the Nuance but you
gain a bigger picture right so that is
how you can see the whole act by moving
away so that ense is what regression
analysis is it's all about stepping
backwards away from the data so that we
don't necessarily care about the
individual data points we don't care
about Barbara's data or you know Ilia
data point or esta data point as
individuals we want to see how the whole
population fits how how the whole you
know the model behind the whole data
right the bigger picture not necess the
individual data points so this is in a n
shell what regression is we are we are
staring intently at this bigger picture
trying to understand to make sense of
what everything
is so in regression analysis as the
picture implies you have two things in
regression here if we were the folks
here staring into telescope and this was
the moon then you there are always two
things in the regression analysis the
thing that we are studying the thing
been studied in this case the moon but
in our study that could be the the
outcome it could be smoking it could be
tobac losis it could be anything so you
have that and then secondly you
have the the thing doing the studying
right so there are always those two
components to every regression analysis
the thing that is been studied which we
call the outcome and the thing doing the
studying which we call the predictor all
right the the depending on the field or
on the context those two things have
very different names too we also call
the outcome
what what what what's another name for
the outcome
variable any you can type it in the chat
or you can what what else do we call
what are other other names popular names
by which we call the outcome
variable okay is it the dependent
variable or is it the inde dependent
variable let's put it that way now is it
dependent or
independent so we have
dependent
dependent dependent all right now in
another context espe social sciences
some sometimes are refer to as the left
hand side variable or the right hand
side variable is the outcome the left
hand side variable or the right hand
side
variable what do you think
right left hand left hand right okay it
is not it is it is a left hand because
think of it as an equation say we are
saying yals to you know X Plus y you
know plus plus Z the the the Y is the
outcome so it's on the left hand side
you say y equals so since is the one
that we're interested in it is a left
hand side variable all right so this in
in in a in they say a picture is what a
thousand words so if you were to
describe regression this is it this is a
perfect picture for what regression is
we are standing far away from a moon we
are looking into the moon to see a
picture of the shape of a moon in its
entirety we might not see the individual
pieces of rocks on the moon but we get a
big picture right we have the thing
being studied which is the outcome and
we have a thing doing the studying which
is a
predictor now in terms of the approach
to how we do regression analysis there
are also two ways in general so imagine
this this young child here says okay I'm
just going to swing my telescope to the
Milky Way galaxy and see whatever it is
I can find in this star night right this
he's more on the fishing Expedition he
doesn't really know what he wants to do
per se he's just exploring and so we
call that kind of analysis exploratory
analysis so you might see you might see
things like we did an exploratory
regression analysis to explore fact fa s
associated with X Y Z all right so that
gives some context as to why we call it
exploratory because we don't have
anything in mind a priori we just going
with a blank mind and saying well let's
just see what we find now the opposite
of a an exploratory an analysis will be
more of confirmatory so in this case
this young man says let me point my
telescope at a specific star to test my
hypothesis about his brightness so he
has a specific agenda in mind he's not
just coming with a blank mind and saying
oh let me just see what I can find no no
no he's coming with a very specific
agenda in mind and and that's why it's
confirmatory so it's either the
hypothesis will be proved or disproved
so any regression analysis you run
regardless of whether it's you know
logistic or linear whatever could be
exploratory or it could be confirmatory
in that
context so here is a pop quiz for you
are five research questions here so your
job is done whether it is exploratory or
or whether it's confirmatory the first
one is the research question is what
factors most influence customer
satisfaction in online retail
environment is that an e for exploratory
or is that a c for
confirmatory so is that e or
c e all right great does increasing
dosage for always get in my way does
increasing dosage of drug a reduce
anxiety symptoms more effectively than
the standard
dose
e well it's not e right we have a very
specific thing in mind we're looking at
drug a versus Dos dosage a versus dosage
B for this drug a right we are comparing
two doses to see how that affects the
you know outcome which is anxiety
symptoms so because we have some we're
not coming with a blank mind and saying
oh let us just see what factors are
associated with anxiety SYM no we we
came with something very specific in
mind which was how does dose a the first
dose and the second dose compare in
terms of this outcome so that is why it
is not exploratory all right the SEC the
third one how do social media usage
patterns vary across different age
groups is that an e or a
c
e okay that is correct it's
exploratory what are the emerging Trends
in remote work preferences post covid-19
is an e or a c
M
excellent C how is that a
c I would like some justification for
those of you who said c why why do you
conclude is a
c okay well okay go ahead please
yes uh because I'm not seen any
explanatory VAR
so the the okay the the question is what
trends right what are it's very
open-ended question what are the
emerging trends like Trends in what
exactly are you talking about Trends in
in X is it Trends in coming to Tor is it
times in you know it could be anything
under the sun right we don't even know
yes so it's just open-ended it's
exploratory okay do you do
you yes go ahead
I also think that when you the sentence
begins with what I'm also getting clue
there when it begins with what it means
that you are not being specific well
that gives you D it means that you are
focusing know something house just so it
me that it's
narrowing well that I like I like your
effort but that may not work in all
cases right I could say what is the
association between you know use of e
cigarettes and smoking cation
I'm trying to measure that assciation
between very specific exposure and a
very specific outome outcome it's
confirmatory so um so don't just rely on
whether the word starts with what try
and understand it at least
philosophically and I think that would
be a better way to go we good yes all
right last question is that an e or a c
a c c excellent all
right so what do we get from regression
analysis they give us models to explain
reality right so that's why you hear
people talk about a regression model or
a logistic regression model because that
is what we are getting from regression
analysis gives us a model so my question
this is an app up quiz what do you see
on the
screen
Apple that's
wrong that is wrong Red Apple that is
wrong
a red apple and a leaf that is very
wrong a picture of an apple sorry a
picture of an apple excellent or let's
be more scientific I I want you to frame
frame that you answer in a more
scientific way what what do you see of
anle image an image of an let's not go
with let's with model okay it's a model
now why is it a model
why were the other answers
wrong why were thew
wrong because it's not reality but just
actually looking at it from
afar something that you can touch you
can't you can't eat you cannot eat this
apple right you cannot hand you can't
smell it so if it is not if we cannot
eat it if we cannot taste it if we
cannot hold in our hands then of what
use is it then why do we need a model if
we cannot eat it if cannot hold it if
you cannot do all of those things we
mentioned why in the world do we then
need the model of an apple who can think
of a reason why then we might need it to
help us understand what an apple is
expain probably how a like give us an
idea how the reality is excellent
imagine somebody you know there some
that's parts of the world where they
don't have apples right like there are
many parts of the world where they may
never have seen an apple right how then
do you communicate to this individual
what an apple
is see
that's so we use model so we use models
when there is no data right when there
no when we don't have actual real world
data then we use model if we have an
apple we don't need a model all right if
we have an actual Apple there's no need
to draw a picture of an apple right so
when we have data we don't need models
we we we need models because we need to
simulate some aspect of reality that we
don't have data for right and that is so
that's how you need to understand
regression analysis in that context that
they giving us a simplified version of
reality this is simplified because this
is 2D a real apple is in 3D we cannot
eat this you know whereas a real Apple
we can eat that so in in relation to the
real Apple this is a very hyper
simplified model but it's it's useful
too because it still helps us to explain
the world to somebody who might not who
might not otherwise see an Apple so that
when that guy finally sees an apple he
like oh I see I see I think I know what
this is this is an apple because I have
seen a model of an Apple so that's why
we do we fit regression models so that
we have a simplified version of reality
so that we can explain the world people
so they can they can understand too now
you must also understand that all models
are wrong but some are useful so don't
be so
confident in your your regression model
right it is wrong you know like you know
judge Bo famously said it might have
some use but all models are wrong so you
have to understand that going in that
you know it's it serves some purpose but
of course it's you know it's that's why
we try to measure uncertainty with our
regression
analysis now for us to understand
regression we need to understand certain
things because from next week we'll be
talking about a lot of things so we all
need to be on the same page about some
Concepts
that actually basic but sometimes have
misunderstood so the First on that list
is hypothesis testing and P values
everybody talks about P values but
nobody knows what P values are so we
need to clarify
can I can't just assume you all know it
so I just have to go over it so um if
you if you're an expert in P values and
you are feeling Bor just just hang in
there we get over it fast
we also have to talk about some terms
that are again pretty used but but Mis
misused as well Univar byar multivariate
and multivariable what do these things
mean and how do we use them we need to
understand bias and validity we need to
talk about types one and two error and
we need to talk about power and sample
size requirements because
obviously regression analysis is part of
inferential statistics that means that
we want to make inferences and we or us
to make those inferences that assumes
that we have enough power to make those
inferences imag imagine the you know a
research being like a spacecraft going
to the moon and the the sample size is
like the fuel in that rocket ship
obviously if you don't have enough fuel
you can't get to the moon right so if
you don't have enough power to fuel your
hypothesis you won't get the results you
desire and you might end up coming with
a wrong conclusion and misleading all of
us so talk about sample size
requirements because you cannot talk
about regression without talking about
all of this things so why I know you
guys would love to jump into logistic
regression and you know you we need to
build on the foundation first and so
that's why that's the purpose of this
first lecture so that we all on the same
page on this basic um found foundational
Concepts all right let's talk about the
the first set of things we want to talk
about which is um truth now every single
thing you you see in the world can be
explained by one of three phenomena it's
either the truth it's either chance or
it's bias so everything you see in the
world um you know whether it's whether
it's something in Media or whether it's
um any observation can be explained by
one of those three things it's it's the
truth or his chance or his Piers um
truth is what we desire we want to find
validity another name for truth is
validity right um and that's when you do
a research what you are actually doing
is you're trying to find the truth right
that's a very simple way to to exper it
so when you say what is the relationship
between smoking and lung cancer and
another way to say that simply is what
is really the truth between this smoking
and lung cancer like what is the truth
about it right so any research you do
you are trying to identify the
truth but the challenge is that truth is
not just hanging in space where you just
go and pluck it like a mango tree truth
is buried and intertwin with chance and
buyers so for you to find out the truth
you have to dig and extricate it from
buyers and from chance right and that's
where regression comes in but you need
to understand these three things right
that okay there's truth but then there's
chance and there's bias and and the
whole point of regression is to separate
the truth from bias and from chance now
bias we talk about bias quite a bit in
this class because that is the whole
reason why we do regression bias means
systematic error
right and bias is different from error
because bias tends to be systematic
whereas error is at random so it's like
bi you can think of bias like or or
error like an honest mistake whereas a
whereas bias means you know it's that's
not it's not an honest mistake it's just
it's a pattern to it you are just being
deliberately deceptive that's what bias
you can think of it that way so error
means it's at random whereas bias means
it's
systematic now in epidemiology there are
three main types of bias we worry about
confounding bias selection bias and
measurement bias we'll talk about these
things in the next few slides but you
need to understand these things at a
very very basic level to to be able to
do regression and interpret them
well and then so truth again is what we
are trying to find bias is systematic
error chance it's now random error and
right and that's why we use P values so
why we use P values to account for the
for chance to control for chance so um
again we'll talk about these principles
um extensively in next few slides so so
when you're doing your your research you
really want to ask yourself so what what
truth am I offering in my study is it
the whole truth or is it the whole truth
the truth with holes right so and this
is the problem with um this is why
journals will not necessarily be so
enthusiastic um publishing your paper
that just has descriptive analysis
because it is the whole truth the truth
with holes right um because there are so
many alternative explanations if you
said for example that um smokers are
more likely to smokers you know smoking
for
example is associated with hypatos right
so you you might end up saying oh
smokers have have um have healthier
mouths it's like okay is that the whole
truth or is that the truth of holdes
what are alternative explanations so the
problem with with descriptive analysis
that don't have regression in it is that
you just report the superficial truth
you don't really dig deep to see H are
there alternative explanations right
have we adjusted for so and so and by
the way we'll explain what adjusting
means in a minute right so but the
bottom line is that I want you to know
that your study could either be the
first one which is the whole truth and
nothing else but the truth or it could
be a truth with HS right which is no
Truth at all and and so that is why
again we do regression
analysis now let's talk about some terms
that we use a lot in regression or
analysis to make sure that we all on the
same
page so when you have one dependent
variable and you you analyze it that's
called a univarate analysis and this is
this is um something that you
find humorously quite quite
interestingly mistaken even in top
journals just just this morning I have
to show you guys this just when I was I
was looking at a paper in um in jamama
and actually saw a paper let me let me
show you this I just have to show you
this uh and
I let's see where where was this paper
let's look at it so that you can see how
even in the topmost journals Sometimes
some of these things slip through the
cracks and why we don't want your paper
to be
um if I can find this let's
see
home what what's this
paper um it was quite hilarious so
um I'm like okay this these authors need
to have a basic basic instruction about
difference between byar Univar if I if I
cannot find it now I'll definitely look
for it afterwards and share with you let
me see if it's if it's this
one um but the authors essentially
completely mixed up the meanings of the
word univariates and and B varites and
and let's see if it's this one if it's
not I would definitely look for it and
share with you guys after the the class
just to save time um let's
see
so ah I found
it
so uh this table two says participant
characteristics by comment tone just
forget about the context but just what
what we want to focus on is on this
table so here what they're doing is I
look at this they look at this
characteristic and they they're
comparing across the three groups
there's a group neutral group negative
group neutral and group positive and
then they have a P value for those
comparisons and interestingly enough
they call this univarate analysis now in
there are a lot of other things that are
just not right right in terms of the
statistical approach now that what we
just saw there is not univariate that is
B variate univariate is when you just
looking at one sing as the name implies
one single variable that is for example
if I look at overall means or overall
prevalence or medians in overall
population that is univarate I'm just
looking at one variable it is impossible
to have a univarate analysis and have P
values because P value means you're
comparing two things so once P values
enter into the picture it's not Univar
it's it's now at least byar or
multivariable so Univar please so don't
make that mistake Univar means like uni
means
one byar means by means two so in this
case you are stratifying for example
prevalence like that that example we saw
where they were looking at for example
mean of age across those three groups of
that other variable so that is by
variant so stratified prevalence or
stratified means or unadjusted
regression analysis all of those are B
virate analysis in that case you have
one dependent variable and one
independent variable in Univar you have
one dependent variable and no
independent
variable now let's go on
multivariable is when you have one
dependent variable and two or more
independent variables so one outcome and
two or more predictors that is
multivariable
now this is where a lot of researchers
get this
wrong researchers love to use this word
multivariant and 95% of the times
they're using it wrong so they might
want to say what they actually mean to
say is multivariable logistic regression
but they end up saying
multivariate logistic regression so let
me let's explain what what those terms
mean and the differences between them
multivar could mean one of two things it
could mean you have two or
more outcomes take note multivariable
you only have one outcome variable in
multivariates you might have two or more
outcome variables and you have one or
more independent variables for example
the multivar and over is an example of a
multivar
analysis it could also mean that you
have two or more outcome variables and
you have no predictor
variable that means that all the
variables in anal are on equal footing
there is no one that is an outcome and
then other ones being predicted they are
all outcomes so think of things like
factor analysis or principal component
analysis right in those cases you don't
have a predictor per se they are all
outcomes and they are you could have 10
or 15 of them but definitely more than
two or more so bottom line is that when
you have one outcome and multiple
predictors it is not multivariant it is
multivariable so let's please use the
proper language when we're writing in in
our
papers all right let's use an analogy to
explain Univar byar and multivariable
you can think of boiling an egg as
univarate there is just the egg right
there's nothing added to the egg is just
plain hardboiled egg that is
univarate now if we take that egg and
then we add pepper to read now it's not
just the egg we have the pepper plus the
egg that is by varant so when we're EA
eating that egg we're eating a by
variated egg right if such a term exists
so because the egg plus some seasoning
which is in this case black pepper now
what if we now decided to go to town we
now added a whole bunch of stuff we had
we have toast and we have guaco and all
all manner of stuff now that is
multivariable because there a whole
bunch of stuff in that dish now Beyond
just the egg we've added a whole bunch
of things and why why are we adding a
lot of things to make it nice and
interesting and and nutritious so that's
that's one way you can also
conceptualize the difference between
univarate B varat and
multivariable all right here is here is
squee time this is now your time to show
that you understand this Concepts I'm
going to show you each screen a screen
that you're going to tell me what you
see on that analysis on that screen
whether it's univariates b variates
multivariable or multivar so here this
is showing mental product prices across
different products
so you have prices for cigarettes for P
for cigars smokeless tobacco and roll
your own tobacco is this what what kind
of analysis is
this for univariant is
univ okay univ excellent it is UN why
why is it not why why is okay let me ask
you
product okay why is it not why somebody
might say we have one two three four
four products and so that multivariable
why would that be wrong me I was
thinking is
byar it's not why you say by VAR because
like I'm seeing here have a price and
some so I was
thinking okay okay so but it's it's just
the price for each imagine each of these
items as a row in your data set right
each of his tobacco
product data set and then the prices the
column how many columns do you
have four have four
columns each of tobac products is and
their price is is is in a column right
the price is a variable how many
colum sorry two colum two colum one
colum one two colum one for cigaret one
for
priz no you're not hearing me let me
explain myself again each tobacco
product I think what is confusing you
here is that we're looking at tobacco
products but imagine this this are like
the unit of
observation so we have each tobacco
product is a row in your Excel
spreadsheet it's a row of its own
and then the price is a different it's a
variable of its own how many variables
do you have in that data
set
one okay all right let's do this then
okay let's let's just let's just you
have only one
variable which is the price the price of
the cigarette all right don't worry
we're going to do it that case you can
only find the mean standard deviation
okay let's so we have cigarettes right
we have
cigars we have smokeless
tobacco and we have um roll your own
tobacco this are the different products
we assessing our study and here you have
price so the price for cigarettes is I'm
just making this up
12.25 the one for cigars is
10 21 the one for smokeless tobacco is
3.44 the one for R is 5.44 so ladies and
gentlemen how many variables do we
have two one variable
one I am I am baffled how some people
are saying it's two variables this is
okay
variables hello
variables how do you account for the
items uh list it's two variable how do
you come for item items are the unit of
observation we have four
variables we have four variables which
are the the the things you listed
the the cigarett the the cigar the
tobacco and
the okay let me explain I think there's
a problem with understanding what a very
is a variable
is the columns are what we call the
variables okay the roles are the units
of observation the rols could be people
in this case they are tobacco products
but it could be it could it could well
be Johnny and Jack and John do right and
Jane do right typically we used to the
Rose being individuals but the rose
could be anything it could be it could
be tobacco product could be anything in
this case our unit of observation are
tobacco products which are the rows so
rows data set is made up of rows and
columns so in this case the outcome here
is price which is one column so that is
why this is called univariant because
there's only one variable we're
analyzing for the different product I
hope that is
clear sure all right yes yeah yeah and
when you relate it to even mattress
dimensioned so it's it's it's still the
same one dimension all right so matter
this also one
dimensional that you are supposed to
have all right moving on is this this
looking at this is looking at Trends in
mental cigarettes smoking over time is
this univariate byari multivariable or
multivar
variate
Univar
okay so me okay
what do you have on the what what do you
have on the
xaxis
it's what
you prevalence prevalence of
what yes
smoking okay the number of times it on
all right and this the xaxis present
what
the about so how many variables are we
looking at
now so if we're looking at two variables
what kind of analysis is
that so this is a Bic
analysis all right so here I'm looking
at the how the percentage of ecigarettes
and um tobacco products differ
by any flavor nonar tobacco product used
by men mental status all right we're
comparing mental users and non-mental
users in the use of ecigarettes so the
green represents um the mental smokers
mental cigarette smokers and the red
represents non-mental cigarette smokers
and this is indicating ever use of
ecigarettes current use of ecigarette
and current use of any flavored non-
cigaret tobacco
products what kind of analysis is this
mul sorry
muli this
is this is a trick question and this the
reason why I put this a trick you is
that when people see a lot of results on
the page they just say oh multivariable
but seeing many results seeing a lot of
results on the page does not mean it's
multivariable you have to look at each
unit of analysis each individual result
is it
is it is it
motiv exactly so this is why
this because it basically describes okay
this is looking at support for bands on
characterizing flavors among all adults
in the US so we have the results the the
variable is support for bands it's
presented overall and broken down by
gender by age and every other thing is
it univarate byar multivariable or
multivariate
excellent again this was to trick you by
displaying a whole lot of things on the
screen and hoping you guys will be
confused but you guys you guys weren't
all righty now what of this this is
showing odds of mental use among current
smokers in grade 6 to 12 and this this
this is looking at all these variables
what what kind of analysis is
this and
I can hear the M I hear what
endari why is itable and not
multivariate
multivariable it is it is multivariable
because the the predictor variables are
many but the outcome variable it's just
one excellent as far as you have only
one outcome that that means that's
multivariable right once you have two or
more outcomes that's must variant so
please that the distinction is quite
clear so please don't mix up
those so what does it mean to adjust I'm
sure if you if you read any scientific
literature you've heard of that term oh
we adjusted for X Y and Z we control for
Y X Y and Z what does it mean well this
the I can think of in terms of
explaining what adjust means so um who
can describe to us what what you see in
this
picture and that's not a trick question
truly like what what do you see
picture it's a boarded passengers on
plane
okay is
zoomed no is um
plane host
seven refreshment on the flight okay so
how many people are moving at the time
one person one one person time and every
other person is what
seting and they are strapped in their
seats why are weing them in their seat
so that they don't move right so that
mean so this is what we mean by adjust
right they so that only one person right
moves at a time so we can see oh wow if
they if the pilot says let's let's see
how does the how does the
airplane behave when people are moving
or when when when the behavior is when
for example drinks have been served well
for us to observe that behavior we need
everybody to be in their seats so we are
adjusting for every person's Movement by
keeping them strapped so in other words
we're trying to delete them from a
picture completely and just zoom in on
what we want that is what we call
adjusting for something so adjusting
means we want to keep them strapped down
in their seats we want to hold them keep
them out of a picture so that they we
just focus only on the outcome right so
let's go back to that previous um
regression analysis here right in this
in this analysis here the the table is
looking at factors associated with
mental use among current smokers so they
took a whole of cigarette smokers and
then ask them do you smoke Menthol like
the kind of cigarette you smoke is it
Menthol or non-menthol so we're trying
to see what are the factors associated
with mental use so here let us see let's
explain this and let's now really
understand the concept of adjusting or
adjusting for factors so here we say
that um females had compared to males
females had 1.17 higher odds okay this
is not let's just focus on the results
that significant so the reference group
here is white so we're saying compared
to White blacks had 3.19 higher odds of
reporting mental use right so the all
the odds of smoking mental cigarettes
were 3.19 higher among blacks compared
to white but holding all other variables
constants so that means that when we are
looking at blacks the relationship
between black and white oh every other
thing in this table is held
constant and what do we mean held
constant mean that we are strapping them
in their seats it means that we're
keeping them all to a constant value
that means that when we're looking at
race everybody we assuming that they
male because male is a reference group
so everybody that in that data set is
male at that time everybody in that data
set is in middle school at that time so
that there are no differences in sex no
differences in school level everybody
everybody is um the reference smokes at
1 plus hour everybody has a a doesn't
have a medical condition right so
everybody is held constant at the level
of a reference
group if it's it's a continuous variable
we hold them
at um please mute yourself if you're not
if you're not contributing um to this
particular
discussion so that's what we mean by
holding things constant is again back to
that picture of a plane we are looking
at the relationship between blacks and
whites and we want complete silence we
want every other variable to be silent
so that we can just focus on that
relationship between blacks and white so
we hold everybody constant at a
particular value so and that value is
supplied by us right so that is why we
use reference groups so the whole point
of reference group is that that is the
point at which we hold everybody
constant so again when we look at
relationship between blacks and white
everybody is a male at that point
because we're not looking at males we're
not looking at gender at that time so
all the participants in our study are
held with fixed gend gender so that the
differences we observed can't be to
can't be because of differences in
gender because everybody is the same
gender at that point at least
statistically that is what we mean by
adjusting or controlling for things and
that's what regression analysis does it
allows us to see the independent effect
of something while we hold every other
thing constant and if you don't
understand this now don't worry we have
four weeks to really go into this we're
going to cover all of that extensively
so you you you'll be pretty sure by by
that time what that means
so talk about um truth um let's let's
talk about about truth and how it
affects our our
results now sorry Lion King you have
your hand up you have a question yes
Prof uh going back to the table you
presented uh the investigator didn't uh
actually um bring the P value so is it
accepted in a a journal
like this
have confidence intervals you don't have
to show confidence intervals and P value
confidence interval
suffice wow this is my first time of
hearing this so okay
great all right so let's talk about
those three elements and then um we'll
then proceed with um again providing
context so all of this is to provide you
context so that when we start talking
about when we start fitting regressions
regression models and analyzing them you
won't be like lost so if you've been to
our previous classes you should be well
familiar with what we call the cycle of
inference we take a sample from a
population and from that sample we make
inferences that is the cycle of
inference external validity has to do
with how well can we take results from
that sample and generalize them back to
that population that is what we mean by
external validity internal validity has
to do with to what extent is our study
protected
oops from buyers right and remember we
talk about three types of bias
confounding bias measurement bias and
selection bias to what extent is our
study protected from those biases that
is what we mean by internal
validity now let's talk about buyas
since we're talking about the
relationship between buyas and internal
validity so let's think of a a study as
people plus the answers so when you take
a bunch of people plus the answers that
they give us to whatever question we
asked that is what we mean by study so a
study is equal to people plus the
answers now the three types of biases
can be mapped to those those that
framework we talked about when there's a
problem with the people that's a
selection bias so remember we said the
study is equal to people plus the
answers well if there's a problem with
how we selected the people that is
selection Biers if there's a problem
with how we collected their answers that
is measurement bias and confounding bias
is a problem where we cannot know who
said what there's so much noise in the
air that it's hard for us to know was it
this person that said this it's hard for
us to know for sure right so confounding
bias creates a scenario whereby we have
a whole bunch of alternative explanation
there's just so much noise in the air
that we can't really say that H this
relationship we are looking is it truly
coming from this exposure or is it just
coming from background noise right so
again simply put a study is people and
answers problems with how we selected
the people is selection buyers problem
with how we fetched or collected the
answers is measurement buyers and
background noise that distorts the truth
is what we call confounding buyers so
again this is another way of visualizing
selection bias how did people enter the
study and how did they exit if there are
differences between the study arms then
we that gives us selection buyas you can
think of confounding buyas as the nosy
neighbor right you know it's poking
their head in where wherever they not
wanted like dude you're not invited to
the party why are you here right so the
that's why we call confounding variables
the Nuance variable in epidemiology
because they are unwanted but they just
want to be there by all
means so regression analysis is all
about removing this guy this this this
nosy guy from our analysis so if you
were to summarize regression in one word
that is it yeah we want to get rid of of
this guy so that we can we can get the
big picture of of the relationships
without interference without any noise
right so you know that's a key objective
of analysis to to control for
confounding or to remove those nosy
neighbors from our
picture now all right this is a test of
your understanding of a
bias okay let's let's let's let's see
how how how well you understand this a
national household survey that uses
landline phones systematically under
represents youth and people in rural
areas what kind of bias is that
selection
bias
EXC manys refuse to participate in the
government study because of concerns
about
deportation
me remember we say stud is people and
their answers problem with how people
were selected regardless of whether he
involved how they came into the study or
how they existed the study give
selection bu so this is the selection
buyers
issue in a study examining gender
differences in tobacco use pregnant
women who smoke are very likely to deny
being smokers
me in a study of very aged and largely
retired population a study found that
the risk of dying is much higher among
young
people confounding
confounding some noise in that data that
that
all right now let's go back to the third
pin which is chance so remember we said
every phenomenon in the universe can be
explained by truth chance or bias you
got to understand those things for you
to even do regression
analysis um otherwise you you might just
be putting out out garbage so let's talk
about
chance for us to understand chance we
need to understand the concept of
decision making and error rates right if
we took all the people in the world and
put them into categories we can say
there are two types of people alpha
males or Alpha females and better males
or better females alha males just rush
in right like they they they're Fearless
they Rush they're very Rush um so they
just rush in in any scenario there's a
burning house they're rushing the
building is collapsing just rushing
right so the the problem here is that
you you know the risk is you might lose
everything right you you might get you
sure you can get accolate but then there
risk of you running into trouble with
just that Brash approach to things then
the better approach is just a very timid
approach that refuses refuses to engage
right and
so can you guys mute yourself please if
you're not interacting with with with
the with the content
please so so so Alpha is very very
frontal and just rushing into every
Everything beta is very reserved very
conservative and by being B right you
end up losing up on opportunities
because maybe that was an opportunity
for you to shine and you just stay in
the background and you just not shine
right so you end up losing on some
things because you were not you not fall
enough so those are two risks and that's
how you want to consider statistics in
general like because decision making in
statistics when we analyze data we
actually making decisions and all our
decisions in data to fall into those two
groups it's either we Alpha and just
rushing into making decisions and just
making conclusions or we are beta and
we're just very timid and we refuse to
make a conclusion because we for
whatever reason we're
afraid now this is a Greek alphabet and
I just wanted to show you that Alpha is
the first letter in the Greek alphabet
and beta is the second letter in the
Greek alphabet so issues of alha are
called type one error because it is a
first letter in a Greek letter Greek
alphabet and issues of beta are called
type two errors because they are type
two errors because because beta is a
second letter in the Greek Greek
alphabet so um now let's let's explain
the concept of hypothesis testing
because you cannot talk about regression
without talking about hypothesis
testing so when we test hypothesis we
all usually have a null hypothesis and
an alternative hypothesis the null
hypothesis is that nothing interesting
is happening here that's why I see the
guy Ying right like gez this is so
boring nothing is happening here the
alternative hypothesis is that oh
there's a party Happening Here something
very interesting is happening here so
that's the that's how you want to
understand the null and the hypothesis
so when we do when you when you read
papers you see that they say um
statistical significance was set at P
less than or P equal 0.05 or the alpha
threshold was at the 5% level what does
that mean well an alpha error rate tells
us that it it's you can simply
understand it in the context of of the
number of of of of 100 people you helped
how many did you actually
have if a study concluded that this drug
is associated with helping let's say we
we we compare two drugs drug a and Drug
B and we concluded that drug a is
completely useless it does not save
anybody and Drug B saves 100 lives
of those people 100 people we saved how
many did we actually harm right or in
other words what the false positive rate
associated with that study so if we say
that the the P value thres was 0.01 or
Alpha of 1% we are willing to accept
that for every 100 people we helped we
actually harmed one
person right if the alha rate is 5% we
are willing to accept that well we
accept upfront that we are willing to
trade that for every 100 people we
helped we actually harmed five of them
right and by harm I mean like FAL false
positive rates so that is that is the
concept of of the alpha or if it's if
it's 10% Alpha you are willing to accept
that for every 100 people you helped you
you are willing to to to acknowledge
that 10 of them were actually harmed or
not helped at all right so that means
that now this is a quiz for you in a
very critical situation do you want to
have a large alpha or very small
Alpha if it's a life or death situation
right do you want a big alpha or a small
Alpha for your
threshold yes um I see some hands up you
can unmute and and
speak very small
so if it's a life and death matter
obviously you want to have it as small
and possible right so that the risk of
harming people is very low so one of one
of the take home messages I really want
you to get from this class today is that
people somehow believe that this P value
of 0.05 was sent to us by God and
delivered by Angels no it is set
contextually in your study right
dependent on a lot of of things for your
study I can choose to have an alpha
level of 0.1 or
0.01 you're not forced to use 0.05 as a
threshold and this is why you need to
understand this fundamental concepts
because if you don't understand how then
do you even how even begin to apply it
in your research right so that is why we
have to go over this concept so you are
we all on the same page as to um what
that means so this is another way of
thinking about um P value
you can you can think of it as you know
a light detector test when you take a
light detector test I have never taken
one but I've watched in movies they ask
you
someting open questions first right like
oh is your name Israel aak it's today
Friday August 9th are you sitting are
you wearing shoes right
now are
you can we can we kindly M our
regene
um re can you kind M
yourself geez you're giving me a
headache so um so this are standard
questions we ask right um and the whole
reason is to establish the truth that is
the reason why you ask those Baseline
questions to establish the truth so that
when they not ask you so did you steal
that money and you now give an answer
that is quite it's like okay the
response under under the test is quite
different from the Baseline we've
established from the trth because when
for each question you asked and you give
an answer the light detector measures
your your biomarker levels right and how
you respond so that when something
Strays from that truth it's like oh this
is not part of this is not consistent
with the observation of a truth right so
the null hypothesis is this part here
which is oh this is so boring nothing is
happening here but when when you start
lying oh something is happening here
because the responses we are seeing here
are quite different compared to when
nothing was happening so bottom line of
what we're saying is that the null
hypothesis the the P value is a is a
it's a conditional probability right um
we we're always comparing the response
to when nothing was happening right so
that is how you want to understand a P
value so the question is what is a prob
of us seeing this kind of very very
strange responses right if indeed
nothing was happening and if we now say
that oh that probability is so small
we're like you're lying you stole that
money it didn't you right so that is how
you want to think of of of of of P
values we are always
comparing whatever we see to the
scenario when nothing was
happening so we've talked about byar you
know Univar and multivariable we're
going we're going to discuss much much
much of this in the next um few um
series but I'll just touch about touch
about um those those various ones just a
little so univ could mean what is the
mean value of x what is the median of X
what is the count of X or what
percentage of people reported X all of
those are
univarate byari could be what is the
difference in the mean value of x
between groups Y and Z right so now
we're comparing two variables and that
by varat multivariable what is the
difference in the mean value of x
between groups Y and Z after holding a b
and c constants again we're holding all
the variables constant and trying to see
the independent effect of that exposure
group so we're going to cover all of
this now this are the the next four
weeks we're going to talk a lot about
the different types of regression
analysis binary Logistics so we we won't
we won't um we won't go we won't go
faster than our shadow now but again
this is just to give you a full glimpse
of what we're going to be covering over
the next few weeks we'll go over each
type of regression in detail we would
fit it in theeta and R we will learn how
to interpret the results and how to do
possible sensitivity analysis as
well all right very very important
things to take note of when doing
regression analysis you want to make
sure that you consider issues of type
one and type two error so the biggest
issue with type one type two error is
small sample sizes so we cannot talk
about um issues of you know regression
or Baseline knowledge about regression
without talking about sample size and
power so here is a pop quiz for
you let's see the the relationship
between effects size and power so the
pop quiz is the more powerful a
microscope is the blank the size of
things it can detect what is blank
smaller
exactly ex exactly so um so if we want
to detect very
small got it so you want to detect very
small differences that means we need a
very powerful study you know if want the
smaller differences we want to detect
that means that the more powerful our
study must be or the bigger the size our
study will have to be so that is
something we have to keep in
mind Now sample size has to be done
precisely because there is such a thing
as too much sample size and you know I
love referring back to this paper
because it's it's quite alarming this
paper the bottom line of the paper is
that of of 215 randomized trials
published in some of the top the biggest
journals they found that only 34%
correctly calculated sample size now if
that is not alarming I don't know what
is alarming then and this were some of
the biggest journals like the landet New
England Journal of Medicine and so on
and so forth so bottom line is that you
have to determine sample size correctly
and that has to you also have to know
how to calculate sample size for when
your analysis is exploratory versus when
it's confirmatory because when we're
doing exploratory analysis the the one
thing we the most important thing we're
concerned about is Precision Precision
is essentially is how wide our
confidence intervals how narrow or how
wide our confidence intervals that's
what that's what Precision is but when
we are calculating sample size for
confirmatory analysis we are worried
about power power means the ability to
detect differences where they exist so
sample size calculation is quite
different for for exploratory versus for
confirmatory analysis so we're going to
use the rest of this time to see how we
can set up our sample size calculation
when we are doing sample size
calculation for power for for a
confirmatory study as well as for an
exploratory study so that we can keep
that in mind and avoid type two error
type two error simply means that there
there's a difference in actual in in
nature but our study fails to detect
that difference because our sample size
was too small all right so now I'll go
ahead and show you how to calculate a
sample size for when we have an A
confirmatory study as well as a just a
descriptive analysis and then I will
then take
questions um all right let's go over
here so we're going to use the K
platform to calculate our sample size
for
our
okay so um so we're going to use the K
platform to calculate sample size under
the scenario of a confirmatory study as
well as a more exploratory analysis and
cannot see the kind of things we
emphasize the most when we are looking
at confirmatory study versus when we're
looking at an exploratory one um for
those of you who who have who who don't
know what a k Quest platform is it's a
it's a One-Stop shop for scientific
research K platform allows you to design
your study like calculate sample size to
sample to design surveys to collect data
and and so many other things but we're
not we're not interested in those other
elements today but now just what we're
interested in is how do we calculate
sample size for the study went so that
we we ensure that our results are
adequately powered so let me log in sir
sir how do you get into the K Square
platform um how do you have access L
will share the link with you um L can
you kindly share a link the platform
with with everyone please I appreciate
it okay great thanks so um so on on the
on the panel on the left you see some of
the functionalities of the platform
survey design is where you design your
survey sampling is where you draw a
representative sample um um survey bank
is a repos repository of thousands of
questions but what we're interested in
now is sample size which is how do we
calculate sample size for our study now
let's start with a confirmatory study a
confirmatory study will be a comparative
study with two arms so remember those
examples we looked at earlier we're
looking at a specific exposure how does
for example we said how does how does
those one compar with those two in
reducing symptoms of anxiety so in where
the exposure is two arms which is the
first arm is dose one the second arm is
dose two our outcome is symptoms of an
of anxiety so in that case we're looking
at a comparative study comparative
because we're comparing dose one against
those two
so we select comparative study with two
arms why two arms because there are only
two two groups in the study those one
and those two if we are comparing three
doses dose one dose two and those three
then we will come to comparative study
with three or more arms but for the sake
of that illustration we we had this
one so the first thing we have to do is
put the prevalence of the outcome among
the control group so the control group
was the standard dose if you remember
remember that question question was how
does dose one compare or this particular
dose compare with the standard dose so
the standard do is you know the standard
of care or the current treatment that is
been given to patients so what we're
asking here is what percentage of
patients who are on who are on the
standard of care reports symptoms of
anxiety let us say that that percentage
is 15% we don't know what that is but
I'm just making it up how do you know
what 15 that number is where you go and
do a lit View and that's how you know so
we provided 15% there now we have to
provide one of four values here to
determine effect size effect size is sir
please uh with that are hello boss so
with the 15% are you are you
from um I'm having difficulty or is when
please repeat
yourself okay well hello yes go
ahead
hello hello yes we can hear you go ahead
I mean the
50 okay you know what can you just type
the can you type your question in the
chat box I think you are really
struggling with your internet
15% from the I mean the 15% where where
do you get it from is it from Artic or
from your raw data okay the
15 it can't be from your raw data
because you you trying to ccate sample
size to do the study so you don't have a
data right um so in this case I'm saying
you should go and do a literature review
to see what other Studies have done what
what what did other studies find out in
terms of on of the percentage of
patients on this treatment how many of
them have anxiety symptoms so it can't
be from your data because you don't have
a data yet you can go and do a literary
review to find it it doesn't have to be
in your community it can be you know
because we can argue that biological
responses are likely to be the same
across population so the way that a
particular population will respond to it
might be the same as another population
so that assumption can be made but it's
something you can get from lit review
now if you cannot find it at all in lit
review well you can convene a group of
experts and ask them well what do you
think right one of the things you have
to realize that sample size calculation
is is a scientific guess because we're
just trying to get an estimate there
it's so it's not as if it's like 100%
fullprof so but any educated guest is
better than nothing at all it's better
than just coming up with a random number
from your head right so but for the case
for the sake of this Dem I am just
making up the numbers so I just want to
be clear right so for the sake of this
demo I'm just making up the numbers
thank you all right great now you have
to Pro choose a quantity below and enter
its hypothesized value to assess effect
size remember that effect size is the
difference we want to detect between
drug one and Drug two right the smaller
the difference we want to detect the
larger the sample size we will need so
remember we said that you know a
powerful study will a powerful
microscope will be needed to detect
small differences so if we want to
detect very small differences that means
that we need a much bigger study and
there are many ways you can express that
difference you want to
detect you can express it as a
prevalence ratio or as an odds ratio or
absolute prevalence difference but let's
say that I hypothesize that this new
drug is extremely good and that I think
it's going to reduce the prevalence of
anxiety among my patients so I think
that the prevalence will be let's say
the pre 10% I think that on patients who
are on this new treatment the prevalence
of anxiety anxiety symptoms will be 10%
so that's where I I provide I provide I
select that indicator and I enter my
Valley next I provide a level of
confidence so in order to become a
professional Forest Trader you will need
Mr Forex Trader can you please meet
yourself of Daniel please um can you
mute yourself we have zero interest in
your Forex Ventures thank you um so
level of confidence desire I provide 95
is a test one-sided or two-sided you
always want to go with a two-sided test
right a two-sided test is like crossing
the road see when right from When We're
Young we told you know look left look
right before you cross the road don't
just make an assumption and look left
and assume that there are no cars coming
to kill you from another direction right
um now a two-sided test will obviously
take cost more take more in sample size
because think of it just from common
sense if you're looking both sides of a
road it takes you more resources if you
look to the left of a road and you look
to the right of a road that takes more
time than it would have taken for you to
have just looked at one side only but
the idea is that research means we don't
know what we don't know so we can't we
can't assume that the relationship will
go one way we we have to acknowledge
that might go that direction or this
direction and that's why your test have
to be two-sided
the power desired you you should
typically enter values between you know
80 and 90 um anything lower than that
means that your study might not be
powered adequately to detect um
differences um I provide a ratio of
controls to cases as one since I ratio
one to one and the aned response rate I
provided as 50 and I calculated sample
size this tells me that I need 393
people in the treatment group and 393
people people in a control group or a
total of 786 participants in my
study so I can show the reference here
to get the formula that was used for the
sample size right um now very important
things to Note One statistical power is
driven by the smaller arm so you always
want to achieve some kind of statistical
balance between the two arms it doesn't
help you if you have 786 people in your
study but you have 86 people in one arm
and 700 in one arm no no no no the stud
is not well powered because remember
that power is driven by the smaller arm
so two studies one has both have 76
people stud one is balanced it's well
powered study two is has the exact same
number of people but it's not balanced
it has 700 people in one arm 86 that is
not a good study right so you want to
strive for balance because again take
home lesson is that power is driven by
the small arm so now I can I can use
this formula I can if I'm writing the
Grant I can download the formula I can
also copy that citation and also include
in my work right so that is how I could
calculate sample size for a a
confirmatory study now if the study is
just exploratory that is the same as
calculating sample size for just a
descriptive study sorry your hand is up
Lion
King okay maybe that's an old hand all
right so um that how you calate sample
size for a confirmatory study for a so
if you have questions please you can ask
before leave that leave that yes Prof my
question here is how do we know the alar
of the the formula Behind these
calculations and then is it possible
that uh for academic purposes whether
the software can show step by- step
calculation before it arrives at the
final the final answer like 7
it well okay the stepbystep calculation
is what in the formula that's why you
have a formula there so that you can if
you want to do it manually you can use
step by step and these are standard
formulas they are not they are this are
standard formulas in science so um
regardless of the tool you're using
that's the same formula that we use but
the the the the calculation is not been
done it's not it's not been number by
one guy sitting behind and writing it's
been written by code and that's why you
have this formula there so in case you
want to say I just want to do it by hand
you can because science has to be
reproducible meaning that if I generate
result you two should be able to use the
same formula and get the same results
and that's why you have this provided
for purpose of reability I hope that is
clear um any other
questions okay sir thank you
for the lecture um you say something
about the power driv the power is driven
by smaller and also you talked about
that the connection so that we can able
to get the accuracy it has to be small
up and furthermore you also I like also
mention that when you want to do do with
kind of Grant and the like these are the
things that they pay attention to so I
we appreciate if you can just expans on
them
on which one exactly on how can how can
we um get the power out because I'm kind
of confused with uh with the concept
behind it how do you get the power power
what the power app yes sir the power you
said the power power is driven by
smaller app oh okay power is not a
physical construct power simply means
like it's the same way we say the power
of a microscope when you say microscope
is powerful you are simply saying that
wow that microscope can detect things
that even even the smallest things can
detect it so we mean when we talk about
power we're not talking about power like
in the concept of like wow you have a
lot of muzzles or anything we're saying
that your power power simply means that
you have a big study right and and so
that's what it means in the simple terms
and and so you don't have to manually
try and figure out what the power is
right you you that's why you have tools
like kqu platform form that allow you to
do that here in this calculation here we
have this power desired just know that
you should a good power should be above
80 between 80 and 90% that's what you
should strive for so you don't have to
you don't have to master a lot of things
just just know the basics and if you
look at the hint here the platform also
provides you with hints so for example
let's look at the hint for power it says
that just like a powerful microscope
allows us to detect small things a
powerful study allows us to T small
differences power is calculated as one
minus the probability of a FAL negative
and is expressed as a percentage we
recommend using the power of between 80%
to 90% so all of everything you need is
already provided for you so you you can
just read the hint and then just follow
what is said now if you now want to go
and do something else that of course you
can you know you're free to do that but
the recommendations are there for you to
just follow and just use as as as
suggested I hope that
clarifies understood sir all right great
any other questions a quick question
from I was going to ask about the ifx
size which is a bit different from the
power so um how about if you are
designing a case control study and then
you have an anticipated let's say um
effect size you want to detect at least
let's say 15% differences between the
the cases versus the control how do you
fit in that into this equation and then
you also may want to say a case to a
ratio of maybe case to control ratio of
1 to three four and the higher the the
much powerful the study would be how do
you account for that here okay so in the
concept of a microscope let's go back to
an analogy I like I like simpli I like
simple stuff again analy the effect size
is the size of the the small organism on
the microscope that's what the effect
size is all right so the power is the
lens how powerful can a lens detect this
small small tiny microb that is on the
microscope so in the formula we present
the effect size as Delta
Delta now on the on the app we provide
you with a lot of ways to think about
how do you want to capture the effect
size if I'm if I'm talking about a case
let me go with the example you mentioned
the case control study you know um in
the case control study what we calculate
is matched ratios so me it will make
sense for me to use all
ratios um please can you kindly mute
yourself if you're not interacting with
the
class so if if I'm if I'm was was I
saying um okay um case control studies
yes um so in that case I may want to
have o ratio as a as a measure of the
effect size and in that case I might say
Okay I want detect at least an alation
of 1.5 between the
cases good
[Music]
girl yourself
please all right um so we provide an
effect size of
1.5 and here we say we want to have an
let's say ratio of of cases controls to
cases of four right by the scenario you
pointed and we click on get results so
that tells me that for my for my cases I
have I need 56 cases and 221 controls
right in a case control study having um
up to four controls P cases increases
the power but beyond that it doesn't
help you so if you are if you now have
five contra to a case or six controls it
doesn't help you beyond four right so
that's why we typically cap you know
number the ratio of controls to cases at
four in the case control study but yeah
so you can you can play with this
regardless of the parameter you supply
in here we they all all roads lead to
Rome as they say right whatever
parameter is Supply here is now used to
calculate the Delta which is then fed
into the formula but the the many
options here is just to give you variety
and options right there are mathematical
relationships between each of these all
the all these things have mathematical
relationships between them so regardless
of whatever you supply we use the
mathematical formula that
relates to to convert the parameter to
what we need eventually so here you can
use whatever parameter or option that
best suits what you're trying to capture
I hope that makes sense to you yeah it
does thank you so you're welcome any
other questions
yeah Prof what about those of us who are
doing qua experimental studies and we
want to reduce um for instance I want to
redu malaria burden from
51% about 10% reduction is is this
um um which of the indicators should I
should I use okay should I use the
prevalence of outcome it doesn't matter
all of them you can use any of them is
fine any one of those would all roads
lead to Rome remember
use the one that your most com comfort
so it's all about your comfort what and
what you have on hand so we don't want a
scenario where youve got a result from
another study as 1.5 and as odds ratios
and there is no allowance for odds
ratios in the app and you now have to hm
how do I con convert all ratios of 1.5
to prevalence that that is that is on
that is on that is unnecessary
unnecessary you know um complex and
making the app user unfriendly that is
why you have all the options to
calculate it you can just fit in
whatever you have on hand or whatever
you're most comfortable with I hope that
makes
sense yes please yes please okay great
any other
questions yeah thank you very much um
Professor aaku um I I just want to
follow on from the question the last but
one person asked about case control
studies and I wanted to ask if um a
study having calculated your sample size
would you say a study that has a one is
to4 um cases to
control would be more powerful
than um one that has maybe a one is to
one um
ratio yes
and that's that's that's the whole
reason why we do it to increase the the
the statistical power so that we can
detect differences but anything beyond
that doesn't add any added advantages
now that arbitrariness in addition of
cases or controls of cases is also the
reason why we can never use case control
studies to calculate prevalence for the
population because the denominator is
arbitrarily changed whether we selected
one control or two controls per case or
three controls per case the denominator
of the study will arbitrarily constantly
change and so with one selection you
might have a prevalence of 22 and so
that is why we can never ever use case
control studies to generate um
population prevalence but yes the reason
why we do that is to enhance the power
of a
study okay thank you but would um the
case we just did where you had is it 700
and something thing uh the sample size
you needed was about 700 and something
and you had three one is to
one um would we is it possible to
calculate it and have a one is to two or
one is to three just to increase the
power of the
study yeah in general whenever you're
doing sample size calculation it's
always it's not it's not a very good
idea to just have one single number and
just to the the grand to and say hey
this is sample size calculation it's
always better to have a grid a table
where you you show the differences in
assumptions like okay in this case I'm
you know we we let's say we're using a
ratio of two: one or ratio of one: one
or we assuming you know power of 80% or
power of 90% And so that the reviewer
gets to see the whole range of possible
you know and the potential impact on the
study because at the end of the day the
studies feasibility is driven not only
by statistical consideration but also
logistical and frankly money right so um
that's why you want to have a wide range
of of assumption so that you you look at
every possible scenario so that should
your resources take you to only so far
you can know that well it's not that bad
given the other options we had it's
pretty close to the other assumptions so
um so it's always a good practice to
come up with a grid of table or a grid
of of calculations under varing
assumptions
okay thank you much appreciated
definitely you're welcome I was
gonna okay sorry go ahead go ahead uh
okay I was going to ask if the same
let's say assumption holds for cohort
study as well because um last month I
reviewed a paper where that was actually
in Sweden a nationwide cohort where the
entire country is is is within the
cohort study and that particular study
use a ratio of um let's say one exposed
to 10 um on Expos group the ratio of if
you if you want to coin it to case
control it's like a case 1 to 10 and
then um part of the method the
methodological component of that um we
argue that in fact with my professor
that it is a very powerful study the
fact that um they could have one to 10
the higher the value for Co cohort study
the more power the sample would be like
the whole analysis would be so my
question is um could the same assumption
be different with cohort even though um
some profess are arguing that the higher
the ratio variation the much powerful
your example would be I just want to
know your opinion on that so ultimately
right depends on the outcome so if if
it's if in the case of a cohort study
right so EP epidemiologically right a
following 10 people for one year is the
same as following one person for 10
years right either way we have 10 person
years right so the question then
becomes what what should we do it
depends the answer is it depends on how
red the outcome is if if the outcome is
very red no matter even if your study
had a million people in the world and
you followed the million people and not
one outcome of the of the not one case
of the outcome was detected that means
that your study is completely useless we
cannot detect we cannot calculate
anything from the study right so we need
to really some of these things are more
nuanced because we then have to take
into account the outcome like how
prevalent is the outcome what's the rate
of the outcome and how many anticipated
cases do we expect across the groups
right go ahead okay so the study was
actually 20 years period and then the
outcome was uh the main exposure here
was body dysmorphic disorder bdd and
then they were having two exposures
intentional self harm and then
suicidality those are the two outcome
they were measuring So within the span
of 20 years they were able to have about
let's say like let's say
um by cases is not much it's like 17
cases mortality within the suicide
intent and then about 27 in the other
category so it was basically a 20 years
span period and then those diagnos of
body dismi is all the way the exposed
the non Expos are the non diagnos and
they were matched by age sex and then
address more like County okay and so you
can see that without reading much of a
study you can see already that one of
the major con ation by the authors must
have been also the issue of competing
risks right there's a risk in your study
you want to capture but then competing
risks are other things that take
individuals and make them censored
censorship epidemiologically means that
the person disappears before we got to
observe the outcome of interest if the
person like suicidal ideation if that is
an exposure group well maybe that person
will have killed themselves before we
ever got a chance to really study the
outcome of interest in so we we've lost
that person to Computing risks so the
question is how then do we ensure that
we have enough of those people to
actually capture the outcomes of
Interest right those are some of the
considerations that will then make you
determine oh I want to have a 10 of
those ones to one of those because those
ones are more prone to be lost to
competing risks right so some of these
things are necessar like um like like
black and white and and in general right
right I think there's there's things
like this there's an art form to them
both a science and an art and so I I I I
generally wouldn't be too worried about
oh is this more powerful or less
powerful as opposed to careful
considerations of things like Computing
risks and and ensuring that you know we
we have ability to capture those people
over time and that that you know we we
control attrition or loss to follow up
because obviously in this case it's
quite clear that you will most likely
have differential loss followup between
the two groups and you most likely also
have differential you know um you know
differences in mortality right not
related to the outcome of interest
between the two groups as well and all
of those things have to be considered
right to make sure that at the end of
the day you have rates of of of the
outcome in both groups that allow you to
be able to capture um you know the
differences you want to detect so I
think a a careful consideration of those
kind of matters is more important than
at the end of the day saying okay um
that
the more more of simplistic looking that
oh this I think it's more complicated
that especially with this kind of
outcomes that are very prone to
competing risk then you have to really
make sure that you have enough people
that you can follow them for long enough
periods to observe the outcomes of
interest um so I I you know so I think
it's it's not it's not black and white
it's a gray area that is subject to a
lot of considerations and nuance
yeah thank you you're
welcome any other question there hi
there quick
one Prof can you hear me cly yes can
hear you yeah I just wanted to find out
I've actually done two uh servy on the
kai Square platform which I'm yet to
launch because I'm also yet to calculate
the sample space so um what I'm doing is
going eventually going to use a
structural equation modeling framework
and sometimes for that you need in
addition to the statistical power you
need the number of latent variables
number of observed variables and and
stuff like that are you able to
customize a Kai Square sample size
calculator for specific studies like
that so latent variables are not
necessarily the reason we call them
latent is because they are hidden right
latent variables are not observed so you
you calculate the you so you you design
your instruments if for example I'm
trying to measure
I don't know what I what what what
what's your
outcome uh how do you mean sir like
right I mean we're trying to what trying
to see the effect of uh artificial
intelligence uh mediating emotional
intelligence in Engineering Management
okay so the question then is like what
emotional intelligence is a very hard
construct measure right because there
are so many indicators that might feed
into this construct called emotional
intelligence right so five of them okay
okay great so you know I'm not an expert
in that field but so you could
definitely look at several indicators
that say okay what is emotional
intelligence how do we capture it um and
then from those observed variables you
can then calculate you can then derive
the latent variable so that is more of a
statistical construct it's not
necessarily a sample size consideration
per se uh it has more to do with the
quality of the variables so this is a
bigger issue here is about measurement
error right measure error in the terms
of how am I capturing the variables how
am I designing the questioner right like
you said there are five items to that
construct but are you sure just could be
more there are five of them that are
broken down that from which you can
actually break out the questions which
become the the The observed variables
okay so my point here is that in your
case right the bigger issue here will be
issues of measurement measurement error
and measurement bias as opposed to
sample size so your SLE size calculation
will still follow standard calculations
but then you would really want to pay
attention very carefully to issues of
measurement bias especially in those
constructs um and ensuring that you you
really are fing standard standardized
construct so that your study is
comparable with
others okay that's fine that's fine all
right thank you for now you're welcome
yeah okay any other questions
I I was going to take you back to the
the bias section where you mentioned um
the measurement bias or something so
could that be um grouped within the
bigger picture of information bias
wherein we have that Mis classification
differential and non differential and
all of that embedded within I've seen
couple textbooks that um categorize
measurement within information Biers and
then at the extreme arm of confounding
Biers we have that collidal St you know
where that are unconditional blocking
stuff so I don't know whether those coal
pathway Explorations may be something
you be covering in this course yeah
we've Tes on Cal inferences um and we
can have another course if if if that's
um but my experience is that when we
teach such advanced stuff um it's it I
find I feel that the audience is very
uncomfortable but we can definitely have
that session we can definitely have
another um more method
sessions that that are more in depth and
look at some of the more complex issues
in so yeah but definitely that's
something we have but yeah going back to
your original Point um different people
call things differently and that's why
for me generally I I always encourage
yese sir can we just
finish
sorry that's why in general right I I
I'll generally recommend more of
understanding the fundamentals as
opposed to semantics or whatever it's
called
that that is generally less important
than really understanding at the heart
of it what the issue is right and how we
address it so you know some whether we
call it information bias or measurement
bias you know um is is is you know that
that could be that could be subject
discipline of you know different experts
with different opinion but essentially
we're still saying the same thing so
yeah I definitely agree with you that
there quite a lot of you know
variability you know in how things are
called but at the end of the day
measurement bias has to do with how do
we measure and what are sources of error
and those kind of Errors could be we can
classify them in different ways right we
classify measurement buyers as as
measurement from sampling sources or
non-sampling sources where nonsampling
sources could be like you know how we
construct our measurements because the
unique thing about human research is
that as you are studying the participant
the participant is also studying you so
so in the N shell what are the biases
that come from me studying you and what
are the biases that come from you
studying me as the investigator not
that's what we're saying right me
studying you has to do with construct my
questionnaire whether it's leading
whether it's bias whether it's double
barrel and so on and so forth you
studying me has to do with of social
desirability Biers and all the other
things that have to do with um you know
social judgment and you know cognitive
heuristics and so as as long as we
understand at a fundamental level yeah
you know whe whatever I call it is
really
material so but yeah we definitely going
to have sessions on you know um caal
inferences and colliders and all of that
um to for people who are interested in
those kind of um more advanced topics um
any other questions please Pro please
can we go
on yeah no we no I think that we can uh
look at more details about this Prof do
this formula take
can can this formula take consideration
of attrition
rate
Prospect yeah yeah sorry I interrupted
you please go ahead oh um for for
cohost so yeah that's why we have respon
taking
back can we have some measure
of you know we can all we can sing
together but we can't talk together so
please can you mute
yourself thank
you everyone
[Music]
okay um I have completely lost my
thoughts
um sorry what was the question
again oh your your attrition rates yes
yes please yeah um yes the platform
takes care takes um
I just for um um attrition rates and so
the opposite of attrition is response so
it's they complement so you
oops
um
well I'm completely disoriented now all
right um but
they so the the attrition and response
attrition rate and response rate are
complements of each of each other the
opposite of response is if I if I if I
responded that means I did not fail to
respond right or if if if if if I
suffered from attrition that means I did
not respond so you provide the response
rate which is the opposite of attrition
so yes it accounts for it the N shell
that's the
answer does that make sense to
you okay take silence as constant
um so
any other
questions so we have actually come to
the end of the session today today's
session was just provide us with a high
level overview of What's um the key
topics we need to know so again these
are the topics these are the key
elements we' we we covered in terms of
um in terms of establishing a common
knowledge base we've talked about
hypothesis testing P values you know VAR
byar multivar and M variable analysis
we've talked about issues of bias and
validity and and then power and sample
size requirements so next week we will
build on those things and talk about we
dive into the first regression on our C
on our list which is um binary logistic
regression so please read ahead of a
class and read all materials you can
find on on binary logistic progression
so that when we have a session next week
you'll have questions to ask and and any
concerns that you you experience to be
answered all right that's that's all
from me unless there are other questions
I need to
answer um sir please you you were
showing us an example on
how you could use confirmat
the you wanted to make an
example question kept coming I don't
know whether you could oh I see you are
you yeah okay all right let's do that
um all
righty all right so for exploratory
right so remember that there two two
piece Precision Power two very different
things Precision has to do with how wide
or how narrow are our confidence
intervals right when we're doing
exploratory analysis that's what we care
worry
about um whereas in in um in
confirmatory analysis we know we worry
about power so please know the
difference those two are not the same
thing precision and power very different
things precision and validity very
different things validity means the
truth you can have a result that is very
precise but completely wrong you know
prec simply means you have very tight
confidence intervals that's what a
precise result means so imagine you have
a result from a large very large
convenient sample you have one million
people who participated in that
convenient example of course you're
going to have very tight confidence
intervals but that doesn't mean that the
results have external validity because
they are from a nonpr base sample so I
just wanted us to establish the
difference between validity and
precision and between precision and
power power only applies when you have a
comparative an analytical study so if
you have a cross-sectional study this is
a common mistake you see people talking
about power in the context of a survey
like that doesn't make sense right you
know like power only exist in the
context of an analytical study where
you're trying to compare two or more
groups so just to make sure we are not
not making any assumptions about
knowledge so in that kind of case again
you just have one large sample there are
no two groups being compared and
exploratory analysis is part of
descriptive analysis so the same way you
will address calculating sample size for
your survey overall is how you'll also
address calculating sample size for your
exploratory analysis but you will just
make sure you pay attention to the
margin of error so that's where you want
to pay attention if you're also going to
do subgroup analysis you might want to
pay attention to that carefully but in
this case let's go ahead and just do a a
quick run
through um and just to add for those of
you who might end up working in C like
contract research
organizations um we end we actually
don't do sample size calculations for
exploratory analysis why because the
data we get is what it is because let's
say I want to do a study on of patients
who are on this new drug let's
sayra and there are patients who are
from multiple centers I am getting all
the data I I I can get from them right
it's it's not so all Alla I get is what
I get it is what it is I don't have a
luxury of calculating sample size so um
just to expose you a bit to the
commercial aspect of scientific research
that with typically don't calculate
sample size in those kind of contexts
where we're just getting all the data we
can because it's exploratory analysis
anyway right so in that case what we
just worry about is the Precision of the
estimates like how wide are the
confidence intervals and and so that's
again something I think it's important
to mention so in this case let's say
that I want to calculate the sample size
for a survey and I want use that survey
to do you know just explore factors
associated with smoking so sample size
calculation for the survey will be the
same as the sample size calculation for
the exploratory analysis but with
particular attention to a few parameters
here so here for example I say the
outcome is categorical I provide a level
of confidence as 95 the prevalence of
outcome is 50 and 50 again is where we
get the maximum variance that's why we
use 50 in case you're wondering what's
so special about 50 if we say that the
country is highly polarized it means
that 50% of the population support this
President and 50% support the other guy
that is the maximum polarity you can get
when the country is evenly split into
opposite directions so that should help
you realize that variance is maximized
at the prevalence of 50 right that's why
we say if you don't know what the
prevalence is just go ahead and use 50%
because what that happens is that that
guarantees you the maximum sample size
possible so it's like aim for the aim
for the moon if you miss you hit the
birds so we aim for the maximum sample
size if we don't know what the
prevalence is so that in case so that
you know we know we are covered
regardless of what happens that is the
idea behind using
50% next we provide the population size
the population size doesn't really
matter so some people really get into a
real like into a real get themselves
into a frenzy about oh my gosh what's
the size of my population like I need to
know that it's this exact number it
doesn't really matter especially when
the samp when the population is big if
your population is 100,000 or more you
don't even have to worry just put any
number above that it's because sample
size is agnostic of population size what
does that mean it means that the sample
size I need to calculate the prevalence
of smoking in the city of Boston is the
same I will need for the state of
Massachusetts or for the entire us right
because it doesn't matter I will
demonstrate that in a second so it only
matters when the sample when the
population is small so if if if I have a
very small population let's say I'm
trying to look at prevalence of smoking
among let's say doctors in a local
government area or doctors in a county
well in that case that is not a really
large population right then the the
population matters then and the reason
why it matters is because of a
mathematical construct which we call the
phite population
correction the phite popul correction or
FPC simply is based on a premise that
the sample we have is a very very very
small proportion of the larger
population so we say the sample is
infinitesimally smaller than the parent
population but the question is what if
it is not what if the sample is actually
a huge chunk of a parent population well
that means that that Assumption of that
assumption is violated that is the only
time we need to we need to worry about
the population size because if you look
at the formula for which I'm going to
show you in a second population size
does not even appear in the formula for
the sample size calculation for a survey
but we need it to Just Adjust for the
FPC which is the finite population
correction factor and then you indicate
whether um cluster sampling will be
performed or not um and this is very
important in in the context of
regression analysis even if you don't
realize it when we use cluster sampling
they that increases the the width of our
confidence intervals in general right so
cluster sampling in general when when
when associated with um you know wider
confidence intervals because there's
just greater variability in in in the
outcomes right there's you know um and
so that that would definitely lead to
much you know wider confidence in your
results but let's say we're just do a
simple R sample we provide a margin of
error as five the margin of error is the
is the half half of the width of our
confidence interval so the confidence
interval divided by two and that's what
that's what the margin of error is so
that implies that if you want confidence
intervals that are very tight you're
going to have a smaller margin of error
but if you want if you don't have enough
money if I'm doing a budget if I'm doing
a grant I'm applying for a grant and I I
see that okay the cost because I have to
I have to account for the cost of
recruiting the patient I have to account
for the cost of incentivizing the
patient maybe I'm giving each patient
$20 right and and so many other costs
that are unit cost per patient obviously
the budget for the grant is fixed then I
now start looking for ways to reduce my
sample size so that it falls within a
reasonable limit right that's when I can
now say okay all right with a margin of
error of five gez this sample size is
quite high this is way beyond my budget
how about I take a a compromise here and
just accept wider confidence in Travels
by increasing the margin of error from 5
to 7 so that my sample size now drops
it's now 392 okay now I can afford this
right so you now you provide a
justification and and and so that's how
you do it right and so this would be how
you would calculate the sample size for
that kind of context where you're just
doing exploratory analysis is the same
as for a descriptive analysis which is a
in this case
survey um like before the references
here show you the formula that was used
which you can which you can download it
also explains what each of those
parameters mean in the
formula any questions on
that
see Hello thank you very much sir you're
welcome okay um from the explanation you
made mention you talked about the
exploratory analysis whereby we need to
consider the sample size well um you
said it is peculiar to survey so one of
the thing I noticed is that from the
from the discussion what you display you
um I was seeing something called
cross-sectional analysis so is there
because I believe I believe that caliz
is just is kind of a survey that is done
over a period of time so does it does
this also affect this um this concept
about exploratory
analysis yeah exploratory analysis is
not only limited to surveys just to be
clear you could have exploratory
analysis even within a clinical trial
where for example
um a researcher wants to um let me pull
this show you I think this is this will
answer your question but before I do
that right imagine you have a clinical
trial you took clinical trial data and
you're now looking at just um a whole
bunch of subgroup analysis that were not
originally accounted for in the original
trial that is also exploratory analysis
right um let me pull up a study and just
demonstrate what I'm what I'm what I'm
trying to do
um so it is not necessarily
restricted to
all right let's see
this so here what we did was we took
um so um so as the very first word as
you can see here is exploratory analysis
of clinical trial data this is clinical
trials but it's it's so that the loan
should tell you that you can exploratory
analysis even within the context of of
um trials so what we're trying to answer
here is what are the dangers that come
from when you do
excessive you know how would I call it
now excess when you excessively beat the
data for lack of word right and so the
reason for this study was because the
health techn technology as HTA agencies
generally will require so there are two
agencies in you know in many countries
that are related to the regulation and
approval of medications you have the
regulatory agency like the FDA or the
you know European medicines agency that
would review the results of the clinical
trial and approve that yes this drug is
efficacious and safe those are the two
things we look out for in clinical
trials is it is it efficacious and is it
safe now you have HT agencies which are
now like responsible for
approving the medication to be used
among patients and paying for the
coverage and all of that HT agencies
would generally sometimes require that a
lot of after the clinical trial has been
done that would take the clinical trial
data and we do a whole bunch of analysis
right and sometimes if we run into
hundreds or thousands of analysis of
small small small subgroups so in this
case we're saying that there is a lot of
danger in that so if you look at this
table for example this is showing you
just some of the in some of the
exploratory analysis right you have as
small as 6.9% of the analysis of this
original sample size was used so imagine
if the original S size was calculated as
100 people that was needed and here you
come and say oh we have 6.9 or seven
people are going to use for this sub
group analysis well that's very
dangerous why because you have you are
almost guaranteed to have type two error
which is that you are unable to detect
differences not that differences don't
exist but because the sample size is way
too small so the bottom line of what I'm
trying to say is that exploratory
analysis is not a feature that is
restricted to service you can have
exploratory analysis of
anything I I hope that is
clear thank you sir you're welcome
yeah any other questions
yeah hello sir I have a question
yes okay my question is regarding like
maybe I have a Time series data is that
one must I calculate the sample size too
or I can just use as much as possible
data sample that I can get so what do
you mean by time series data because
some of the terms are used by people
with very very widely varied meaning so
I want I need to understand what exactly
you mean by when you mean okay okay for
instance let's say I'm trying to look
into the trend of unemployment rate in
Nigeria and I got a data set from like
maybe 2000 to 2000 like a monthly data
set like they recorded it on a monthly
basis from maybe 2000 to
2024 something like that do I still have
to consider maybe um probably I'm using
too much data set or lesser data
sample should I do I have to be afraid
of maybe it will affect my result or not
in that kind of
scenario okay so in that case what you
what you have are um so that in general
there when you talk about time varing
data collections there are two there are
two major buckets we have we have
repeated survey right and then we have
longitudinal surveys those are very they
sound similar but they're very different
in repeated surveys which which sounds
like what you're
describing can you kindly mute yourself
please don't forget to send me
my
um I know what we can do in the future
to have an interactive session yet one
that is not disruptive but we'll have to
figure that
out um see I've lost my turn of thought
uh
[Music]
um oh yeah we're talking about you are
talking about
longitudal yeah yeah longital and
repeated so what you're describing
sounds to be a repeated survey repeated
surveys is like in in Nigeria I come
this month um I take a sample of of a
country I come the next period I take
another sample now there may or may not
be some people that were capsuled the
first time that may also be included in
the second time but the samples
themselves eles are independent of each
other the next time I come I take
another repeated sample that way right
um that is just standard surveillance
really um and that is how we use that is
what we use to calculate Trends over
time each of those repeated data
collections are assumed to be
independent and so the sample size
calculation for each wave or each unique
iteration of data collection will be the
way you will calculate for any standard
survey all right and then we can and
then we can then you know um we can then
calculate the sample size like that now
sometimes you want to calculate sample
size and you want to make sure that you
have enough estimates for some subgroup
right so let's say you are very
interested in a particular minority
group that is let's
say 30% of population or 20% of or 10%
of population well that means that you
have to increase your sample size by
that factor so you going to take the
sample size you calculated for the whole
nation and divide it by 0.1 that is
obviously going to increase the sample
size tremendously so there are huge
implications when you make when people
you know clients or government say oh by
the way when you calculating the sample
sitze we really want to make sure we
have enough sample for this group well
well in that case you I mean this is
when you need to know the subject matter
very well because then you have to
contemplate what are my options in that
case the options are that I could
stratify the sample so that I make sure
that that group is captured in is
captured as a separate Str right and
then I might sample disproportionately
from that sample I might oversample that
group to make sure that they have enough
representation or I mean there are
millions and one ways you can you can
think but this is why you need to know
the subject matter to a very deep extent
so that you can you can on the Fly
Implement adaptive sampling or adaptive
measures to correct or to address
concerns that key Sak holders have but
to call the long story
short you yes you can use that sample
you can look at Trends there's nothing
wrong with that just make sure that the
sample size calculation for each wave is
adequately power you know and so that
you can have enough you know um you have
enough samples for precise estimates
okay I think you actually did not get
what I was trying to explain okay please
can you can you repeat then yes what I
was trying to say is okay I already have
maybe a data set with um that has
monthly like okay it has months like
January
February let's say January it's 01 2000
like I have 01 2000 observations like
that like January 2000 February 2000
like that's my observation and I have
another column that has unemployment
ratees
like each
month now oh I see I see a point oh okay
I point what you should do is a joint
Point analysis because the the of
analysis is a population level it's not
individual level obviously among an
individual so they you you should look
into using joint Point analysis for that
now if you have a if you have if how
many indicators are you looking at is it
just unemployment or you have a whole
bunch of other indicators you're looking
at that indicators not unate like other
yeah I understand like how many of them
do you
have um like like let's say five okay
that's fine but if you if you were to
have a lot of estimates a lot of
parameters like like imagine you had
like 100 indicators we going to look at
Trends over time you have unemployment
you have you know um you know you set
like and 100 of them that's a lot that
that is what that's what we call type
one error just waiting to happen so in
that case some practical Common Sense
things you can do will be to change the
level confidence level from 95 level of
confidence to
99% so that your estim measured at 99%
confidence intervals right to account
for the fact that you are doing way too
many comparisons you looking at too too
much right but yes you can definitely go
ahead and and and use your data and so
look into joint Point regression that
would be a nice way
to that particular data you have joint
Point regression is it's a very it's a
it's a very easy technique and you know
you can you can definitely explore that
with your particular
question answer your
question yes thank you so much sir okay
okay
good any other question
so um all righty
[Music]
um there no other questions we will call
it today
then yeah thank
[Music]
you thought
[Music]
please no no no thank you so much
everyone let's meet again next week and
please share the the the link the
invitation invite those in your circle
so they can learn with
us please please let's practice more
let's prce next time please if you if
you not
speaking you're not speaking just kindly
mute yourself
please that's for next time please good
yeah just yeah Prof I was just going
with