Day 3 of Advanced Statistics and Experimental Design Training – A Complete Overview

Name: Advanced Statistics and Experimental design Day 3
Uploaded: 2026-01-14T15:44:59.214508+00:00
Channel: RUFORUMNetwork
Description: Summary and key takeaways on Day 3 of Advanced Statistics and Experimental Design Training – A Complete Overview, covering Introduction The session opened with
RUFORUMNetwork
Jan 14, 2026
•
5 min read
YouTube video ID: I01Y0RMrYaE
Source: YouTube video by RUFORUMNetwork — Watch original video
PDF
Introduction

The session opened with a warm welcome from the professor and organizers, thanking participants for punctuality and noting the increasing attendance.
Participants were reminded to submit email and phone contacts for the World Bank’s follow‑up and to expect certificates for previous trainings.
Instructions were given to use the Q&A box for questions and to avoid posting them in the chat.
Purpose of the Training

Provide a solid grounding in experimental design, analysis of variance (ANOVA), and regression techniques.
Equip researchers across disciplines (agriculture, biology, social sciences, economics) with practical tools for designing unbiased experiments and interpreting statistical results.
Core Concepts of Experimental Design

Fundamental Principles: Replication, Blocking (local control), and Randomization.
Experimental Material: Varies by field – plants, animals, humans, or laboratory samples. Correct identification of experimental units is crucial.
Treatments: Defined clearly; ambiguous treatment definitions lead to analysis problems.
Replication:
Repeating a treatment on independent experimental units.
Prevents pseudo‑replication and allows estimation of experimental error variance.
Increases precision and protects against loss of the entire experiment.
Blocking:
Groups homogeneous experimental units to reduce non‑treatment variation.
Examples: soil fertility strips, animal weight classes, greenhouse light zones.
Randomization:
Assigns treatments to units by chance, ensuring each unit has an equal probability of receiving any treatment.
Essential for the validity of statistical tests.
Standard Experimental Designs

Completely Randomized Design (CRD)
Assumes homogeneous experimental units; rarely used in field work.
Randomized Complete Block Design (RCBD)
Accounts for known heterogeneity by blocking; treatments are randomized within each block.
Latin Square Design
Controls variation in two directions (e.g., soil fertility gradient and shading).
Incomplete Block Designs
Used when the number of treatments exceeds the block size; includes Balanced and Partially Balanced designs.
Alpha (Lattice) Design
Suited for plant‑breeding trials with many varieties; flexible block size, creates super‑blocks.
Using R for Design Generation

The agricolae package provides functions such as design.rcbd, design.lsd, and design.bib to create randomization plans.
Setting a seed ensures reproducibility of the randomization.
Analysis of Variance (ANOVA) and Interaction

ANOVA partitions total variation into treatment, block, and error components.
Interaction occurs when the effect of one factor depends on the level of another (e.g., gender × alcohol consumption).
Example: A two‑way ANOVA on a sociological study of alcohol’s effect on attractiveness showed a significant interaction; males’ ratings changed dramatically after four drinks, while females’ ratings remained stable.
Contrasts allow targeted comparisons (e.g., no alcohol vs. any alcohol, two bottles vs. four bottles, male vs. female). Coefficients must sum to zero.
Correlation

Pearson correlation coefficient (r) measures linear association ranging from –1 to +1.
Interpretation:
|r| ≈ 1 → perfect linear relationship.
|r| ≈ 0 → no linear relationship.
Correlation does not imply causation; experimental studies are required to establish causal links.
In R: cor(x, y, method = "pearson") and cor.test() provide the coefficient and significance.
A correlation matrix (via the Hmisc package) can explore relationships among multiple variables.
Simple Linear Regression

Models the relationship Y = β₀ + β₁X + ε.
β₁ (slope) indicates change in Y per unit change in X; β₀ (intercept) is the predicted Y when X = 0.
Estimated by Ordinary Least Squares (OLS), which minimizes the sum of squared residuals.
Residuals (ε) are the differences between observed and fitted values; they are used to assess model assumptions.
R² (coefficient of determination) quantifies the proportion of variance explained by the model; Adjusted R² corrects for the number of predictors.
Hypothesis test for the slope:
H₀: β₁ = 0 (no linear relationship)
H₁: β₁ ≠ 0
Conducted via t‑test or F‑test; p‑value < 0.05 rejects H₀.
Regression Assumptions and Diagnostics

Assumption	Diagnostic Plot
Normality of errors	Histogram or Q‑Q plot of residuals
Homoscedasticity (constant variance)	Residuals vs. fitted values – random scatter indicates validity
Linearity	Scatter plot of Y vs. X
Independence	Residuals vs. time or order; autocorrelation indicates violation
- Violations guide remedial actions: log‑transformation for funnel‑shaped variance, polynomial terms for curvature, logistic regression for binary outcomes, or time‑series models for autocorrelation.
Multiple Linear Regression

Extends simple regression to Y = β₀ + β₁X₁ + β₂X₂ + … + βₖXₖ + ε.
Same assumptions as simple regression plus low multicollinearity among predictors.
Multicollinearity can be addressed by:
Removing/re‑coding correlated variables,
Centering variables (subtracting the mean),
Using principal component or ridge regression.
Interpretation is done by holding other predictors constant (partial effect).
Practical Take‑aways for Participants

Always define experimental units and treatments clearly before data collection.
Use replication to obtain unbiased error estimates; avoid pseudo‑replication.
Apply blocking to control known sources of variation; randomize within blocks.
Choose an appropriate design (CRD, RCBD, Latin Square, etc.) based on the heterogeneity of your material and the number of treatments.
Perform ANOVA to test main effects and interactions; use contrasts for focused hypotheses.
When exploring relationships, start with correlation, then move to regression if a causal investigation is warranted.
Validate regression models with residual diagnostics; transform or change the model when assumptions are breached.
Report Adjusted R² for multiple regression to reflect model complexity.
Resources and Next Steps

All scripts, PowerPoint slides, and data sets are available on the training’s YouTube channel and the shared Google Drive folder.
Participants are encouraged to practice the R commands demonstrated (e.g., design.rcbd, cor.test, lm, anova).
The next session will focus on hands‑on regression analysis (simple, multiple, and model building) and will include a Q&A segment.
Organizers will forward the participant contact list to the World Bank within three weeks.
Acknowledgements

The Food and Nutrition Institute, the World Bank, and the Forum for African Agricultural Research provided funding and logistical support.
Special thanks to the facilitators, especially Prof. Regario, Dr. Thomas, and the technical team for ensuring smooth delivery.
Effective experimental design—grounded in replication, blocking, and randomization—combined with rigorous ANOVA and regression analysis, equips researchers to draw unbiased, reliable conclusions and to communicate their findings confidently to stakeholders such as the World Bank.
Frequently Asked Questions

Who is RUFORUMNetwork on YouTube?

RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
plot of residuals | | Homoscedasticity (constant variance) | Residuals vs. fitted values – random scatter indicates validity | | Linearity | Scatter plot of Y vs. X | | Independence | Residuals vs. time or order; autocorrelation indicates violation | - Violations guide remedial actions: log‑transformation for funnel‑shaped variance, polynomial terms for curvature, logistic regression for binary outcomes, or time‑series models for autocorrelation. ### Multiple Linear Regression - Extends simple regression to Y = β₀ + β₁X₁ + β₂X₂ + … + βₖXₖ + ε. - Same assumptions as simple regression plus low multicollinearity among predictors. - Multicollinearity can be addressed by: - Removing/re‑coding correlated variables, - Centering variables (subtracting the mean), - Using principal component or ridge regression. - Interpretation is done by holding other predictors constant (partial effect). ### Practical Take‑aways for Participants - Always define experimental units and treatments clearly before dat

collection. - Use replication to obtain unbiased error estimates; avoid pseudo‑replication. - Apply blocking to control known sources of variation; randomize within blocks. - Choose an appropriate design (CRD, RCBD, Latin Square, etc.) based on the heterogeneity of your material and the number of treatments. - Perform ANOVA to test main effects and interactions; use contrasts for focused hypotheses. - When exploring relationships, start with correlation, then move to regression if a causal inves
Summarize another video
Full Transcript YouTube

foreign
[Music]
good afternoon everyone good afternoon
participants
good afternoon
yes Professor nice to see you here today
thank you
okay today is uh day three for our
training and so far we are doing good
so I want to just continue to thank the
participants they always try to join on
time I see the number is increasing and
in the interest of time we have to start
exactly at two as usual
so you are most welcome to day three of
the training
and we ask your your participation and
concentration during the training so we
have proflogario in the meeting with us
today so he can give a word or two
before we start with our meeting
two of you are welcome
and your camera seems to be closed this
uh
I think you need to check the cameras
yes now we can see you
thank you very much uh Selma
um good after particular today
facilitators that everything they've
been doing to make sure we have this
wonderful uh training on Advanced
statistics
instrumental designs uh uh good
afternoon to uh Secretariat
welcome to this fat day of session of
training and advanced statistical and
experimental design
uh one thing that I would like to ask
you
um regarding this training is to make
sure you put your contacts uh in terms
of email and telephone number uh the old
bank has started requesting us to submit
uh the list of
participants of the first training that
we organize which is a our training and
then the second on proposal writing so
please provide us your your email and
your contact numbers because we need
that information to provide to the World
Bank the World Bank in three weeks time
will starting contacting you so uh be
aware uh that the sentence organizing
this training the set of excellence in
Agri Food Systems and the nutrition of
Eduardo Mandan University and with that
I would like to
um uh invite you to enjoy this training
and take advantage of this for the goods
or of of all Africa thank you very much
thank you very much professor
Professor I think let me also assure the
participants at the beginning of this
training that we do notice and we are
aware that there are some of you who
have not gotten who have not received
certificates for the previous trainings
so we are going to reconcile on the list
and ensure that we check out for the
names which were left out and give the
but the certificates
so please be assured that the
certificates will come uh Professor we
shall provide the list of the contacts
we've been collecting their contact
details on the forms we just need to
reconcile and send them to you
so please do not put your names in the
chats in the chat box uh the the
registration link is going to be shared
you register as usual make sure that you
register and also please do not put your
questions in the chat box otherwise the
facilitators will not be able to see the
questions put your questions in Q and A
so with that thank you very much for
joining and thank you to the
facilitators professor Susan I hand over
to you and your team thank you
very much
can you hear me
yes we can hear you well
okay thank you very much Selma and
professor regario
uh
we are happy and excited to come to day
three of our training would like to
thank the participants for logging in in
a large number
would like also to extend our thanks to
uh food and nutrition Institute for
organizing this training and funding
also thank World Bank
for funding the training we would like
also to appreciate through forum for
organizing providing us with space
internet
and the webinar to conduct the training
I'll hand over to Dr Thomas to complete
what is has prepared for you and then
after Dr Thomas we will change and go on
to something else thank you very much
over to you Dr Thomas
hello good afternoon everybody
uh you're welcome to the third day of
our training
uh
uh the the the time is always not not
our best Ally but we would rather cover
something little and people go with
knowledge rather than covering
everything and then they leave most
people without so I'm going to give
brief uh overview of experimental design
and then
they they there are a few data sets that
is already provided to you with the
scripts so some of this we are going to
ask you to
to analyze yourself and then we can
we can follow from there so let me just
give my overview probably one hour or
one two hours and then I gave my
colleagues uh to continue
okay so so
so
an experiment we know is the only way to
establish this cost effect relationship
uh when we are talking about
experimental design uh there are basic
principles that cut across it doesn't
matter which field you are so everybody
should feel comfortable here whether
you're doing a culture biology ecology
social science economics and so on uh
the the only thing here is that
sometimes we give example uh we we give
example that really to illustrate the
point and then you can try to put into
perspective uh what your situation is of
course those who know who have gone to
the field
experiments are layout in the field so
from here of course you see there's the
issue of shedding there is also some
kind of a patch that seems not to be uh
to be you to not really to be doing very
well so these are issues that needs to
be taken into consideration uh when
designing an experiment you also have
experiment in the lab
uh those who do food science
through technology engineering and so on
they do us a number of experiments in
the lab so uh things like test tubes
Petri dishes those are a lot more
uniform and that are simplification as
far as doing your experiment is
concerned
so when we are talking about our
experimental design uh there are
basically two things that comes to mind
the first thing is uh the issue of the
experimental material uh or or or what
what are the you could look at it in
terms of the type in terms of the nature
of your your experiment so this is where
variation comes in if you're an animal
scientist uh animals maybe your
experimental material uh if you are a
social scientist or human beings
I scream not sure okay
sorry my I stole my screen is not
showing
it's all done a bit
or let me first sort my screen and then
I get back
yeah so so uh I'm going to to share my
screen but uh I'll continue talking
that
no I don't use window I use a Mac
so all the the Mac products are genuine
um
I guess
yes
now I'm trying to
I don't know what I don't know but I
don't want to
to have it here view can I have it as a
full screen
okay let me try to to share the diagram
a little bit I want to first go to the
first page and then
let's see there
yeah can you check confirm
okay good night
okay
sorry for for for that let's keep
monitoring to ensure that I'm showing
you something that is authentic
okay so we we we have
can do experiment in the lab
so I will my point I was trying to
emphasize here is is it depends on the
field you are we always use different
experimental material if you are in
agriculture like crop mainly the
experimental material that I use I
mainly learn uh you can use pots
uh you can use basins and any other if
you're an animal scientist you can use
animals like the way we're seeing here
different animals May constitute your
experimental material uh and then if
you're a social scientist you may use
human being as you experimental material
and then the other thing is
the treatment the treatment set so these
are what you're trying to test the
effects so here
you
can control the type of variation that
is caused by the treatment so if you
look at my picture here uh I have what I
call treatment on this side
and then have the uh the experimental
material on on this side so if you tell
you're dealing with your animal drugs
for example then you need the animals
that are supposed to be your
experimental material
uh so for uh for Felix for a pot
experiment for example uh you you
probably you those who do bot experiment
report here with the soil we constitute
our experimental unit and then the
plants uh the maze plan here uh could be
then you could be having things like
fertilizer or pesticides or variety
uh and that could be a treatment so the
treatment has to Define very very
clearly if your treatment is not well
defined uh then you're not able to go to
do any data analysis remember when
you're doing analysis of variance uh we
looked at the uh at the issue of of what
to convert factors and that so that
needs to be very very clear and you also
have to be very clear what you're
measuring are you measuring plant ID are
you measuring Rambo called Dimension
Leaf Lane so this also have to be very
very clearly sometimes uh you find that
uh somebody's writing a test cases but
they cannot differentiate between what
an experiment what what is the treatment
and what is what are the variable that
are being measured
um sorry I apologize for these pictures
if you get offended sorry this just to
illustrate my point
um so there are basically two things
that an experiment should give you a
good experiment should be able to
estimate the differences between the
treatment that you're comparing so here
we have two I have two three example
here
I'm testing two energy drinks so these
are my my my my experiment my treatment
and then uh human beings in this case
are my experimental unit so in the first
case for example uh we can see that
there is variation in our experimental
experimental material so from here you
can see uh that we have a small guy uh
fighting with a big guy so that may pose
a bit of challenge here and then here
you have experimental unit data kind of
similar so the whole idea behind design
and experiment is that you want to
design your experiment in such a way
that you have a non-biased estimates of
the differences between the treatment so
what could be wrong with it with the
first case
can can somebody in the chat tell us
what could be the issue with this
in experiment one what could be
okay somebody said there's going to be
some biasedness
unbalance
okay
yeah so there's certainly going to be
buyers bias is not one of the treatment
is going to be favored and the other one
is going to be unfavorable so if for
example this young man here somehow
succeeds in putting the big man down uh
we will get a bit confused whether it is
because he took Red Bull or because he's
smaller and he was able to move quickly
and and Dodge the the big man so that's
the kind of thing so here the interest
is is that
so we need to compare the effect of the
two energy drinks and we need to have
unbiased estimates of the differences
a good experiment should also provide
for us what we call a experiment
unbiased estimate of experimental error
variants so here for example let's say
we have four wraps this is our
experimental unit and we give them the
same treatment now you you certainly
you're not going to get uniform reading
there are going to be some slight
variation here now this variation
between similar experimental unit
receiving the same treatment is what we
call our experimental our experimental
error so experimental error variance is
very important because when you're doing
any statistical tests uh you need the
estimates of experimental variance in
order to calculate statistics such as
Theta this T statistics or F statistics
so that becomes very very important now
let's look at the second case now
now in in this case the treatments are
all the same
and then you can see here we have a lot
more variability
so this is time taken uh maybe let's say
whatever it is so you can see from here
that the the smaller dog stacking yeah
they have a score 50 the bigger dogs
have a score of 500. remember they're
all receiving the same treatment so when
you are designing your experiment and
you make a wrong assumption about the
experimental unit it's likely to cause a
lot of noise a lot of variability in
your analysis and you may not be able to
identify you may not be able to detect
the differences between the treatment
that you have here
so
in this case we cannot assume that all
the variation here are due to
experimental error there's also
variation due to the differences between
the different animals
so so basically
I know this is a review
is this kind of a repetition for most of
you uh basically there are three
principles of experimental design uh one
is replication
the second one is called blocking and
sometimes called local control sometimes
called restriction and then we have the
what called randomization
so what is replication so replication is
simply repeating a treatment under
identical condition
so each treatment is applied
independently to two or more
experimental units
so here for example we can we have a
situation where each rat receives uh the
same treatment but independent of each
other so each of them will constitute a
replicate but in the in the on the left
hand side there that's a receiving
treatment not independently they are
they're shooting it uh together so the
all of this is actually uh an
experimental unit then each rat becomes
a sampling unit and if you decide to
take each rat as a replica this is what
we call a pseudo-replication pseudo
replication which is not true
replication so you need to be aware of
the pseudo replication
so why is replication important uh
without replication you will not be able
to estimate the error variance
so if we have this term 9 11 and and
nine so you can do you can run this in R
and you get so for this group here the
variability is is 0.72 073 now for the
case of of the dogs
uh the variability is up to 42
358 so you can see there's a lot more
variability here now it would be
uh extremely
naive to consider all this variation to
be experimental error uh because uh it
is going to this is an over inflated
experimental error so this effect of the
animal size is also included in the
variability that we have measured there
um also we all know that replication
increases uh precisions of the parameter
you're trying to estimate so let's say
for example you're interested in
estimating the time of reaction of a
wrap to a poison I say the rust is given
a poison and then it takes maybe yeah
let's say if you have a smaller
experiment
uh with the uh with eight Rod if you
have a bigger experiments the variance
will be 0.78
now if you look at the variance of the
mean is going to be the variance that
you calculated divide by the total right
the number of applications so you can
see with a drugs uh the variance is
sorry so you can see with eight drugs
the variability
will attract the variability is much
smaller than if you're dealing with only
four rats so you the smaller the
variability of the of the estimate uh
the more precise the the estimated so so
you can see with eight replies you get
higher variability and lower variability
compared to uh forums
uh other roles of replication include
avoid losing the whole experiment so
here for example if you have your
experiment somewhere in the village on
the farmers field there may be Goods
running around so if you have only one
one
plot the goats might actually eat up the
experiment
one of the plot so if you have five
plots yeah they can eat two and then you
have three so that will still be able to
to save you as a student ah
replication on also increases the renew
validity of the results so here you
observe results of a wider sample space
Maybe soil maybe farm and so on and so
forth yeah this one always reminds me of
a story
of an old lady and an urgent man in my
Village who decided one time uh that
they already claimed that she can
actually wrestle the man down
the man said no that's not possible so
they were arguing from the house so they
say why why not we are all yeah let's
try
so the the woman puts the man down
the man said no no you know your house
very well let's try go outside
so when they move on one side of the
house the man was put down again when
they moved under the tree the man was
also put down again they move a little
bit to the neighbor's Garden the man
also support down so under this we have
validated our result of a white revenue
of situations and we are confident that
whatever result that has come out is
actually uh
valid I talked about pseudo-replication
already
for those who work with samples uh you
need to be aware of what we call
biological versus technical replication
so
biological applications are like from
leave one Leap if you get one leaf and
extract one sample from that leaf and
test that that will be one biological
but you can extract two samples from the
same lid and you have two replicate like
A1 A2 so A1 A2 in this case is what we
call technical replication those who run
uh run experiment in the lab they know
that when you're doing soil soil
analysis for example you may need to
from the same sample you may need to run
three or more
replicates so those are those are more
called technical replicate compared to
the biological replicate so you need to
be aware of this uh the biological the
the technical replicates are just like
pseudo replication that we are talking
about
uh just briefly
always there I know in agricultural
experimentation always people say I've
used three replication so three
replication has become like uh a kind of
the the the a trademark for anybody uh
who is doing experimenting and culture
and if you ask people why no one can
even explain yeah but the the the the
number of replication will depend on on
one the how much variability is there in
your experimental material
so if you you look at the
at the animals on the left hand side and
then you compare with animals on the
right hand side uh which of the animals
which which one would you would you
consider to to be easy to represent the
one on the left left or the right which
which group of animals is to
to represent
okay the left
okay
why why the left
okay so the animals on the left are more
uniform I I hope your left is the same
as my left
so the animals on the left are more
uniform so it's easy to pick one animal
to represent them so if you're dealing
with material as variable as the one on
the right that means you may need to
have a lot more replication in order to
be able uh to to represent that Buffet
well uh my colleague alien is going to
talk about sampling at some point and
you'll be able to to see how that one
comes into play
uh the other thing is also the size of
the difference that you want to detect
now if uh you are you let's say I have
this two example here let's say we are
interested we want to set up an
experiment to compare the the the the
noise made by rat and the noise made by
by a lion
do we really need to do a lot of
replication do we even need to do an
experiment
okay
so you can see that the difference
between the layer the noise made by The
Lion and the noise made by a rat is so
is is so the difference so big uh that
you probably need to replicate it only
one time but when you're trying to
compare uh
the two Rats of different colors
probably they make similar amount of
noise and in this case if you have to
detect this you need a lot of
replication
um
so for example let's say you are trying
you're comparing two new varieties of
maize let's say you're comparing the
whole variety of maize with a new one
that has come up now the question that
one would always want to to answer is
how much
increase in yield is is acceptable
because that you could set that as as
the margin that you want to detect and
if they if you if they they they're
different that you want to detect is
very very small uh that means you need a
lot more replication but if the
difference that you want to detect is
very big uh then you don't need a lot of
replication
of course also how confident do you want
to be what is the level of significant I
I yesterday we talked briefly about we
talked about Alpha which is the uh what
we call type one error so if you want to
be 99 confident you may need a lot more
replication compared to somebody who
wants to be 90 confident and then of
course as a student or as a researcher
there's always the issue of resources
you may have the money but then
experimental material itself may not be
available so that can also limit uh the
number of replication that you can you
can make this material is going to be
shared at some point so don't worry
um but the the other thing uh that uh I
want to talk briefly is the blocking so
blocking we are simply trying to control
or minimize non-treatment uh
non-treatment variation in an experiment
so non-treatment variation can confound
the effect of the treatment just like uh
where we started from
let me go back to to this so so you can
see from here that the non-treatment
variation is for example the variation
between the two the two gentlemen yeah
who are wrestling so if this young man
some arrested the big one so you cannot
tell whether because of this size or is
because of the the
energy drink that he has taken so that's
what we call confounding
so we we always need to be very very
careful uh when
we are designing our experiment so
non-treatment variation increases the
noise in an experiment and makes it
difficult to detect the treatment
differences of any effect
so remember we we calculated this
variance variation in this group and
variation in that group and we found
that there's a lot more variability in
this group compared to a first group and
part of the variation could be
experimental error but partly would is
due to the differences between the
different animals so we need to take the
issue of the differences between the
different animal into consideration when
we are designing our experiment so that
we remove part of the variability that
is due to the animal and already
variation that is due to chance or
experimental error variance okay so here
for example you can also see a situation
where they shedding the shedding of or
part of the experiment by these trees
here and then you can see that uh the
soil that seems to be on the right hand
side of the trees seems to be a lot
seems to be more Brown compared to the
one that are on on the left-hand side so
these are likely to have an effect on
the treatment that you're going to
compare so at some point you need them
to come and look at your experimental uh
area and try to take into consideration
this variation that you see in the
experimental material and one way of
handling uh such variation is through
what we call
experimental uh blocking in an
experimental design
yeah of course
let me try to check
say the YouTube link is not working okay
so so the non-treatment variation needs
to be controlled one way of controlling
this is using homogeneous materials so I
what I need to do in that case is I can
use only rats that are similar so then I
can avoid that and then also the issue
of management
yes
in the chat okay
yes there there there there are standard
uh formula for calculating number
replication uh that will take into
consideration the things that I've
talked about
okay yeah so that one is is there but uh
we cannot at this point go through all
that because we don't have time but the
information is is available
so there is the issue of homogeneous uh
measurement
management so things like within your
experiment should then uniformly if
you're going to take uh you're going to
to take a measurement you need to to
take measurement in such a way that the
people you're using to take your
measurement don't introduce a lot of
other variability in and then there has
to be also
um a careful measurement there's
homogeneous management
but the what we are going to talk about
is what called blocking so blocking can
allow us to remove or control the effect
of variation uh in the experimental
material so what is a block in the
simplest term a block is a group of
experimental unit that are expected to
be similar
so if we are doing an experiment with
dogs so we can group our dogs in two
groups that are similar so for example
cannot block one
with these dogs here they look similar
we can go to block two we go to block
three and then we go to block four so
examples of
blocks include a plot with similar soil
same breed of animals like what we are
seeing in this particular case so a
block is simply a group of experimental
units that are expected are to be
similar now when we are talking about
similarity in this case the similarity
has to be defined in relation to some
characteristics when you're dealing with
soil for example the the the the
variation or the the similarity maybe in
terms of soil fertility in terms of
animals you may be interested in things
like the the initial weight the the
weight of the animals the breed of the
animal and so on and so forth so this
depends entirely on your entirely on
your experiment what type of experiment
you're trying to run
so in the field uh we look at something
like this so you could for example
consider
this
as each of this trip as a block but it
could also be two strips constituting a
block so here we assume that within the
strip there is uniformity and this
uniformity could be in terms of soil in
terms of water in terms of sheddings and
so on and so forth now we we always
assume uh that the the experimenter
knows uh the experimental material which
they are dealing with and therefore
they're able to actually group the
experimental units as such that there is
actually some kind of uniformity within
the group and the group is what we call
a block
and then the the simpler of the the the
the three principles of experimental
design is what we call randomization
so randomization is simply the process
of assigning treatment to different
experimental units uh using probability
or chance the the emphasis here is that
we want each treatment uh to us to have
an equal chance of being ending up in
any of the experimental units that we
are we are using in in in in our
experiment so in medical studies for
example uh we may be we have a new drug
and we comparing that new drug with some
kind of placebo or even a standard drug
that we we have so once we recruit
patients who qualifies for our study
then we we can take a coin and flip a
coin we save the head comes up you will
be assigned to the treatment group if
the tail comes up you'll be assigned to
the uh to the uh to the control group
and after that you give them the
treatments and then you follow them up
and then you compare your results so the
good thing is we are going to see that
ARA can generate for you can randomize
for you your treatment in the field very
very nicely so that you don't need to
worry about anything else
foreign
approach like I mean a Blog if this is
what we have in the field where we have
block one
block two and block three so here we can
then do the randomization within each
block so the randomization process will
be done within the block
I think it is really important that we
do randomization because
analysis most of the statistical
analysis that we do is based on the
assumption that randomization has taken
place so if you don't do randomization
the validity of your result is not going
to be okay your result is not going to
be bloody I assume that you are
interested in comparing the height of
male and female I mean then then someone
say oh let me select all mail tall
female
and select short short female and
compare the high so you can see from the
start that there's no way you're going
to come up with a valid result because
you did not select your sample uh in a
random manner
okay so those are the three principles
of experimental design uh I'm going to
introduce you to three standard
experimental design and then I'm going
to talk briefly about
split plot and briefly about incomplete
block design then I will try to show you
some results and we will later ask you
to try to generate those results
yourself then maybe tomorrow morning
them as one person to to illustrate that
for us
okay
so the first design uh the most simple
the simplest of all the design that we
use in experiment is what we call
completely randomized design and
complete the Android design as one main
assumption
is that the experimental units are
homogeneous so like you see all these
rats are similar and therefore uh it
doesn't really I I only I don't need to
group them
okay so here the process of
randomization will just happen at once
so you can randomize everything at once
uh I'm not going to so so here this is
just a simple a simple way of generating
uh using ARA to generate for you uh to
randomize for you you are your treatment
in randomized complete block design so
let's say we are starting with this so
we need to label each of these as one
two three up to 12.
um
so you have the script is with you so
always the read the try to read the
comment like here we say if we set
we use this command set or synth with we
say we set the numbers say 500. now this
allows uh you to get the same
randomization so if you want to repeat
the process you get the same
randomization process now for now I
don't want you to to worry that much
about about this uh you can easily pick
this script here and run it with you so
finally
what all this thing generated is that it
generated for us a plan so you can see
that number one will get treatment B rat
number two will get treatment c number
three will get treatment nine like that
so this is the process so this as a
result of assignment of the treatment
you can see that now we'll be able to
assign uh our
treatment to the different experimental
units and this assignment is done in ARA
what I'm trying to show here is kind of
a complicated one because I was just
trying to illustrate a point but there
is a much easier uh function in package
agricole that can help you to generate
either crd or rcbd as we are going to
see
no
no so so completely randomized design we
assume that all experimental units are
similar with respects to some
characteristic of Interest so this is
mainly used in the laboratory in screen
houses and uh
application in the field is very very
rare although sometimes it's possible to
do it in the field
and so the why crd cannot be used in
very frequently is is the issue now of
making the assumption that the
experimental units are more genius
because if you make the assumption that
experimentary units are homogeneous look
here this is the variant the this is the
experimental error versus this
experimental error so you can see that
the experimental error in this
particular case is too large so we need
to get away of removing
that the variation due to the animals
the animal the different type of animals
and this brings us to the second types
of design
what we call
randomized complete block design so we
know
experimental material are always not
homogeneous if you are an animal
scientist you can have all these animals
here available to you to do your
experiment now let's say we are
interested in in weight gain
we have feeds new feeds that we want to
test on the different animals of course
this animal which which is highly
machiated is going to respond
differently from this animal with a good
body score so unless you take this into
consideration you're going to end up
with a very messed up experiment
so this is where the issue blocking
comes in
so randomized concrete block design is
useful especially when experimental
materials are homogeneous
so here there is a predictable Trail in
the variability of experimental material
now my advice here is don't design your
experiment in the office if you're going
to do a field experiment don't design it
in the office
because design experiment in the office
you will go in the field and find
completely a very very different kind of
situation going on there
okay so you need to go physically to the
field look at the the area where an
experiment is supposed to be be set up
look at things like that shedding is
their issue of slope is there any
indication that there's variability in
soil fertility or the soil ties because
you may walk fine one area is Rocky
another area maybe is this that's a more
deeper soil so you need to take that
consideration so the knowledge of
experiment material is very very
important as far as blocking is
concerned so you create your block you
have a block one block two block three
block four so instead of assuming that
these dogs are a uniform you take in you
group them up so then later I can then
test my treatment within this group I
test my treatment within this group I
test my treatment within this group
attach my treatment within this other
group
so in the field for example you could
create like like what I did I showed you
highly on let me go back a bit
yeah so like like this could be your
block so if we look here
okay so this is your
this is your block
this is your block
one
this is your block number two block
number three block number four block
number five and block number six so you
can see that from the the shading the
blocks are different but within the
block there's kind of uniformity now if
your block runs upwards then you're
going to have a problem because remember
we want uniformity to be within the
block so here we then have our blog one
up to six
uh so this is also going to be uh in the
script that I gave you
I generated a randomized complete block
design using this and then it kind of
you you will run this
like what we are yesterday doing so here
you're just creating a vector you create
this Vector here you create another
vector and then later you combine them
and then you end up with a good plan
so key things here is that treatments
are randomized within each block
separately
all treatments appear in each and every
block
and then
each block serves as a replicates so
here the number block is equal to the
number of replication so that is uh
watch the whole thing so the first thing
is let's we do randomization block by
block
so this is the the easier way of using
uh games using ARA to randomize yeah to
get for you a randomized complete block
design field plan
uh so it is in the library Agricola
so you we are going to the first thing
that we do we are going to create our
treatment if we say we have six
treatments
then we use the function this is the
function that we're using design Dot
rrcbd
then you specify the treatment
you you you you specify the number of
replication or number blocks and then
you put a seed number now I want to talk
about the seed number
so how do you block in a greenhouse in a
greenhouse the key blocking issue could
be the direction of the sunlight or how
sometimes the if it may be hotter in the
middle of the greenhouse and and cooler
on the edges of the greenhouse so you
can use the the this the position within
the the greenhouse to actually block
your experiment so you could block for
light
intensity of light depending on on where
the light is coming from if the if there
is a potential that moisture is coming
from One Direction You could also use
that for your blocking
okay so so here we we have
um
our
function that design and then we create
an object which we call out.design
and then out of this object here has
several things inside one of the things
that it has inside is what we call a
book so a book is literally the field
book or what you actually have so the we
did so we moved from from extracting
this rcbd uh we extracting something
from the the result this result we
created this object from this object we
are picking what we call a book
and then
we do some leveling here
okay so what we have in the book as a
block we want to leave it as block one
block two block three block four block
five and block six and then what we have
in the treatment remember the treatment
here is one two three four five six so
we want to leave it as a b c d e and f
and then finally this is what you're
going to end up with so you can see that
block one
plot one will received d block one block
two will receive F block one plot three
will receive B and then you have C you
have e you have a and then you go to the
next block
the seed equals 11 is to allow you
generate the same design again and again
so if you put seed equal to 50 it will
generate a different design if you put a
sheet sheet equal to 1000 uh it will
give you another design so literally
this is actually not not a true
randomization process it picks for you
from the different design that already
there gives you one and so the process
of the picking is is literally what
you're having there so you change you
can change the seed
so if you want to show your supervisor
that uh you generated your design using
para don't forget to maintain the seed
because if you change the seed you're
going to show a different thing and they
will say you're lying
okay then uh the the other thing is
uh that's what we call the Latin Square
design
so there's what we call a Latin Square
design
yeah there's what we call a Latin so
when you randomize complete block design
we are controlling variation in One
Direction
so you can see from here that our body
our intensity of the color goes down so
here the first block has lighter color
and as you go down the plots get data
and background so this could be
representing for you your soil fertility
status of the different areas of the
plot now you could also have a station
where in addition to the change in the
soil fertility there could be a number
or could be some trees on on the left
hand side and those trees can provide
shedding and that shedding Moves In in a
certain direction so you may want to
control both the fertility gradients you
also want to control uh you also want to
control the the shedding
so here then you may need to block in
two different direction so you can block
for fertility gradient then you can
block for the slope uh so this would be
what you have so so the Latin Square
allows you blocking in two directions
and each Direction you have actually a
randomized complete block design so here
if I look at only the slope I have a
randomized complete block design when I
look at the slope when I look at
fertility grading I have a randomized
complete block design in each case so
that's what we have as a Latin Square
design so this could be you have
shedding in One Direction your fertility
gradients in the other direction so this
is the same field but we are trying to
use uh different colors to show the
direction what we are trying to uh to
have now this is Latin Square can be
very useful in the in the lab for
example if you are dealing with machines
so let's say we have machine one two
three and four and then you have those
operator or technicians one two three
and four
uh so here you can see that you need to
block using the operator because each
operator may have something unique with
them so you block using operator so you
can see the operator here is testing all
the treatment the operator here is
testing all the treatment is operator is
testing all the treatment and this is
also testing all the treatment so if we
ignore the machine we have randomized
complete block design we're blocking
with the operator now if we ignore the
operator we can have a randomized
complete block design with machine
so here you have two independent
randomized complete block design that
are imposed on each other
it's also very useful in in nutrition
studies in animal science so here for
example when you're dealing with
different lactation period uh you you
you you you have different animals of
course the different animals uh will
differ in how they perform uh how the
the biologically behave and then the
time
also it's different so here you may want
you look at this animal is different
from that animal is different from that
animal is different from that animal now
the weeks are also different week one is
different from week two different from
week three different from week four so I
have my treatments here and I need to
control using two different uh
directions so I have this versus that so
that is what we have as our Latin Square
design
I'm not going to spend time on this
so you can also generate
a Latin Square design using the function
Agricola so here the function is this
using the package that we call it here
is the function design dot LSD that is
design dot Latin uh Latin Square design
the other ones was designed rcbd and
then the the the other one the the one
for for crd you also design dot RR rcd
so this is what looks like after you've
done your design
uh and then um I would want to skip this
a bit
let me skip this
I just want to talk about an incomplete
block design briefly and then I want to
just present you some additional
information on analysis of variance then
I hand over to my colleague
so so
when you're dealing with randomized
complete block design the assumption is
that each block can accommodate all the
treatment that you are you are trying to
study so if we we go here
okay so here I have my block as six plot
one two three four five six so here I
mean I only have six treatment that I
can test now sometimes you may have more
than six treatment so then a Blog in
this case may not be complete may not
have all the treatment will be
incomplete as far as the treatments are
concerned
okay so we we talked about the
importance of blocking and we say
blocking reduces experimental error and
give you more Precision we want we
always when you are trying to do uh
you're trying to
uh you're you're trying to do analysis
you want to estimate your parameters
with a very high level of precision so
that is why blocking becomes important
and then blocking also allow us to make
comparison within uh the different
within the different condition okay
there are challenges that can come
uh from blocking so sometimes you have a
large number of treatment
that require blocks that are larger okay
but as the block also get larger the
condition becomes more heterogeneous and
once the so if you try to force as many
treatment as possible into a block
you're going to end up with a very big
block that will no longer be uniform
like I have this example here let's say
I'm interested in studying puppies
uh then I I have nine treatment but I
only have four puppies so the question
is can I borrow the neighbor's puppies
the neighbor's puppies could be the
puppies of a wild dog so now if I'm
going to bring uh five of these puppies
here I'm going to have my block which is
size nine we love all my treatment
batteries you see I've also brought in
variability in so sometimes then it's
important it may be necessary to
actually have a smaller block than
having a block that takes care of all
the treatments
so sometimes blocking is also based on
natural grouping so for example you can
use each uh each mother is piglets set
as your blog now in this case then you
can go this one has fewer liters fewer
fewer piglets compared to to this so
that means this one can take more
treatment and this one can take less
treatment so here your block is like to
be incomplete while here the block maybe
complain
there
are you getting me
am I clear can I see in the link
yes you're very clear okay
okay thank you I think the majority are
getting me clearly
okay
so so in this case then it may also be
necessary to have uh to have blocks that
are smaller
so then sometimes it nests are therefore
to put experimental unit to smaller uh
groups uh but then
you you then have to assemble these
smaller groups in order to get a
complete set of all your treatment
so when we talk about incomplete block
uh we can look at incomplete block based
on how many blocking factors do we have
so here you can have one blocking Factor
like in randomized complete block design
so you're going to have a incomplete
block design data derived from
randomized complete block design
but you also can have a incomplete block
design that are derived from uh
Latin Square remember Latin square has
two blocking factors so you can have
that
the other way of looking at
at
incomplete block design is also
depending on how the treatments uh the
the precisions of the comparisons of the
different treatment so we have what's
called balance incomplete block design
versus partially balanced in complete
block design I'm going to explain uh
this one here and then I will move okay
so we have randomized incomplete block
design that are based on one blocking
factor and then we have incomplete Latin
square that are based on two blocking
factors
for balance for balance incomplete block
design
here each pair of treatment occur equal
number of times
so that means if I'm comparing A and B A
and B O cast Two Times a and C should
also cut two times A and D should also
cut two times so each pair of treatment
should occur equal number of times in
the same in the same block while
partially balancing concrete block
design
uh pass or treatment don't occur
together equally so let me just show you
an example
okay so so here we we have an experiment
we have an experiment in balance in
complete block design and this is
somebody's interested in testing the
seed jump sodium tomato seed germination
and trying to look at the effect of
temperature on seed germination so here
the treatment they add four different
treatments where seeds that were
subjected 25 degrees Centigrade
we have just one subjected to 30 degree
Centigrades those are unsubjective 35
degree centigrade and also unsubjected
to 40 degree Centigrade so these are our
four different treatments
and then
yeah the the the design here is such
that there is a a a a germination
chamber
which is used as your experimental unit
and each run of the experiment will
constitute a block so let's say for
example we are going to set up an
experiment today
so we have we are going to have Chambers
and each chamber will be uh having a
certain condition let's say 25 degrees
30 degrees 35 or 40 degrees uh so we set
this set of experiment that is what we
call our round number one so round
number one is our our is is our first
block
now
for us to have a complete randomized
block design that means within each run
we should run uh all the four treatments
should be included so in this case we
need to have all the four temperatures
tested
but you are told that in this case only
three chambers are available to the
scientists you have four treatments but
you only have three chambers for doing
the experiment that means each time you
run an experiment one of the one of the
the treatment will not be tested so in
that case each run will be incomplete
because you've already asked three out
of the four blocks so this is an
illustration so run one will only have
25 degrees
uh that 25 30 and 40 degrees so here 35
is missing from run one when you go to
round two then you have a
photo is missing from here if you go to
this one here that is missing if you go
to this one here you find that 25 is
missing so the the question then is then
is this balance so for it to be balanced
uh let's try to look at this pair 25 and
30. so 25 and 30 are here occurring once
in there in this block
uh they also in this block
but they're not in this block they're
not in that block so 25 and 30 will cut
together in round one and in round two
now when you come to 25 and 40 okay in
round one
and run three
and then 30 and 40 because here
and also because there so here we can
see that each pair of treatment will
occur together in the same block two
times so that is gives us what we call
balance incomplete block design with
partially balanced in complete block
design the pairs will occur different
number of times together in a block that
is what we the difference between the
two
uh so
you can also use error to generate the
different types of of design
uh so still the same agriculture package
so you can use
uh design dot bib that is balanced in
complete block design uh you feel in a
few of this and then you run
but you need to to clear you know what
what you need to hear trt so you need to
specify the number of treatments
or the treatment that you have okay you
need the size of the block remember here
our block only takes three of three
instead of four uh R is the number of
replication how many replication do you
want
okay if you don't but in this case you
don't need to fill the number of
applications because uh the the program
will have to generate in a way that you
can have all the players coming in
together and then you can go ahead and
and have that so this would be what will
be generated for you
and then I already talked about
partially
uh balanced incomplete block design
so in this case
uh the required number of replication
may be too privative for you to run to
run a balance in complete block design
so partially balancing complete block
design uh which requires less number of
replication may be constructed and and
I already explained to you that will
partially balancing complete block
design the number of pairs occur
together in the same block different
number of time so in some cases you may
be interested in some pairs occurring
together uh more frequently than others
for reasons that are based on to you as
a researcher so this is I'm not going to
pull here
okay so let me just say something
briefly about uh about Alpha design so
Alpha lattice design is a design that is
very use that is used very frequently in
plant breeding uh this were developed by
Pattersons and William in 19 as 1976
and then in in the greater UK the number
of varieties and the number of
replicates were fixed uh by The Rook by
a law that if you are going to do
experiment you're coming up with new
varieties we need you to have this
number replication
okay
okay first of all we we are going to I
want you to first run those uh those
trips that you have then tomorrow I can
answer the questions
but by now you should already be in
position to actually uh look for some of
this information yourself
okay
so so but the the requirement by then
was that you need to have a fixed number
of replication and and a fixed number of
varieties for before you can go for the
trial
so the the light numbers of varieties
would necess steady use of incomplete
block design
but the challenge over there would then
be that they were not enough design that
would accommodate the number of
treatment that were available then
so the these guys came up with
alphalitis design
so this one has no limit on block size
so you can specify any size of the block
that you want in most cases you deal
with blocks that are much smaller than
the number of treatment that you yeah
that you have
now
so the only constraint in this case that
the number of treatment should be a
multiple of a block size so the number
of treatment should be equal to
um
the number the the number of blocks that
you require and the size of the block
so that you what you want is that
you you can finally get your smaller
blocks and combine them together to
create a bigger block what you call a
super block so a super block will then
be used uh as a super block we don't use
as your replication that would be what
you would have as your application so
that is the the the the the main thing
here
and we have another
we have a nice way of looking at it here
I want you to first run and then later
we can have a small discussion with that
now I I want to run for you just look
for at one particular analysis that is
two-way
and then I hand over to my colleague so
there are two analysis here I'm not
going to go into one uh you're going to
have the uh the
what I I did an illustration for using a
Latin Square to actually show the
different things that will happen I want
you to First you you get the script
there you get the the PowerPoint uh and
then you go through but you have all the
scripts with you you should be able to
generate uh
the the script so this one here you will
read first
okay so I I want to talk about a
factorial experiment now this one I'm
going to slow down a bit now I was a bit
running over the other one
uh so let me first show you some
some some video here before I can not a
video I'll show you some pop for some
just hold on a Beat
so so I I have a rather well it's an
interesting
example just was lighten up the
afternoon so that you don't
you don't uh get
I know it's getting to the week is
coming to an end
foreign
we are going to talk about what they
call the via Google effect
so he said always when you and your
friend go to the bar to drink
uh before you you you take your six
pairs this is how the waitress looks
like after you've taken your your beers
this is how the waitress looks like so
so there's a uh some sociologists wanted
to test uh the effects of uh alcohol on
choices of mate in in in a pub or
whatever it is with the idea is always
that uh that alcohol clouds the vision
the clouds our judgment and then you're
likely to end up with a a completely a
different story this one I picked from
from from from somebody sharing it
around okay so I'm going to get to my
presentation now this is not the best
this is not the presentation I'm sharing
with you I just want to
of it there so so here
there's a simple example here that that
happens so Anthropologist is interested
in uh the effect of beer on Main
selection in a nightclub
so the rational
so the rationale is that that after
alcohol consumption subjective
perception of physical attractiveness
would become more inaccurate and this is
what they call the well-known Google
effect via Google effect now the
question is does beer Google effect
depends on on on on on on sex
so this is a an experiment that was
designed so this person went and picked
um
uh three females
I mean you you pick six females and each
of them was given uh
BL free alcohol-free lager
and then another six got two bottles of
beer another six got four bottles of
beer then the same was done for the
males and then
um
they will take a photograph of the
person that the participant was chatting
with and then I got a poll of
independent judges to assess the
attractiveness of the person on a scale
of say uh Auto I mean on a scale a
certain scale
okay so now we we start the analysis by
looking at uh at our descriptive
statistics so here the first thing
the first thing we are looking for now
is we are comparing the gender male
versus female so you can see from here
uh that in terms of the median there is
actually no difference between male and
female if you collapse you you forget
about the the you forget about you
ignore the alcohol effect so the male
and female seems to make choices in a
similar way although there's a lot more
variability with the male than the
female
now when you you ignore the
uh the the the gender
then you can see that no beer or two
bottles of beer actually there's no
difference between them but they say
drop in the attractiveness of the choice
when you go to four bottles of beer so
we are going to our six bottles of beer
now with six bottles of beer you might
go to zero now okay and then of course
the the thing here is that we have two
factors we have gender and then we also
have uh alcohol or the number of bottles
of beer now we then need to look at the
two of them now so if we look at the one
the female alone you can see that with
female uh beer doesn't seems to have an
effect on the choice of the maid on
their rating of the attractiveness but
with the male once you take four bottles
of beer I think literally everything
looks like the other one that I assured
you before so you can see now this is
what we are going to look at what we
call interactions the interactions
between uh the two factors that we are
we're looking at so this is what we call
a two-way analysis of variants because
we have two factors that we are trying
to to look at
of course you you need to do your
summary statistics first so you can see
here that uh we get the data
we look at gender and summary so here
you can get the minimum
from for female
and then the maximum for for men so you
can see the minimum score for female was
50. the maximum was 70.
the minimum score for male was 20 and
the maximum was 85 and remember we saw
there was a lot more variability with
the with the film with the mail compared
to the female so you can already see it
there
and you can also
look at it from alcohol point of view
so this is for no alcohol
this is for two bottles and this is for
three bottles so you can see the minimum
the maximum so the minimum the person
who scored the
who got a score of 20 took four bottles
of alcohol
okay
so you the the it is always good to to
try to to put your of the analysis side
by side so now if we look at gender and
alcohol we look at female
none this is how the the scores were if
you look at male none this is where this
our discos are if you look at female two
bottles this is what you have if you
look at a male two bottles and then you
go to four bottles and and so on and so
forth so it's important to do this
Statistics because you you it will help
you to see whether there's an error in
the system or or or or whether whatever
you have makes sense let's say for
example we get a negative score
and remember we are supposed to be from
0 to 100 so if you get a negative score
that means there must be something uh
not right with the observation that that
was the only data that is given to you
so you may need to cross check that so
when you look at the mail with four
bottles of beer you can see how how
small the values are
okay
so now we we have two factors here we
have gender and then we have alcohol so
we can do a one-way analysis of variance
so let's say for example here we have
one-way analysis of variance
um
so this is gender
the one on on the right is on the left
is gender
so here is the p-value
0.35 so on the chart was there a
significant difference between the males
and female
okay you can say there is no significant
difference between males and female a
few people still say yes but but at
least the p-value is greater than Alpha
so there is no difference between male
and female now let's look at the
the alcohol
so the p-value is that is there a
significant difference between the diff
and the number of bottles taken can I
see in the chart
yes
okay so here the p-value is less than
0.05 so the significant difference now
this does not tell you which which
bottles are different from the other
that's where you need to go further down
and and look at the Post Oak
so in terms of you can plot from the
model from each of the model that we
have generated we can plot what we call
the effect so you can see from here uh
that's the male
female versus the male the slight
difference between the male and the
female but you can see from here that
their confident interval overlaps so
that's why in this case there's no
difference between the two genders now
when you look at alcohol you can see
that there is no difference between
none and two actual discipline a slight
improvement with two bottles of beer it
seems to be better choices than we want
but you can see the drop in in in in in
in in in four bottles so these are the
thing
these are the things that you need
actually to look into your data to be
able to actually tell uh which direction
you are going
so we we first started by running
single analysis one-way analysis for
this and for that
but remember the reason as to why you
are doing a two-way the reason why
you're looking at two factors is because
you're trying to see whether there's an
interaction between the two factors so
the best analysis would be to look at a
two-way analysis of variant so with the
two-way analysis of variance here you
need to have the gender and alcohol both
in the model
so you can see from here that gender
was not significant
the p-value for gender is
0.16
alcohol is significant
and the interaction between alcohol and
gender is is also significant now once
the interaction is significant that
means
I cannot interpret the results
for the two factors independently I need
to combine to interpret them together
so here is then what you do you you you
what you look at the female separately
compare your beers for the female
compare you as for the mail so this is
where the two interaction comes in so
you can see with the female there is
actually no difference
but with the mail there is a big
difference between the nun none and two
versus the four so this is what you have
as your two-way analysis effect compared
to the one way you see we have one way
this one only gender this one only
alcohol level but this one combined both
of them now so we have for female this
is the situation uh for male this is the
situation so here we are saying that uh
for female well you can go to a barn you
will always still come up with a good
looking man in respect to how many
bottles you've taken but with a man you
have to be careful after two bottles you
better come back another day to make
your choice
okay
so two way out
action is interpreted the interactions
mean if I'm looking at alcohol I need to
look at alcohol with respect to the
gender if you're female yourself if
you're male after two bottles come back
the next day and then you can have your
yeah so this is the same you do for the
varieties with fertilizer whatever
you're trying to Stu to study you want
to be able to uh you're always asking
yourself is there interaction between
the two factors if the interaction is
there what does it what does the what
does it mean
if you're going to advise Farmers what
do you need to tell them are you going
to tell Farmers that please if you are
fertilizer please this is your variety
if you don't have fertilizer please uh
you need uh to plant your local variety
that you've been doing before don't
waste your time with a new variety
briefly I I introduced the concept of
contrast uh now contrast is is is is is
is a very very interesting and a very
useful comparison that you can make
now this is mainly useful when you're
comparing treatment with some kind of
structure
so instead of like what we saw yesterday
in the analysis with the with the LSD
where you add the numbers a a b b and c
so here you are not comparing one
treatment versus another you may be
comparing a group of treatment
you compare in groups of treatment so
for example
one contrast that you may be interested
in you want to compare alcohol
no alcohol versus some alcohol
okay so no alcohol versus some alcohol
contrast two you may want to compare two
bottles versus four bottles and then
contrast three you may want to compare
male versus female now the male versus
female you can I mean it's it's the same
as what you you have before but you see
the first two contrasts are slightly
different so you can look at alcohol no
alcohol versus some alcohol remember
some alcohol can is an average of the
two bottles and and the four bottles
okay so so here then the the challenge
here is you need to be able to tell ARA
the type of contrast to call the
contrast that you want to test this is
where the challenge is now
now if we look at uh let's look at our
so here we have none
we have two bottles
we have four bottles so now see here
none
as negative 2
and then two and four have one one so
remember we are comparing
this versus the two so that's why now
you you this is how you generate the
contrast so I I will multiply what I
have as as none with two and then I
compare I compare with the the other two
so this one will give us a contrast for
uh none versus some alcohol remember
some alkal is the first contrast there
no alcohol versus sum so here I have no
alcohol I give it negative 2 and then
positive one positive one for the other
one the the idea yeah the main thing is
that the sum
of the coefficients should be equal to
zero
now
after I've done with the none when I go
to the second
contrast the second contrast the two
Bears versus four bottles so in that
case I'm no longer interested in the Nan
I give it a zero then I compare the two
versus the four I give negative one and
positive one
then I what I then need to do is then I
need to then I can do the analysis now
so this is my this this is my analysis
the the one on this side
this is the one that I've done I have my
gender I have my alcohol and my
interaction
now this is the one down that I have my
contrast in so the first thing is it
tells you gender one
so gender one
mean I'm comparing male versus female
and you can see from here male versus
female
the p-value is
0.16261 which is greater than our Alpha
so in this case we can conclude that
there's no difference between male and
female
okay now let's go to alcore one
alcohol one is the one that is comparing
none versus sum
and then so none versus sum you can see
is significant
then alcohol 2
will compare
two bottle versus four bottle and there
is this is also significant now the most
interesting one here is you're comparing
gender
one interacting with alcohol one
now gender is male versus female
alcohol one is none versus sum so here
we can see there is a significant
effect so what does that mean
okay that means if I'm comparing no
alcohol versus some alcohol I need take
into consideration the gender because in
that case if you compare no alcohol
versus some alcohol for female there
will be no significant difference but
for male there will be significant
difference and the same is for alcohol
too so if I'm comparing male versus I'm
comparing male versus female I will find
out that two bottles versus four bottles
for female is not significant but for
male it will be it will be significant
and then
um
so so this one brings us to what we call
the the simple effects
so the interactions between two factors
implies that the effect of the two
factors cannot be interpreted
independent of each other so you cannot
interpret the number of bottle
independent of whether of the gender
so in this case you can do what we call
a simple effect
so a simple effect is you're comparing
the level one factor when holding the
level of the others constant so more or
less this is what we are what we are
we're talking about you're comparing the
level of alcohol while holding gender
constant so if we look only for female
this is the comparison if you we look
only for male this is what we get so
that's what we call the simple effect
okay so you compare male and female at
zero alcohol level you compare male and
female at two at two bottle level and
then so so you can also construct
uh this simple effect as contrast is a
bit uh it needs a bit of getting used to
but the concept is very very simple like
what we we have before
unfortunately there is no way around
this you need to be able to construct
this yourself
so here uh we have
so we first of all we have our
female no alcohol
female two bottles female four bottles
male nun male two and these are the
categories that we have now so
alcohol effect one
alcohol effect one I want to compile we
want to compare
none versus sum
now named there are two nuns yeah
there's none for there's none for female
and then there is
so you can see these two will constitute
four and then the other one will
constitute one one so here you still
have zero as you are you still have zero
as your your your the sum
now
alcohol two effects
here we have
two bottles versus four bottles but
there is two bottles for two bottles for
female and two bottles for male they
they carry the same sign and then the
four bottle for male and female They
Carried so you combine these two and
then here you want to look at uh gender
that is
gender gender no and gender none so you
look at at female
with none and male nuns so this one just
depends deals with that and then you
look at gender two bottles so you
compare male and female are two bottles
and gender four four bottles so the the
other thing with the contract contrast
is how do you construct this once you
know how to construct this then the rest
will be a walk in the park
so this is the result I think we're
almost coming to the end
okay
so you can see this is now completely a
different kind of analysis like the what
you what some of you are used to so here
we can see alcohol one
alcohol effect one
so alcohol effect one
compares
no alcohol versus some alcohol when you
lump up the male and female together
Effect 2 will be
some alcohol two bottles of two two
bottle versus four bottles
and then
gender none
now see the agenda now gender none is
not significant what gender means is
that if you compare male and female when
they have not taken any alcohol when
they've taken zero bottles they they
their their choices are not
significantly different now even with
two bottles the choices are also not
significant but with the four bottles
the choices become come significant so
this is what you can now do with your
data set you can break up your your
analysis and somebody can look at you
and say well this person knows what he
or she is doing and you can buy us the
examiner in one way or the other but if
you write a thesis like everybody else
then you'll have a problem
um
so you can still go ahead and run the
post hop like what we did before but I
think I want to
to stop here so that I give my
colleagues
opportunity to to start maybe we can
have four minutes break and then
we can have four minutes break ten we
have 10 minutes break and then you can
we come back and
I have three one one three okay three
thirty six so four three forty forty
seven
come back at 347.
okay thank you
[Music]
[Applause]
[Music]
thank you
[Music]
[Applause]
[Music]
thank you
[Music]
[Music]
foreign
[Music]
[Applause]
[Music]
thank you
[Music]
when I give you
[Applause]
[Music]
[Music]
[Music]
thank you
[Music]
yes
[Music]
[Applause]
[Music]
[Music]
[Music]
thank you
[Music]
foreign
[Music]
[Music]
[Music]
[Applause]
[Music]
thank you
[Music]
foreign
[Music]
[Applause]
[Music]
[Music]
foreign
[Music]
[Music]
foreign
[Music]
[Music]
[Music]
[Applause]
[Music]
[Music]
[Applause]
[Music]
thank you
[Music]
[Applause]
[Music]
[Applause]
thank you
[Music]
please do that
[Music]
foreign
[Applause]
[Music]
thank you
[Music]
[Applause]
[Music]
foreign
[Music]
good afternoon ladies and gentlemen
I I believe you back from the break
and we are going today to talk about uh
correlation and regression with r
uh are we together can I see in the chat
are people back
are we back should we start
okay okay
that is great thank you very much so as
I've been saying we are going to talk
about correlation and regression and
we'll start with the theory then once we
finish Theory we'll do practicals
correlation is a measure of degree of
linear association between two variables
and it ranges between -1 to positive one
we find that sometimes you have what we
call positive linear correlation
and the general trend is that points run
from bottom left to top right or
negative correlation where the points
run from top left to bottom right
then if you have just a random cut of
points then we call that no
correlation
the correlation coefficient gives us
the type and strength of linear
correlation it is denoted by a small
letter r
as I've said it radius from -1 to
positive one so if it is minus one it's
called negative correlation if it is
positive one it is called negative
positive correlation
foreign
so if it is close to one
if R is more R is equal to one
then that is called perfect positive
linear correlation and if it is equal to
minus one it is called perfect negative
correlation
if R is equal to 0 then we call that no
linear correlation
with a correlation coefficient we are
able to quantify the strength
that is the magnitude of the correlation
and direction direction is positive or
negative
and if we are dealing with continuous
data then that type of correlation is
called Pearson linear correlation
and we have a formula there which you
can refer to later uh
s x means sum of squares for X values
and s y means sum of squares for the Y
values
graphically we find that if there is a
strong correlation
the points tend to lie along the line
that goes through the zero zero point so
if you look at the topmost graph
we see that most of the points are
concentrated along that line and if we
draw a line through this point
we will see that most of the points will
fall on that line that shows that there
is a strong relationship between weight
and chest gap of the buds
now you can see that at the bottom left
corner there is just a random scatter
points and the correlation value is 0.07
indicates the correlation is is quite
weak
then we have uh the bottom right corner
the points are scattered
but not as in the previous
graph
and we see that it's not very strong
correlation and the magnitude of the
correlation is 0.67 you can see all
these ones are positive correlation
that's why the points are running from
the bottom left to the top right
for negative correlation we will say
that the points run from the top
the top right top left corner to the
bottom left
bottom right
corner
that is negative correlation
and we see that if the correlation is
strong the points are close together
compared to the graph where we have a
random scatter points
and this other graph where R is minus
0.62 we find that the points are spread
apart but not as in the graph where the
correlation is minus 0.13
so with the graphs the scatter diagram
can always tell you what magnitude that
type of relationship has
correlation is not causation what do we
mean just because two variables are
correlated does not mean one causes the
other to change
so
does religion cause crime
in most cases you find that uh if we get
data on people who have been
found with bombs on their body they they
are said I'm not very sure we need to
confirm that I say to be Muslim but does
not does it necessarily mean that being
a Muslim causes you to to tie yourself
with a bomb no no no no she does it or
if we look at the marble stocks in there
that we have a hospital here called Lago
hospital and there is a tree full of
mild stocks and they are giving birth
every day but even the women are giving
birth every day does does it mean that
the Malibu stock causes the women to
give birth or the other way around no
that's why we are saying that
correlation is not causation so when two
variables are highly correlated it does
not necessarily mean one causes the
other to change so for us to see
causation we need to perform an
experiment
so Pearson correlation coefficient
measures the strength and direction of a
linear relationship but not the causal
relationship
in R we can compute correlation using
the command called Cor in bracket you
put in the two variables so with that
command we are able to compute the
correlation coefficient and then cor DOT
test it tests the association between
the paired samples so you get the
correlation value and the p-value
so you're able to know whether the
correlation coefficient that you've
obtained is significant or not
significant
so
when you write commands in R you also
specify the method of correlation that
you want to use so if X and Y are
continuous then we specify the method as
being Pearson
and if we have rank then we specify as
being Spearman correlation
it is say X and Y are numeric vectors
with the same length so in most cases
when you have a data set where X and Y
are different lengths then you're going
to have an error in your console
foreign
scatter it is used to obtain a scatter
plot but on top of the GG scatter we can
obtain the correlation we can add a
regression line we can get we can
specify the method of correlation and
also specify the the
title for the X
axis and the y-axis so in one go we can
have a correlation coefficient computed
a regression line we can attach
confidence interval the correlation
coefficient we specify the method and
also label our X and Y axis so on the
right we see the graph that comes out
after you've told R to use the GG
scatter
so on top here we see that R values the
correlation coefficient and we also see
that p-value which tells us because you
can see it is 2.2 to the power minus 16
is smaller than the significance level
normally used for agriculture
Environmental Studies Etc and we
conclude that the two
variables price and weight are strongly
positively correlated
so we see if you look at this result
here maybe it's not it should be visible
in your in your notes if you look at the
value of R is 0.92 and that P value is
2.2 times 10 to the power minus 16. so
the null hypothesis here is that
no correlation between price and weight
of diamond
and the alternative is that there is a
correlation relation to all the two
price and weight are
strongly associated
uh in R again you can perform or conduct
a correlation coefficient by having
specifying an object name which is ass
then you assign to
call DOT test which is the command for
computing correlation between diamond
and price and the method you're using is
the Pearson correlation so the results
that you see after printing the object
is down here it gives you a t value with
the 98 degrees of freedom and the P
value which you've already seen it also
tells you that the alternative
hypothesis is true
meaning correlation is not equal to zero
when a correlation is equal to zero it
means that two variables are not
associated with each other
so on top of that it gives you the
confidence interval for the correlation
value and also gives you that
correlation coefficient
okay
so the next slide is talking about
interpreting the correlation value in
details so we as we see that our p-value
of 2.2 times 10 to the power minus 16 is
less than 0.0 5 that is the significance
level
and we conclude that weight and price
are significantly correlated
with our correlation coefficient of 0.92
and it is positive
okay
we could sometimes produce a correlation
Matrix and it used to investigate
dependence between multiple variables at
the same time so if you have more than
two variables we can construct a
correlation Matrix and be able to obtain
the correlation between different
variables now the package we need for
this type of analysis is called hems h m
i h c
SC package and it produces the p-value
and the correlation coefficient
so we'll see later when we are going to
do practicals that in case you don't
have that package we can install it so
let me see in the chat
I will find should we continue do you
have any problem
sorry my internet has a problem sir
about that
okay
okay
okay
thank you very much
uh just mind the speed okay I'm going to
switch on my speed Governor so that I
can reduce the speed
okay okay then
thank you and in the Q and A
please kindly send material uh my
colleagues will send the material the
link for the material to you
okay so so let's proceed
so if a pair of variables have
significant linear correlation then the
relationship between the the data values
can roughly be approximated by a linear
equation
the process of fitting the linear
equation to the data is known as linear
regression
okay so so the material where we find
the the PowerPoint
is called is under deathly
so when you go in day three
week two day three
that's what Helen is telling me that's
why she placed that information so go to
the same link but look for week two day
three
okay so the process of finding a linear
equation that fits our data is known as
linear regression and the line of best
fit is called a regression line
now regression is concerned with a
relationship the start of a relationship
between variables with an objective of
identifying that relationship when we
talk of identifying the relationship we
want to know is it a linear relationship
is this non-linear is it cavlinear
estimating when we talk about estimating
we are looking at the parameters the
slope and intercept and then validating
the relationship remember that the other
objective of regression is prediction so
we need to validate the relationship to
see whether the assumptions under the
regression analysis are valid for us to
be able to use the regression model that
would have established so that is the
use of regression
foreign
relationship is a straight line model
consisting of The Intercept B naught and
the slope B1
the slope of the line
indicates that for each unit
incrementing in the X X which is the
independent variable
then y increases by the slope the value
of the slope B1
and B naught is the intercept
when X is equal to 0
y changes or Y is equal to the value of
the intercept so when X is equal to 0 Y
is equal to the value of the intercept
and when
when X increases by one unit
then y
increases by the slope B1
if the slope has a negative value then
we are going to say that by when X
increases by one unit y decreases by B1
so that's how we interpret our straight
line model
so this is what I've just been saying
that the slope represented by B1 and
describes the change in y for every unit
change in x
and Y is the value of the I mean b
naught is the value of y when X is equal
to zero
so the line that we fit onto our data is
called a line of best fit
and once it is fitted to the data not
all the observed values fall on that
line of best feed
now the difference between the observed
value and the predicted value which we
are going to talk about in a minute is
called residue
so my colleague has been talking about
residio but talking in terms of
experimental design when we conduct an
experiment we find that we have a
hundred percent in the response
when you fit the treatment onto the
model it explains some variation and
whatever remains is what we call
residual the same applies to regression
when we run a regression between why the
response and X the independent variable
it explains some variation in in the
response and whatever remains is called
residue
now the line of best fit minimizes the
sum of squared error terms
and it is obtained by the method of
least squares which we abbreviate as l o
s
now with a method of least squares the
slope and intercept obtained by a l o s
those two are called the list
Square estimates
so we have a regression line here
and that is the blue line which has y
hat and the red points are the observed
values which are called y the response
so when you fit the line then the model
becomes y hat is equal to beta 1 X Plus
The Intercept beta note
but you can see that this line
although it has fitted the most of the
values the red points not all of them
fall on the line so the difference
between the red points and the blue line
is what we call residue
so let's go and see uh graphically what
the residue is
okay so when we look at this graph
the bigger graph here is what we had
before and you can see the line going
through the red points but not all of
them fall on the blue line
now if we magnify this we see that the
blue line is the model line which
equation is y hat is equal to beta1
X Plus beta naught
the red points here have an equation Y
is equal to beta 1 plus beta note plus
Epsilon unexplained variation so the
difference between the red points and
the model line is the residue which is
Epsilon so you can see here in the
diagram that we are pointing onto the
blue line and the red point so the
difference between that gives us the
residue the red points are the observed
value which you collect from the field
and the predicted value is y hat
so the difference between why and why
heart gives us the residue
the ordinarily squares regression which
is we are abbreviating as l o s
minimizes the sum of the square
differences between the observed and the
predicted value
now the best fitting line which is our
Blue Line has the small sum of squared
residues
now ordinary squares or l o s
estimate us
can be estimated using the formulas
given and those formulas are obtained by
through differentiation which will not
do here
so when we have the original model of
the observed value is equal to Beta node
plus beta 1 X Plus Epsilon
we now get the difference between the
fitted line and the
under observed and then Square those
residues then differentiate with respect
to Beta note and beta one that's how we
come up with these formulas
so for now we will know these formulas
we don't need to know how to
differentiate
because then we will use R to give us
the value of the slope and intercept but
I know that those who are doing pure
statistics do differentiation and come
up with these formulas so that's the The
Intercept is equal to the mean of the Y
values minus the slope
into the mean of the X values meanwhile
the slope is equal to the sum of the
cross product between X and Y
over the sum of squares of X that gives
us the slope
so down here we've expanded s x y and
sxx and we also have the formulas for
the amine for X values and the mean of
the response
alternatively option b we can obtain our
slope in this expanded format
but you end up with the same values
so here we are still talking about our
residues we are saying that residuals
are very useful they allow us to culture
the error sums of squares and when you
have the error sums of squares we use
them to compute
what we call coefficient of
determination r squared
and we'll see that r squared tells us
how much variation has been explained in
the response
as a result of that regression between
the response and the independent
variable
so r squared as I've just said tells us
the proportional variation in the
dependent or response variable that has
been explained by the model
below here we have formulas for
computing r squared
and r squared is equal to 1 minus r this
is the regression sums and squares over
the total sums of squares gives us r
squared
and we have also a formula for
regression sums of squares which we are
not going through now
so r squared is very important when
you're doing model building
if a model explains above 70
of the data variability then that model
fits our data
we normally say the model
explains a a big proportion of the
variability in the data set
when r squared is near one then the
model fits the data well and when R is
near zero then the model
poorly fits the data
we find that when we do our analysis
we'll have r squared and then adjust r
squared now r squared
is what we've seen it is the ratio of
explained variation over total variation
or it is the ratio of the regression
sums of squares over the total sums of
squares that gives us our r squared
then at just r squared it is the value
of the variability explained but then it
adds just for every parameter added into
the model the r squared is adjusted
automatically so if you have more than
one independent variable in the model
the r squared will address the number of
parameters in the model so we see that
if we have only one response and one
independent variable we are going to
have only two parameters
but if we have if we add another
independent variable we will end up with
three parameters so the r squared at
just will be adjusted meanwhile r
squared alone is inflated as we add in
more variables in the model so when we
are doing model building we quote r
squared that just because it's going to
adjust for the number of parameters
added in the model
it is not affected by you adding more
variables in the model meanwhile r
squared alone keeps on increasing and
increasing as you add
variables in that model so when you're
doing model building uh take care of
that to make sure that you quote r
squared at just
so what are we interested in when we are
doing regression the other thing I I
forgot to mention that if we have one
response and one independent variable
that type of relationship is called
Simple linear regression because we have
one response and one independent
variable but later we'll see that when
we have one response and many
independent variable that is called
multiple regression
and then if we have many response and
many independent variables then that
type of analysis is called
multivariate regression
okay so for now we will look at simple
linear regression and multiple linear
regression and if time allows we are
also going to do model building
so when we are doing regression analysis
we are interested in the slope
when the slope is equal to zero then
there is no linear relationship
okay uh let me see whether I can get our
pen here and we see
okay so when a slope is equal to zero
this is what happens
the relationship is like that it's just
a straight line horizontal straight line
this is when beta
1 is equal to zero
now if the relationship is not equal to
zero
we can have a positive relationship
or we can have
a negative relationship
it depends on the type of relationship
so we are interested in testing the
slope and finding out whether it is
different from zero so our null
hypothesis here under regression is that
there is no linear relationship between
the response and the independent
variable and alternative is that there
is a linear relationship between the
response and independent variable
we can test that hypothesis using a
t-test and a regression and we are
testing the slope and our Target value
is zero over the standard error of the
slope so this is the standard error of
the slope
under degrees of freedom
correspond to the residual degrees of
freedom which is n minus 2.
or we can use an f-test
which is a ratio of the mean
Square regression over the mean square
error we are going to see that later in
the Anova table
where do we get uh
mean
Square regression mean Square
error okay
so so far do we have any problem or I
continue in the chat
can I see do you have any problem
is anything that means clarification in
the chat okay
between the sun okay
degree of Freedom okay let me talk about
the degree of freedom
okay great
okay now
uh when we perform an analysis on our
data
let me pick the
so
when we perform an analysis
we will have what you call an over table
you've heard of an over table under
experimental design
there we have a categorical variable for
the independent variable here our
categorical variable our variable is is
continuous so we have what we call
degrees of freedom let me
degrees of freedom
degrees of freedom that is the amount of
information that is free from constraint
so source of variation I'm going to
represent is that as R regression
then we have residual or era represented
as error
and then we have total
so let me put a line to separate these
two
so when we have our linear model
the response is equal to
thus intercept beta note
and then we have also the slope
beta1
X
in this model we have only two
parameters the intercept and the slope
so the regression degrees of freedom is
equal to the number of parameters which
are two
minus one
then when we go to error the degrees of
freedom is equal to the total number of
observations
minus the number of parameters
that is number of P
parameters
when we come to total is equal to the
total number of observations minus one
so that is degrees of freedom
here when we come to the total if you
look at the formula for total sums of
squares we require the mean before we
can actually compute the total sums of
squares so that's why we have a penalty
of one here error error sums of squares
in order to compute error sums of
squares we need to know our intercept
and slope so you are using two degrees
of freedom I mean you're using two
information to compute the error sums
and squares that's why our degrees of
freedom is n minus the constraint which
is the number of parameters two
and here it is the number of parameters
minus one or the regression degrees of
freedom also is equal to the number of
independent variables in the model here
we have only one independent variable
and therefore the degrees of freedom is
also going to be equal to one
so that is roughly what we what degrees
of freedom is
okay
okay so let's go on to the next slide
okay
so when you analyze your data this is
the output that you get
so there's also variation here is
regression
and we have residual and total variation
we've already said that because we have
only one independent variable
our
degrees of freedom for regression is
equal to 1 or the number of parameters
which are two minus one
here
we know that our total observations will
be 30 plus one
because we know that the degrees of
freedom is n minus 1. so get that one
observations so the residue is going to
be n minus 2 which is 31 minus 2 gives
us 29.
sums of squares there's a formula
but we want to go through those details
I want us to talk about the mean Square
now for regression the mean Square here
is the ratio of sums of squares and
degrees of freedom so when you get the
sums of squares divide by degrees of
freedom we get the mean Square here the
same applies to residue residual mean
square is equal to the sums of squares
for residue divided by the degrees of
freedom we get a mean square of 18.1
the other thing we are interested in is
the F value
the F value which is the F test is the
ratio of mean Square regression and
residual mean Square
so this value here we get
7500 81.8 divided by 18.1 you get this
value under p-value we can use the
principles of mathematics to calculate
it
so that is the analysis of variance
table
and regression
now below here we also have some output
we have s s is the square root of 18.1
so you can find out whether that is true
you can get your calculator get the
square root of 18.1 gives you s which is
the standard deviation of the data we
talked about r squared and I said r
squared is the proportion of variation
explained in the response in our
analysis our response is volume and our
independent variable is diameter
so 93.5 percent of variation has been
explained
in volume by the regression model
I swear that just you can see that it
has already been it had just for the
number of parameters in the model we
have two parameters they intercept and
the slope
so it has the same interpretation
that
93.3 percent of variation in volume has
been explained by the regression motto
okay
so that is the interpretation of this
output
this one you should know the p-value
we are testing the slope the null is
that there is no linear relationship the
alternative is that there is a linear
relationship
so we can use the p-value or the F value
so the p-value here is smaller than 0.05
which is our probability that the null
is likely to be or assumed to be correct
so if you get any value is smaller than
0.05 you have evidence to reject the nap
and conclude that there is a significant
linear relationship between volume and
diameter
so that is the output here now let's go
to the model
so this is our regression model
and we have a negative intercept
this negative intercept sometimes it
makes sense sometimes doesn't make sense
for example here we are talking of trees
and they are volume
so if we say that when there is no
diameter the tree has a negative volume
doesn't make sense
so when we are interpreting The
Intercept we need to be a little bit
careful we have to look at the
biological system before you interpret
the intercept however the slope here
indicates that for every
unit increment in diameter
volume increases by
5.07 cubic meters for every unit
increment in diameter volume increases
by 5.07 cubic meters because uh the SI
unit for volume is cubic meters
any questions so far in the chat
please prove between AR squared and
adjust
okay which one should we report between
r squared address if we are doing a
simple linear regression you can report
any of that too
okay just like you can either report
93.5 percent or 93.3 percent but if you
are doing multiple regression
we report the R square that just
because it is not influenced by addition
of variables in the model or additional
any additional parameters in the model
okay
any other question
uh would you repeat
I'm really enjoying this this okay thank
you ah I don't know what to repeat but
let me repeat what I've just said last I
have said that if you're conducting a
simple linear regression which is a
relationship between one response and
one independent variable you can use
either r squared or r squared just but
if you're doing multiple regression
you're better off using r squared adjust
because it adds just for any parameters
added in the model
okay even when we got to do model
building you will see that we'll be
looking at R square that just because
it's not influenced by adding more
parameters in the model
okay
okay so we continue
so this is what I've been talking about
that that Anova test is testing this uh
uh
hypothesis and if we are using the
f-test we should have what we call F
tables
if you go online and put in F
distribution tables they consist of the
numerator and denominator
uh degrees of freedom and then a value
that intersects those two gives you the
what we call a rejection region value or
critical value
so he have indicated the critical value
as 4.18 if you compare it with the table
the culturated value or 419.36
they calculated F value is greater than
the tabulated indicating that there is
evidence to reject the null
if we use the p-value we found that our
p-value of 0.00 was smaller than 0.05
indicating
strong
or sufficient statistical evidence that
there is a linear relationship between
volume and diameter
okay
so I have within the notes there are
examples
time you can go and actually attempt
this this exercise
it is about phosphorus inorganic Force
for us and then phosphorus content in
the corn at the end of the season
so we have assumptions anytime you run a
model any model there are assumptions
that we make and under simple linear
regression the Assumption the first
assumption is that why is a random
variable normally distributed with mean
mu and variance Sigma squared
so why do we say random variable we
assume that you collected your data
randomly it's not subjective and the
data you've collected follows a normal
distribution
then the other assumption we make is
that the unknown random errors The
Unexplained errors or residual we've
been talking about are assumed to be
independent normally distributed with a
mean zero and constant variance Sigma
squared I've misspelled squared here you
can correct it in your notes
and then we also assume that the random
errors have constant variance
we also assume that there is a linear
relationship between the response and
the independent variable
so those assumptions should not be
violated when you're running your
regression
so that when you come to prediction
you're able to do predictions with a
valid regression model
how do we test those assumptions we use
the residues
remember the residues and we call them
unexplained variation we call them error
those are the ones we use to test the
assumption
uh normality of the response we
construct a histogram of the response
that's if it comes out to be bell shaped
then the response has no more
distribution
now for linear relationship between the
response and the independent variable we
construct a scatter diagram
if you find there's a linear
relationship then truly the relationship
between the response and the independent
variable should be linear when it comes
to the residuals having no more
distribution
we construct a histogram of residues as
we are going to see
so statistical inference can be made
when the postulated model is adequate
and are correct model should have
residuals assumed to be normally
distributed with the mean zero and
constant variance
and they should not also be correlated
so independence of residuous means that
the residues are not correlated
when residuals are correlated it implies
that there is some dependence in your
data which needs to be explained either
by another variable time or you need to
use another model to explain the
correlations
so a histogram of residues is used to
check for normality
so we all know about no more probability
plot they are also used to test for
normality so a plot of residuous versus
normal scope helps helps us to test for
normality as you shall seen are
and then for us to check for constant
error variance a plot of residuous
versus predictive values
so if there's a random scatter of points
as you see here
that indicates that the residues have
constant variance
anything that deviates from the random
scatter points then we have to pay
attention
to the to that pattern that we see in
the graph so what you see here is just a
random scatter points and I've said it
shows that the residues have constant
varies
now if points from a horizontal band
around zero
if points form a horizontal band and
they are just randomly scattered
around zero then the Assumption of
constant variance is valid
if the residuous increase
with increasing fitted value or response
then let's draw that graph if we see a
pattern like this
I hope you're seeing this graph here
so the residues have a mean of zero
here we have Epsilon which are residuous
and here we have fitted value y hat
so if we get something like this
I am not very good at drawing art but we
are seeing that these things are
increasing with increasing
or decreasing with increasing fitted
value it is a ball shaped I mean a
funnel-shaped curve
see a funnel-shaped cup that shows that
the Assumption of constant variance is
invalid so that is the first pattern If
You observe that you know that that
assumption is valid and you may need to
transform your data to normalize the
values so normally we use the natural
logarithm of the response and once you
apply natural logarithm to the response
you're able to deal with this type of
pattern then let me show you another
pattern
so that is when we have a funnel shaped
curve how about when we have something
like this
okay let me draw here it is okay
so if we have something like this
abortion
so I'm going to use dots hope they are
visible
you see this both shaped is it visible
so you can see this is a ball shaped
this is our Epsilon and this is the
fitted value or sometimes you can use
the response itself
that shows that your response
has a binary nature it has the positive
side
and also has the negative side and
because of this ball shaped the linear
model doesn't fit your data so you need
to apply a logistic regression instead
of simple linear regression
okay so that is another way to know that
your data does not have constant
barriers if you see a ball shift the
other uh let me erase this
uh the other uh formula I mean a graph
that you can see
is this one
so we have our y hat the fitted values
versus the Epsilon
and this is the Zillow line
if you see a curvature so we are going
to see if you see a curve pattern a
curved pattern like this this shows you
that actually your linear model doesn't
fit the data the
Assumption of constant variance is is
invalid and you need a curvlinear
relationship to explain this
relationship between the response and
the independent variable so those are
the things we should look out for look
out whether your model your linear model
fits the data if you see this then your
linear model doesn't fit the data you
need to include you need to run a
curvlinear relationship where you have
powers under independent variables if
you have a ball shift then you need to
run a logistic regression and if you
have a funnel shift then you need to log
transform your response and then rerun
the regression and check the assumptions
so the the last assumption is
independence
now if you have time if you've collected
data over time
that data normally is correlated over
time so you will find that this
Assumption of Independence is violated
so how do we see that you will see that
when you construct your
your graph you're going to have
something like this you're going to have
points running like this
okay so this is going to be your pattern
of points this is our Epsilon and this
is y and you know this shows you that
there is time factor in your data which
you need to take into account
and the variance is increasing and
decreasing as well so the Assumption of
constant variance is violated and also
the Assumption of Independence is also
violated
so in that case then we are going to run
Ada we are going to perform time series
analysis or you are going to include
time as a factor in your module to
account for this type of relationship
okay
any any question
any query
let me see in that chat should we
continue what is a cavlinear
relationship
okay okay when you have let me get uh
the hand if we have y
sorry
okay so here we are
and
apparently I can draw okay so now if you
have this pattern here
and this is our Y and this is X
it means that our y
has a relationship between
beta note plus
beta1 X but again the beta the to
account for this curvature here we have
to apply
the x x squared term the quadratic term
or you could even apply the cubic term
so we could have another
depending on what is significant so that
is a linear relationship
non-linear relationship remember uh in
that case we'll have something like this
we can have Ada going like this
or coming down but in the same format so
when we have an anilinear relationship
it means that the relationship between
the response and the in the independent
part variable
is linked
in a non-linear format
in terms of the parameters you see here
we have a straight line model in that
case our model is going to be something
like exponential beta note plus beta1 so
we will no longer have a linear
relationship but we are going to have a
non-linear relationship where the
parameters are linked onto the response
through a non-linear format
okay
so we have cavlinia where we have powers
and in the regression jargon the
equality polynomial model and we have a
nonlinear model those are the difference
between uh cavlinear and a linear module
okay in the linear model we only end at
beta naught plus beta 1 without powers
when we have a curvlinear relationship
the independent variable is accompanied
by Powers either a quadratic or a cubic
or x to the powerful depending on what
is suitable for your data
okay let me see in the chat in case
there's any question
and when is a multinomial regression
that okay mult means many no Meo means a
nominal which is a categorical variable
and we can do that if your response an
independent variables have categorical
nature
so if your response is just names for
example uh
agree disagree
Etc then and you want to regress agree
disagree on to age sex ATC then you can
run a multinomial model why logit model
not probate
now remember luggage regression here we
are looking at a relationship between
our continuous variable and independent
variable a little bit more probit model
is used when you have percentages you
have proportions
not when you have continuous data if the
residuals show that there is no
relationship between the variable how
can
we get oh okay the relationship already
can be shown as I've indicated
in the residual we can have a cav-linear
relationship showing the residual
automatically you know that you need to
run a polynomial regression
and in case you see a ball shift then
automatically you know you have to run a
logistic regression if you see you have
a funnel shift then you know that you
just need to apply natural logarithm to
your data and then that is
corrected
okay any other question those are the
ones that I found on the way I don't
know whether
there is okay okay let's go ahead uh
what's the time
okay we have 13 minutes to go let me
just a reason see what we can do in the
next six minutes so that I leave a few
minutes for closing uh the session
so when we go to multiple regression you
will see that the assumptions under
simple linear regression also apply to
multiple regression and in addition to
that we'll see that there's an extra
assumption so we we normally do the same
uh tests
okay
so here I've already talked about this
natural logarithm or you use the square
of a that response or independent
variable if you find that the Assumption
of constant variance has been violated
and I talked about polynomial regression
which is a special case of regression
where powers of the independent variable
play a role of individual variables so I
have a data set among the data set you
have there is a data set that has a
polynomial nature and we will look at
that uh data set and see when you
analyze
ignoring the Cavalier how the
relationship looks like and after you've
recognized that there is a cap linear
and include the quadratic term what
happens to to your to your assumptions
okay
uh in the
few minutes left let's talk about
multiple regression
now we've said that multiple regression
is a relationship between one response
and many independent variables
sometimes you'll see that we've run a
simple linear regression but you still
have factors in your data set and The
Unexplained variation
is so large meaning that your r squared
is so small maybe even 30 percent
of your model so if you find r squared
is so small then that means that there
are other factors that you can add into
the model to explain the variation in
the response so if you have a model
where you have one response and more
than two predictors that is called
multiple regression
and I'm multiple regression model you
can see we have an intercept
slope for each
uh slope for each independent variable
and this the slopes are called unknown
parameters and intercepts
Cylon is the random error or we call it
residual or unexplained variation due to
other factors not included in the model
then assumptions mentioned under linear
relationship still hold and the last
assumption is that the independent
variables should not be correlated if
they are collected the correlation
should be weak so X1 X2 should have very
weak correlation for it to have a valid
multiple regression in case you find X1
X2 are correlated then we can run a
multivariate regression
or there is a method called centering
whereby you get the average of
of X1 and subtract from X1 that's what
they call centering and then we run the
regression so you get the mean of the
first
independent variable subtract from the
original values of the independent
variable then run the model and in so
doing you're removing the correlation
but we don't want to lose that
information so in most cases it's better
to run your multivariate regression
which will tell you where the response
is
okay uh
okay so when you analyze your data using
the multiple regression
this is the equation
here our relationship is between volume
diameter and height
and just similar to simple linear
regression the only thing is that now we
have more than two independent variables
when you're interpreting the regression
equation you keep one independent
variable constant
so we will keep height constant and then
interpret diameter
so in that case for every unit increment
in diameter keeping height constant
volume increases by 4.71 cubic meters
and when we keep diameter constant
for every Unity increment in height
volume increases by 0.339
okay so I think we'll stop here and
continue from here tomorrow
tomorrow we'll do practicals of
regression both simple and multiple and
hopefully even model building sorry
so thank you so much I am going to hand
over to my colleague Thomas this time is
my post
he has been saying I'm the boss now is
the boss so thank you participants
you've been so good the questions are so
well framed and informative
and we see you tomorrow thank you thank
you oh yes thank you very much my boss
for the good work
and Ellen for answering all the
questions
yeah I I think there's been a wonderful
day we just have a bit of the issue of
time here and there because four hours
or three hours a days is not that
adequate but uh we always be available
if any one of you thinks you need us to
come and train you wherever you are
physically we are more than willing to
come and hey maybe the center will also
organize another physical training in
Mozambique will be there so uh giving
you back to back to to forum team
thank you
thank you very much so much and
Professor
I would like to thank also the funders
for the training the center and the
World Bank and also would like to
encourage a
the the Learners to always visited a
YouTube link in case there is something
you want to refresh or you missed so you
can go back to the YouTube link and
follow the training and also in case you
you missed the material you will always
share the links for the drawing where
all the materials are kept and the data
so we'd like to thank also the
participants that be without you it
would not be a training so we thank you
very much and we thank also the
facilitators plus the founders so we see
you again tomorrow
cheers