Advanced Statistics and Experimental Design: Day 4 Training Overview

Name: Advanced Statistics and Experimental design Day 4
Uploaded: 2026-01-14T15:46:53.816298+00:00
Channel: RUFORUMNetwork
Description: Summary and key takeaways on Advanced Statistics and Experimental Design: Day 4 Training Overview, covering Introduction The fourth day of the World
RUFORUMNetwork
Jan 14, 2026
•
4 min read
YouTube video ID: d0zNa0Ogf98
Source: YouTube video by RUFORUMNetwork — Watch original video
PDF
Introduction

The fourth day of the World Bank‑sponsored Advanced Statistics and Experimental Design training was held online for participants from the Centre of Excellence in Agri‑Food Systems and Nutrition, Mozambique. The session combined a theoretical recap with hands‑on R programming, covering multiple regression, dummy variables, polynomial regression, and model diagnostics.
Recap of Simple Linear Regression

Purpose: Model the relationship between a single response variable and one predictor.
Key assumptions:
Response normally distributed with constant variance (σ²).
Residuals normally distributed, mean zero, independent.
Linear relationship between response and predictor.
Diagnostic tools:
Residual‑vs‑fitted plot (checks constant variance).
Q‑Q plot (checks normality).
Residual‑vs‑time/order plot (checks independence).
Remedies: Log‑transform the response, add quadratic/cubic terms, or include time as a factor.
Multiple Regression

Extends simple regression to one response and two or more predictors.
Model form: (Y = β₀ + β₁X₁ + β₂X₂ + \dots + ε).
Assumption added: Predictors must not be highly correlated (no multicollinearity).
Example used: Volume as a function of tree diameter and height.
Coefficients interpreted by holding other variables constant.
R² (adjusted) = 94.4 % → strong explanatory power.
p‑values for both diameter and height < 0.05 → significant relationships.
Degrees of freedom calculated as n – p (observations minus number of parameters).
Sequential sums of squares show each predictor’s contribution.
Indicator (Dummy) Variables

Needed when a predictor is categorical (e.g., food type, plant species).
Coding rule: Number of levels – 1 dummy variables.
Example: Three food types → two dummy variables (0/1 coding).
Dummy variables allow inclusion of groups in regression; they affect the intercept and can be interacted with other predictors to obtain parallel or separate slopes.
Polynomial Regression

Used when the relationship is curved rather than linear.
Model includes higher‑order terms: (Y = β₀ + β₁X + β₂X² + β₃X³ + ε).
Demonstrated with hardwood concentration vs. paper tensile strength:
Linear model R² ≈ 0.30 (poor fit).
Quadratic model R² ≈ 0.90 (substantial improvement).
Residual diagnostics confirmed better fit, though a few outliers remained.
Practical R Implementation

Setup: Load required libraries (car, psych, ggplot2, etc.) and set the working directory.
Data import: read.csv() for datasets such as eggP.csv (water uptake, food uptake, egg production) and uptake.csv (CO₂ absorption experiment).
Exploratory plots: Scatter plots using base R, ggplot2, and plot() to visualise relationships.
Correlation analysis: cor() and cor.test() to obtain Pearson coefficients and significance.
Model fitting:
Simple linear: lm(Y ~ X, data=…).
Multiple: lm(Y ~ X1 + X2, data=…).
Polynomial: create squared/cubic terms and include them in lm().
Dummy variables: create binary columns with ifelse() and include them.
Model summary: summary(model) provides coefficients, t‑values, p‑values, R², Adjusted R², and F‑statistic.
Diagnostics:
plot(model) → residual‑vs‑fitted, Q‑Q, Scale‑Location, Residual‑vs‑Leverage.
shapiro.test(residuals) for normality.
dwtest() (Durbin‑Watson) for independence.
vif() to detect multicollinearity (VIF > 5 signals concern).
Outlier detection with Cook’s distance (cooks.distance()).
Model comparison: AIC and BIC values guide selection; lower values indicate a better trade‑off between fit and complexity.
Model Building Considerations

Variable selection: Keep predictors that are statistically significant and biologically/economically meaningful.
Multicollinearity: If VIF is high, consider removing or combining correlated predictors, or use ridge/weighted least squares.
Influential observations: Examine Cook’s distance; remove only if they unduly bias parameter estimates.
Confidence intervals: Provide a range for each coefficient; if the interval includes zero, the predictor may not be significant.
Future topics: Time‑series analysis, mixed‑effects models, and survey data analysis were mentioned as upcoming sessions.
Conclusion

The day equipped participants with a solid understanding of multiple and polynomial regression, the creation and use of dummy variables, and a complete workflow in R—from data import and exploratory analysis to model fitting, diagnostics, and selection criteria. Attendees left with practical scripts they can adapt to their own agri‑food and nutrition research projects.
Effective regression analysis hinges on choosing the right predictors, checking assumptions with diagnostic plots, and using information criteria to balance model complexity with explanatory power.
Frequently Asked Questions

Who is RUFORUMNetwork on YouTube?

RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
plot (checks normality). - Residual‑vs‑time/order plot (checks independence). - Remedies: Log‑transform the response, add quadratic/cubic terms, or include time as

factor.
Summarize another video
Full Transcript YouTube

foreign
[Music]
[Music]
good afternoon
good afternoon everyone uh welcomed all
of you thank you so much for joining in
today
is uh day four of our training
and it's good to see that the number is
still Rising
so you're most welcome uh to day four of
the advanced statistics and experimental
design training which is brought to you
by the World Bank through the centers of
Excellency in agri-food systems and
nutrition at Edward mandolin in
Mozambique
so please thank you so once again thank
you so much for continuing to
participate through the training and
um we want to thank our facilitators who
are always here with us on time and
taking us through the entire afternoon
thank you so much uh professor and your
team
so as usual please uh let's go on with
the training uh participants please mind
of the time if you are given the chance
to ask a question or show the screen
please take as much time as possible and
most importantly please do listen to the
training so that you don't ask questions
which have been already uh given the
answers during the training so the
certificate please we do not want to see
the questions of the certificates again
once again I'm reminding you we are
going to work on them and do ensure that
you fill in the registration in the
attendance registration form which is
going to be shared in the chats and also
in the Q a when requested so please
enjoy the training and uh all the best
thank you Professor Susan I hand over to
you
good morning good morning everybody
yes here I am
good morning everybody
how are you I hope uh everybody is okay
and we are ready for day four now today
we'll have a little practical and then
we do our our practice and we hope that
when I if I can finish in the four hours
then
my colleague tomorrow will be doing
something on survey and category record
data analysis
so let's start we will start from where
we ended yesterday we'll complete the
theory and then run to to practical so
I'm going to share what I had yesterday
and then
okay so as I recap yesterday we talked
about
uh simple linear regression we say that
it is a relationship between one
response
and one independent variable
we also say that the objective of
regression is to identify the
relationship to estimate
the prediction
we talked about the assumptions and
regression
uh and the assumptions included uh the
response being normally distributed with
minimal and constant variance Sigma
squared
the residuous the or unexplained
variation are assumed to be normally
distributed with the mean with the mean
zero and constant variance we also
assume the Unexplained variations or
errors are independent
they are normally distributed we also
assume that there is linear relationship
between the response and the independent
variables so those are the assumptions
that we looked at and we looked at how
we can test and see whether our model
does not violate those assumptions we
talked about the residual plots and one
of the approach we talked about was
the QQ plot or normal probability plot
or a histogram of residuous and if it
turns out to be a bell-shaped curve then
we we know that the Assumption of
normality of residues is not violated
anything that deviates from normality
shows us that our data is not normally
distributed then we also talked about uh
a graph of residuous versus fitted
values and we said
the mean of residuous then we know that
the Assumption of constant variance is
not violated
we also talked about independence the
errors or unexplained variation should
be independent not correlated and we can
always test that
using a graph of residual versus time
order or order of observations and in
case we don't get a graph which has a
sort of like a sigmoid curve or a curve
that has troughs and pigs troughs and
Peaks so if that uh doesn't appear in
the graph then we know that we have
Independence in our residuals and we say
that in a case you find that you have
many constant variance you can use
natural logarithm you log transform your
response
and then run the regression again
uh
on the other hand you can use you can
square you can apply the cubic terms or
quadratic terms onto the independent
variable to correct naninality too we
also talked about
including time as a factor in your model
to account for
independence of
unmutable
it's certainly PowerPoint I need to
share selection
it should be this one I think okay
okay
so then we also talked about I'm doing
just a recap and then we are going to
start with multiple regression I'm
saying that we also talked about
including time in case you find that you
have residues that are correlated we can
include time
in that model and see whether the that
problem is solved if it's not solved we
use time series and that one is a topic
a big topic on its own and if that
doesn't solve then we go for mixed
models whereby we will specify uh the
correlation for various Matrix in the
data to solve that problem so there are
various problems but all those problems
you must be knowing them you must have
done something on time series you must
have learned what mixed models is we are
not going to talk about mixed model
today or time series but with the time
we will be
organizing those uh training online as
time and funding allows okay so today we
are going to look at multiple regression
and I said that when you have uh
a lot of unexplained variation meaning
that your r squared is small maybe for
example it is 30 percent so meaning that
in the response only 30 percent of the
variability has been explained that
implies that there are other variables
other factors that you need to include
in the model
and if you do so then you will have a
regression between one response and more
than two uh predictors or independent
variables and that is called multiple
regression and the regression model
appears like this this is when we assume
that one model fits all these
independent variables depending also
what you see in the scatter diagram
remember that when you're doing
regression the descriptive statistics
you do is to construct a scatter diagram
and also produce some correlation
so those Aid you to know whether you
have you need
two lines or different lines or parallel
lines for your for your analysis so here
we find that y i is the independent is
the response
and then X1 X2 are independent variables
while beta note beta1 beta two those are
unknown parameters beta note being The
Intercept beta 1 beta 2 uh the slopes
corresponding to each independent
variable we also say that the
assumptions under simple linear
regression still hold under multiple
regression and then we include one
assumption we assume that the
independent variables are not correlated
that's why you see X1 X2 are independent
variables so they should not be
correlated okay an example
of
a regression between volume damage and
height
so when you get the output you will get
what we call the coefficients
the values of the coefficient here the
constant represents the intercept and
the value corresponding to diameter is
the slope for diameter and then the
slope for height
and yesterday say that when we are
interpreting a multiple regression model
we hold one variable constant and
interpret one so if you have like four
hold three constant interpreter one then
after that you hold another three
constant interpret one and still you
finish so in this case if we hold height
constant we are going to interpret the
slope corresponding to diameter and in
this case for every increment in
diameter volume increases by 4.71
if this was negative if the slope had a
negative sign we would say that for
every unit increment in diameter volume
decreases by the that 4.71 but in this
case it is positive
so as diameter increases by a unit
volume increases by 4.71 so now after
that we hold diameter constant and
interpret height so for every unit
increment in height volume increases by
0.339 cubic meter
so that's the interpretation of the
model so for your projects you will not
have volume or diameter height it
depends on your objective so you you
also interpret that model depending on
the parameters I mean in the variables
you have in your project
okay down here we have a table
consisting of the coefficient the
standard deviation
the T value and
p-value now the T value here for example
minus 6.71 is the ratio of the
coefficient 50 minus
57.988 divided by 8.638
so this T value is the ratio of the
coefficient and the standard deviation
and then the p-value using mathematical
formula we can calculate the p-value for
for minus 6.71
so here we are interested in the slope
of diameter and height so when we look
at the slope it has a p-value of 0.000
and we are going to use a significance
level of five percent this significance
level is the probability
uh associated with the null hypothesis
so the null hypothesis says that there
is no linear relationship between volume
and diameter
so for that signal to be correct our
p-value should be equal to 0.05 or
greater so in case you have a pivot less
than 0.05 we have evidence reject the na
and conclude that there is significant
linear relationship between volume and
diameter
so then we go to height we find that
also that P value is small it is
0.014 it is smaller than 0.05 so we have
evidence to reject the null and conclude
that there is a significant linear
relationship between volume and height
then when we have a value here
which is the standard deviation of all
the data we have our r squared but
remember when we are dealing with
multiple regression we will interpret r
squared just y it has adjusted for all
the parameters in the model so here our
r squared at just is 94.4 percent it
implies that
94.4 percent of the variation in volume
has been explained by the regression
model
so if you have any R square that just
greater than 70 some books say greater
than 70 other books say a range of
65 onwards so depending on the book
you're reading you quote the book as you
interpret your results so in this case
we'll use 70 percent although that book
is I can't remember it of head
so we say that we our model fits the
data because we've explained 94.4
percent of the variability in volume
okay
so there's this an analysis of various
table
and yesterday we looked at analysis of
various table when we have one
independent variable and let us focus on
degrees of freedom
uh yesterday our degrees of freedom for
regression here was one when we have
SIMPLE linear regression and we say the
degrees of freedom when we have
for any regression analysis it's is
equal to to number of parameters minus
one in a simple linear regression we
have only two parameters minus one we
get one degree of Freedom inner in a
multiple regression depending on the
number of independent variables number
of parameters uh go hand in hand with
the number of independent variables in
the model in this case we have three
parameters the intercept and two slopes
corresponding to two independent
variables so minus one we get two or the
degrees of freedom again is equal to the
number of independent variables in the
model here we have volume
and volume has been regression to
diameter and height so we have two
independent variables then the residual
is equal to the total number of
observations minus the number of
parameters and the total degrees of
freedom is equal to n minus one so in
this case n minus 1 is equal to 30. if
you make n the subject n is equal to 30
plus 1 giving us 31. those are the total
number of observations so minus 3 gives
us 28. so that is degrees of freedom
sums of squares there is a formula and
then minus Square I said it is
directional
of the regression sums and squares and
degrees of freedom corresponding to
regression we get the mean Square this
one here
then for residual mean Square it is this
value here divide with this 128 you get
15.1 and if you have a calculator and
you get the square root of 15.1 it gives
you the standard deviation that we saw
of three point something
the F value
is the direction of these two the mean
Square regression divided by the
residual mean Square gives us the
inflation
then
the p-value
it is also testing the same uh
hypothesis that the the null is that
there is no linear relationship between
volume diameter and height
and the alternative is that there is a
linear relationship between volume
diameter and height since our P value is
smaller than 0.05 we have evidence
reject the null and concludes that there
is a significant linear relationship
between volume diameter and height
down here we have what we call
sequential sums of squares it shows you
the contribution of each of the
independent variable towards the total
towards that no towards the regression
sums and squares a bigger pattern so
this value here the regression sums of
squares when you add these two you will
get this value the regression sums of
squares of seven six eight four point
two so if we add here we are going to
get 7
six
eight
four
point two so point two
so that gives us the sequentials total
sequential sums of squares now we can
see that in explaining the variability
in volume
diameter is doing a big job you can see
the amount of variability explained and
the height is also doing some good job
now when we are doing model building
we look at the contribution of each
independent variable in explaining the
variability in the response but also
look at the importance the biological
importance of the different independent
variables before you say no I can only
have diameter in my model since it is
explaining a lot of variability in in
volume you have to look at the
importance
is it important to have height in the
mode in the model
uh economically if we have height in the
model we will take more days collecting
data compared to when we are just
measuring diameter so those are the
things you look at and you given your
field of specialty you will know whether
you should leave height in the model or
not so in the chat any question
if constantly I have not seen that
question okay
in this analysis we are interested in
the slope because our
our
hypothesis says that there is no linear
relationship between the response and
independent variable
okay Q and A
I'm not clear about the 95 percent
confidence level
okay
it for most of uh
the
studies we do in environment agriculture
we normally use the five percent
significance level now the five percent
I mean we normally use 95 percent
significance level and our
we normally sorry beg your pattern we
use a five percent significance level
remember that when you choose to use a
big significance level that is the
amount of error you're making in your
conclusion if you are concluding that
there is a significant linear
relationship between volume and diameter
and height on the other hand five
percent
is saying that actually there is no
linear relationship so the smaller the
significance level the smaller the error
you're making in your conclusion that is
the five percent we are using
in in our testing of hypothesis
okay okay
let me see again in the chat
no we do not actually use the constant
constant is the intercept
we don't use the intercept to account
for any variability
or any parameter in the model so the
constant is the intercept and it is
showing us the value of the response
when the independent variables are equal
to zero
okay
okay
uh
the question is which one is better R
said that just up just say that it is
not influenced by the number of
parameters you add in the model so
initially we looked at a model where our
independent response is a function of
the intercept and the slope
and one independent variable in that
case our r squared can be used to
interpret how good the model is and also
R square that just because we only have
two parameters
however when we have uh when we have uh
multiple regression we are going to add
another variable here so we add variable
two so we'll have here X2 okay now as
the then the more the number of
parameters in the model are squared here
is influenced by increasing the number
of parameters in the model so R square
just increases keeps on going high as
you put in more variables in the model
meanwhile R square that just add just
for any parameter you add in the module
so giving you the correct amount of
variability explained in that response
so if we are dealing with multiple
regression let us quote r squared at
just if we are doing Simple linear
regression you can put n of that too
okay
organizers why are some of us getting
emails indicating that we are sorry that
you are not able to attend our webinar
okay that is a question for a forum
and uh we are sorry that some of you
cannot get into the training because it
has a cup but you can go online you can
go on YouTube and attend the training
okay so let me check the chart in case
there's a burning question uh
okay okay okay so let's go to something
else
uh
before we go there I would like to write
clean uh this white boat here
okay
so there's an example here in your free
time you can actually use R and analyze
this data we are going to use the same
data set for both simple and multiple
regression but in your free time you can
erase uh one thing that I want to talk
about before we go to R is indicator
variables
sometimes you collect
data and the data is not quantitative it
is qualitative for example uh you have
things like species variety
uh you have things like type of food
test
so those variables are not continuous
they are not quantitative so in that
case if we want to include those
variables in the regression model we use
what we call indicator variables or
dummy variables now what we do we give
codes to the categories under that
variable for example if the variable is
type of food so we are going to have a
type of food
and in case of variable is this type of
food
type of food we can say maybe
sweet potatoes
and then we can have
Passover and Maze or corn okay so if we
have these three categories and the type
of food and we would like to regress it
onto somebody's income for example we
are going to give codes for ease of
analysis we give these codes we can say
if if type of food is sweet potato we
give it zero they do not have a meaning
and if type of food is cassava let it be
one and if type of food is the Maize uh
maybe we call it two
okay
so this is how you give quotes to the
different categorical variables and
enter them neuter now this variable
you've created the new variable with
d012 is called an indicator variable or
a dummy variable meaning they are just
for you to is the analysis you're going
to conduct
okay
so then we regress that dummy variable
onto our data set on to our response
now
when do we use
indicator variables when we have
qualitative variables that are also act
as X assets
and when do you know that you need to
use structure diagram just going to
construct here scatter diagram and you
find that the data is in groups for
example you have here and then another
one here okay
so we cannot actually fit a line here
like this it doesn't really look a good
line one regression line so this type of
scatter diagram is telling you that in
your data you have two groups so you
have the first group here and the second
group and fitting one line across the
two groups does not explain all the
variability in the response so in that
case what you're going to do is to give
group one a code and also group to a
code and then run a regression if you
find that
volumes
included is significant that having one
model or one regression line so that's
about the indicator variables we are
going to have a data set that shows you
how to do the indicator variables okay
okay
any question before I move somewhere
else kindly share the material for
yesterday okay
in the charts
okay
okay the material is going to be shared
okay then how many indicator variables
do we need for a particular variable
okay there the number of indicator
variables of Demi variables depend on
the levels under your qualitative
variable if you have two levels then you
need only one dummy variable because
it's the number of levels minus one
gives you the number of indicator
variables to be included in your
analysis so if you have two levels or
two categories it's going to be three
two minus one gives you one independent
indicator variable if it is three
categories or three levels then it's
going to be three minus one meaning
you're going to have two indicator
variables so let's have an example here
okay we are going to have an example I
hope this is visible what I'm writing is
it visible
okay so we are going to have variety
here
and I'm calling it V and I'm assuming we
have variety X and another one is z so
if we have two levels our dummy variable
we are going to say if variety X if
variety is equal to X then the dummy
variable will be zero
if variety is Z then then the Demi
variable is going to be one
so now we only need one dummy variable
now how about if the varieties were
three so we have w
so that means we need two dummy
variables this is dummy variable number
one Demi variable number one says if
variety is X then the chord should be
zero if variety is Z then the chord
should be one
or else if variety is W the chord is
also zero so the first dummy variable is
representing Z then the second one two
we are going to say if variety is X
still give it zero
and if it is z we give it 0 and if it's
w we give it one okay
so that's how we are going to create our
dummy variables and then we are going to
include so we'll have yield here and we
are going to run a regression of yield
versus the dummy variable maybe also
have cite as another factor and you also
put that in your model so that's how we
run the we recreate the new variables
and then include them in the regression
to account for the groups that you've
observed in the scatter diagram
okay let me look at the Q and A
okay
yeah I've answered all these questions
okay and then the charts
indicator variable is the same as the
dummy variable Not instrumental
okay
okay let's go to something else and then
run to R
so I want to erase this so that we have
a clean white board
okay
lastly
um when we have indicator variables just
as I'd say it we can fit a single line
ignoring the grouping and we can only
ignore the grouping if when you include
your dummy variable in the model it ends
up to being a significant then you don't
need that pressure to have separate I
mean lines for the different
varieties now if you find that because
here we are saying let me go back and
we are saying for single line
we have this this is our data
but we are saying that we can fit one
line and the line is okay now for
parallel lines
if we include the dummy variable in our
data so we have
these two ones and also this one here
so what happens that if you include the
dummy variable and ends up to be
significant it is going to help you
specify the intercept so you will find
that you have different intercept and
that is what we call the parallel line
so the dummy variable that you've
included with the chord 0 and 1 will
help you specify the intercept for the
different lines so if it ends up to be
significant then we are going to have
this graph where we have two parallel
lines
and in this parallel lines we know that
our slope is going to be the same
because they are parallel lines however
that after putting in the dummy variable
we call the code 0 and 1 it is
significant but we still need to explain
more variability or still the power line
does it suit we go for what we call
separate lines so we might end up having
things like this
so separate lines will have a separate
different intercepts and different
slopes I'm not very good at drawing but
we are going to have different
intercepts and different slopes okay in
the separate line how do we get the
different slopes
what we are going to do once you have
your dummy variable now let me clean
this and uh
show you what happens when you want to
include the different slopes
okay so we say that when we find that uh
we have
so
if we find that
the dummy variable is significant but
still we have some variability to
explain or fitting one model does not
really suit our data so this is our data
here and then another one is here so you
have a line going like that and another
and going like that meaning they have
different slopes and intercepts the
slopes are different and intercepts are
different
so how do we create the this different
slopes and different intercepts for the
different slopes
for the different intercepts the dummy
variable you created is good enough for
the slope we are going to have the one
of the independent variables interacting
with the dummy variable and that is
going to help us get the different
slopes
so we'll have an interaction between the
dummy variable and one of the
independent variable and that is going
to help us get the different slopes
so probably when we go to R we can see
that and even write the models better
okay
okay so I'm going to erase this
and
Okay so yesterday we talked about the
polynomial regression and people are
saying what are powers so a polynomial
regression has a model like this
the response plus the intercept
we have the first part is the linear
term and then we have the quadratic term
depending on which one so this is the
quadratic term
okay we could have
the linear part
the quadratic part
and the cubic
okay so that is a polynomial regression
and in this case we will get a graph
that is has a curvature so we can get
something like this
and if you fit a linear model it will
not fit so well you can see that the
linear model leaves almost all the data
out so we need to fit either quadratic
or a cubic to that type of data
so we will see that when we are in R
okay
in the chat
let me see if you have a question
and then let's go to R
ather
okay so there are some Transformations I
will not go through this we can read and
then let's go to R uh the other thing
may be to mention in passing is that
sometimes the relationship is not linear
it's not curve linear but it's
non-linear as I told you yesterday when
the parameters
do not have a linear function with a
response that type of relationship is
non-linear and we do not use the
ordinarily Square in this case but we
will go for the maximum likelihood is
what we use in non-linear regression
so that can be a big
topic for next time
and I request us to go to R so that we
have a feel of simple linear multiple
testing assumptions and also we look at
quadratic and dummy variables then if
time allows we will do model building
okay I'm being told that the materials
of regression uh in week week two day
three
pattern in that Google Drive so you go
to the Google Drive and download
materials for
today they are in day three
in a week two
okay ladies and gentlemen
I'm going to stop sharing and let's go
to R and we open
in R we are going to open uh
a script called Co and regression
the script is called Co and regression
okay
okay
let me share this
in the chat let me see in the chat
okay
um please the Google link has already
been posted
uh Dr Helen please share again
we are turning to our
and I was told by Dr Thomas that you are
now a gurus
and don't move very fast today
okay
so somebody's
there is confidence inter
Val I have said significance level is a
p-value in this case we are using a five
percent significance level which is
equivalent to
0.05 that is the probability
that the null hypothesis is assumed to
be correct
confidence interval it involves a range
of values the lower limit and the upper
limit
it consists of a range of values for the
possible either
population mean or population standard
deviation it is C so confidence
gives us a range of values four
either population mean or any parameter
that you're computing
and statistically we interpret that if
you collect data
a hundred times you go back to the
population collect data 100 times and
calculate their confidence intervals 95
percent of the time you're going to find
them including the value of the
population mean
so when you go back collect data a
hundred times compute their confidence
intervals at five percent 95 percent of
those confidence interval are going to
include that
value of the population mean if the
interest is the population mean so
that's the difference
or
uh sometimes we say we identify percent
confidence
confident that
the population mean lies within
the upper and the lower limits of that
confidence interval not between because
the value of the mean so it's okay
ladies and gentlemen I believe that you
are all at this stage you have this it
is called Corp LA and regression that is
there
and I have several libraries I've
included here we have Thai device
learner R deplier Red Dead cell radar
yeah
so let's set our directorate
we got session window
set working directorate choose working
directorate and mine is here it's called
red and I do open
so if I've moved so fast
it is session window directorate choose
working directorate browse to the
directors where you put your death re
material and then set that as your you
set it as your folder where the data is
are we together
so in the chat has everybody set their
working directorate
has everybody said they are working
directly
that I can see no no no no no no no okay
I'm giving you one minute please set
your working directorate and if you do
not know how to set your working
directorate uh they say I've not
provided the link uh Helen please post
the link again
there it goes the link has come so
please
uh please please download the material
and let us continue don't work in the
Google Drive
save it on your computer
if you work on Google you'll be left
behind
name of the file is call and regression
let me type it in the
it is oh
and
week two day three material this is the
file name
there you go the file name is there for
the script the r script it is an R
script we can't see
oh
okay let me let me do it again let me
okay sorry about that I thought I was
posting to everybody
okay
there we go that is the file name
connection is poor is that for me or
general okay that's not my question
Okay one minute one minute let us get
the data set
resend the link
please resend the link
and in Q and lme C Q and A uh where can
I find the files please send a link to Q
and A the link will be sent in a
okay
you've sent
Q and A
just copy and answer one of it
okay so
I believe that everybody is ready can we
now start make sure you set your working
directory to the fold away at the data
set we are going to use is
the data set is egg P egg production
okay the other thing we are going to do
is we are going to load these libraries
so load the first part
and if you go below here
and unless if you don't have it
you see we have Library car we will need
it Library psych we will need it
also be needed
if they already installed just run the
second bit
and run
and we'll be good to go
so in the chat have you all loaded the
packages
in the chat have has everybody
what has everybody loaded the package
that language the r language
you know
it's not English
somebody's saying no
if you say no give a reason error in
library Gigi B if you get an error
and if it doesn't work leave it for now
in case we need to load it later or
install it later
ERA with the library GDP
b r it is still loading okay
one more minute for everybody
please install those packages install
them
the link
don't get tired
send the link
okay the link has been shared please
there's one for YouTube and also for the
Google Drive
so those on the YouTube you're able so
to access
so I think everybody now is comfortable
if yes can I see that just whether
you're comfortable and we go on
uh registration you can register even
during the break please let us now save
time for this
um because I want you to get some
knowledge on survey
okay so some people have gotten
knowledge on experimental design now we
need you to get uh survey
then you haven't installed game models
please install it
just install right install.packages in
inverted commas game models
okay
okay let me continue now we are going to
continue
have has everybody sit there working
directorate before we go on to the
analysis
sweet
if you find that there is a message
saying no package install it
[Music]
somebody is saying that they can
[Music]
see okay people have said they are
working directorate great okay let's go
on uh so then
on line 49
and if it's not 49 then it should read e
x p a sign read dot CSV in call in
inverted comma uh hp.c is
so when you look at under your console
you see water uptake food uptake and egg
production
so this data is about real in chicken
and you want to see how much food you
should be buying how much water for you
to have a maximum egg production okay
okay
so is line 49 and 50 okay in the chat
has everybody run line 49 internet is
very weak
okay
okay
after running line 49 you should be able
to look at the data state
and we are now going to go to
visualization now we are only going to
construct a scatter plot
and we are going to construct a scatter
plot of
water uptake
and food optic
okay
so I remember yesterday my colleague
showed you uh line 49 it just divides
the the window that you're going to have
your Windows device the window here we
are telling r that our window should be
one row and one column so you'll have
only one graph fitting there so then the
next part is plot
so here I've given you three options you
can use scatter you can use deep Gigi
plot you can use plot so let's run the
first one line 60 and see what happens
so if you run line 60
what do we observe in line six let me
see in the chat what have you observed
in the output in the plot window what do
You observe what can we say about this
plot
somebody says Kata another one linear
let me see linear
linear relationship
we can see a scatter you are right
positive correlation grid positive
linear excellent okay
linear relationship positive linear
relationship yes we are seeing a
positive linear relationship between
water uptake and food optic
okay so we are going to get the same
output even for line 64. it's just
another way of constructing a scatter
diagram so you can place your cursor and
run scattered or smooth and it puts in
aligns through the points
so the same applies to line 67 you can
use the ggplot store whatever
appears appearing to you is what you use
so line 67 to 69
gives
that plot there with the smoothing on it
a line and the points are there
okay
then we want to find out is there
correlation between food uptake and
water uptake
so line 72 you can run a touch attach
so that R remembers that it is dealing
with that data set so run line 72
and then we are going to run the
correlation between food optic and water
uptake
and it gives us a value
so we have an R value of 0.9778
so in the chat what do we think about
that value
how to interpret that value 4.977
if it says no file directory
I cleaned the console okay let me clean
the console but I believe you're also
getting the same thing so let me run
again there we go is it visible now
and what does it mean
I can see strong positive correlation
strong positive correlation that is
excellent we have a strong positive
correlation between water uptake and
food uptick
somebody is saying 90.7 percent
explained no we are now dealing with the
correlation
and once we go to regression we'll be
able to know how much variability is
explained in the response okay
okay so the next one
[Music]
is the test to test to find out whether
our correlation value of 0.9 is
different from zero so line 74 is called
DOT test is going to test that
correlation value so run that
and you will see that in your console
it's going to say Pearson product moment
correlation
the T value that pivot that freedom and
P value
so here it is saying alternative
hypothesis true correlation is not equal
to zero so that is alternative
hypothesis that the correlation between
water uptake and food optic is not equal
to zero or there is correlation between
water uptake and food optic so since our
P value is so small which is
0.0004 we conclude that uh there is a
significant
strong positive linear correlation
between water uptake and food optic
so with the code.test it even gives you
the correlation value it gives you the
95 percent confidence limits or interval
for that correlation value now what if
you want to get a correlation for all
the variables in the data set so long as
they are so long as they are all
continuous we can use call in Brackets
the
file name and it will give us
correlation for all of those so run line
75
gives us correlation values for all the
the variables we have in the in the data
set so you can save you can see in the
output water uptake and food uptake is
0.977 we've already seen that water
uptake and air production is 0.79 uh it
is C
okay
so if you no longer want to use
egg P you can detach using line 76 so if
you detach meaning that wherever you
write an error command you're going to
remind r that you're using e x p
okay
I will clear my console again now we are
going to run our simple linear
regression and it's going to be between
uh water object being our response and
food uptake being our independent
variable
okay
so the line 83 model is the object name
given to that regression analysis
means linear model so that is the
function we are using water uptake is a
function of food optic and data equals
XP so run that run line 83
and for us to see the output you have to
print the model so print model then it
gives you so when you use LM it gives
you the intercept and the slope only
for you to be able to get the Anova
table and the R square though that you
have to put summary in bracket model so
when we run line 85 it gives us more
than what the model gave us so we get uh
if you see the results of summary model
in the console we have the formula water
uptake is a function of food that take
will have
residuous the minimum residual the first
quarter residual and the maximum then it
also gives us the coefficient so let's
look at the coefficient ladies and
gentlemen coefficients in your console
you should have coefficients
and we are interested in the hypothesis
that the null says that there is no
linear relationship between water uptake
and food optic
and we are testing that hypothesis at
five percent which is the same as 0.05
probability
so when we go to the slope if the one
testing the linear relationship and the
slope corresponds to food optic our
estimate of the slope is 2.733
the p-value corresponding to the slope
is
4.02 to the power minus eight
now that value is smaller than 0.05 the
level at which we are testing our
hypothesis and it is also significant at
0.0 something percent 0.001 0.00 you can
see those Stars means that it is
significant at all these levels so since
it is significant we have evidence
rejected now and conclude there is a
significant
linear relationship between water uptake
and food optic
we can write our regression model being
that water uptake is equal to minus
63.5131 plus because this value is
positive plus 2.733
food optic
so for uh for purposes of interpretation
for every unit increment in food uptake
water increases by 2.733
cubic liter cubic millimeter depending
on the units of measurement in your
okay when when we go down I will as well
check the chart
okay when we go down we will see uh
multiple r squared
which is the r squared it is
0.9562 and R square that just 0.95
2. so that tells us that the amount of
variability explained in water uptick is
95.6 percent
so most of the availability in water
uptake has been
explained by the regression model
relationship
between the T value corresponding to
food uptake and the F statistic
the F statistic here in the console is
218.6
if you have a calculator or you have
your funds can you get the square root
of 218 points
what is the square root of the value of
f statistic which is 218 points six
for that
okay 14.8
another one
thank you another one what do you get
square root of 218
square root of 218 14.785
okay yes that is the valve 14.8
14.785 thank you very much now let's
look at Food optic under coefficients
ladies and gentlemen
look at coefficients when you go along
food optic you get a t value
and that's even
is 14.78
so that means that
you've gotten the T value the square
root of 2
18.6 is the T value
and you can see that the P values
actually are similar
so then the null hypothesis of
non-linear relationship between water
uptake and food uptake can be tested by
either the t-test or the f-test and the
relationship between uh the f-test and
t-test is that if you square the T value
you get the the if if value if you get
the square root of the F value you get
the T large
okay so when you're doing your analysis
you can always cross check whether your
results are correct by doing those
simple simple math okay okay let's go
ahead I hope
all okay
now we can actually produce the r
squared using summary model r dot
squared lines
86 gives you the value of r squared
which you've been already talking about
and you can also get the R square that
just
so this is just extra code for you to
use in case you want to get your r
squared or you're doing model building
you can always use summary model uh
dollar sign at just so you're telling to
summarize the model but you only want to
print the error square that just or you
want to print r squared
then also the other
criteria you can print for example AIC
is the akaka information criteria and
bic is Bayesian information criteria so
those two are used to assess the
goodness and fit of the model
if the model fits well your data then we
should have small AIC we can only
interpret those values if you're
comparing more than two models so if you
have model one and model 2 you are
choosing between those models look at
the value of AIC whichever model has a
small AIC or small Bic then that model
is good
so what does AIC do
so when you have a model with intercept
both AIC and Bic it's comparing
the information lost when you add in
another variable into the module and if
that information lost is small then the
resultant model is good that is in
Limitless language okay let me sing that
chat
foreign
[Music]
then we have no evidence to reject the
null hypothesis conclude there is no
significant linear relationship between
water uptake and food optic
so we end there the model ends there we
can't use it for prediction we we don't
even need to go and assess the
assumptions under that model because the
relationship is non-significant
okay let's go
AIC is a high information criteria bic
is Bayesian
there is also what they call the Asian
statistics so Bayesian information
criteria the two criteria are used for
testing
how good the model is comparative and
the model with one variable you want to
find out whether you
should go with the model with the
intercept alone you produce the AIC for
the model with insert intercept alone
and also with a model with one variable
so whichever of the two models has a
smaller AIC is what you choose as the
best model for your data or a smaller
bic is what you choose for fitting to
your data
so those two we shall see them in model
building they are mostly used when you
have so many variables in your data set
and you want to fit all of them onto the
response but then you don't know which
variable to drop which variable to live
in the model so as you remove one
variable you check the value of AIC if
it increases then that means that
resultant
is not good so for every model the AIC
should be small
for the mode to be good for prediction
now we can also use the command aov
you get results also using the command
of the sum clearing my console
uh more than 92 users users
I mean line 92 uses aov to do regression
instead of linear model
so if we run that and then run our
summary we will see that we have the
coefficients given and also the
corresponding significance codes okay
but this is just a by the way you can
use LM you can use aov
okay
so if we want to do a regression line to
our scatter diagram then we use ggplot
so if we run line 109
it gives us a regression line added on
to our
model
so we are going uh
straight on to
testing the assumptions
under
simple linear regression
ality between the response and the
independent variable the second one
is that the
residuals should be normally distributed
with mean zero and constant variance
they should be independent and also the
response should be normally distributed
with the main meal and constant and
variance Sigma squared
okay
so if you want to do
we run the model again and this time I'm
using LM I found it small friendly than
elf so I'm running my model 2 using LM
in your script you
LM line
in 140 it's model to assign LM instead
instead of off so
we are good to go so in the chat have
you understood what I'm talking about
no no no no no no no okay
uh
not clear no no no no okay when you go
to line 140
if it's not line 140 go under checking
assumptions
you will see
a command model 2 assign instead of aov
change it to linear model Ln only that
change to LM
and then let's
so if if that is done let's run line
140 or the line that is in your script
and also the model summary which we've
already had
now the second part we are instructing
are to divide the graphical panel in a
two by two two columns and two rows so
please run one 143
is everybody at the same place with me
um
[Music]
ladies and gentlemen are we okay can we
go ahead because to be left behind
okay somebody says I should go slow okay
great I will I'm switching on my speed
governor okay
some people are saying no if you say no
please put a reason why you're saying no
so that we know that problem
okay so I've said run line 143 and it
divides a plot section in two two
columns and two rows to allow us fitting
our plots
okay
and I'm just using the clearing a yellow
broom to clear my plot section
so there is a yellow broom there if you
click it you clear your
clothes section
so then let's let us run and plot in
bracket model 2 and that gives us
residual plots for testing our
assumptions so let's run that
oh
gee
he says that should hit enter
there we go
so when you run a plot
model to
in the course so you will see return to
see the next plot so you hit enter to
see the first plot so the first plot we
are seeing
is residuous versus fitted value
and this plot I said yesterday is used
to test the Assumption of constant
variance of the residues or to test
homogeneity of resilience so here we see
that
um
the dotted line you see in black is
denoting that
the average
value of residuals is zero the mean of
residuals is zero and we should have at
least the the points randomly scattered
around that line
and no pattern so you can see that
except for these ones which have been
labeled 11 and 7 and 6 so we have 11
uh seven and six are showing that you
might have some uh values which are
outliers so we will see as we go ahead
whether that is true so that's because
we are seeing a random scut of points we
conclude that the Assumption of constant
variance is valid
there is no pattern that you're seeing
so in the const so they say it's
returned the next float
the next plot is a QQ residual plot has
everybody got in the QQ residual plot
okay let me check in the chat
please repeat
okay I don't know
much
but
[Music]
let me explain this QQ plot then we'll
see
about the first plot
thank you
for normally you see a doctor if I
follow these videos follow a normal
distribution
and still in the in the graph you can
see that 0.67 and 11 have been uh
related meaning that they are possibly
they are possible outline
okay
we shall say about that
so once points fall on the dotted line
then our data is normally distributed
and exploit
the next plot is a scale location plot
and it is the square root of
standardized residuous versus fitted
values
and
it also tells us that our data
it talks about constant variance this
testing for constant variance and also
highlighting some
observations as outliers
Okay then if you hit enter again you'll
see the last graph residuous versus
Leverage
now this one
is the one going to tell us whether we
have influential observations or
outliers so the dotted lines you see on
top here we see in let me zoom out
and then stop sharing and share this I
am going to share this graph alone
foreign
residuous versus Leverage
now we see these dotted lines the black
dotted lines at the bottom and also at
the top
now in case any value lies above these
dotted lines then that value is denoted
and outlier
so we see that all
these values actually
they've indicated
are all below they don't
value greater than the cook's distance
you can see it is called Cook's distance
not the other one who cooks food but
here in statistics any observation that
is above the hooks distance here at the
bottom and also on the top it is denoted
an outlier and in this case we do not
have any value
below here or above the dotted lines
is that clear in the chat let me see in
the chat
okay proof the network is not stable
you might be right
or the figure take long to come okay
okay I'm going to repeat this one I'm
told the network is unstable and the
figures take long to come so I'm saying
that the dotted line they indicate the
Hook's distance these ones here anything
above these lines is denoted an outlier
and anything uh below these lines is
also denoted an outlier so if we see an
observation highlighted by R and it is
falling here or falling above here then
it is an outlier
okay
okay
so I will stop sharing this
and share uh
and then share the r script
okay
now we can actually produce these plots
in one go
so I told you that when we run line 143
since I didn't run it so let me run it
and then run plot and you can see that
all of them are produced at a go and you
can export them you use the export
button to export them as images or you
can paste them on the clipboard and be
able to insert them in your document so
the four graphs can be produced at a go
make sure that you run line 143 and then
you run the plot model
okay I am going to the next one a Sphero
test it is used to test
56
let me clean my console and run again
line 146 is testing for residuous
whether they are normally distributed or
not the null hypothesis here is that the
residuous are normally distributed
they follow a normal distribution and a
alternative is that residuals do not
follow you know more distribution when
we look at our p-value here I hope you
can see that people in your output in
the console it is
0.6077 this is greater than 0.05
meaning we have no evidence to reject
the nerve and conclude the residuous
follow in a more distribution
okay
in the chat is that okay
can I see in the chat whether the
interpretation is fine and we continue
okay that is great
great and in Q and lme SEC check also Q
and A and C uh what stress fold of AI
okay I'll answer that later
okay then the next line 149 is test for
independence of Errors we want to find
out when the whether the errors are not
correlated
so we use that down
Winstone test
and if you la run line 149
let me also clear the console again and
run because I want you to see so here
the null hypothesis is that uh the the
data is correlated
the null says the data is correlated the
alternative is that rho which means the
correlation is equal to zero meaning the
errors are not correlated
so
uh if we look at our p-value
is it true
oh the null hypothesis is that uh the
data is not correlated versus the
alternative is that the data is
correlated now our P value is 0.4 which
is greater than 0.05 and we conclude
that our data
I mean the errors are not correlated
then we also have a test for linearity
line 151
uh the CR plot it tests for linearity
and if we run that we will see uploads
of water uptake versus food uptake and
if the points follow on the pink line
there's a pink line and a blue line
and we see that the pink line and the
blue line are overlapping in case one
deviates from the other then we have uh
then the two variables are not linear
related but here we see that both the
blue line and the pink line are fusing
into each other they are overlapping
meaning that the two variables are
linearly related
lastly but not least we also test for
homo
homogeneity or variance among the
residuous
and we use the N CV test
make sure you run the library card if
it's not
uh loaded please load it I've already
finished loading library card
and if you didn't load it you load it
before you can run this test
and when you run the NC test it's not a
constant variance test it will tell us
whether the what we saw in the upload is
what is in the results so again let me
rerun this
and we our null hypothesis is that the
variance is constant and the now the
alternative is that the variance is not
constant and because now that pivot is
small is bigger than 0.05 we conclude
that the variance is constant
so this data and the model we've applied
the data suits the model and also the
assumptions under the model have not
been violated
in the chat let me see in the chat any
problem
line error in line 149 yes
let me see 149
I have error
what is intercept
okay no problem okay the errors are
moving so fast I can't even read them
okay
okay so let's have one person
okay
can you
uh David David of Reform can you promote
is that the right name I saw let me go
back there are two musings
okay
[Music]
can you now uh say yes it is me so that
you can be allowed to share
your your script
uh I'm told
he can raise his son and then I can
please raise your hand also
um
an error Prof would you repeat the scale
location okay uh let me check through
this uh
hahaha
oh dear uh
oh Simon Peter just zoom in in your R
Studio go to view and zoom in that
problem will be solved or figure margins
will be solved that is not a big problem
but somebody here had an error in line
49
and I'm looking for them
friends
okay somebody had an error on line 149
can you please put up your heart if it's
true you had an error because I can no
longer see you in the chat
so that you can be promoted I can say
okay let's see new era okay let's have
uh
what is her name
David can you promote same here
[Music]
Sammy
yes Sami it is s
a m e h
that is their name yes
okay
we are promoting Sami
Kami you should uh somebody has an error
could not find a Durban with stone test
that means you did not load some
packages
that is the problem you have there's an
error in line 156 yes if you have
William make sure that you corrected the
line I talked about instead of using aov
put LM
foreign
there's also error in line 156 uh Moses
asinda
Moses can you put up your hands so that
you can be promoted and then we can see
that error in line 156.
where to put LM and LM suit should not
be capital
uh okay if you have a problem with line
156 put up your hand and you're promoted
it will sort the problem out okay if you
have a problem with line 156 put up your
hand so that you can be promoted
so David whoever you see putting up the
hand let them be promoted because I can
see the same problem
promoted in Mary
okay Mary
okay we are going to solve that of where
to put LM
Mary please go ahead
and share
okay Wilfred that is great yourself Mary
good evening Prof can you hear me
let me stop sharing sir about that I had
blocked you I couldn't see you okay
Mary's go ahead
I can see your name but not the script
just a minute let me share my script
okay
okay
yes go uh can you scroll up
fast roll up
up
to line 140.
okay there you go line 140. uh for
everybody please pay attention
scroll to line 140 it says model 2
assign instead of a of put LM
so change it to LM
okay then let me wait for everybody has
everybody corrected that mistake it was
a mistake that I put aov and it appears
doesn't go well with that function of of
so if you have corrected line 140 then
run line 140.
okay so first clean clean your console
using the blue clean
yes great okay run line one yeah
and then run the summary okay great
so I believe that everybody has gotten
an output which resembles Mary's output
for line 140 and line 141.
is that okay should we go ahead
in the chat are you seeing yes yes
proper I also promoted the is it sales
yes okay when Neymar is done we will see
us okay wait a minute
has been promoted
oh the other one has gone
hello Mary still also around okay Mary
go ahead now go and run lines
this is sharing the screen
okay fast way itself so before you share
the screen you wait for Mary to finish
Mary can you go and oh dear I got 156
yes and run line 156 that had refused to
run
but you have to share the screen has it
run
yeah it has run I share my screen you
see no it is okay let's sense or share
since the problem is certain let's see
census
celso oh my goodness your name I can't
pronounce so well okay thank you Mary
let's see the next one
okay you can mute yourself Mary thank
you okay now what is the problem go to
line 140.
you know it's already okay
it's already okay have you run it
yes okay yes I ran it because I I was
the one who was sharing it oh
okay then run line 156
yes
right it's already run it's already run
first clean your console so that others
can see
and then run it
okay so whoever corrects line 140 will
be able to run line 156 just replace the
aov with LM and run and then you come
and run line 156 all be sorted okay and
there are some people who had a problem
with the dabbing are we Stone test that
means there is a package you've not
installed please go and install those
packages that is the problem okay
go and install that those packages and
uh I can see here
uh Baku the problem be sorted by
installing the packages
so install all of them and then load
them okay
it's solved okay great the packages okay
to fix that error please install package
car
okay so you've been told to to fix the
problem let me share mine
so when you run when you install package
car and uh load it you're able to run
all these other ones the dabbing Winston
test the CRP plots
Etc if you don't load and install that
package you will not be able to run
those tests
okay
if that is okay
can I see in the chat in case anybody
has a problem in the chat I have package
problems what is the problem idea
can you indicate the problem in the Q
and M I call it will help you
uh has the problem been sorted is it
okay
no problem let me see in the chat say no
problem oh okay then I'll know that
you're fine and we go to another another
bit
okay great
if you have problem with the package
please put your question in the Q and A
my colleague will help you with the
package that is not installing so copy
and paste your error in Q and A
H thank you okay okay so go ahead
okay we are going to check for outliers
and influential observations
so we use the outlier tests still for
that same model
so line 160 helps us to test for
outliers
and let me first screen my console and
rerun it line 160 and you'll see that uh
in case any value you see number six was
highlighted in the in the graph as being
a possible outlier it has a residual
of -3
and now they are testing is to see
whether it's an outlier and here the
p-value obtained is greater than 0.05
indicating it's not on our outlier
so for you to know whether it's true
it's not an outlier we've uh created an
object out one
as dot numeric names all that bonifera
the same test and then it's going to
tell us
that
we don't have any outliers so out one he
would have highlighted the outliers in
the data set but we see we have zero
outliers
then to to indicate the influential
values we use LINE 163
and if you run it it gives you the
diagnostic plots
and
and you can see that
it has the bonifera P value here
and the index so all the observations
here are indicated
and you see that
number six here appears like send
outlier but it's not we've already
tested and is it an influential
observation
it appears like but we can also use the
cook's distance to test whether
observation 6 and 11 influence our
parameters so the next part which is
Cook's distance is going to help us to
see whether any of those values is
actually influencing the value of the
parameters
so Line 6 165 is the cook's distance we
are going to calculate Cook's distance
for all the observations
and we are calling that file CD
then we are saying influential is equal
to which in case any of these values in
CD are greater than one then they will
be printed out in influence in this
object of line 166.
so if we run that Let Me Clear My
console and run it again if we run
online 166 and run it says no
influential observation says name the
integer zero so it is just showing us
that all those observations actually are
not outliers they are not influential
observations
okay
and line 168 still we are constructing a
plot if you run you will see upload
which is showing us these values which
were one time called influential but
they are not influential because they
are within the cook's distance value so
there is a formula for culturing The
Cook's distance threshold
It Is Cook's distance is equal to 4 over
n and being the number of observation in
the data set so you can calculate that
value It Is Cook's distance
is equal to 4 over n
don't ask me where the four came from I
also saw it in literature and haven't
researched it so well so here
you can see they've indicated Cooks
distance on the plot from 0 to 0.2
is 4 over n and we have 12 observations
so for over 12 should be your threshold
which you use to determine if an
observation has a Cooks distance greater
than 4 over n then that is influential
it is going to affect our values of the
slope and intercept
okay
when a question in the chat before I got
multiple uh regression in case there is
influential what do we do we can do two
things we can remove the influential
observation
and then analyze our data look at the
values of the parameter the intercept
and slope compare with the the model
with the influential observations if the
change is not big then we leave it in
the model
if the change is significant we can test
the two values of the intercept and
slope we can use a t-test and see
whether the difference is significant if
it's significant then that influential
observation if left in the model is
going to affect our slope and intercept
so it has to be removed
please removed Cook's D I said The
Cook's distance has a threshold
let me write here hooks
Cooks distance
has a threshold which is equal to
4 over n small n where n
oh I'm writing here let me post it
I've written it in the chat Cooks
distance has a shortfall of 4 over n
a value in the observation with a Hook's
distance greater than 4 over n it is
labeled an influential observation
okay
so you can see value 6 and 11.
they are not influential if you get 4
over 12 what do we get if we get 4 over
12 it is 0.33 so look on that graph
are they greater are they are values
greater than three three no so 6 and 11
are not influential observations
okay
thank you very much I've seen people who
have cultured it how is four derived
that's what I told you that I read it in
literature and I need to find out how 4
is derived but I just saw it in
literature so I also decided to share so
you can go in literature and find out
how has four derived
okay break time you are right we need
break time even me I need break time so
let's have 10 oh let's have eight
minutes of break time so we come back at
four
Ugandan time
so at four or some people are saying
that I should give them their full break
time okay come back at two minutes past
four
okay
how has four derived please Nora I've
said that it is in literature I'll refer
to that literature and we can read it
together with you
and find out how far is derived we need
to do some math
okay thank you very much you've been
very good participants go for a break
for zero two let's come back and finish
okay great
[Music]
[Music]
foreign
[Music]
[Music]
[Music]
foreign
[Music]
[Music]
[Music]
[Music]
[Music]
foreign
foreign
[Music]
[Music]
[Music]
[Music]
foreign
[Music]
foreign
[Music]
[Music]
foreign
[Music]
[Music]
[Music]
foreign
[Music]
[Music]
[Music]
[Music]
[Music]
thank you
[Music]
[Music]
thank you
[Music]
[Music]
foreign
[Music]
[Music]
[Music]
thank you
[Music]
[Applause]
[Music]
foreign
[Music]
[Music]
[Music]
thank you
[Music]
foreign
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
[Music]
foreign
[Music]
thank you
foreign
okay ladies and gentlemen welcome back
from the break
I hope you had a cup of tea me I was
here waiting for somebody to offer a cup
of tea nobody offered but I'm fine let's
continue okay now we are going to look
at multiple regression
we are going to look at multiple
regression
okay I remember multiple regression
involves regressing
one response on more than
one independent variable
so in our data set we have three
variables we have water uptake we have
eight production and food optic
so the first thing we do
for any of this regression the
descriptive statistics we do is to
construct a scatter plot and look at the
type of relationship we are dealing with
so we are going to construct a scatter
plot for all those three
so let's run line 173
and then 170 yeah line 172 and 73.
so when we run it
in the upload section you will see that
we have water uptick versus the second
column is food optic the third column is
egg production
so you see what a uptake is positively
linearly related to food optic but it
applies also to food to egg production
you can see water uptake is positively
linearly related to egg production
and egg production and food uptake this
is the
sixth graph on the second
row the the third graph on the second
row it is uh food optic and egg
production you can see this relationship
here
let's see in the chat should we say it's
linear
or what type of relationship do You
observe between food uptake and egg
production let's see in the chat
okay people are saying linear linear
linear linear okay great
it is linear okay positive linear thank
you very much it is positive linear so
in your project in case you're going to
use the regression the first thing you
do is construct a scatter diagram which
tells you what type of relationship
you're dealing with and you can also
know the correlation magnitude for
example here
water uptake and egg and food optic has
a strong correlation which is up above
90. but when we go to egg production and
water uptick it is moderate you can see
should be around 60 or 70. that is the
rough idea of the magnitude so you can
talk around you can copy export it to
your clipboard or save it if you click
that export button it allows you to save
as an image save as PDF or save as
clipboard and then you can clip it and
put it in your document and write around
it remember when we are presenting
graphs
figures
a graph we put the title below the
figure
next let's run our regression
model 3 is the object name
and LM is our function so water uptake
is a function of food uptake and egg
production
and our data is still egg P so you don't
need to set the directorate so when we
run that and we run summary of the model
we are going to see in the console
we are going to see in the console that
we have
we have residues then we have
coefficients
we are testing the hypothesis that the
null non-linear relationship between
water uptake food optic and Aid
production
versus
the alternative that there is a linear
relationship
we are testing that hypothesis at five
percent
and when we look at the p-value
corresponding to food uptake
it is 1.04 times 10 to the power minus
5. it is small
compared to 0.5 indicating that we
reject the null and conclude that there
is
a significant linear relationship
between water optic and food uptake
when we got to add production the
p-value is
0.33335 it is greater than 0.05 meaning
there is no relationship there's no
significant linear relationship between
water uptake and egg production
so we can now get literature to support
our results is it true
that the relationship between water
uptake and egg production is not linear
so we read literature and support our
results but the statistics here shows
that there is no significant
relationship between uh water optic and
egg production
I square that just
I told you when we are doing multiple
regression let's look at R square that
just because it is adjusting for all
these parameters added in the model
so it shows that
95.2 percent has been explained
by the regression model the variability
in what uptake
95.2 percent has been explained by the
regression model
okay it also gives us an F value
on two or nine degrees of freedom under
P value so overall it shows us that
there is a a significant linear
relationship between water uptake food
however when we look at individual
variables only food uptake is
contributing to this
the result being significant
okay
so then we can run the same uh test for
assumptions that we had before in the
linear regression when it comes to
correlation we can also run the
correlation tests and find out whether
the independent variables are correlated
that one I'm leaving it to you since we
are running out of time please go and
find out whether the variables are
correlated
now
let's go back and look at the script
the next thing we can produce confidence
interval for all
the
independent variables that we have
and you can see if you look at egg
production the confidence of ranges from
minus
10 to positive three
so you remember I said confidence
interval is a range of possible values
for the given parameter so for the slope
of egg production one of its possible
value is zero confirming what we saw in
the result up here that there is a is
significant relationship linear
relationship between water uptake and
egg production I'm going to repeat I'm
saying that when you look at the
confidence interval corresponding to egg
production the slope of egg production
it ranges from minus 10.466 to positive
three
so if you list all the possible values
between
-10 and 3 one of the values zero meaning
that the slope
for egg production is equal to zero and
when the slope is equal to zero it means
that there is no linear relationship
between production so this confidence
interval is also confirming the result
you've obtained in the regression
analysis
in the charts is that okay should I
continue
should I continue in the chat
okay okay
thank you
so alternatively uh if you look at line
one eight F4
184 if you have so many variables
instead of listing the variables you can
just say model for
LM water uptake is a function of dot
comma means that you have so many
variables listed in egg p
so if we run this line 184 you're going
to get the same results like before
so you can see we are getting the same
result
like before so this is an alternative
way of writing your regression model
okay
okay now we can also find out whether
our
independent variables egg p and food I
mean egg production and food uptake
have high
various inflation Factor the vif
tests whether there is multi
collinearity among your variables malt
means many collinearity means
correlation
so I O variables egg food optic and egg
production correlated we use the vif
function to tell us whether they are
correlated now there are various schools
of thought
one school of thoughts says if the vif
is greater than one for any of the
variables then there is high quality
collineality
in R it says if vif is greater than 5
then there is multicollinearity
and that vif goes hand in hand with you
r squared so let's run line 190 1.
so if we run line 190 one we find that
the vif values given are three let me
remove this and run it again so that
people can see them visibly so we see
that when we run vif vif means variance
inflation Factor
so those values are 3.4 meaning that the
correlation between these two is not so
there's no need for worry according to R
it's not greater than 5. however
we can also test it using the square vif
and check whether we have a problem of
multicollineality so line 193 if we run
it
we see that it is saying false false
meaning there is no malt coloniality
if it was true true then there would be
multicollineality and when there is
multolineality we can
either use weighted least square instead
of ordinalist square
what is weightedly square
we get the mean of each of the
independent variable
create new variables from those
variables so food optic get the amino
food object subtract from the original
values of food uptake then the resultant
variable is what you use in your
regression
that is weighted mean then you can use
multivarity multivariate deals actually
with correlation for you to run
multivariate regression the variables
must be highly correlated so it will
sort out your problem of
multicoloniality
okay
in the chat
I think this script ends here in the
chat
can we continue
please repeat Eric what should I repeat
should I continue everybody says yes
except Eric
please continue okay Katie I'm going to
continue we continue we continue okay we
are going to stop as fast the script is
concerned in a summary I'm saying that
the vif tests whether
your independent variables are
correlated
and are from what I read from R it says
that if vif is greater than 5
then the two variables have a problem of
multicoloniality
so please always read a statistics book
that you will use to support your
results
now down here on line 193 we've tested
whether there's a problem of
multolinearity using square root in
bracket vif model 3 and when we run that
we found out
that
there was no multiplication saying false
false meaning
no malt collineality if it had said true
true then the two variables would be
highly correlated and the correlation is
likely to affect your results
malt means many coloniality means
correlation
I don't see the casa moving this leads
to confusion nowhere Olivia I'm just
talking so okay here is the casa moving
or output how is the output sent to
Microsoft Word
you just copy here in the console
hope you are seeing that
right click or use Ctrl copy Ctrl C
and then open our document and paste it
there
okay
what is true what is false true means
from the word true
it means it is right
so if that test gives you true means
that there is collinearity if it gives
false false it means there is no
coloniality
and remember that we've said square root
of that value is greater than 2. so
meaning that when you square that value
we don't get anything greater than 2.
how about our mark down proof oh
alternatively use Aramark down knit two
okay yes yes yes you're right
uh Francis thank you so much but with
our markdown you have to get your script
uh open it in R markdown run it then
after you've run it to produce
everything in our markdown then you open
our markdown in your document
or you can actually save my R McDon as a
PDF or as another extension
what's the difference between cooking
and Cooks distance
and cooking data
what's the difference between what
oh cooking that and Cooks these things
okay Cook's distance is a test
that helps you to know whether the
individual observation is the new data
set
have an influence on the parameters
in the regression model meanwhile
cooking data
is falsifying your data
when you don't don't present the correct
data then that is cooking data
in the case of multiple linearity I said
we use either weighted list
squares
or weighted regression go and read about
weighted regression wait at least
regression squares regression please
I appreciate our studio no Excel graphs
okay if you find that there is
multolineality you can either use
weighted list
squares regression
or you use malt
variate regression so write them down
and read about them
in your free time when you're taking a
cup of tea
okay let's go to the something in the
last 30 minutes we are left with
is there anybody who is lost because we
are going to close this
script and open another script
uh Prof please repeat that I said if you
find you have a problem of correlation
I'm going to talk as a Lemma of
correlation between your independent
variables
then you have to use weighted list
squares regression or mult variate
regression
that's what I said okay thank you very
much so now we are going to use a script
called reg extra regression extra
meaning
we are going to look at uh
linear regression and or polynomial
regression and also
something
okay
okay let's open that script
it's called reg extra
the data sets are still there
so forget about
let me write in the chat
uh it's called
rig
extra
so that is the name of the script we are
going to use
and it has let us we are going to use a
data set called uptick
this was an experiment to find out how
much carbon dioxide is absorbed by
plants if they are export if they are
exposed to carbon dioxide
so a certain amount of carbon dioxide
was passed onto the plants and then
after some time T they measured how much
content of carbon dioxide was in the
leaves so that is the data set
okay ladies and gentlemen I think that
everybody is fine since we've already
loaded the packages we don't need to
load again
let's just go and read
our uptake if you don't close R and open
it straight in R then you don't need to
load the packages but if you close R and
then
you got to open this script
we might need to load some packages so I
hope you just opened it
so let's read in this data set and line
one it's read
uptech is the object name read.csv
when you run it you will get your app
your data and this data set
consists of time uptech and iron
so under iron we have rubidium and
bromide
I know some of you are not chemists but
in my O Level we used to do some
chemistry that's how I know the things
of Rubidium and bromide
however they have yeah 11 year 10 you do
the things of chemistry
okay so these are plants
being exposed to carbon dioxide and then
time taken to absorb carbon dioxide in
presence of rubidia
and also bromide
so
we are not going to draw the scatter
diagram but I think we should
and uh remember that we have a
categorical variable which is rubidium
and I want to create
a a variable with the codes where
rubidium will be zero and bromide will
be one
so the next line
is to create a new data set so we are
saying
uptake look for iron the object name is
object dollar sign ion one
assign if else uptake dollar sign ion is
equal to rubidium then give it zero
otherwise one
so that is line four
so I'm going to run line four
and view my data
so when you see in the
RS Studio you will see that our data set
has now four variables
with time optic ion and a variable
called ion one where we have codes for
rubidium and bromide
so we can x that out on top here of the
of rstudion x that out para studio and
then we go
to line seven
line seven just tells you the names of
the columns in your data set so that you
write them correctly
so
there we go and then we want to plot our
data and see really do we need what we
think
what I'm thinking so line eight
is a plot scatter diagram and you see
that this is the data
so from this data set in the chat what
do you see
kindly send me the link
okay that ink will be sent
there you go
okay positive correlation positive
colony
no there is no positive collinear my
dear it is positive correlation positive
linear correlation
okay this is positive do you think that
we have two groups in the chat do you
think we have two groups in this data
set warning uh you did set your
directorate my dear
somebody who has filed not found
okay thank you
uh somebody has filed not found order
set your
go to session Windows set working
directorate and choose the working
directory browse to where this data set
was saved please and there you will set
your working directorate
okay
okay so we see that
in the plot section that we are likely
to see two groups one on top one above
if we put in a straight line here it may
not explain most of the variability
so probably we need two lines one going
through the top and one going through
the bottom but let's see whether that is
true
so the first thing we are going to do is
to run our model we we are running a
model
of app Tech as a function of time
plus iron one iron one is the dummy
variable representing the categorical
variable
uh iron rubidium and bromide
okay
so I'm running line nine
and the summary
so let's go to the console
ladies and gentlemen are we together
let's go to the console
we are testing to see whether there's a
linear relationship between uptake of
carbon dioxide and time taken to absorb
the carbon dioxide
and we see that from our p-value there
is a significant linear relationship
between
uptake of carbon dioxide and time taken
to absorb the carbon dioxide
and also check the ion one is
significant
because it's p-value is 8.02 times e to
the power minus six
so when that is significant what do we
do
we realize that we cannot fit a single
model
but we require a model to model one
explaining bromide and another
explaining
uh
uh rubidium
here in the r squared at just our r
squared is 0.9855
indicating that 98.55
variation in uptake has been explained
by the model
okay ladies and gentlemen this one
and we see the difference between
having no
iron one in the model what does it do
I'm just going to type so don't get
confused I'm pasting the same thing as
above and removing iron one
so you can see what I'm doing I have
removed iron one and left my model I'm
going to call the object one so it's
model one or mod 1 is equal to LM up
check as a function of time only I want
to see
the difference
whether should we go ahead and include
iron one to separate the two lines which
we've observed in the scatter diagram or
we just fit one line onto that data
that's what I'm doing and I just copied
and removed iron one so I hope that is
okay with you guys
okay line 12 which I've just placed in
I'm going to run it
and run the summary oh sorry it gave me
a different result
uh this is N1 okay so run line curve
again and then run model mode one
okay so I want you to pay attention to
the r squared r squared
if we look at r squared when we only
have one variable in the model the
amount of variability explained is
0.7725 which is 77.2 percent
of the variability in uptick
and a whole around 23 percent
variability has not been explained
so that shows us the importance of
adding iron in the model when we had
iron 1 in the model the variability
explained was
98.2 percent
so ladies and gentlemen are we
understanding what I'm doing or I'm
talking to myself
somebody Prof is using our software
yes they are using our software as I
don't know what you're saying
are you guys understanding do you see
the relevance of using a dummy variable
here I'm showing you the relevance of
using dummy variable and we've seen that
in this case we need parallel lines we
cannot
where is the cancer we cannot use uh
only a single
line to go through our data
okay so repeat on dummy variable I'm not
going to repeat so much but I'm saying
that we had a categorical variable which
is named ion and created a new variable
called ion one with chords zero and one
to represent rubidia and bromide and
when we included it in the model the
amount of variability explained was
98.2 percent when we run a model with
only time alone we see that the amount
of Arabic explained in 77.2 so there is
an improvement in the amount of Arabic
explained when we include iron in the
model also when we look at our scatter
diagram we see that there are two
parallel lines here which we need to fit
onto our
graph
so how do we fit those parallel lines we
are going to do a bit of math
let me just clean this and run again so
uh we are going to forget about this
single line here since we've already
known that actually running a model with
the
iron is important so I'm going to run
iron again
with time
so this is our
output
we say that
for rubidium it's coded zero
so wherever you see Iron one substitute
with zero
so the first model for our data is going
to be the intercept and the value of the
slope which is 0.181 so let me uh use my
writing pad so the model
model for rubidium
is going to be equal to 2.7
maybe 7 plus 0 point
point one eight
one that is the model for rubidium
because where there is iron 1 here will
substituted with zero
the second model
for
bromide I'm going to say bromide is
equal to
2.77
plus
0.181 plus minus 5.28
iron
this is Ion one
this one has time on it so we are going
to put a big T there and also a big T
here
remember in our data set we coded
bromide one
is that true
let me look at the code
yes we call that bromide one
now in this model you see here badly
written we are going to substitute where
this iron one we substitute with one so
if we multiply here by one we are going
to get a value of minus 5.28
then rearrange our model so our model
now is going to be
uh bromide let me use this one bromide
is equal to
2.7
minus
five point oh no plus not minus when the
negative crosses the equal sign no it
has not crossed actually sorry about
that
the negative hasn't crossed the equal
sign so it's still minus
okay
minus
minus 2.8
then Plus
zero
point
one eight one time
okay so here we have two models we have
this one of Rubidium and we also have
this one of bromide the same slope but
different intercept that's why we see in
our
scatter diagram that we had parallel
lines
so those are the two lines we have
okay so you can actually put them on the
model later on when you you do your R
script but won't do it now
so I just wanted to show you that if you
have a categorical variable you can
actually run it using a dummy variable
and then you will need if it becomes
significant you need two
two regression models one where iron is
zero and one where iron is one
I hope that is okay because I'm going to
the next bit
okay so next I want us to have a glimpse
of
polynomial regression
okay
so in the chat
anybody was a problem with what I've
been saying
okay
okay people don't have a problem so
let's go to polynomial regression
and remember I am going to do a plot to
show you the type of relationship you
have then we decide whether we should
have non-linear regression linear
regression Etc so the data set as again
is within your folder what you do just
use line
in my
script is line 16 but reads hard as a
function of read no hard is equal to
read dot CSV in bracket hard so that's
what I'm running
and then view our data
so this data here I will give you a
small background
it talks about making of paper the paper
we use for writing
what do they do in the industry they
collect wood
then disintegrate it in what we call
pulp
and then the concentration of Pulp they
add chemicals and make paper then they
test the strength so that when you pour
water it doesn't get disintegrated so
here they were looking at different
concentrations of Pulp
versus the strength of paper
okay
so this is the data we have hardwood
concentration versus the tensile
strength of paper
so let's plot line 19 is a plot of
tensile strength as a function of
hardwood concentration
when you run
when you run that
would concentration
as a function of
we get that so what do we see in the
chat in let me see in the chat what are
you seeing
what are you seeing in the scatter
diagram
linear polynomial a curve okay
so we see a curve in this
somebody says non-linear
okay okay sigmoid curve all that okay so
here
inverted
uh likely a probability curve so here
what we are seeing is a cavalinear
relationship between Turtles tensile
string and hardwood concentration
so the first thing we are going to do
is to run a linear regression assuming
we ignore the shape of the what we
ignore the shape of the of the model we
see in the scatter diagram and run a
linear relationship between tensile
strength and Hardware concentration what
happens
so let's run
and there we go
so in the console
if you pay attention
to hardwood concentration you will see
that actually the linear relationship is
significant
because the P value is smaller than
0.05 let's go down and look at the
amount of Arabic explained
the r squared
is 0.30
5 meaning 30 percent of the variation in
tensile strength has been explained by
hardwood concentration
or by the regression onto hardwood
concentration
so meaning there are other factors that
are required in the module
for us to be able to explain some of
that variation
we can also draw
before we go ahead let's construct our
residual plots to see whether even if
the model is significant uh the residual
plus telling us that the assumptions are
not violated so let's run line 24 and
25.
so we run that and run that and we have
the four plots
so I'm assuming that you're following
we've already done this the only
difference is that now we have data
which is cavlinear
so when you run your your residual bus
has fitted but you can see that the
residual is increasing with the fitted
value
and we see that the residuals at the
beginning here of the curve which is
pinkish
is smaller than the residues on the
other side
so it is also showing us that actually
the linear model doesn't fit that red
curve is showing you that you need to
fit a cuff a curve onto this data not a
linear
relationship when you go to the scale
location it's also telling us the same
that actually the linear model doesn't
suit our data you can see the curvature
drone
in it and also it is showing us that
they are likely we are likely to have
some values which are outliers which can
be confirmed when we go to the QQ plot
there is normality but there are some
variables down here
you see the points are lying on the
dotted line indicating that actually the
data is normally distributed but we have
outliers here there is outlier 19
outlier 18 and outlier one if you look
at the QQ residual plot
in the cooks distance plot residual
versus Leverage
I don't know whether you can see it or
assume in it
but it is showing that residual 19 might
be an influence an influential
observation
let's keep that for a moment and look at
when we apply a quadratic regression
what happens
so we are going to run a quadratic
regression
first of all we have to create a
variable
that represents a quadratic
variable meaning that you have to
multiply how to do concentration by
itself you square it so here I've
created this variable and I'm calling it
uh hardwood the object is called
Hardwood
in hardwood look for hardwood
concentration and call it hardwood
concentration to assign a quadratic term
on it this is the line 28.
so if we run line 28 and view we shall
see that in our new data set we have a
third variable called hardwood
concentration two
so we can x that out
and run a model where we are going to
have tensile strength as a function of
hardwood concentration plus hardwood
concentration squared
that is line 31
and when we run that and run our our
summary we are going to see that
how do the concentration as a linear
term is still significant and even the
quadratic term is also significant
meaning there is a significant quadratic
relationship between tensile strength
and hardwood concentration let's look at
the r squared at just
when we look at the R square that just
it has improved from what we saw before
now we've explained uh 89.7 percent of
the variability in tensile string
uh we have okay we want us to look at
the residual plots you've already seen
what happened now let's look at the
residual plots with the new model
and
we run
and we see that let me just zoom in and
stop sharing and share the residual
plots and we see that we are doing a
good job
oh
okay ladies and gentlemen this is the
plot which is zoomed in
remember the graph we had here it was
showing that it was a curvature
but now we can see that the graph has
improved it's a random scatter points
only that we still see some bit of
curvature
and when we go here it is still showing
us there are some points that are likely
to be influential
and when we go here
in the QQ plot QQ plot is telling us
that our data is not normally
distributed you can see almost all the
values are falling off the dotted line
when we go to the cook's distance
there is observation 19 which is
influencing the shape of the line so
what we could do if we had time we could
eliminate observation
19 and see what happens to our model
but still when you look at our residue
of us has treated value
we still see that
there's a bit of
that is the random scatter but I think
we can apply a cubic to the data and see
how far we go
but
still there's some improvement that's
what I can see
so I'll stop sharing
and share the script
and
that's all about quadratic you can go
ahead and run a cubic and see how it
improves
but I think it is a better presentation
compared to when are we put in on a
linear term
although we see that there are some
observations influencing
the direction of the curvature so we can
remove them and see what happens to our
parameters the slope and intercept
so that is the
polynomial regression
uh apparently I think we have how many
minutes to go
we have only eight minutes to go
and tomorrow
I want you people to be exposed to
survey
how do we collect the time survey and
some bit of analysis so I beg to stop
here
and if you want model building please
invite me to your invest I get a chance
of seeing you face to face
but let's keep model building for the
next session
it's also an interesting topic model
building
I love it because you're playing with
numbers and seeing numbers how numbers
influence the biological system or the
economic system so so ladies and
gentlemen I would like to stop here we
cannot do any further
but I thank you for your patience
and uh
in the last part I was a bit fast the
speed Governor button was not pressed
but I believe you've understood
uh somebody's saying do you have
nanoparametric
you see that we prepared a lot but we
can't cover everything in this short
time
but thank you for being patient with me
and I hand you over to Forum David
takeover I'm done unless if my
colleagues Helen and Thomas do you have
anything to say they are saying they
don't have anything to say but you have
been very excellent I don't know whether
that word is there but excellent more
than excellent participants thank you so
much thank you thank you and back to
reform
there you go
oh thank you so so much bro for being a
very good teacher
uh even doctor don't say he's not a
teacher but you people are doing so so
great you've done so much teaching
people who are just coming from a no
Arab background but uh you really I can
see most of the people really understand
now what they are doing and all that
thank you so much Professor Susan and
your team we really appreciate
participants thank you so much as usual
we are really grateful please please
ensure that uh you're filled in the
registration attendance and also
the their materials for these trainings
are all in that folder I see you we
still have people who are asking for
materials please they are all in one
folder which was put there a week or two
weeks ago
so thank you so much uh Professor thank
you so much participants we'll see you
tomorrow again and wrap up