Understanding Correlation and Simple Linear Regression with R: Theory, Practice, and Model Validation

Name: Day 5: Research Methods and Statistics Training
Uploaded: 2026-01-16T09:53:00.264526+00:00
Channel: RUFORUMNetwork
Description: Summary and key takeaways on Understanding Correlation and Simple Linear Regression with R: Theory, Practice, and Model Validation, covering Introduction The
RUFORUMNetwork
Jan 16, 2026
•
4 min read
YouTube video ID: Fd1mnabOdAE
Source: YouTube video by RUFORUMNetwork — Watch original video
PDF
Introduction

The session introduced correlation and simple linear regression using the R programming environment. After a brief theoretical overview, participants applied the concepts to a real dataset (egg production, water uptake, and food uptake) and learned how to interpret statistical outputs.
Correlation Basics

Definition: Correlation measures the strength and direction of a linear relationship between two continuous variables, ranging from –1 (perfect negative) to +1 (perfect positive).
Types of Correlation:
Positive (points rise from bottom‑left to top‑right)
Negative (points fall from top‑left to bottom‑right)
No correlation (random scatter)
Interpretation of r:
|r| ≈ 1 → perfect linear relationship
|r| > 0.5 → strong
|r| ≈ 0.5 → moderate
|r| < 0.5 → weak
Computation in R: cor(x, y, method="pearson") for continuous data; method="spearman" for ranked data.
Visualising Correlation

Scatter plots reveal the pattern: - Tight cluster around a line → strong correlation (e.g., r = 0.96). - More spread → moderate correlation. - Random cloud → weak or no correlation.
From Correlation to Causation

The instructor emphasized that correlation does not imply causation. Examples (bird weight vs. length, hospital births vs. bird births) illustrated that a statistical association alone cannot establish a causal link.
Simple Linear Regression

Goal: Fit a straight line Y = β₀ + β₁X + ε to predict a response (Y) from an explanatory variable (X).
Parameters:
Intercept (β₀) – predicted Y when X = 0.
Slope (β₁) – change in Y for a one‑unit increase in X; sign indicates direction.
Estimation: Ordinary Least Squares (OLS) minimizes the sum of squared residuals (differences between observed and fitted values).
R Implementation: model <- lm(Y ~ X, data=dataset) followed by summary(model).
Interpreting Regression Output

Coefficients: Provide intercept and slope with standard errors and p‑values.
p‑value: Tests whether a coefficient differs from zero; p < 0.05 indicates a statistically significant relationship.
R‑squared (R²): Square of the correlation; represents the proportion of variance in Y explained by X (e.g., R² = 0.956 → 95.6% explained).
Adjusted R²: Adjusts R² for the number of predictors (relevant in multiple regression).
Model Diagnostics & Assumptions

Linearity – confirmed by scatter plot and correlation.
Normality of Residuals – assessed with QQ‑plot or Shapiro‑Wilk test.
Homoscedasticity (constant variance) – examined via Scale‑Location plot; random scatter indicates the assumption holds.
Independence of Residuals – checked with residuals vs. order plot; lack of pattern suggests independence.
Influential Observations – identified with residuals vs. leverage plot and Cook’s distance; points outside the dotted lines may unduly affect the model.
If any assumption is violated, remedies include: - Transforming variables (log, square, etc.) - Switching to logistic regression for binary outcomes - Using polynomial terms for curvature - Applying mixed‑effects or time‑series models for correlated observations.
Multiple Linear Regression

Extends simple regression to several predictors: Y = β₀ + β₁X₁ + β₂X₂ + … + ε.
Same assumptions apply, plus no multicollinearity among predictors (they should not be highly correlated).
Model comparison can be performed with ANOVA tables (aov()), and significance of each predictor is examined via its p‑value.
Practical Workflow in R (Step‑by‑Step)

Install & load required packages (ggplot2, car, lmtest, etc.).
Set working directory and import the CSV dataset.
Visualise data with plot() or ggplot().
Compute Pearson correlation.
Fit the linear model with lm().
Summarise and interpret coefficients, p‑values, and R².
Diagnose assumptions using plot(model) which produces the four standard residual plots.
Address any violations (transformations, removal of influential points, or alternative modelling).
Report key statistics: slope, intercept, p‑value, R², degrees of freedom, and diagnostic conclusions.
Reporting Results

When writing a project report, include: - The scatter plot showing the linear trend. - Correlation coefficient and its significance. - Regression equation with estimated β₀ and β₁. - p‑value for the slope (and intercept if relevant). - R² (and adjusted R² for multiple regression). - A brief statement on whether model assumptions were met.
Closing Remarks

The training emphasized that statistical software is a tool; understanding the underlying mathematics and assumptions is essential for credible inference. Continuous practice with real datasets solidifies these concepts.
Correlation quantifies linear association, while simple linear regression models that relationship, estimates its parameters, and validates assumptions; mastering both in R equips researchers to draw reliable, interpretable conclusions from their data.
Frequently Asked Questions

Who is RUFORUMNetwork on YouTube?

RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.
R Programming For Data Science Book Recommended
Provides comprehensive R tutorials and statistical examples, helping learners apply correlation and regression techniques immediately
Amazon →
Rstudio Desktop Application
A user‑friendly IDE for R that streamlines data import, visualization, and model diagnostics, essential for practical analysis
Amazon →
Statistics For Biology Textbook
Explains concepts like correlation, regression, and assumptions with biological examples, reinforcing the session's bird‑study case
Amazon →
Data Analysis With Ggplot2 Cheat Sheet
Offers quick reference for creating scatter plots and diagnostic graphs, speeding up visual exploration in R
Amazon →
Links may be affiliate links. We only include resources that are genuinely relevant to the topic.
Summarize another video
Full Transcript YouTube

okay thank you so good afternoon
participants I think you already know my
name uh today we'll be looking at uh
correlation and simple linear regression
using
R and uh my colleagues have already
introduced what uh correlation
is so we will go through short Theory
then we go to practical so it might take
like 30 minutes in theory and then going
to pra IAL so that we can do more
practicals and
interpretation instead of uh instead of
theory which you can actually read by
yourself
okay so we find that correlation is a
measure of degree of linear association
between two variables and it varies
between minus one to positive one it's
denoted with small r or lower case
r
when points on a scatter
plot follow this trend where by the
points are running from
bottom bottom left to the top
right we call that positive linear
regression and if the points are running
from top left to bottom right then that
is negative
correlation and if there is no pattern
in the plot we call that no
correlation so correlation coefficient
helps us to determine the type and
strength of linear
relationship uh when we call when we
talk about type we are talking about
about the
direction is it negative or positive
then strength is the magnitude is it
strong or
weak so if correlation ranges if it is
positive one we call that a perfect
positive
correlation and if it's minus then it is
negative correlation and if it's minus
one that then it is called a negative
perfect
correlation so you can see the second
slides shows that small R = to 1 is
perfect positive linear correlation then
small
R1 is perfect negative linear
correlation whil Ral to Z is no
correlation if you want to compute it
manually there's a formula here you can
compute your correlation manually
otherwise from the correlation we can
know the strength and direction where
direction we are talking about positive
or negative linear relationship when two
variables are continuous we use Pon
linear
correlation continuous variables from
introductory part we already know that I
know that if I ask that in the chat can
you please define continuous variables
just let me know what continuous
variables are in the chat just using
uh two two words to describe continuous
variables so you can use the
chat okay somebody's giving us an
example height another
one scale and numerical variables okay
age weight okay
salary okay body mass index yield
income
okay okay thank you very much so I now
know that you know what continuous
variables are and in case you have two
variables that
are continuous we use what we call peon
correlation otherwise
if you have for example ranks you use uh
spearman's rank
correlation okay so for
this session we are going to focus on
peon
correlation so here are plots in case
you're presented with plots in your
project you've run your correlation
between two variables you can actually
tell whether the correlation is strong
or weak from the first plot on top here
we can see that the values are close to
each other and they almost go through
the zero zero point so that type of plot
indicates that there's a high or a
strong correlation between the Y
variable and the X variable and the
value here we can see that R is equal
0.96 which is almost close one
indicating that the correlation is
really
high then when we look at the lower plot
we see that the the points are a little
bit separated from each other and the
correlation is moderate it's not very
strong but when we look at the plot the
first plot on the lower left bottom you
see that the points are scattered it's
just a random scatter of points
indicating that actually the
relationship between the two variables
is quite weak so that's how you can tell
the magnitude and the type of
correlation and we say that if values
are running from the bottom left to the
top right this is positive correlation
so all these ones have positive
correlation the negative correlation
this is how it
appears the points run from the top left
corner to the bottom right corner so you
can see here also points are close to
each other and the correlation is minus
0.93 and the next one is moderate
correlation you can see the points the
red points are separated from each other
and also when it comes to random scut of
points the correlation is weak
indicating that there's a weak
relationship between the variables the Y
and X
variables so when you get results of
correlation no normally we are tempted
to think that one variable causes the
other either to decrease or increase but
correlation is not Cession we cannot say
that the weight of the bird and its
length we we cannot say that the length
of the bir causes the weight to increase
or
decrease or in we have a hospital here
called Malo hospital and there's a tree
full of birds Mar talk Birds they give
birth on a daily basis but when women
are also admitted that are due to give
birth they also give birth so if you get
that data and correlate you find that
they have a positive correlation so we
cannot conclude that the birds cause the
women to give birth so that is
correlation when you go when you you
conduct a correlation analysis it just
shows you a linear
relationship between the two variables
and either it's positive or negative
but one variable doesn't cause the other
to change unless if you perform an
experiment so in R we normally compute
correlation
using the commands you see and we are
going to apply those commands in R
okay so we before you conduct any
regression analysis or correlation we
normally use a scatter plot to
investigate what type of relationship do
we see so in R we can either use gig
cutter or gig plot to uh construct a
plot between X and Y so that you see the
relationship so also for correlation
peon
correlation uh you can see in this uh
slide that ass is the object name and
this symbol here says assign assign this
command of code. test and diamond is the
file name or the data name and then
you're saying call weight which is the
variable name and then also Diamond call
Price so you want to find out the
correlation between wait and price and
the method is PE so in case you have
rank data the method would be uh
Spearman
okay so we shall see the interpretation
when we are in
R for now that is the theory of
correlation so let's go to
regression so when two variables are
highly correlated you would like to
investigate the relationship between
those two variables and normally we look
at whether they have a linear
relationship have linear relationship or
nonlinear relationship
today we'll talk about a linear
relationship so the process of finding
the linear equation that fits the data
that you have is called linear
regression and linear regression is
aimed at identifying the relationship
estimating the parameters and also
validating that relationship to see
whether the assumptions under linear
relationship are not
violated so that is the objective of
linear regression and when we have two
variables one that we call the response
and the other we call the depend
independent
variable that type of relationship is
called Simple linear regression so how
do we identify a response and a
independent variable given the data
assuming you're given data like for
example you have uh number of eggs
produced water uptake and then you have
food optic among those three variables
we see that the eggs produced they are
definite actually do not vary so much it
is very you find a bad unless if it is
manipulated to give more than three eggs
in a day or even two and then when it
comes to food
uptake it food is taken on in depending
on the need of the
bir and it is measured for each bir but
when it comes to water the amount of
water taken is not measured it depends
on the thirst of the bir so because
there is variation in how much water bad
aex and bad B and bad C then we call
water the response it responds to the
amount of food taken so if you take more
food then you're likely to take in more
water but when you come to egg
production we might find that there's no
relationship between water uptake and
egg production but we'll find a
relationship between food uptake and and
egg
production so for a response it is a
variable that varies when you vary
another variable in this case we are
varying the amount of food and we are
looking at water uptake so water uptake
ends up to be the response while food
uptake because it's measured with
minimal error it becomes the independent
variable or
explanatory okay so we are going to talk
about the straight line relationship and
a few equations of uh regression then we
go into practicals because it is this is
a more practical topic than Theory so we
find that in those who attend the
Ugandan system we have o l and a level
and then those who are in International
System have grade seven when you are in
grade seven you look at a straight line
relationship and the straight line
relationship uh relates why the
intercept and the slope and the X
variable where the intercept which is
BET note if you can see in the slide is
the value of y when X is equal to Z and
the
slope for every unit change in
X Y changes by the value of the slope so
if the sign in front of the slope is
positive it means that every time you
change X by a unit Y is going to
increase by beta 1 and if this sign was
negative
if you change X by a unit then y
decreases by
B1 so that's the interpretation of the
slope okay okay so before I go on in the
chat uh is that okay can we go ahead
with the theory of uh
regression in the chat I only have one
response two three
four okay
okay somebody says please repeat I don't
know from
where okay
okay okay thank you very much let us
proceed uh the last part okay the last
part I'll go
backwards so the last part we have a
straight line relationship
where we have y being a function of
X and consist of the
intercept and the slope The Intercept is
denoted as B
note the slope is denoted as
B1 now if you have such a relationship
and you'd like to interpret the
intercept and slope if x is equal to
Z then if you substitute zero here you
find that the value of y is equal to B
not that is The
Intercept then the slope if you change X
by a
unit for example if x
was diameter over threee and Y is
biomass diameter is measured in centim
for every unit CM increment in diameter
of a tree the biomass of Y increases
by beta 1 the
slope if it is positive we say increases
if it's negative we say
decreases so that is the interpretation
of a slope for every unit increment in X
Y increases by beta
1 if this sign here is negative for
every unit increment in X Y decreases by
beta 1
okay
so in linear regression we use what we
call ordinary RIS squares it's a
program that helps us to minimize the
difference between the observed values
and the line that we are going to see
that fits on to our
data so once the difference is minimized
then we can use some mathematical
expression to differ appreciate the
equation of the line and then get the
equation that we can use to estimate the
intercept and the slope as we are going
to
see
so L squares method is the method we use
to estimate the slope and
intercept and the estimates are called
squares
estimates so assuming this is our data
set and we've constructed a scatter plot
the red points are the observed value
the value you've collected in the field
and You' fitted a line that is the blue
line now you can see that the fitted
line does not go through every
point and since it doesn't go through
every point we can estimate the
difference between the red point and the
blue line the red point is the observed
variation I mean observed response
consisting the whole data consists of a
100% variability that is the response
when we fit this line we try to minimize
the variation between the red point and
the fitted line so let's go to the next
slide which I want you to see that is
this one so this is the scatter plot we
had the original one where it is a
relationship between the chest GTH of
the bir and the weight of the bird so
this is the relationship you can see it
is positive linear and strong given that
the points are concentrated together now
if we fit a line that's the regression
line the blue line the line does not
touch all the points all the red
points if we come back to this section
here the red points are called
observed
value that is the chest
gas now when we f the Y Line it explains
some variation but then leaves some so
the difference between the blue line and
why observed is what we call
residual the blue line is the new
regression
line
the data here before you fit the
regression has 100% variation meaning it
has the response itself it has bit note
plus The Intercept X and the Unexplained
variation which is the residual when you
fit
regression whatever is unexplained is
what we are showing with this line as
residual what remains which is
unexplained
okay let me get a pen and write for you
the equation so we are saying that our
original equation is y is it it
visible beta
note plus beta
1
X Plus
Epsilon okay so for the i
unit and J
okay so we are saying that originally
before you fit the Line This is what is
in the data set once you fit in the line
you're going to explain this part so
this is the model line so the model line
here has an
equation which is predicted with a
heart is equal to the intercept
plus beta
1 the slope X so this is the new
equation now the difference between yig
and Y hat is what we are calling the
residue okay and you will find that now
I'm using capital or bet note instead of
small when we are doing the test the
test set statistics we CT test
regression we test the parameters the
population parameter not the statistics
that's why now we see that instead of
writing beta not small B we are writing
in terms of the
parameters
okay okay so that is what happens in
regression uh let me go in the chat and
see what is happening in the chat what
is the explanation of big and small
residual okay
when we have small residual it means
that our model predicts well the
observed value when the residual is Big
it means that our model does not explain
well the data does not fit the data well
so when we go to do prediction you can
see that the predicted values are
actually so different from The observed
value with a small residual the
predicted value will be close to the
observed observed
response
okay let me see okay I think that is the
equation which was
there
okay okay so from here let us go to the
practicals we are I I'm going to be
explaining every bit of the output that
we get uh the most important thing of
this course is for you to learn how to
interpret the results you get and the
theory you can always read any statiscal
book they are very this simple linear
regression from its name is simple it's
really simple for anybody to understand
but we need to know what to do in our
for us to get the output
okay
so so for now let's turn to R you can be
uh download loing
the file the file we are going to use is
called
eggp that's what we are going to use so
I'm going to stop
sharing and then share the
r e
okay so I have shared my R is it
visible is it visible
okay okay thank
you okay so I'm going to zoom in a
little and so it becomes a bit
wider Okay so we remember that uh
this uh file we are using is called Co
and Rec regression app it's
found in the most recent folder on the
Google link uh in case you don't know
where to get it and we are going to use
eggp file uh to get in our
data so as usual we first of all load
the packages I assure now everybody has
these packages you've already installed
them but in case you see any of these
ones that you've not installed just use
the install uh command so you can see
for example on line 16 up to 19 I have
added this to show you that in case you
don't have either gimo do or you don't
have o
lsrr or you don't have car oras you can
just uh type install packages in bracket
inverted comma H and then you put in the
name of the package and run and then we
load so do not load a package you don't
know go through the list I give one
minute to everybody to go through the
list of those packages that have listed
and in case you have all of them then we
are good to go in case you do not have
any just pick the name from here insert
it between here in the in the brackets
and run
and then it will be able to join other
packages in your R program and in case
you have all these packages installed uh
we are going to load them so I'm going
to highlight all of them just like you
see the blue I have highlighted all of
them
is I have already shared the material in
the Google link uh my colleague Thomas
and Helen can provide the link and you
download the r script and also we are
going to use
eggp so in the chat does everybody have
the script please the link so the link
is there no no no no okay I'm going to
give you five minutes to go to the link
and get the data set okay so I'm giving
you five minutes
unmuted it is a CVS
file yes yes Helen you didn't share the
link sorry the our
script I didn't share the r script
yeah cor the PDF and then thep
CSV you can sh I upload
it okay you can put it in the chat I can
pick it from there then up okay
okay I didn't see it
[Music]
okay Helen have you gotten
it um let me pick it before they they
from the chat
yeah okay pick it and share thank you
sorry about that
thank you
m
yes they can check now put it there def
five then the AR script is there
now Prof is muted she might be talking
while she's muted okay sorry that is try
was muted sorry about that I've also
shared it in the in the chat in case
they know how to pick it in the chat you
just double click on it and it will open
directly into our I also uploaded in De
five material they can check it okay
okay thank you sorry about that
hiccup okay so in one minute is
everybody okay
can we see in the chat whether you
fine I need to see more
yes uh please download it from the
Google if it's not opening if it's not
opening then you do not have r on your
laptop or your
desktop make sure you have r
okay okay so now can we go ahead should
we go ahead in 1 minute is it okay I
don't want to leave anybody behind
because today we have only 2 hours and a
few minutes left 16 minutes
okay okay somebody's installing packages
today
uh the other thing is that for any other
training please try to keep checking the
Google link and also go back to the
former recordings and make sure you're
up to date so that we do not lose
time an an you're saying installing
missing
packages okay okay one more minute for
installing missing
packages and can I go ahead now can I go
on can I see the chat whether everybody
is fine so that we go
ahead
[Music]
okay I need to see like 200 yearses
because we are
500 going to 600 actually so if I don't
see then it see I'm going to stay here
waiting for you
guys
okay
okay okay then I think we can start uh
yes you use the same knowledge of
installing as
yesterday okay but if you have those
packages already in R don't install them
twice you just need to install them once
but every time we open R we need to load
them that's why you see I I've
highlighted them so if you are in R
highlight the packages you can see I've
highlighted from library in bracket TI
vers up to library o l s
RR then click
run click run and you see the packages
so if you see any red information read
it if it's not that the package is not
installed don't worry about it because
you can see in my red the following
objects are masked from a package called
deer meaning that this package which has
been loaded already exists or it is
being hidden under deer so that is not a
problem okay so can we go
ahead so from there the next thing we do
is to set working
directory we are going to set working
directorate
so you go you can use two ways of
setting the working directory you can
either go to the menu
see session window on the menu bar
there's session set working directorate
then choose working directorate are we
together from window we go to session
set working directory choose working
directory okay so let's go ahead set
session set working directory choose so
click choose then you browse to the
folder which you're using so go to the
folder and click on that folder it will
appear in the folder uh dialogue box
then you click
open so in your console you're going to
see that it is saying
set working director in bracket the root
where the the the folder
is so you can either write the command
or you can go to
the what
is
internet
okay
okay directorate
okay from there now we are going to read
our data in so the second way you can
set your working directory is to run the
command you can see my line 38 says set
working directory if you run that you
will see that you get the same thing you
are doing the same thing okay so let's
go Ella in rary uh L R then you haven't
lo you haven't installed l please try to
install it and then it will be warning
in install
packages yeah that means ORS is already
on your laptop if it's already in
use and will not be installed it's
already
there
okay okay so I was looking at the
comments in the chat as we go on
yes our file we are going to use is
called egg p and it a CSV
file so we are going to line 41 we are
going to call our data into R it is a
CSV file so the first part here ispp is
the object name and I'm going to assign
it to a command read. CSV in bracket the
file name
okay do
CSV so then I'm going to to to run or
execute uh line
41 so once I do that there is no red
message in the console I go ahead to
print out my data so when I run I see
the data in the console so that
everybody have the data in the console
let me see in the chat yes or no
okay thank you so now when you have your
data and your aim is to run a regression
the first thing we do is to visualize
data my colleagues have already talked
about visualizing data I believe now you
are comfortable with visualizing data
plotting uh constructing a scatter
diagram histogram Etc so we are going to
construct a scatter diagram to show what
type of relationship there is between
food uptake and water uptake so I've
given you different ways you can
construct a scatter diagram and the
reason why my window is so small is
because I don't want the plot window to
complain so if you go to line 51 we will
see that the first you can use plot and
eggp is a file dollar sign food uptake
food uptake is my X variable then water
uptake is my y variable so I can place
the CA and
run and you
see in your plot window you will
see a plot
appearing so the line I run lastly is
line 51 which has
plot okay line 51
so if you look at that plot in the chat
what do we
see because now we assuming that you're
analyzing your data for the project
you've constructed a scatter diagram
what do you see in the scatter
diagram okay not yet in the graph can we
okay I have it triangle points linear
okay no line of B fit that's okay I have
a drawn a line of best fit uh error
object not found if the object is not
found then you didn't set the working
directory please set the working
directory to your folder where eggp
is H Square points yes you can see so
what do those Square points when they
are put together what they show you
positive linear relationship yes
positive linear relationship correct so
it is a positive linear relationship
just like we mentioned a few minutes ago
that if points run from the bottom left
to the uh B uh bottom right I mean top
right then that relationship is positive
and linear should we say the
relationship is strong from what you see
is it a strong or weak
relationship in the chat you can write
what you think somebody say strong weak
if it's a weak relationship the point
should be far apart from each
other if it is a strong relationship the
points are closer to each other so some
somebody says moderate week let's find
out let's find out what the relationship
is once we construct our correlationship
okay so now in let's go back to the our
studio to the script you can see that
I've given you different options you can
use scatter do smooth to construct as a
lot so when you're analyzing data in
your project you can choose what to use
so we can run line
55 and you see that line 55 has even
inserted a line for
you and then we also run line 58 which
is gig plot
uh and it
adds a boundary around your points and a
line fitting the data so that is the
regression line so it is trying to show
you that the line is likely to go
through those points you see and it is a
linear
relationship and somebody says now the
relationship is strong I also believe
that the relationship is strong but we
shall
prove
okay is this the line of best fit not
yet we are just H exploring visual
what the data tells
us okay so in the second part I'm not
going to run this command I just split
uh line 58 to 60 I split in two stages
for you to learn how to add different
sections of the of the command so I'm
not going to run that let me go and run
this
part so let's go to line
97 where we want to compute the
correlation coefficient and
see line 58 shows the error so put the
error in the Q and and a my colleagues
will help you with that
error okay so line 97 we are going to
compute the correlation
coefficient and the command we are using
is co c
r ask is the object name where I want
that information to be
and I'm saying that in the file eggp
call Food uptake comma and then in the
file eggp called water uptake and we are
going to use a method called Piel so
that ends
our uh line or command line so let's run
line
97 so when we run line 97 you don't see
anything but in the environment if you
check the environment you see that has
given you a value let's run the object
ask and when you run the object ask you
get a value
0.977
8789 that is the correlation
coefficient
so the correlation coefficient
0.977 shows that the relationship
between water uptake and food uptake is
positive and strong linear Rel
relationship it is positive and strong
so we find that if the r was equal to
zero we would not have a relationship
there would be no linear relationship
between water uptake and food uptake if
R small R was equal to
0.5 we would say it's a moderate
positive linear
relationship if R is greater than 0.5
then it is
strong uh linear relationship if R is
equal to 0.5 it's moderate if R is less
than 0.5 it's weak so that's how we
interpret those values
okay
to
okay oh I am moving is is am I moving
very fast
results confirming that it is very
strong yes uh if I'm moving very fast if
you finish
constructing the GG
plot line
90 line that was line
58 go to line
97 very fast okay I'm going to reduce
sorry about the
speed yeah Thunder speed
okay okay okay participants let's go to
line
97 I've jumped line the lines in between
because I was trying to show you how the
things come together to form that line
of GG uh scatter plot
okay okay so somebody has an error that
g function G is not found if you've inst
it load it using library in bracket G
plot it will now start working or
somebody says I recap from the beginning
oh my dear I can't go back to Theory I'm
going to recap from where I am line
97 okay so please come to line 97
because line 98 58 was constructing a
scatter diagram which you did yesterday
I was just doing a simple recap now the
real of the matter is at line 97 so
let's turn our eyes to line 97 we'll
start from there what's the the
difference between R and R squ you ahead
of us
uh
mam okay you're ahead of us we'll talk
about R and R squ later so everybody's
at line 97 run line 97
how do you know it is a positive
relationship you can see that if you
have the value of
0.977 it is positive that is how you
know it is positive or if you have a
scatter diagram without that number if
values if you look at the my my output
of the plot window if points run from
the bottom left going across to the top
right then you have a positive linear
relationship error object ask not
found uh my colleague will help you
there meaning you didn't run line
97 uh please run line 97 because object
ask consists of that command at line
97 okay somebody is warning you I think
yeah I am going to count with him then
let us we did the things yesterday I'm
just doing a simple recap I want us to
do regression and finish before time so
let us pay attention let us all go to
line 97 so in the chat are we all at
line
97 because I want to repeat
it
okay so I have Aras all the plots in my
plot window because I want to repeat
line 97 so what does line 97 say the
object name is called ask there's a sign
there we are assigning the command
core for calculating the
correlation and we are saying eggp the
file name we
call the variable food uptake comma eggp
call the variable water uptake and
method is p
so if you run please
run uh line
97 if you don't have any information in
line 97 that means that you didn't set a
working directorate EG you don't have
eggp in the system of R so you have to
go and set your working directorate to a
fold where you saved your
eggp why PE I say that when variables
are continuous for us to know whether
they are related we use peon correlation
if they are
continuous okay there we go I'm going to
run my line
97 and when you run line 97 you don't
see anything until you run line 98 so
run when I run my line 98 I do not get
an error and yet it's the same script I
gave you but I get a value in the
console of
0.977 and if we run off to two decimal
places it will be
0.98 my
dear don't run what you see in the the
script about setting the working
director dsdm 204 is my working
directorate not yours so please change
I'm sorry I'm sorry that I didn't tell
you that but you go to session set
working directorate choose working
directorate and you change the path okay
error in oh
dear
okay but let
be answering question so please put all
your queries in Q and A my colleague
Thomas doesn't have any any question to
answer that's what he has just told me
so please take all those questions there
in the chat I want to see yes no yes no
yes no and any question that I ask and I
need an answer please explain the output
of the regression okay I haven't done
regression I've just done correlation
but I'll explain the
output so if you go to the console at
the bottom you will see a value of
0.977 and I'm saying that we said
line 83 error
please please don't go to line 83 you
will run those in the free time you I
can even give you my WhatsApp number if
you get an error then we can deal with
it later on okay I explained that uh I
jumped those lines so there's no eror I
run them this morning I jumped those
lines because I was breaking down the GG
plot command and telling people how you
get get them in subsets okay so this
value of
0.98
okay is the correlation between water
uptake and food
uptake I can't create a working
directory my colleague will help
you just jump to line
97
okay error line 98 what error there is
no command in line
83
it's
okay so have you all got the value of
0.978 that is the correlation
between uh food uptake and water
uptake you can see it is positive you
can see it is close to one meaning it is
very strong correlation between between
food uptake and water uptake so we are
going to move to line
100 when we move line 100 we want to
test and find out whether the
correlation value is different from zero
is it significantly different from zero
that is the test for
correlation uh if you're not getting any
value uh put your question in the Q and
A so line 100 please run line 100 and
this time our object name is as it's a s
that is object name the next part is
correlation do test that is the command
and the rest you know so let's run uh
line
100 and when you run it you don't get
anything then you run the object and we
get an output so does everybody have an
output because I want to explain you
seen the out output do you have the
output I have okay
great that is good that you have the
output so the output is testing this
value here you got before
0.977 to show whether it is different
from
zero okay so we look at the P value the
P value here which is 0 it is 4.0 2 *
10^ -
8 if you turn it into a figure it is
0.0 put eight zeros in front of four or
put seven zeros in front of four that is
the P
value okay somebody still complaining
about the value
0.97 not seeing not seeing the
value okay can David can you promote
lot
am let her I'm going to stop
sharing you're given two minutes to
share if I don't see anything I'll go to
another
one uh Helen please post the link for
material uh Prof maybe you tell the
person to raise the
hand okay so raise up your hand if you
can't see any value in your in your
output in the
console please raise up your hand so
that David can see
you the link has been posted so pick it
and get the work
okay so nobody oh somebody's raising up
nobody's raising up their hand let's go
ahead I'm going to share
again we we promoted rosan but deined
so we going to try another one
okay we are now we have promoted
mod let him
share please
share and Angeli
also okay if that one is not
ready we can go to Angeli angel
join please go ahead and share your
screen there's also
Zan I think let's use one person because
of
time go
Ahad go ahead and share your
screen
mod okay Z has decided to share it's
okay
okay
oh okay that is a script for
zanella go ahead and tell us your
problem
zanella if I run the lines I get the
blue uh lines on the console I'm not
getting the values for example if I run
line number 998 it's just giving me a
blue line without any values okay did
you run line
99 run
99 it's the same situation you run it
you've not run
it put your cursor on that
line it's not giving anything yeah there
it go
it gives Val is giv a
value yes even Run 100 one they have to
learn the object
names there's those are object names 100
yeah there's nothing on 100 go to
1001 I've just run 101 so there's no
value also no no no no you the values
only come in when you print them when
you create with the line 101 only
creates the value for you to see it you
need to you need to highlight on that
value
and so you need to run one3 to see the
to see what you what you created ah okay
okay yeah and can you install the
packages you can see the warning there
on
your you see there's a warning up yeah
click on
it so I have to run number 101 and 103
to see the value that's right yes so
once you create a value you create an
for you to see it you need to print
it okay thank you yeah okay okay you can
stop
sharing any other with a
query okay then let me go and share my
script okay so I'm going to repeat mine
on my script it's line 100 so line 100
ass is the object name and I'm assigning
this command you see here and it's the
one which is going to contain the object
name contains the output so for you to
see the output you have to print it out
by running line one2 so if I run
line
100 I don't see anything then when I run
one
one2 I see the
output okay so we were talking about
this P value what are we doing here we
are testing the correlation value of
0.977 and we want to find out if it is
significantly different from zero
remember that if R is equal to zero then
there's no correlation so the test wants
to find out whether there is actually a
significant correlation between food
uptake and water uptake so that is the
test we are doing down here so when the
P value is smaller than
0.05 as you see it is
0.00
seven uh eight decimal places before you
you reach the
point let me write it down for you so
that I don't have to I'm going to write
here to 0
point I hope you can see what I'm
doing so we have seven zeros in
front that is
67
4
0 2
okay so that is the P
value
yeah so that P value is
smaller
than
0.05 I think you can compare it and see
this is smaller than
0.05 so
meaning there is evidence to reject the
N which is saying no
correlation and we conclude that there's
a sign significant positive strong
linear correlation between food uptake
and water
uptake I hope that is okay so let me see
in the chat is that fine we go
ahead
okay and yeah if you have a question put
it in Q and A okay so we finished to
found out a linear strong correlation
is explain
again so the P value is smaller than the
significance level that we set normally
when we are testing hypothesis of 5% for
agriculture research
forestry uh all those sciences that are
not so sensitive if you're dealing with
with with humans human medicine medicine
and veterinary medicine the significance
level is smaller than
0.05 so at this at
5% even if we chose 0 99.9 still it
would be significant the value is
smaller than
0.05 we have evidence reject the N which
says no no correlation and conclude that
there is a significant positive and
strong correlation between food up and
water uptake now don't ask me where did
I get strong and full and and and linear
you already know
that okay let's go ahead let's go to the
let me REM this let's go to the to the
script
again so we are going into regression so
if you find that there's a a strong
correlation between two variables you
want to investigate that relationship
using simple linear vition in case you
have only two variables like now we have
water uptake and food uptake so we are
going to write an equation in the script
in our
script and it it helps us to run a
regression relation analysis so now the
model name I mean the object name is
called Model we assign LM LM is linear
model what uptake is our response the
tier means function so water uptake is a
function of food uptake and the data
where that information is is eggp so
that is line 110 in my script so if you
run that
line if you run that
line it's going to work but then you
need to see the output once you print
the model which is the object so once
you run line
110 you need to go and run line
111 to see what is in the
model so have we all
gotten do you have the output that I
have in my console let me see in the
chat before I
explain so we have the intercept and the
slope for food that is what we have only
but remember we want to find out whether
that
relationship is significant or not so
now before we explain the intercept and
slope yes that is correct let's turn to
line
112 in my script it is line
112 and says summary in bracket model so
for you to review your results of the
Reg ation you need to put summary in
bracket the object name which is model
so I'm going to run line
112 So output before I start
explaining similar output participants
can I see in the chat I have just run
summary in bracket model model is the
object
name and in that object we stored our
linear
regression okay that is great so let's
turn our eyes to the console to the
output and I'm assuming it is
clear okay so if you go to the
output you will see that the model
you've run is written under call
then you have residuals they are
highlighting
residuals uh I mean observations with
high
resides that is the first line the
second one is
coefficients you see intercept isus
6351 that is The
Intercept food uptake has a
corresponding value of
2.73 28 which is the slope
and when you go across that line you
will get a P value of
4.02 * 10^ minus
8 which value the the same P value we
get we got for correlation we will
explain that later so the Stars you see
they just telling us that the analysis
you've just done is significant at 0%
0.00
go
on then we have residual standard error
of
17.3 on 10° of Freedom we have uh
multiple R
squ of
0.956
2 now let me start from there do you
remember our correlation value our
correlation value was 0.977
can
somebody get the square of
0.9
7787 the correlation
value just square that value what do we
get write it in the chat if we Square
0.7
0.977 what do we
get why didn't we change uh data to
factor because regression doesn't
require you to use a factor for now so
Michelle says the value is
0.956 it is the same value for our R squ
isn't
it so you see that when you square the
correlation value you get your
r² the coefficient of determination
so that value you are printing in the
chart is the same as r
s which is the coefficient of
determination what does it imply this
coefficient of determination tells us
that 9 5.62 if you multiply by 100 of
the variability in water uptake has been
explained by the regression model
when you collect data you have 100%
noise or variation within your response
when you fit a
regression
model some variation is explained and it
is uh expressed in terms of the r
squared here our R squar is
0.95
45 so if we multiply that by 100 to make
it standard it becomes
95.
62% so out of 100% you've explained 95.6
to% of the variation in what uptake has
been explained by the regression
mod so remember I've just told you when
you square your correlation coefficient
we get our R squar so if you get a small
correlation value even your r s is going
to be
small so what is the difference between
r s and r s
adjust r s adjust is also explaining the
amount of variability explained in the
response but it has been adjusted for
the number of parameters in the model in
the model we have two parameters the
intercept and the
slope it is not normally used when we
are talking about multiple
regression do we need to put it in the
data presentation or discussion yes when
you run a regression you need to show
how much variability has been explained
by the model so in this case we've
explained
95.6% and also that value tells us that
our model actually is good is fitting
our data the amount of variation I mean
residual or noise that is not explained
now is very little and that means when
you go to do prediction using that model
you're going to get uh
accurate values because you've explained
most of the
variation okay now
when you get a higher r squared means
that your residual is small so we are
going to see another output where you
will see the
residual yes when you have a high r
squared which is high means that that is
good for your model the model fits the
data but at the same time if we go to
multiple regression the story changes we
need to to check our variables but now
we are talking about simple linear
regression and for now we know that the
model is good because it's explaining
most of the
variability now let's go to this line of
food
uptake the reason why we are conducting
this regression is to test the
hypothesis
that to test the
hypothesis
that the NS say there is no linear
relationship between water uptake and
food
uptake the alternative says that there
is a linear relationship between water
uptake and food uptake so if you look at
the output corresponding to food uptake
you will see that our P value is 4.02 *
10^ - 8 which is smaller than
0.05 so we have evidence reject the
null and conclude that there is
significant linear relationship between
water uptake and food uptake remember I
say the aim of regression is to estimate
the parameters the intercept and slope
to
identify the type of relationship it's
linear and now we've proved that
actually the linear relationship ship is
significant between water uptake and
food the last thing we are going to do
is to validate that model if it violates
the assumptions under simple linear
regression which I'm going to mention
then we need to go back and do something
to our
data if you have many
variables uh we use what we call
multiple
regression but but also under multiple
regression which we may not be able to
cover today if we finish simple when we
still have some time we will cover it
but if not then we will do it next time
we
meet okay so the theory I'm
explaining holds even for multiple
regression except a few things which we
will mention when we are talking about
multiple
regression so we've said that we now
know there's a a strong linear
relationship between significant linear
relationship between water uptake and
food uptake now when you're presenting
your results you don't need to present
all this information I see on in the
output you need the P value the degrees
of freedom and the T value
only the P value corresponding food
uptake the T value and the degrees of
freedom then you explain what that P
value means in terms of your research
then you also need to show the R squ and
explain whatever I put in your project
explain don't leave it for your
supervisor to tell you explain
okay okay so now let's go to another
line so the first part we saw was the
first way you can analyze your data
using
regression now let's go to the next part
then I'll explain about de of
Freedom so we can also use the command
of which is aov to
produce linear regression so let's run
model6 in my script in your script I
don't know which line you are on but we
are here on this line the object name is
is model
two okay I'm going to run that line and
then I run the summary model
two so this particular one gives us
everything you've received in the first
part of L linear model remember we use
command LM here we are using command aov
so using command aov we get even the an
is variance table which we didn't get
before
so we've explained everything above you
see my highlight the only thing we've
not talked about is the
Anover this is the anal of variance
table below
here does everybody have the analysis of
variance but no interpretation here in
my
console you please write that problem in
the Q and
A yes do you have the same output I'm Al
going to talk about the intercept and
slope in terms of the the variables we
have okay thank you so let's pay
attention we run a
command uh model
two using the O aov command and it gives
us what we call an analysis of variance
table this analysis of variance table is
actually doing the same thing like above
so let's explain the components of this
analysis of VAR table you will see that
the first part here should have source
of variation as the title here it has
food uptake and
residual food uptake has a degrees of
freedom
one how do we get one
uh the number of parameters in the model
two the inter intercept and slope so the
number of parameters minus one we get
one degree of
Freedom or the number of independent
variables in the model are one so that's
why we have one degree
Freedom we are going to the second oh
let's go to residual degrees of freedom
residual is
10 who can tell me how we get 10 in the
chat this time I want to see who knows
about regression and who has understood
how do we get
10
okay
okay number of data points number of
observation uh even if you don't have
the number of observation you can get 10
my
dear okay number of observations minus
two some people have even worked out the
total number of observation that is
great I am happy to see that actually
you know about regression that excites
me I won't do a lot of work
explaining okay number of number of
degrees of freedom for residual is equal
to the total number of observations
minus two where two is the number of
parameters we have in the model the
intercept and
slope okay so how do we get the total
number of observations we can work
backward we can get 10 isal to n minus
two and then we get our two cross it the
other side and we get n equal to
12 okay I hope that is clear do I need
to write anything simple
math so
n is equal
to I no
10 is equal to N -
2 so we are going to add two on either
side so you add two why am I adding two
because we the equal sign that is
mathematics of grade one you remember
those ones so this n now our
n is going to be
12 so if we have 12 observations minus
two parameters gives us the residual
degrees of freedom for
residual okay I assume that is clear so
why do I go why do I lab with all this
when you get an out
as a researcher you need to understand
what the computer is
doing
okay so don't get values and hardly pick
them put them in your project if one of
your supervisor is good in statistics
you'll be able to know that you actually
presenting wrong results so always know
what the output means and what those
numbers mean and how to compute
them okay now sums of squares there is a
formula for sums of squares which we
will not do today
and then we also have sums of squares
corresponding to residual let's explain
about mean Square now mean Square
corresponding to food uptake is the
ratio of sums of squares of food uptake
divided by one the degrees of
freedom okay so you see that the mean
square is equal
65610 so sums of squares
1 is equal to mean square if we go to
residual the mean Square residual is
equal to the sums of squares
corresponding to residual divide by the
degrees of freedom gives you that value
somebody is saying they didn't
understand about
uh residual I don't know what you didn't
understand but to get to get 10 you get
the total number of observation minus
the number of
parameters okay let's go to the F value
the F value here is the ratio of mean
Square Food uptake over the mean Square
residual that is the F ratio so if you
get this big value divide by 300 you'll
get the F
value is that okay can we go ahead
can we go
ahead
okay okay okay great so now you see that
the P value we are getting is the same P
value we got in the previous output
where we had a t
value there is a relationship between
the F value and the T value
if you look at the P value here under
anov it is 4.02 * 10^ - 8 if you look at
the T value and corresponding in the
same output you will see 4.02 * 10^
minus 2 there is a relationship between
the T value and F value if you get the
square root of the F value you will get
the T
value if you get the square root of the
F value you will get the T
value that is the simple math I am going
through that simple math now for you to
know that when you get an output it is
very important to know what how those
things were calculated by the computer
so that you don't just assume everything
is correct how about if you made a
mistake entering data how about if you
wrote a wrong program so you need to
know how to compute them manually the
thing is for us when we were studying as
students at your age we did the things
manually
manual 1+ one equals two so that's why
you see I'm still in that analog uh I'm
still analog by the way okay
although I'm learning to become what is
the most recent
time digital I'm learning to become
digital
okay so what does the P value mean now
the P value here the hypothesis we are
testing now is that there is no linear
relationship between water uptake and
food uptake the alternative is that
there is a linear relationship between
water uptake and food uptake now because
our P value here is smaller than
0.05 we have evidence reject now and
conclude there is a significant linear
relationship between food uptake and
water uptake I hope you people
understand we are do using actually
probabilities our probability for the N
to be correct is
0.05 or 5% so if you get anything lower
than 5% meaning the chance of the N to
be correct has reduced that's why we
reject it and conclude in terms of the
alternative I think that is very I have
simplified it Beyond even for even a lay
man who has not done statistics should
understand now regression that's why
this part is called Simple because it is
actually very simple for you to
understand okay so if we are good to go
we are moving on to the second
part what is the value of the standard
error of the data the mean square error
300
is our standard error of the data if you
get the square root that is the root
mean Square so here this value here is
our mean square error and if you get the
square of this you'll get 17.33%
[Applause]
now let's
focus I think we can clear our console
there is a brush
here in the top right corner which says
clear console let's clear the console
all of us you can use the brush or
there's a command I believe my colleague
Helen and Thomas told you that command
so I'm going to clear my console just
click on the brush it is on the top
right
corner don't clear any other information
you need clear only
that somebody say control L great that
shows that whatever Helen told is still
working and I am so
excited okay we can fit a regression
line onto our scatter PL you remember
that one and that was done by Thomas
that is the second part which I would
like to jump with your permission
because you did it yesterday
so let us jump that part and go to the
another part because you did it
yesterday because of time yeah we are
going to the part of where we want to
test the assumptions under our
regression
model some of those questions my
colleague will answer what's the
difference between standard error mean
standard error and
residual okay so the next thing we want
to look at are the assumptions under the
regression analysis to be valid first we
assume that the response is linearly
related to the independent variable in
this case our response is water uptake
and our independent variable is food
uptake is there a a linear relationship
we've already found out that using the
scatter diagram using the correlation so
that is already a proven assumption
secondly we assume that
our response follows a normal
distribution or exhibits a normal
distribution with mean mu and Varian
Sigma squared
we can prove that using either a
histogram or a normal probability plot
third we assume that the
residual you remember I talked about the
amount of variability explained in the
response by the model so whatever is
left is called residual we assume that
those
residual are normally
distributed have constant variance and
are not correlated they are independent
for each response we can obtain a
residual
value so we can also find out the
correlation okay so those assumptions if
any of those is in violated then we have
to go and do
something okay so let's go
ahead uh checking assumption I'm online
162 two checking assumptions under
linear
regression please repeat the three
assumption the first assumption we
assume there's a linear relationship
between the response and independent
variable the second one we assume the
data is normally distributed with mean
mu and variance Sigma
squared uh thirdly we assume the
residual have normal distribution the
they have constant variance and are
independent okay I assume now you know
the Assumption now we can test for those
assumptions before you test those
assumptions you must remind uh the
program that you're still doing
regression so I'm running line 164 again
so that I can pick the model object the
object model to to use in my plots so
I'm going to run the regression model
again so I run and
run and then I go to line 67 I believe
you know this line 67 it just splits the
grid into a 2 by two so that you are
able to get four
four uh
plots so let's run line
167 and then run plot in bracket model
two it's plotting the residuales so if
you see in my plot window I have four
plots so I am going to
zoom in and stop
sharing and share the plots so that we
can talk about those
plots do you all see these plots
I've lost my
chat
okay okay so we have those
plots good
okay so the first plot is a plot of
residual versus fitted values now the
fitted values are the predicted values
with the model that we
got we can predict the values we insert
in the intercept and the slope and
predict
a value which is called a fitted or
predicted value so if you plot residuals
versus fits you should just get a random
scut of
points no pattern in
the in the graph so that is the graph
you can see that actually we do not have
any pattern and this red line is helping
you to see that there's no pattern the
points are just scattered
around okay there there is a point there
which is point number
10 you can see on your graph there's
Point number 10 uh here it is being
labeled
as one with high residual we shall see
whether it is true in other graphs so
let's skip that for now and see in other
graphs
whether it comes out okay so even this
scale location plot is a plot of
standardized residual versus fitted
value still telling you that the
Assumption of constant variance is not
violated because these plots are
exhibiting a random scut of
points okay I'm going to explain to you
later how do you know that the
Assumption of constant variance has been
violated let me just explain these plots
then I'll go to that later
then the QQ residual plot it is a normal
probability
plot it shows you that the residual are
normally distributed and once the
residual follow on this line
here you can see that the residual are
falling on that line meaning that our
data is
normally
distributed okay
are we good to
go oh the point is 11 no it is
10 I think it is 10 can we go to another
plot the last
one
okay uh there's a plot of residuals
versus
leverage uh this plot here highlights
influential values those are values with
high standardized residual that have a
high residual that are
likely those values are likely to affect
your
parameter in
case they are in the model so if you see
the plot of residuals versus leverage
you can see there are dotted lines have
you seen the dotted line above the red
line and below the red
line can I see a yes in the
chat they are dotted lines above the red
line and below the red line those are
called the cooks distance
lines I don't know which cook cooked
this graph but that is the name probably
the person who came up with this graph
was called
Cooks okay now if a point is not above
those two or below if we examine the
plot and we don't see any point above
the dotted lines or below the dotted
line then we do not have any observation
that is
influential is that okay can I see the
chat is it
okay I'm going to
repeat if these points the round points
the
residual okay let me repeat so pay
attention now I'm going to repeat if
these round points you see drawn Within
These dotted lines are not above these
dotted lines or below the dotted lines
then your data does not consist of any
influential
values what do I mean with
influential those are values that if
one they affect the size or magnitude of
your intercept and
slow they could be named
outliers but sometimes we have a data
set with outliers but they are not
influential they are not affecting the
value of the intercept and slope so in
this case we can have outliers that are
not affecting your so the cooks distance
result will show you that these are
called outliers like outlier 11 and
there's also another outlier here they
are there as outliers but they do not
influence they don't affect the value of
your parameter and
intercept okay are we together so
influential
values are those values that if removed
from data the data set
they would either inflate or deflate the
value of the intercept and
slope how do they occur one it could be
an error you entered your data
wrongly two it could be a natural
occurrence by chance for example if we
are here in the room and we have a very
tall person that would be an outlier and
probably would be influential when
you're Computing the mean height of
people in the room
okay then there are those so we've seen
that it could be an
error and if it's an error you can
correct the error it could be by chance
that that value is there it is correct
so you can't delete it from the data you
have to leave it in the
data
okay somebody somebody said I cannot get
those plots oh somebody say that they
can't get the plot so we give them a
chance to show their screen before I
complete the resid versus leverage which
you want to know about so I'm going to
stop
sharing and then David can
let check they get a break maybe some
five minutes break
okay I am giving them difficult things
uh
David oh people want a break maybe we go
for a break and then come back and
finish the leverage plot because
actually it is coming to
4 do you want a break can I see in the
chat yes yes yes okay everybody needs a
break so at 5 past 4 let's come
back
okay 5 4 we come back thank you very
much
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e e
it's time I
think
okay good afternoon good
afternoon can you hear
me yes Helen you had a
suggestion now you on
so I muted myself yes I was suggesting
that can we first give the evaluation
before you continue I don't know yes the
first few for 10 minutes yeah yeah 10
minutes okay please do go ahead and give
the evaluation I'm going to
mute okay
okay I'm going to share it in the
chat yes um de participants would kindly
like you to fill in the evaluation form
in as far as this uh training is
concerned it's just for um to make the
the upcoming trainings better I've
posted a link in the chat it can take
you like at most four minutes and then
we just need your feedback and then
Professor suan will start from there
thank you just click on the link and
then fill
in and you can add different suggestions
that you you want at the end I'm seeing
people posting that they want
certificates and all that well you can
capture that within the um the Google
the the link that I've
posted thank you
we are having five minutes to fill in
the evaluation form please if you're on
you just need to you'll only need four
minutes to complete it's not uh it's not
a long evaluation form they're just
three sections and um you just supports
to to fill in the form and then later
after that um profess is going to
continue with the the remaining material
before the session
ends we'll be so grateful for everyone
who
participate
for e
yeah some people are just joining now
but for those that have attended the
training for the whole week we are um we
are pasting the the evaluation form we
need to hear your response we need to
find out how you've uh how the training
has been at your end we want to make a
followup on what things we need to
improve to make the training better so
we kindly ask everyone to
participate for those that have just
joined today um you can uh you can visit
the r Forum YouTube YouTube channel and
then you can still follow through from
the start until the end you can do your
personal reading invest time in
yourself um those that have been with us
since Monday we'll be so glad to hear
from you I'm just filling the question
here and then um for the upcoming
trainings we'll be able to know how what
we can do better and where we need to
improve thank you so much um I think
Susan I don't let me check let me let me
just uh get a response from the chat
have you managed to fill the question
kindly type yes and then we just need to
get a state if Pro can
continue are you done feeling the
evaluation form you can type yes in the
chat we are around
542 so I expect more yes okay no I'm
still feeling so we have only um 3
minutes to go yes we are resuming the
session at
4:15 East African time so kindly um
support and F the evaluation form and
then we
continue where is the
[Music]
form I've psted the link in the chat you
can you can click on
it for those that are asking where the
form is
um thank you svia I need to get more
people to tell me that they have
completed okay thank you we need more
people we have two minutes to
go mission complete great thanks
Francis um
gabri Peter
yeah um or
when someon is saying I'm done and
completing for waiting for my
certificate okay that's so nice great
thank you so much for those that are
still feeling we have two minutes to go
please complete the form
you don't need really write much take
you a a maximum of four minutes there
just few questions that you need to
respond
to thank you so much for those that have
completed 1 minute and then we get
started for those that are still feeling
please um speed up
we have 1 minute to
go okay thank you thank you so much um
Professor Susan over to you I think most
of them are
done thank you Dr Helen thank you very
much for conducting the
evaluation uh if we are all done let's
come go back to our graphs I explain the
graphs then we we do something else and
let me share share them
again
okay okay so uh when we went for the
break I saw that uh
I saw that some people are saying I
should recap the graphs from the first
one but I think let me finish the last
one the leverage the residual versus
leverage then I'll go back to the first
one the second one and the third
one
okay so we say that we use these plots
we call them residual plots to check
whether the
assumptions
and linear regression simple linear
regression are violated or
not so in this case we are looking at
the leverage versus versus
residual
plot this plot is meant to
identify resid videos that
influence the magnitude of the parameter
and intercept since we are doing Simple
lar regression and how do you know if
any point Falls above the dotted lines
on top of the the the right
corner then that is an influential
observation then if it falls below the
dotted line still bottom right then that
is an influential observation now
somebody ask can an influential is an
influential
observation an outlier yes the
influential an outlier can be
influential but not all outliers are
influential some can be outliers like
you see here Point number 11 is an
outlier you can see it is being
highlighted and point I think number
four but they are not above the dotted
line meaning they are not influential
they do not affect
our they do not affect the values or the
magnitude of of the intercept and the
slope so that is the leverage so in case
anything is appearing above the dotted
line or below we go and investigate that
point first we verify that the
observation is not an error
one minute please
okay
use
use okay so what I was saying that
assuming we had an observation above the
doted line meaning that observation is
influential
meaning that the observation is
influential then what do we do first we
check whether the observation is an
error if it an error maybe you entered
it wrongly check through your data
sheets then you have to enter it
correctly then you
rerun then you rerun the
regression so
sometimes
the observation occurs in nature meaning
that we
cannot
delete we cannot delete that observation
but we are going to work with it so what
you're going to do if the observation is
occurring in nature the first thing we
do remove it from the data set save it
with another name make sure that the
original data is saved differently so
you give it another name after removing
that value run the regression look at
the magnitude of the intercept and slope
are they changing tremendously or not if
not then you can remove the out that
influential observation if yes then you
put it back and go ahead and have your
final results with your influential
observation because that means that that
influential observation should be in the
data so that is what happens with
influential
observations let me go to the scale the
QQ
residual plot it is a normal probability
PL it is testing whether the residual
are normally
distributed if the points follow on that
line you see it's a dotted line then the
data is normally distributed the
residual are normally distributed and in
this case the residues are normally
distributed when we go to the last two
plots the scale location plot this one
is testing whether the residual have
constant variance now this red line here
should be going through the points but
in a horizontal fashion in this case
even if we have this bit of us uh amount
here it is not so woring so we don't
need to worry about this so still we
have to interpret this plot as the
residual have constant variance and even
the upper plot residual versus fitted
value in case you get a random scut of
points then the residual have constant
variance now I want us to
see graphs where you
see that the residual is not constant
that the residual have n normal
data so I'm going to just use the chart
to to draw a white board to draw those
plots so I'm going to stop
sharing and
share
okay so there way I have a white board
here so we are looking at constant
variance when do we know that we do not
have constant
variance okay so if you have a a plot of
residual so this is residual is
abbreviated as
Epsilon and fitted value is abbreviated
as y
heart fitted a predicted
value so normally the mean of residual
is zero so we have a zero
line so in case you have points like
this going like this like a b
shape this is the first one so you have
a b
shape you have this b shape
here and this are
positive this is negative
residuales so if you is it clear the the
chat I've used green I hopefully people
can see the chat can you see the chat
that I have
drawn
okay okay thank
you
okay okay so you can see that this is
what we call a b
ship this shows us that our
response has a binary
nature it has values above the zero line
and values below a zero line meaning
that the linear model you fitted does
not fit your data so you need to treat
your response as a binary probably maybe
your response had things like yes or
no or it was coming from data of
germinated not
germinated or in that case this response
is a result of exotic and unexotic
birds so that it is that the fact that
there is that varability due to Exotic
and unexotic it is being depicted in
this residual plot so in that case if
you see this then you're going to run
what we call a logistic regression that
we will talk about in another training
so you run a logistic regression the
linear model doesn't fit your data
that's what it is showing and run a
logistic regression if you do not run a
logistic regression what you can do is
you can apply
natural so you're going to have natural
logarithm of your response and then run
it
onto
the this
the independent
variable so you can either transform all
of them the response and independent
variable or you first transform the
response run the regression if it's
still executing this nature then
transform both the response and
independent
variable on the other hand you can use
the square you can square the
independent variable and run it against
the response and see whether this has
cleared how do you see as cleared you go
back and produce the residual plots
another
scenario is when we have the
residual we have our zero
line and then we have our fitted
values so in this case we are going to
have a final
shape
uhoh FAL shape funel like a funel that
is how the graph is going to
appear and if you have sear hopefully it
is
visible if you have such then your
residual are increasing with increasing
fitted value meaning the variance is not
constant so to stabilize this variance
you are going to apply natural logarithm
to your data so you apply natural lthm I
hope all of you know what natural
logarithm is natural logarithm of Y to
stabilize the various and also natural
logarithm of X so that's how we solve
that
problem okay
so we see that if we have a final shap
apply log
transformation using natural logarithm
of the response and independent variable
if you have a b shaped uh you use
logistic regression it is the best for
the B shaped now you could have another
problem
okay so another uh graph that you could
see let's see another one I'm trying to
clean this one
okay so assuming you get this type of
graph you've asked for residual for your
mod but this is what you
get okay so you see that the residual
also are increasing with increasing
fitted value meaning there's no constant
variance and also the linear model does
not fit your data that's why you have
this cover being shown in the residual
plot so this shows you should apply a
quadratic quadratic means that you
square
X or a cubic depending on what will fit
your data so what you're going to do is
you're going to form a new variable and
make sure you square the independent
variable run the regression check the
plots most cases you'll see a change
you'll see that after you've used the
the quadratic model you're going to have
a random scut of points meaning that now
the residual have constant
variance okay so now once you do those
Transformations remember when you're
reporting you have to if you use natural
logarithm you have to to anlog before
you report the slope the ATC so make
sure that you report in terms of the
current model not the old model which
was violating the
assumptions okay so uh then there's an
assumption which we didn't talk about
the one of Independence let me just wrap
here then that is the last assumption
okay okay so how do we test whether the
residual have independent they are
independent no
correlation so in this
case we see a plot if there is
Independence we have a plot of
residual
versus order of result order of
observation it's called order of
observation
order of
observations so we will just see points
that are not having any pattern so there
will be points like this so no
correlation for that case however if we
get a graph like this one order of
observation or order of
rank and Epsilon is here then we have a
zero
line if we get something like this like
you know time
series that shows that you have time in
your data and you need to either run a
Time series analysis or you need to use
mixed model to account for the
correlation within the data so you can
use mixed model to solve this problem of
correlation or you can run time uh time
series
analysis okay so I think those are the
things we need to make to take note of
when you're testing assumptions under
the linear regression so we've tested
normality and remember if you use a
histogram you should have a b shaped
curve SRAM of residuales and if you want
to know that there is a the data is also
normally distributed construct a
histogram and if it has a bell-shaped
curve then the data is normally
distributed if you want to know that
there's a linear relationship between
the response and the independent
variable also use a scatter diagram you
remember that was the first thing we did
to know what type of relationship we are
doing dealing with so if you dress cut
plot and you find there's a linear
relationship then confirm as in your
right up tell them that the scatter plot
exhibited a linear relationship figure
that you don't need to go against and
redo the the scatter diagram okay so I'm
talking as well as giving you a way of
presenting your results in the project
okay and relating also relating the
results from your exploratory data
analysis that you did yesterday remember
the aim of exploratory is to show you
the trends so if you see a linear
relationship you also and you found is
significant you said these results
confirm
what was observed in the exploratory
data analysis or what was observed in
the scatter
diagram
okay okay so let's go to the chat uh
let's go back to the our script I think
in the chat I think we are okay with
these residual plots please go ahead if
you have a problem type in Google how do
interpret residual plots you will get a
lot of information if you've not
understood uh simple linear regression
using R also type in Google that uh
simple linear regression using R you're
going to get videos you you know some of
them are the accent is too heavy you may
not understand but you can pick mine
hopefully I don't have a heavy accent
and you you go through that okay so
somebody is saying does multiple linear
regression has the same assumption yes
multiple linear regression has the same
assumptions and there's one more extra
in addition to all of those you also add
one extra assumption we also assume that
the independent
variables are not
correlated so for you to have a valid
multiple regation the independent
variables should not be correlated
remember let me write the
model uh our model here of simple is
just between the
response uh beta note is the
intercept and then beta one and the
independent variable is only
one so that's our model
plus unexplained variation so in this is
now
simple simple linear
regression now multip
linear regression you have many
independent variables you're going to
have y is equal to Beta
note
plus beta
1 X1 that's the first independent
variable then you get another one plus
another slope beta
2 X2
ITC so you might have like five
variables so in the assumptions all the
assumptions of simple linear regression
hold for multiple linear regression they
are true they apply for that type of
relationship and we include one that
these ones X1 X2 X3 X4 should not be
correlated they should be
independent now if you find that these
are dependent X1 X2 X3 then you go and
run multivariate regression
analysis if you find X1 and X2 are
correlated run a multivariate
regression on the other hand we can also
check the amount of influence they have
on the parameters remember when things
are
correlated they influence the magnitude
of the intercept and slope and since we
are going to use the model for
prediction we need a at least a model
that is
accurate
okay is there a
normal uh if you have sometimes if you
have like two variables which are
correlated X1 and X2 are correlated you
will find that one of the variable
explains more variability in y than the
other looking at the the sums of squares
you find one variable has huge sums of
squares compared to the other that tells
you that one of the variable is more
important in explaining the variability
in the response compared to the other so
if biologically you don't need that
variable in the model you can remove it
so I'll give you a scenario I was
developing gometric equations for for
trees used in AGR forestry systems and
then I found out
that the variables were correlated the
height and diameter were correlated in
estimating biomass but I also found out
that biologically we needed height in
those models so I left it in the model
and
explained but you also find that other
trees like the conos
the the ones with chical uh
crowns only height explains almost all
the variation in biomass so you don't
need diameter diameter is very n
significant so you eliminate it in the
model and leave height alone to explain
the biomass in cono trees so when you're
doing model building that's what
happens I suggest we
do a VI for explanatory that is very
good we can do a VF let me see how much
time we have let's go back to our output
our script and we do what is left and
then we can see whether we can add
anything okay so can we go are we good
to
go can I see in the chat whether we are
good to
go you say yes no no
okay so let's go
back let's go back to our script
okay okay in the script line 17 in my
script has a test for
normality it tests for normality of
residual instead of using the QQ plot
you can use the spheral test I saw in
the comment somebody wanted to know the
difference between spheral test and
LaVine test the spheral test is used to
test for normality
Lavin test is used to test for equality
of
variances okay I haven't even done the
spheral test I'm just explaining the
difference so when we get the results I
will okay so I was saying that Sphero do
test test whether the residual are
normally distributed by giving you P
value so you can interpret to see
whether the results are
significant however the test yesterday
my colleague talked about when you're
doing a two sample T Test you are
required either to assume that variances
are equal or variances are not equal and
you can always test those variances
using a LaVine
test so that's the difference okay so
let's run
line uh
170 I hope all of you are there we run
Line 1
17 and it gives us a value have you all
gotten the spheral dash wi normality
test you have those
values
okay so who can tell me in the chat is
the data normally distributed or not
from your interpretation of the P value
given participants this time it's you to
tell me what it means I'm not going to
tell you
so is the data normally distributed or
not if you say yes why how do you
know because you know in statistics we
don't beat about the
bush if you're statistician you say the
truth nothing but the
truth okay because the p Val is greater
than 0.05 great
somebody is saying the P value is
greater this time since it is we are
saying the n
hypothesis is that the data is normally
distributed and the alternative is that
data is not normally
distributed so if the P Val is large
meaning it's giving more chance to the N
to be correct so this time
06077 is larger than
0.05 so we do not have evidence reject
the n and conclude the residual an
normally
distributed are we together with that
one I can see you guys are interpreting
correctly thank you so much and I
believe that all of you are now at the
same page with
me okay
so if you are the same page with me
let's go briefly and talk about multiple
regression we have a 16 minutes let me
actually we have to leave some time for
Forum to talk how many minutes should I
use will you will I be given all the 16
minutes okay let's see okay under
multip somebody is saying 10 okay no no
not
20 okay now let's look at multiple
regression I have included it in the
theory you will look at those notes and
study later because if we had done the
theory we wouldn't have done the
Practical yeah the Practical now is more
important than the theory okay multiple
regression when do we have multiple
regression when you have have one
response and many independent
variables then and we run a regression
that is called multiple
regression so before you run multiple
regression we also have to find out
whether there is a linear relationship
between the response and all those
independent variables
uh Dr Helen please forward the Google
link okay now the first part we are
going to do is to run a scatter diagram
of all the data and EGP we are not
changing the the data so if you go to
line
175 we are just staying telling the
program that we want four uh sections of
the page and I'll will clear my graph so
you use the brush to
clear sorry about that that's noise of
the laptop saying do you want to clear
yes clear so now I run line 175 to allow
me have uh four panels in one graph then
I plot the data this is just plot in
Brackets the file name so once I run
that I get my graphs so let's see these
graphs what they say
everybody has gotten the
graphs have you all got these graphs I'm
going
to I'm going to zoom
in so let me zoom in and we talk about
these
graphs is refusing to zoom
in okay I think I have one which is open
okay okay Zoom is working I'm going to
stop sharing the script and then I share
the
plot so if you construct a plot with
many variables this is what you get in R
it can be more beautiful than this
depending on what command you've used so
I use the most simplest so this row here
represents water uptake the First on top
all these graphs are water uptake and
another
variable the
second row represents food uptake the
third one represents egg
production so let's go to water uptake
so water uptake and food uptake that is
this graph here the second graph on the
first row and it's the same graph we had
there's a linear relationship
water uptake and egg production there is
also a linear relationship but you can
see that it is a bit weak the points are
a bit far from each
other so we have only two variables food
uptake and egg
production we also need to find out is
there a correlation between food uptake
and egg there is if you see here where
I'm pointing with my CER it is in the
the second column the last row that is
food uptake and egg you can see these
points they are showing a linear
relationship so we can actually find out
whether there is correlation using the
correlation
command so you can do that later but we
go in and do the regression since our
time is running out so that's how you do
first of all uh ident the relationship
using the scatter diagrams and then go
and analyze since we see they are linear
related let's go ahead and analyze so
I'm sharing the
script
okay uh regression analysis the model is
called
three model three is our object name and
we are using the LM
water uptake is a function of food
uptake plus egg production because you
want all of them to be in the model and
the data is equal to egg P so let's run
line
179 and then run the summary of the
model so it gives us an output in the
console do you have the same output like
me let me see in the chat
they have the same
output
yeah okay great since we have the same
output let's go and interpret the output
so our regression analysis is between
water uptake as the response as a
function of food uptake and egg
production it shows the
residual with the observation with high
residual we go to coefficients it gives
us the intercept the slope for food
uptake and the slope corresponding to
egg
production when we go to the P valys we
find that egg production is not
significant meaning there is no
relationship there's no linear
relationship between water uptake and
food
production so but it does not
necessarily mean that the birds should
not take
water this is just saying that the
relationship between egg production and
food and and water
it's not linear it's not significantly
linear however we know that biologically
we know the birds need water to make the
eggs if they eat food without water you
w't get the eggs if you get some eggs
the birs become constipated so this does
not necessarily mean that you eliminate
our our egg production from the model
probably egg production is related to
another
another variable okay let's see our R
squ our R squ is
0.968 indicating that the amount of
variability explained in water uptake is
96.8
okay and overall
you see this P value of 4. 677 this is
the
overall test statistic you would get n
over if you changed here for example if
we change and say uh let me just instead
of using
uh so just changing the command instead
of using LM I'm going to call it object
four and instead of LM I put
a or
a and then run this
line so then I print my
model model
four run and gives us the Anover table
okay now we find that our data set is
really small it is
complaining that res that estimated
effects may be
unbalanced we have only about 12
observations so I think that's why it's
not giving us
unable you can can also check manually
to see how many degrees of freom we
could have if we look at the in terms of
residual two degrees of freedom
off we have
around 7 degrees of freedom left okay
anyway this is just a a made up data set
so maybe I need to get one which is
which has more data than this one to run
the regression
analysis
okay so hopefully you've learned about
the regression interpretation of the
model that which had forgotten when
you're interpreting for multiple
regression you hold one constant hold
egg production constant and interpret
the slope of food food uptake for
example for every unit for every egg
produced holding food up
constant the water taken is minor 3.45
so every time the egg in every time the
bird produces an egg it takes in minor
3.25 of water now food uptake holding
egg production constant for every unit
increment in food taken by the BS the
amount of water increased or taken is
3.03 m
m so
whenever a bir takes in one gram of
food
extra then even the water is going to
increase by
3.03
m The Intercept that is the value of
water uptake when food uptake and egg is
equal to
zero sometimes the negative intercept
doesn't have a meaning sometimes it has
a meaning especially for kinetic
processes okay the residual standard
error that if you square that residual
standard error we are going to get the
mean square
error that is shown in an Anover table
so it has no problem that
one okay
I think that's it we have four minutes
left we will we are going to meet once
Forum invites us for an advanced course
I think the first part of the advanced
course will be regression we'll talk of
multiple model building we can talk a
bit of nonlinear
regression
if Prof how can we contact you
uh Helen is in charge of
my
Helen my dear Helen Dr
Helen somebody needs to
know email anyway let me write my email
down I can hear
you you are in charge of my email give
out my email account with seven okay I'm
going to do that
okay all right and then you also put
yours also put
Thomas all
right okay thank you uh thank you very
much participants you've been very
patient with
us if we had more time we would talk
talk talk until the birds come home or
the goats go back to their house but
everything comes to an end and today it
appears like this program and the
training ends today and I am so so
grateful to you you are wonderful
participants are paying attention asking
so many questions which shows that
you're
learning okay and I am so happy to share
my knowledge with you and when I come at
yours please don't Revenge but also
share with
me okay thank you very much I hand over
to my colleague Thomas and Helen to say
something and lastly I also thank roram
for giving us an opportunity to share
our knowledge with you so uh I hand over
to Thomas
then you your we didn't see you oh
boy yes good
evening everyone of good morning for
some people H thank you very much for
being very good student
I think enjoyed
myself
hopefully will
still meet in future other courses we'll
also try to join the WhatsApp group so
that we remain in touch thank you very
much in case you have training oriz
training us
to you can let
us okay thank you thank you Helen okay
um thank you so muchen to you yes you
can hear
me can you hear
me yes we can hear you we can hear you
okay great um thank you so much
participants it's has been a a great
time from Monday up to today I'm hoping
that right now R is not something new so
it just need you just need to practice
from the little knowledge you've
gathered in this particular training
it's enough to it's enough to take you
beyond what you cannot expect like all
other facilitators have been saying it's
about practice practice and then also
fall in love with the program because if
you love something you always spare some
time to do it um great thanks to the
facilitator Susan and Thomas your is a
great team and inspiration to me thank
you so much and then I also want to
thank Ro forum for always giving us this
opportunity to share what we know thank
you you and all the
best over to you re Forum
Dr thank
you yes uh thank you so so much for this
training and um yes like you say uh we
are going to have another session of
Advanced Training we shall share the
flyer um in different platforms for you
to know the actual dates when you are
having the the training most likely next
month mon uh before we leave I want to
thank my the facilitators first and then
my colleagues who have been driving the
ship and making sure that everything is
smooth uh I I would like my colleagues
uh to put on their video so that people
can also appreciate you so that when
next time when they see you they'll know
that this is the person my colleagues
please say something before we close am
me
too uh thank you very much jene I also
would like to thank the facilitators for
at least steering the meeting very
nicely I see the audience is very happy
and we appreciate your support to not
only us here but also to the entire
network Beyond even Africa so we thank
you very much and I hand back to J yes
thank you David H
another thank you thank you Julene
um would like to thank the facilitators
for the
commitment and also the participant for
being patient and I'm sure they are
going home with something so I wish you
all the best and have a nice weekend all
and we see you in the next training okay
bye thank you
Nur I think my colleague at is not on
his desk but I would like to say thank
you everyone let's meet again uh next
month we shall be sharing the the dates
and the times for the training all the
best have a nice weekend bye