Comprehensive Guide to Principal Component Analysis (PCA) in R

Name: Advanced Training on Scientific Data Management for Post-Graduate Day 4
Uploaded: 2026-01-16T09:31:27.830418+00:00
Channel: RUFORUMNetwork
Description: Summary and key takeaways on Comprehensive Guide to Principal Component Analysis (PCA) in R, covering Introduction The session walked participants through the

RUFORUMNetwork

Jan 16, 2026

•

4 min read

YouTube video ID: 7xppzA7cl8o

Source: YouTube video by RUFORUMNetwork — Watch original video

PDF

Introduction

The session walked participants through the theory and practical implementation of Principal Component Analysis (PCA) using R. PCA is a multivariate technique that transforms a set of correlated variables into a smaller set of uncorrelated components, preserving most of the original variability.

Why Use PCA?

Correlated predictors: Traditional regression assumes independent predictors; when variables are highly correlated, multicollinearity inflates variance and invalidates results.
Dimensionality reduction: PCA condenses many measurements (e.g., flower morphometrics) into a few components that capture the bulk of information.
Data screening: Outliers, clusters, and data quality issues become visible in PCA plots.

Theoretical Foundations

Data matrix: An (n \times p) matrix (X) where rows are experimental units and columns are variables (X_1 … X_p).
Standardization: When variables have different units (cm, mm, inches), they are centered (mean = 0) and scaled (SD = 1) to make them comparable.
Covariance vs. Correlation matrix:
Covariance retains original scales; suitable only when variables share similar units.
Correlation matrix is based on standardized data and is preferred for most PCA applications.
Eigenvalues & eigenvectors:
Eigenvalues ("latent roots") indicate the amount of variance each component explains.
Eigenvectors ("latent vectors") provide the loadings – the weights that combine original variables into a component.
Component ordering: PC1 explains the greatest variance, PC2 the next greatest, and so on. All PCs are orthogonal (uncorrelated).

Preparing Data in R

Load required packages: FactoMineR, factoextra, ggplot2, gridExtra, ggraph, GGally, etc.
Set working directory and import the dataset (e.g., the classic iris CSV).
Explore the data: histograms for normality, scatter‑matrix for pairwise relationships, boxplots for group comparisons.
Standardize using scale() before feeding the data to PCA.

Running PCA in R

library(FactoMineR)
library(factoextra)
# Remove the categorical column (Species) and standardize
pca_res <- PCA(iris[,-5], scale.unit = TRUE, graph = FALSE)
summary(pca_res)          # eigenvalues, % variance
pca_res$var$coord          # component scores (PC1, PC2, …)

Key functions: - PCA() – performs the analysis. - summary() – shows eigenvalues and cumulative variance. - get_eig() (from factoextra) – extracts eigenvalues for scree plots. - fviz_eig() – visual scree ("elbow") plot. - fviz_pca_ind() – individuals (observations) plot, colored by species. - fviz_pca_var() – variable contributions (loadings) plot. - fviz_pca_biplot() – combined biplot of individuals and variables.

Interpreting Results

Eigenvalues: Components with eigenvalue > 1 are typically retained. In the example, PC1 = 2.9 (≈73 % variance) and PC2 = 0.91 (≈23 % variance); together they explain ~96 % of the data.
Loadings: Large absolute loadings indicate strong contribution. PC1 is driven by Sepal.Length, Petal.Length, and Petal.Width. PC2 is dominated by Sepal.Width.
Scores: Plotting PC1 vs. PC2 reveals three clusters corresponding to the iris species. Overlap between versicolor and virginica reflects their similar petal measurements.
Biplot interpretation: Vectors close together (< 90°) are positively correlated; vectors > 90° are negatively correlated; orthogonal vectors indicate little or no correlation.

Choosing the Number of Components

Scree plot (elbow method) – visualizes eigenvalues; the point where the curve flattens suggests the optimal cut‑off.
Cumulative variance – aim for > 80 % explained variance for most applications.
In the demo, the elbow appears after PC2, so PC1 and PC2 were selected for downstream analysis.

Visualizations & Diagnostics

Scatter matrix (GGally::ggpairs) – simultaneous histograms, scatter plots, and correlation coefficients.
Boxplots by species – assess group differences before PCA.
Biplot – combines scores and loadings; useful for spotting outliers and interpreting component directions.
Contribution plots – bar charts of variable contributions to each PC.

Practical Tips & Common Errors

Package installation: Install all listed packages before sourcing the script; missing packages cause "function not found" errors.
Version mismatch: Warnings about packages built under a different R version can be ignored temporarily, but updating R is advisable.
File handling: Ensure the CSV is unzipped and placed in the working directory; use read.csv() or read_excel() accordingly.
Standardization: Forgetting scale = TRUE leads to misleading PCs when variables have different units.
Interpretation: Remember PCA is exploratory; it does not provide p‑values. For confirmatory analysis, follow up with clustering or discriminant analysis.

Next Steps After PCA

Cluster analysis: Use the PC scores as input for k‑means or hierarchical clustering to formalize the observed groups.
Discriminant analysis: Test whether the identified groups are statistically separable and obtain classification probabilities.
Regression on PCs: If a predictive model is needed, regress the response on the retained PCs (they are orthogonal, satisfying the independence assumption).

Closing Remarks

The session emphasized the importance of aligning statistical techniques with research objectives, avoiding blind application of methods, and continuously expanding one’s toolbox through practice and community resources.

Principal Component Analysis converts many correlated measurements into a few orthogonal components that retain most of the original information, making it indispensable for dimensionality reduction, data exploration, and preparing data for further multivariate modeling in R.

Frequently Asked Questions

Who is RUFORUMNetwork on YouTube?

RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Why Use PCA?

- **Correlated predictors**: Traditional regression assumes independent predictors; when variables are highly correlated, multicollinearity inflates variance and invalidates results. - **Dimensionality reduction**: PCA condenses many measurements (e.g., flower morphometrics) into a few components that capture the bulk of information. - **Data screening**: Outliers, clusters, and data quality issues become visible in PCA plots.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

R For Data Science Book Recommended

Provides a comprehensive, beginner‑friendly introduction to R programming and data analysis workflows, including PCA, enabling readers to apply the concepts demonstrated in the tutorial.

Amazon →

Factominer Package Documentation Pdf

Offers detailed explanations and examples of the FactoMineR functions used for PCA, helping users troubleshoot and extend their analyses beyond the basics.

Amazon →

Datacamp Principal Component Analysis Course

An interactive online course that reinforces PCA theory and R implementation with hands‑on exercises, ideal for learners who want structured practice after the webinar.

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

and creating new variables that are
uncorrelated can you hear me let me see
the
chat can you hear me
participants
okay thank
you and in case you have a question
utilize the Q and A for a question and
uh we shall use the chat for the
questions that I ask thank you very much
okay so let's continue some multivar
techniques are applicable when more than
one variable is measured on an
experimental unit and in most cases such
variables are correlated for example if
we try to model uh baby's growth and we
have two new
uh diet produced would look at their
head size we look at the length we look
at the wi of the body we look at the ey
color we look at so many variables but
we'll find that all those variables are
correlated and we cannot actually run a
regression of growth as a function of
the variables that we've mentioned and
because of that multivariate was
created univariate analysis as I've just
mentioned
is not helpful because we know that
under most of the regression analysis
under analysis of variance we assume the
data is normally distributed we assume
the Unexplained variations are cor are
not correlated and in case you run
correlated data then the Unexplained
variations are going to be correlated
that's why we resort to
to we resort to multivar
data
analysis today we'll be looking at
principal component I know that my
colleagues talked about it and I'm just
going to briefly talk about what
principal component is it is a multivar
technique that creates new variables
that are uncorrelated from very many
correlated
variables so assuming we have P
variables for example X1 X2 up to XP and
they are measured on experimental units
where the experimental units will go
into the rows and then X1 XP will be the
variables uh we find that we can
organize our data in an a matrix called
X where we have the components n * p and
that Matrix can be represented as you
see here we have X1 on on experimental
unit unit one dash dash until
X1 P so this is how multivariate data is
presented and if you're going to do
multivariat as a statistician you have
to learn matrices but in this case we
are not going to do that we are just
going to go through the theory and do
the
practicals so now using multivariate
using principal component which is
abbreviated as PCA
we can compute several P * P matrices
using either sums of squares or sum of
cross product of
X so we can also use the corrected sums
of squares for the X values where we
have in the corrected sums of squares we
subtract The
observed values under each variable
minus the mean of each variable and then
I get cross product and calculate the p
and p
variables we can also compute the
correlation Matrix which we can use to
compute the new
variables so PCA is done on the on these
matrices and it's therefore important to
select which type you're going to use in
this case we are going to use the
correlation Matrix which
I'll tell you why we select the
correlation Matrix so you can either use
the corrected sums of squares you can
use the sums of squares but in this case
we will use the correlation
Matrix and the operation after I've
chosen one of the matrices produces what
we call latent roots or igen values it
produces igen vectors and each latent
root correspond to an value an an Vector
yesterday they talked about this we also
see what igen values and igen vectors
are so with P highly correlated
variables we can produce p new variables
that are not correlated but they a
function of the original data that you
collected so let U1 U2 be the new
uncorrelated variables so for us to
express them as a function of the
original variables we see that mu1 or U1
will be equal to a slope and X1 plus a
slope X2 until XP so we can create
several new variables depending on the
number of variables you have in the
original data so the number of principal
component is equal equal to the number
of variables in your data set if you
have five variables then you able to
create five new principle
components so when we create the
principal
components who have P variables that are
uncorrelated during the process we we
rotate the original accesses such that
the first principal component has
maximum value iation explains most of
the variability in the data the second
one which is uncorrelated to the first
one explains the remaining variation the
third which is uncorrelated to the first
and second explains the remaining
variation and so on until you you get
number of principal component equal to
the number of variables in your data set
so during princip principal component
analysis we transform the original
variables to new and correlated
components that account variability in
the data and they are arranged in
descending
order so if if you want accounts for
maximum variation You2 accounts for
second maximum variation it is
independent
I mean U1 is independent of U2 and so
on so what's the use of PCA the uses of
PCA or principal component are
discovering dimensionality of the data
in dimensionality we are talking about
reducing the data variables from the
correlated values to uncorrelated Value
data screening where we can uh find out
whether we have made mistakes or we have
outliers we can also check whether we
have clusters and then find
abnormalities using the scatter
plot
okay so after you've created the
principal components the new uncreated
variables you can now conduct multiple
regression with those
new variables that you've collected or
you've created using principal component
you can plot the first principal
component and it helps you to identify
the Clusters associated with the data
you can partition the G times e in
absence of replication so those are the
uses of PCA also PCA can be used in
other multivariate data analys is like
discriminant analysis A analysis
Etc so in R several functions from
different packages are
available for you to conduct principal
component we have PR comp or PR comp
butin R stats package so that one exists
then we have principal component built
under Factor minor r package or uh
factor x
extra and all these ones you can choose
whichever you feel like but whatever you
choose you'll be able to extract the
principal component you be able to
visualize the principal component
components okay so we'll choose one of
these most likely we are going to use
Factor extra our
package so in R before we use if we
choose to use
correlation we first of all standardize
our data why do we standardize because
you might have measured for example if
you measure length length is measured in
meters if we measure the GTH we will
measure it in centimeters if we meas the
I size we might measure it in in inches
so you find that the variables have
different scale so you cannot use
variables with different
scales obtain principal components so we
standardize those variables to make sure
they are comparable and when we create
the principal components we are creating
principal components with data of the
same scale so we use the scale function
to standardize the
variables so after you've standardized
the variable you can car out principal
component analysis using pre pre
comp which is a function in R and this
gives us the standard deviation for each
component the proportion of variation
explained by each component the standard
deviation for each
component we can also decide on how many
principal components we are going to
retain either using the script plot as
we shall see later or looking at the
amount of variability explained by each
principal
component how do we determine the number
of principal components to keep as i'
I've just mentioned we can use the scrip
plot to set the total number of of
variation accounted for or we can use
you can use the igen value rule whereby
we choose all those principal component
with an igen value greater than
one so under package Factor extra there
are certain functions that we are going
to be using we have one of the functions
used for
visualization of igen values then if we
want to draw graphs we have a function
drawing graphs we also have if we want
to get the igen values we use also a
function as you see in the notes so
function one two is used for visualizing
the principal component and
then three extracting data from the
principal component and the data that we
extract we can see the coordinate we can
see the squared cosign we can see the
contribution of each uh variable to the
created principal component
so the principal components the first
principal component that you choose
consist of the
Maxum variation or the largest variation
and normally we place that on the xaxis
and then we have principal component
number two which has the second largest
variation that is the amount of
variability explained in the original
data and pca1 and PC PCA 2 are
orthogonal meaning they are independent
of each
other now the amount of variation or the
amount of variance retained by each
principal component is measured by the
igen
value so what's the difference of using
Co various Matrix and correlation Matrix
and principal component you can use
either of the two under convarious
Matrix we do not standardize the data
before we run the principal component
analysis so when you choose to use the
karious Matrix then the principal
components are obtained from the row
data without any
standardization and directly you get the
igen vector and igen
values so these ones are
use if your variables are similar in
scale and in case they are not similar
then you may not get good interpretation
of your results of the
PCA then when we use correlation Matrix
the first thing we do is to standardize
each
variable we standardize them in such a
way that each variable new variable has
a mean of zero and a standard deviation
of one and in so doing we make them
comparable after that we compute the
igen vector and igen values of the
correlation
Matrix when we use correlation we want
to
identify relationships between
variables while in karious Matrix we
want to look at which variable has the
highest uh variance or amount of
variability explained only we can't
establish relationship between
variables So Co variance indicates the
direction of the linear relationship
between
variables covariance is an indicator of
the extent to which two variables are
dependent on each other a higher number
denotes a higher dependence and aarian
correlation measures both strength and
direction of the linear relationship
between two
variables cor correlation values are
standardized while covariance values are
not
standardized so once use correlation we
are able to know whether two variables
are strongly related or
not the convarious lies between minus
infinity to positive Infinity while
correlation lies between min-1 to
positive one so those are the
differences between uh covariance and
correlation okay in the chat let me see
in the chat if you have any
question okay you seem to be okay okay
uh let me see in the q&
a okay the materials are in the Google
link it has been posted my colleague Dr
Helen is going to post again uh the
Google link it has the
file the PowerPoint file it has a data
set instead of getting the data set from
Irish I mean from R I downloaded it and
saved it as a CVS so we are going to use
that so make sure you put it in the
folder where you're keeping your
Powerpoints and also the our script is
part of
it
okay okay so it appears like uh people
are comfortable with the principal
component if that is the case let me see
in the chat are you comfortable with the
theory of principal component we are
going into R we are going into
practicals okay in the
chat are you okay can we go ahead to the
practicals if you don't say yes or no I
will assume it's a
yes okay thank
you okay so now I'm going
to uh stop sharing this and then I'm
going to share the r script so let's
turn to R and do
practical we return to our and do
practical so I'll give you two minutes
to turn to
R so that we can do
practicals somebody is saying the file
is not downloading maybe you need to
unzip before you can download the file
okay so in the chat can we start
practical can I see your response in the
chat can we
start some people are saying
no okay I will give you one extra minute
please uh make sure uh
uh installing packages perfect now in
the if you have downloaded the script
and you see that you do not have those
packages you can install them but if you
have the packages don't install second
time
okay so I can see that uh the file is
zipped so unzip unzip
first unzip that file before you can
access okay uh I request my colleague uh
Helen please in the chat can you p the
script and also the data set so that uh
uh people can access I think some people
don't do not have the zip unzip file or
they do not have that program on their
computer my system cannot unzip okay so
my colleague Helen will uh post the
files in the chat make sure you capture
them and then uh uh save them on
your
folder uh Helen says she has uploaded a
folder that is
unzipped
uh please send it in the chat and Q and
A Helen
the one which is
unzipped no they want the unzipped
one can you please send the Google drive
with unzipped
file so that people can
access uh
okay so can we continue file cannot
open I can't sit in the chat here it's
not in the chat
is it okay
now have you received the file
okay thank you so we are going to start
I do not want to start and then we go
backwards and then start and you
know okay Dr Helen has sent the Aros
script in the chat make sure you pick
it okay
okay the r script and Irish has been
sent in the
chat okay pick them and save them in a
folder that you can access on your
computer okay so I think we can
start so let's pay attention to the
script so the first thing look through
these
packages if you think you don't have any
of them then install if you have them
already do not install for example you
must be having facto
extra Factor minor
r
g c
plot even this
one I think you only do not have gig By
plot so make sure that you do not
install second time the only thing we
can do now is to run line 13 onwards
that is we load the packages so after
installing let us load the packages so
just
highlight and load the
packages
okay so for principal component we need
these packages highlighted as
install and you can use one of them not
all of them at a
go so let me see in the chat has
everybody loaded the
packages okay okay thank you so let's
continue so since we download the data
we are going to set working directory so
you got
session set working
directorate shoose working director
that's where I am session from the r
Studio menu set working directorate
choose working
directorate
okay so click choose working
directorate and uh the directorate that
has come is what I am using so you click
on the name on the folder that you've
where you've stored Irish data and click
open okay so in the console you see that
there has been uh a command written set
working directory in bracket the root
showing where the folder
is okay in the chat has everybody set
their working
directorate is everybody okay in the
chat have you set working
directory
okay thank
you
okay so we are going to line 41 we are
going to read our data into R using the
read do
s
CSV so you just run those two lines and
run and you see the
data oh
okay is that visible people are
complaining about the size is it
visible is it visible okay okay thank
you okay so line 41 and 42 shows us how
to import
data into
R okay and once you run line 42 you will
see the data set in the console so if
you look down here in the console you'll
be able to see the data set which
consists of five
variables and 150
values okay so the data we have consists
of the SE
length SE wi P length P wi and the
spaces okay so I'm assuming that you all
have the data so we are going to the
next part is normally what you do when
you receive a data set we normally do
some descriptive statistics which can
either be means or graphical displays
you can either do numerical display or
graphical display so in this case I've
chosen a few things that you did with Dr
Helen and presented them here so the
first thing you do you can choose one of
the variable and draw a histogram to
check whether the data is normally
distributed or not so if you look at
the uh plot window you will see a
histogram and I think you can see that
that histogram is sked to the
right is that what you observe let's see
in the chat or you are too fast
okay I'm going to reduce I thought okay
what have do okay I'm going back to the
chat do you all have the data
set do you all have the data set say yes
okay if you have the data set
if you can't set the working directorate
you can go and type data in bracket
Iris then you run it's going to bring
the same
data so the only thing you be left with
is to give it an object
name that is going to help you run the
rest of the
analysis okay so I'm still installing
okay or if it if you get a warning which
says package gigy by plot was built for
version
4.41 it means that you need to update
your your R so you'll go online and
download R but in the meantime pay
attention as it's being downloaded then
later you can you can update your once
you've downloaded your file
okay your histogram looks normal no it's
not normal you look at this tail here
between seven and eight and also between
four and three we have another tail
there a bit skewed and the longer tail
is on to the right that's where we are
saying that the data is C to the
right okay
so are we good to go can we go to the
second
bit kindly share the link the link has
been shared but it will be shared
again
okay
so the histogram is picked yes it has a
small a pick but not really picked
picked means that it has a pointed uh
shape but in this case it's a bit
flat how to update R first of all do not
pay attention to updating R leave R
alone just make sure you are in the
script and paying
attention uh
unexpected
[Music]
error okay
okay okay Dro can you promote
half so I'm going to stop sharing so
that we sort this problem out maybe
other have
it can you promote anybody who is taking
care of the
ICT promote haon or anybody with an
error put up your hand so that you can
be promoted
should I stop sharing is the person
being promoted Dr raro can you please
promote halfon
promote Halon
uh you can get our just go online and
type in
okay haon you've been promoted please
share share your error so that can be
corrected maybe it can help all the
others please make sure you share your
share your script your AR
script if hton is not ready
Joseph can you please share your script
somebody with a long
[Music]
name b i Ru can you share
okay
pardon please share you've been given
the rights to
share yes is C with you
Professor pardon is C
yes okay yes thank
you please share if you've been given a
right share one of you Melissa or be
please
share okay since you're not sharing I'm
going to share my script and we go on
no they not promoted been promoted who
who you I want to
share uh there was
haon then there is me
Melissa a queen
they type their names in the chat so
that I can see their names because I'm
trying to search and I don't see
them the names are in the chat I can see
some names in the
chat have they been promoted those
ones no no I've not yet promoted
anyone okay I'm want to look at Melisa
I'm not seeing
Mela Melissa
shy Melissa shy can you share
Marisa
shley can
you share your R
script do you have another name uh uh
there is
haon haon
yes Melissa is um
declined
okay if haon is not available is also
Joseph Joseph who
Joseph
or
something okay I'm promoting that one
Joseph you're being promoted if you're
here please share your script so that we
can see the
error then there's Patricia
dumbe J is
joining okay thank you
J you can go
ahead yes good afternoon Prof
please I'm having is that I was
following the webinar with my phone not
with my laptop because my laptop you
know crashed I mean the got crashed so I
have to reinst the amount studio and
that's why I'm busy looking for a
compatible one first before I can um
start all over with what I want to
do that's
why okay no problem let's have another
person share Patricia is
declined choose any Patricia is still
there among those who have put up the
hand she has declined I just promoted
her
okay I'm updating the studio okay maybe
you can move on now okay there's also
Muhammad Mohammad
Whois I'm see
muhamed if mamed doesn't share then I'll
go ahead yeah please go ahead okay let
me share the screen
again okay so people are saying that the
zoom the histogram was normal uh let me
share this stop
sharing and share the
histogram this is the histogram I've
gotten from the data is it visible
H is it visible to
everybody is the histogram visible yes
okay so you can see that this histogram
is not normally distributed it has a
tail see this tail here which is the
tail to the right and uh doesn't have a
definite Peck pick tends to be a little
bit flat so not normally
distributed okay so from there these are
just minary uh analysis you can
do with your data let's go ahead and see
what else you can do we are just doing a
recap remember you've already done this
we can create a scatter plot which
you've already done so we are going to
line 52 and run line 52 to
57 so we run that
and we get a scatter diagram so let me
zoom in the scatter
diagram so this is the scatter
diagram so what do we see in this
diagram in the
chat is there a relationship between
sepo length and SE
width okay do you see any
relationship not much yes there is a
weak
relationship between sepo length and
sepo width so that is very good in the
chat the answers are correct a weak
relationship between sepo length and
sepo width if we had a strong
relationship the points would be close
to each other
okay okay so we can close
this and then go to the
script after creating the we go to line
70 and here we are just doing the same
thing thing but in this case we want for
all the data set so we create a class we
create colors for the three
spes in the data set using line 70 so
you just run line 70 to just assign
colors to the species even if you see n
n n is okay then we run line 72 which is
the panel Pon so it's going to create
graphs including the the pon
correlation so let's also run line 78 so
you run line
70 uh 2 to 78 and that gives us a plot
so let me share the
plot so you run line 70 make sure the
first thing you do is to run
make sure you run line 70 then you run
line 72 to 78 for us to get the to get
the
graphs
okay so I'm going to rerun this so so if
we
run and run and then run we get the data
and then I zoom in I just want us to
look at the
relationship my zoom plot must be
somewhere
okay anyway let's look at this section
of the make it
wider in the plot
section as I look for the zoom
plot it has
disappeared somewhere
let's just use this and explain
okay so let's look at uh the plot window
in my screen although it's not zoomed I
hope you can OB you can visibly see that
uh
sepo length and sepo width are
correlated though not highly correlated
but sepo length and PTO length are
highly correlated you can see the points
are together the same applies to sepo
length and PTO width you can see the
points are highly correlated forming a
straight line almost so we can see above
here in this
uh we see the the correlation values
between the data values okay so I think
let's go to the script we are going to
see that more and more so the next thing
is we can use the giggy pairs to
construct
uh the scatter plot the histograms at
the same time as the R so if we Lear run
line
83
so l l line 83 produces the scatter
diagrams it produces the histograms and
it also produces the box plots as well
so you can
you can look at it in your screen let me
go to the chat does everybody have the
same plot like mine says error in giggy
pairs uh make sure that you install all
those first uh please
reshare I didn't see that one TR
reare today's Google Docs okay my
colleagues will share okay you have uh
Minor error uh mine is error H post this
eror in the Q and A you'll get a a
response please upload today's work
okay okay so if you have the same chat
look at your chat and see the the chart
we produced earlier on is almost similar
to what we have but in this case we have
it colored you can see there are species
in red species in blue and then species
in
green okay
Dr
[Music]
Helen I'm looking for my zoom button and
the zoom seem to have hidden somewhere
but as soon as I see it would be very
nice for us to
see so if if we look at se length that
is the first line you can see the box
plot show that the species in red in
blue which is vgin has the highest SE
length and the species in green which is
vasc
has the second highest sepen and then
while satosi has the lowest if we go to
SEO width although satosi has the lowest
SEO length it has a wide up flower you
can see the SEO weth is higher compared
to the the two box
plots uh that's how we interpret the
middle line represents the median so
that is uh part of the
descriptive startat we can do and now I
want us to produce the correlation
Matrix which is now going to show us
which variables are correlated with
which other ones and should we go ahead
and run multivariate analysis so let's
run line 86 to
87 86 produces a correlation Matrix you
can see there is a minus five Dr Helen
talked about it means that we subtracted
the fifth variable which is categorical
so if we run in your
console
window if you go to the console window
you will see that we have
correlations sepo length and sepo width
is very weak and
negative sepo length and P is very
strong 0 87 very strong and
positive sepo width and P length we've
already said is weak and then petal
length and SEO width we see that it is
minus it is weak also less than
0.5 so we see that petal length and P
width are highly correlated almost a
perfect correlation of 0.96
so if we wanted to
model the
growth characteristics of this flower we
may not be able to use uh M uh
regressions multiple regression because
of these correlations that exist in the
data so now let's go ahead and prepare
our data for PCs so line 90 we prepare
the data for principal component
analysis so the first part of line 9 is
the object name is PCA underscore
result assign PCA which is the command
Irish is the data set
subtract uh the last variable which is
categorical because PCA Cates uh the
principal components for only
values graph equals false don't
construct a graph so let's run line
90 and then you run PCA results so it's
in PCA results it's going to tell you
the names of the variables that you find
in PCA results so before you see the
real numerical values you have to use
summary PCA uncore results to see the
real numer values so run line 90 and it
will show you the values
that so let's go to the beginning so
what I'm going to do is I'm going to
clean the console using the brush this
little broom and rerun
summary
uh run line 92 so that we can see
clearly what is in
there okay so in the chart
okay if you get ah an error that the
function does not exist it means you
have not installed the packages you know
at the beginning here we had some
packages installed so make sure you
install line three to line
nine you see there is Factor extra
that's where the PCA is coming from so
make sure it is installed PTO Miner G
cor plot ellipse and then grid extra and
gig By
plot uh gigy By plot holds the gigy
pairs so make sure you have giggy pairs
installed okay all these packages should
be installed before we can do so make
sure you install all of them so that you
don't ask which one does that which one
does this at the end of the lecture you
can actually go in Google and say what
does gigy By plot do okay please the
script in defo is empty would you check
it uh Helen please post the Google link
with the unzipped
file I'm installing but gigy does not
function gigg
pays please install all this okay giggy
PA works together with giggy plot two
okay so giggy pairs works together with
giggy plot
to so make sure even these ones you see
in my loading packages I was assuming
that you've already installed gigy plot
2 giggy uh p a library you've already
installed read R read Excel deer is also
necessary so if you go to
line
uh 13 up to line
20 line 30 if you run them and
one of them says object not found that
means that you have not installed that
package minus five means we've removed
the last the fifth column which has uh
species
names
okay
okay okay in 5 minutes minutes is
everybody
comfortable can we go ahead with the
interpretation because I want to go and
interpret this
output uh G plot 2 uh G plot 2 is
misspelled make sure I spell the names
of the package correctly it's giggy plot
two make sure it is well
spelled okay
okay I am going to go back to the
beginning of this script reason if you
didn't install these packages we are
going to be going back and forth so let
us go to the
beginning
uh okay
at the beginning in case you find that a
pack you have a an error when you load a
package just pick the package name and
write install. packages in bracket the
name of the package
okay
so at the beginning here I said let's
install the packages that are required
so there's Factor extra and there's also
Factor Miner G cor plot and ellipse now
also there's also grid extra and giggy
By plot so those ones are required for
multivar analysis but we've already
covered some of the r packages down here
which I thought you already have like
tide
VST now if you don't have tide VST
create another line of install and put
in tide vast if learner are is not
working create another line of install
and put in bracket inverted commas Lear
eror if de player is not working so when
you run line 13 to line
30 and you get an error it means that
one of the packages is not installed
okay okay so I'm giving you five minutes
let us sort ourselves out uh my
colleague Helen please share the link uh
some people haven't gotten the link I
think they've just
joined okay so five minutes you have
four minutes to sort yourself
out five four
minutes and once we go to principal
component analysis you should now not
have those problems of this package is
not running this function is not
found
okay okay in the chat is now every
everybody comfortable have you run the
analysis of principal component and
gotten the output in my console let's
see somebody is saying
no
okay so if you say no and you have a
question put the question in the Q and A
my colleagues will answer the question
so if you install the packages please
try to run line 90 to
92 operation functions are not
working which ones please be
specific now if it says restart
R say yes it is not going to
restart so that you can go on
uh you can download the
packages you don't actually need a
website to download the packages just
click
install just write install. packages in
bracket uh the name of the
package and if you're connected online
it's going to connect to crank project
where those packages are
stored if you're getting a problem like
warning message package grid extra was
built for version
4.41 it means you've not updated
your
version Okay so
for now ignore that and when you reach
at that point we can now struggle to see
how you can update
R uh make sure extra facto extra is
installed uh Justin uh make sure Factor
extra Justina make sure Factor extra is
installed and you've loaded it so after
installing click library in bracket
Factor extra and run PCA is under
that
okay okay so are we good to go can I go
to interpret the results
I want to go to interpret results when
everybody has okay so if you're okay
then we are going to go to the
results uh make sure you have results
starting with igen
values you remember in theory we say
that the variance explained by each
principal component is stored in igen
values now under package Factor extra
the principal component are named dime D
so di is principal component number one
D2 is principal component number two D3
is principal component number three D M4
is principal component number
four so the here under igen values you
see the variance explained you see
uh percent of variation explained the
first principal component has explained
72.
962 per of the variation in the data
set the second one has explained 22.85%
of the principal components we need to
use in the further analysis we already
if we add 72.9 6 and
22.85% of variation so we see that the
first
principal component has 72.9 6 uh if we
add 22 to 72.9 6 we get
95.8 one if we add three to that so the
last row indicates the cumulative amount
of variation
explained but the cumulative is just
telling you that when you got principal
component number two you've already
explained
95.8% of the variation in the data so
you don't need actually to look at
principal component number three and
number
four
okay so when you scroll down you see
variables where there is sep length
point I mean SE Point length SE width
petal length and P wi so from there we
also see dime one which is principal
component one so this particular table
is telling us which particular variable
is responsible for the
722
72.9% which one contributes to creating
principal component number one so we see
that if we look at the values we see
that se length has 0.8
89 petal length has
0.992 petal width has
0.965 so in this case principal
component number one these three
variables
contribute positively to principal
component number one and when we got
principal component number two it is
Seth if you see sepo let me highlight
it it is
0.88 this is contributing
0.88 of the variability in D Dime 2
which is a principal component number
two so from there we know that this dime
two is summarized by the SEO
Wei okay so let's go ahead and see
uh what other principal component does
so if you want to see only six results
of your principal component you can use
the
head pcore result comma 6 so you you see
only six fast values but I want us to go
uh we go to get igen values and see so
we got line 99 and line 100 so we want
to get igen values
from PC
results so in order to get igen values
let's name the igen values. Val equals
to get underscore value in bracket the
file name which is
PCA underscore results so if we run line
99
and then Ryan the object name it's going
to give us a summary of the igen values
in each principal
component and we said in our theory that
if a principal component consists of an
igain value greater than one then we can
select it and include it in our analysis
so we see that dime one consists of igen
value of
2.9 dime two has
0.9 and it also has the variance
explained you see variance do percentage
that is the am of variability
explained so the principal components
one and two are the or up to four are
the new variables that have been created
from your original data which is Irish
data and then we now know that principal
component now number one and two are the
ones explaining over 90% of the
variation in the data and we will use
them in the further
analysis the next
line which is get a list of variables in
PCA so we want to see what is in PCA
uncore results so PCA underscore results
if you run line 103
and then run V it consists of the
variables that are embedded in PCA
result PCA results which are the
coordinates for the variable
correlations between variables cosine
squared and a contribution of each
variable towards each principal
component now if you want to know what
the coordinates have then you can run
head dollar sign
to give you the
coordinates within PCA result so let's
run line
108 so if you run line
108 you see the coordinates within the
principal p a result if you go to head
this is just viewing what you have
calculated it also gives you the cosine
squared if we go to line
114 it gives you the contribution for
each variable towards the principal
component so from the contribution we
can know that principal component number
one is summarized by the SE length P
length and the P width while principle
component number two is mostly
summarized by the SE
wi
okay we want to put all that information
in a data
frame so that we can run other analysis
using that data
frame so I'm just going to clear my
console using the broom in the right top
corner there's console clear if you
click on it
that little broom clears the
console okay so in the cons in
the r studio Line 116 to line
118 gives us the results of the data
frame so I'm
reminding r that my
object uh
PCA andore results is equal to
function Pro home the data is Iris
subtract the fifth column Center and
scale so let's run up to 118 and see
what
happens so we get the data set in that
data
frame okay and that one consists of all
the pie PCA PCA 1 PCA 2 PCA 3 PCA
4 so that in future we can use that data
frame to plot our
PCA
okay okay in the
chat in the chat does everybody
understand
what PCA
principal component analysis
does in the
chat I can see two NOS
yes um saying in the chat uh tell me
whether you can understand what
principal component does from the
analysis you stopped running from line
98
okay okay let's go back we are going to
go back once and then we will not go
back
again let's
go somebody said he stopped running at
line
98 okay Alpha thank you very much it
gives dimensions and explains the
contribution of
variation uh to the total
variation okay some people have
understood now for the sake of those who
have not
understood
okay I believe it
uh the ICT person is
recording interpreting results okay so
we say that we prepare our data for PCA
principal component what does it
do we said it reduces the correlation in
the original data set by creating new
variables that are
uncorrelated but those new variables
called principal components consist of
the data from the original data set so
that when you run further analysis
you're able actually to relate the
principal component and your original
variables so why don't we use other
methods like mult uh multiple regression
we know that under multiple
regression there is an assumption which
says that the independent variables
should not be correlated for the
analysis to be valid so in this case
we've already seen from our output that
these variables are highly correlated
correlation ranging from 0.4 to
0.99 so since the variables are highly
correlated we cannot run a multiple
regression because we are going to be
faced with a problem of multic
coloniality mult means many correlations
among variables so that's why we decide
to use principal component what it does
it gets the
data creates new variables called
principal component pca1 PCA 2 PCA 3 PCA
4 or depending on what you've used for
analysis uh you can see Dime One dime
two dime three diam for those are
principal components there are a
function of the variables in the data
set
okay so line 90 helps us prepare our
data called
Iris and the results are stored in PCA
results so PCA uncore result consists of
the igen values I said igen values
summarize the amount of variability
explained by each principal
component and we also say that if you
want to select how many principal
component to use in further analysis we
look at those principal component whose
igen values are greater than
one okay we also said in the theory that
we can also use a script plot which we
are going to
construct so let us go to the
summary I am going to clear my console
again so I'm running line line
90
so run line 90 to
91 it gives
you what is consistent in PCA uncore
results for us to see the numerical
values we have to use summary in bracket
PCA uncore results so let's run line 9
to then it gives you all
this so it gives you the igen values
with Dime One di two which are the
principal
components under the value you have the
variance of each principal component
like that is Row one row two has the
percentage of variability explained in
the
data you remember that principal
component number one is a function of SE
length SEO wi P length and P
width and it's sort of like a regression
so the amount of
variability explained by principal
component number one is
72.9
six meanwhile dime two which is
principal component number two explains
22.85% of the variation of the remaining
variation out of 100 if we remove
7296 we remain with around
28 something so out of 28 22 has been
explained by principal component number
two and then the rest is explained by
principal component number three and
number
four then cumulative percentage is you
just keep adding uh the amount of arabit
explained by one principal component on
to the other for example
7296 2 + 22.85%
okay so let's go ahead and uh interpret
the last
part the last part is showing the
variables corresponding to each
principal component which is Dime One
dime two you can see on on the left we
have the variables sep length SEO width
P length and PTO width so now it is
telling us which variables are
responsible for creating principal
component number
one so for principal component number
one it is the sepo length it is the P
length and the P
wi sep wi is negative and it's even
small that's why it is not really
contributing a lot to creating principal
component number one it is the others
that are contributing to creating
principal component number one now
principal component number two is being
created
by it is summarized by the sepoi
okay so the rest of the principal
component we can see dime three is
summarized by pth because the rest PTO
wi is actually summarized by P wi and
sep length where SE length is
negative
okay so let's go
I'm going to go to
the next
part
I've run to line
116 that's where we need to start and go
on and explain again I'm also going to
clear my
window and uh I want to
briefly uh stop sharing because I want
to pick the zoom part which is we will
use for the plots so give me two minutes
to stop
sharing
and
e e
okay I've gotten the function that I
wanted has
disappeared so we are good to go so
let's turn to
line6 up to line 118
is the screen clear can you see the r
script can you see the AR
script okay thank
you
okay so starting from line 116 the rest
you can run we were just saying that uh
in the
v Matrix there are several things that
you can retrieve the contribution the
Iain values but let's start here because
of time so that tomorrow you can have
time to do class
analysis okay so what we are doing here
is create a data frame that consists of
our PCA results so that in future we can
use a data frame so we will run line one
and 16 where we have an object still
which is pcore result assigned to
PR data is Irish minus 5 you Center and
scale so if we run
that and also run so in the second part
we are saying data I mean PCA underscore
data is equal to data frame PR PCA
results dollar sign X species is Irish
dollar sign species
okay so now we've brought our spaces
back and then we can print the
data so that is just printing the data
set and I believe everybody has the data
set and the data set consists of four
principal
components
okay so next we are going to draw plots
and these plots that's why I was looking
for the zoom function these plots will
be showing in the plot section and some
of them will will need to to zoom
in so we can use gig plot to plot the
data set which was the data frame we
formed PCA under data and we want to
plot PCA 1 versus PCA
2
so let's run line
121 to line 123
so just place your
curer and then run that so you will be
able to see a scatter plot of the
principal components but at the same
time you're able to know which points
belong to which type of
flower so I'm going to zoom in
here stop sharing
and
share the
plot okay so I hope this is
visible can you all see the
graph it's a plot of principal component
and principal component one and you can
see
that uh Vasca and Virginia are closely
related they share something in common
while setosa is separated from the
two and it is mostly being explained by
the negative values of pc1 and the
negative values of PC2 that is setosa
the group setosa
meanwhile uh vagin you can see it is
tending to the positive side of
pc1 and also theca is also tending to
the positive side of pc1 so being
summarized by the positive uh values in
pc1 while uh PC2 is being summarized by
the values the negative values of uh PC
to so this is just a scatter plot of the
principal component you've created from
the principal component
analysis is there any question in the
chat how can somebody add percentages
yeah I think we will do that later we
are going to add percentages later it
seems there are outliers yes we can see
that uh this particular one of setosa
has a value above minus positive2
principal component and also there are
outliers but for now we won't deal with
them uh gigy plot is not
working uh let me give that question to
my colleague uh Helen to help you halulu
take that question into Q and A so that
Helen can help you because here once I
ask a question things are going to
disappear it is difficult to
separate VCA and Virginia yes VCA and
Virginia share some things in common
which we will find out
later
okay okay so I'm going to close this
out and then share the
script
okay so in the next part we want to find
out the optimal number of principal
component using a scrip
plot okay the optimal number of PCA or
principal components using a script plot
so this function
feore I is found
under uh Factor extra so if you've not
installed Factor extra you won't be able
to use this
function so we are going to run line
126 to line
127 okay so let's run and see what
happens so we create let me Zoom this in
so that's everybody has the blue
chat which has which is like a bar graph
in the chat can I see do you have this
the same graph like
me
okay okay thank
you okay I'm going to zoom in the
chat and then stop stop
sharing then share the
chart so this is what we call a scrip
plot it consists of it is a plot of igen
values you remember each principal
component has an a corresponding value
and dimensions the dimensions are the
principal components 1 2 3 and four so
we want to know how many of the
dimensions should we use in the further
analysis now when you use the script
plot method where the script plot starts
leveling of those are the number of
Dimensions or principal component you
choose so for example we can see that we
will take principal component number one
and you can see that that at this point
of principal component number two the
plot is trying to level off it is
beginning to level off so in this case
we are going to
choose principal component number one
and number two in further analysis so
when you use PCA in your analysis of
your data in your project you have to
present this scrip plot and you have to
explain that because it has an elbow the
first elbow is at this point of two now
we decide that the number of principal
components that are explaining a lot of
variability are principal component One
and Two from this scrip plot so that is
it about the scrip plot any question in
the chat about the scrip
lot any question in the chat about the
scrip
plot okay we find that if the you see
the igen values they're saying there's a
red line showing that we should be
selecting principal components with igen
values above one so I've chosen two
because if we take only one principal
component
we will be explaining only 73% of the
variability and left with a lot of
variability unexplained so that's why we
also take principal component number
two you will find that in practice when
you go to collect data you find that you
will need four principal components to
actually get 70% or even 80% of the
variability
explained so look at both the script
plot and also look at the variability
explained to make final decision on how
many PCs you select for your further
analysis
okay okay so we are going to click this
so where the script plot starts
elbing that is where you end picking the
number of Dimensions so we are going to
pick one and two in further analysis so
let me share the the
screen the
script
okay so we say that we can use the
script plot or we can look also at the
igain values so the next part is to
print the igen values which is line 130
to 131
so let's run
that so if you look at the
console you are going to see the Iain
value you will see first of all dime one
which is principal component one it has
an igen value of
2.9 and has a corresponding variance of
72.9 6 which is approximately
73% then dime two principal component
number two it has an value of 9
0.91 and uh variance of
22.85%
so in looking at these values also it
confirms what we've already seen in the
plot that we can use principal component
one and two in further
analysis
okay okay so now we want to see which
variables are contributing to this
variability explained in principal
component number one and two so let's
run
line uh
134 and then from line 134 we are going
to construct a plot of correlation
and that helps us I'm going to zoom in
uh before I zoom in do you have the same
chat like
me in the chat can you tell me whether
you have the same
chat
okay okay excellent uh let's zoom
in and I'm going to stop sharing the
script then I'll share the
plot so I'm making it a little bit
wider okay so this chart here is telling
us under principal component number one
which variable is contributing
most so you see that if you look at the
scale we have those that are dark brown
then lighter brown let say moderate
Brown and then light brown so the ones
that are dark brown are contributing
most to the principal
component so we see that under dime one
which is principal component number one
it is SE length P length and P
width SEO width is contributing but a
little you can see slightly yellow then
principal component number two is sepo
wi which is contributing a lot the rest
are not contributing much to the
variability explained so we see that uh
principal component one is a summary of
the SE length P length and PTO width at
the same time we saw that these
variables had positive uh
correlations
so as principal component number one
increases even the SE length increases
the P length and P WTH
okay so let's close out this
chat and then go to the next part so
we've been talking about how to select
the number of principal components to
use in further analysis and we've seen
that we need principal component one and
two so you can also use uh Fe under SC
cos
squared to see the variables
contributing to both uh principal
components so you can click
that so instead of getting circles now
we are getting the bar graphs and they
also shows you the contribution of each
variable to each principal component
okay if
if if we want to look at individual
contribution using a graph instead of
axis equal 1 to 2 you put in axis equal
to 1 that is line 141 so if you run that
and print a it gives you you can see
that we have three variables
contributing most to our principal
component number one and the same
applies to
B okay so these graphs are just
confirming what we've already seen in
the igen values themselves in the plots
so that is just contri uh
trying
to to
confirm so in the next plots we want to
look at the scatter plot and then
interpret we look at the contribution
the relationship between the variables
and the principal component and uh since
in this case we've chosen to use the
correlation
Matrix so still I go back and mind are
that PCA results consists of Irish
consists of the principal components so
run line 191 I mean
151 and then the data
frame and then we go to draw our scatter
plot so for us to construct the scatter
plot we are using the GG function and
the data is PCA data and X is PCA 1 Y is
PCA 2 the color is pieces so let's run
that
okay so still this is telling us that uh
p uh satosi as we saw in the setosa as
we saw in the scatter plot is mainly uh
summarized by
pca2 and a little bit of pca1 while pca1
mostly summarizes vascula and
vagin so we want to
see we want also to look at the
variables that we have versus the PCA
so if we run sorry I didn't talk about
if we run line 160 to
163 we get this plot so let me give you
one minute you run line 16 to 163 you
get a plot which shows us the SE WID SE
length and the relationship with the
principal component so
let's let me see in the chat do you have
the same graph like I
do
okay thank
you okay somebody is saying a break and
that is a facilitator can you imagine
but anyway let me give you
that let me give you that break he has
felt for you that you need a break so
let's come back at five minutes past
4 let's go for a break and come back at
5 minutes p 4 and explain the meaning of
this graph which we've just
constructed okay see you then at
405
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e
e e
okay welcome back from the
break uh when we went to the break I
hope all of you are back
okay can we
start okay okay be we went to the break
after constructing this chat and I want
to zoom in the chat and we talk about
this chat let me
first stop share
ring then I share the
chat
okay
okay there we go
okay from the chat is it visible is it
visible to everybody is the chat
visible okay and I'm assuming you also
have the same
chart okay so we can see that we see
arrows here one corresponding to
SEO width
another P wi and they are all
green then we have
uh close to Red PTO wiith and red which
is SEO
wiith now when these arrows are close to
each other like you can see SEO length
and P width they are close to each other
it means they are closely
related or they have uh a closer
Association if we are talking in terms
of correlation they are highly
correlated so sepo length PTO width and
petal length are highly correlated
because their arrows are close to each
other the angle between them is less
than
90 now if we bring petal length green
and sepo wi we see that it is almost
90° actually is more than no it's 90° no
above sorry it's above
90° and when is above 90° it shows that
those two P length and petal width are
negatively correlated the ones that are
close to each other are positively
correlated whose angle is less than 90
they are strong and positively
correlated these ones are negatively
correlated so even the same applies to
sepo width and sepo length it is
actually at 90° if you draw here that
means these are these two are unrelated
SEO width and SEO length meanwhile we
also have
pwith and sepo
wiith they seem to be unrelated also or
the angle is greater than 90 and that
shows that they are negatively related
and when we corresp when we uh interpret
them in terms of the principal
components it's just confirming the
results we got
from the output that principal component
number
one the
73%
explained is contributed by the sepo
length the PTO width and the PTO length
it also shows that as this principal
component number one increases to the
positive side even SE length P width and
petal length are increasing while as
principal component number one increases
the SE width is decreasing meaning it
has a negative correlation with
principal component number one which we
saw in the results then when we come
to uh principal component number two we
see that all these arrows are going far
away from Principal component number two
but sepo
wiith is the one that is explaining most
of the contributing to the variability
in principal component number two so if
you wanted to run a regression which we
are not going to do today we can do
multiple VAR regression next time if for
a minut you will run a regression of
these two pieces versus these
variables and you also find that
actually the variables shown here are
the ones uh contributing to explaining
the variability in the flower
types okay in the chat
if the angle is exactly 90 it means that
they are unrelated the Rel there's no
relationship if the angle is greater
than 90 they are negatively related if
it's less than 90 they are positively
related
okay uh variability explain that is the
variance explained in in the data set
set the data set has noise which is 100%
And when we construct the princip we
obtain the principle component one and
two the the variation or correlation
within the data set
is all summarized in the two principal
components do this all this plus me the
same no
uh somebody saying it seems like we are
talking about the same
thing okay so I'm going stop sharing and
then
share the last bit
okay so the other thing we can do is
construct a b
plot and what the B plot does it also
con actually a a plot of principal
component number one and number two and
we can use different functions we can
use By plot in bracket PCA uncore
results that is line
165 so if if we run
that we see a bip
plot and this bip plot has observations
from the data set attached to them it is
also
showing so let me zoom in in the B
plot okay it's already there so let me
stop
sharing so you can see that in the B
plot we have number one here is setosa
and then we have the two Vasca and
Virginia then we have the variables that
were used okay so we can see that setosa
here is mostly summarized by the sepo
width that is the one that can be used
to summarize setosa meanwhile these ones
here vinica and
VCA P width P
length and SE wi are the ones actually
contributing to their having an overlap
you can see that they have an overlap so
tomorrow when they talk about uh
clusters we can see that that we have
class three clusters however these two
flower types are have something
similar which is the petal length petal
width and SE
length Okay so this is a bip plot and
then you can also draw the same so let's
see
uh so this is another you can use the g
g By plot so line
167 if you run
it you will get the same By plot so this
one the GG you can
modify and put in color instead of black
you can do different
colors so the same
explanation so let's see we can also use
the function under Factor extra which is
phore PCA underscore B plot and we see
what type of graph we get that is line
171 and this shows you arrows it shows
you the petal length sepo length but
still telling you the same
explanation that sepo wiith is
contributing to the variability in
principal component number two and
meanwhile SE length petal length and P
wi are the ones responsible for
principal component number one and the
increase as also principal component
increases because you can see the arrows
are pointing in an increasing
Direction so we can also extract scores
and loadings which are part of the PCA
underscore
results I remember that in the first
part I showed you that uh PCA uncore
result consists of so many things
including the P principal components the
ien values we can
uh extract them and look at them or even
use them for
plotting remember that principal
component is not a confirmatory test it
is expl explanatory meaning that you're
just constructing graphs looking at if
there are any uh numerical values you
can look at that's what it does you're
just looking at the trend so now it just
leads you into another type of analysis
for example here we can see that there
are three groups so in that case you can
go and do uh clust analysis to confirm
whether those groups are correct so
clust analysis can be used to confirm
whether those three groups are correct
whether the observations were correctly
placed in the different flower
types so it that's why you see we are
looking at graphs graphs graphs and we
don't have P values it is and values and
numerical
values okay so now we we are going to
use those loadings in creating a a
customized B
plot and see whether it makes
sense not really fascinating it is also
another plot
with setosa here and also Vasca in the
two I was trying to draw circles around
them but I didn't get much time so you
can go and work on this last part which
I was working on to see whether you can
draw uh circles around these
plots I think that's the end of
principal component
analysis
uh in case you have questions we can
have uh in the chat you can put in your
questions and uh maybe if my colleagues
have any anything extra to add on they
can add on so Helen do you have anything
to add
on
uh and uh Thomas do you have something
to add
on uh tomorrow we are doing class
analysis and uh uh somebody is saying
confirmatory test let me see that
question after doing PC analysis now
that you've observed that the data has
three groups we can go and do class
analysis to confirm whether those groups
exist and then we can do discriminate
analysis to qualify the groups or we can
also use discriminate analysis in case
we collect more data about those flowers
to group them into the different
groups okay so in the discriminate
analysis you even have a P value which
helps you to know that this particular
observation has been rightly placed in
the group where it has it is so tomorrow
my colleague will be talking about clust
analysis I don't want to preempt what
she's going to to
teach okay okay next question
uh uh does it mean that if you're having
P values then that is confirmatory yes
because P values test a hypothesis and
you're confirming that hypothesis with
the probability that you you compute
uh is does the training have a
certificate that question goes to the
administration uh they will answer that
one can I have links to
yesterday's uh slides uh they are in the
Google drive if you open the Google
Drive you will see day three day four
day one and day two all those slides are
there you can access them and uh listen
to
the uh the recordings I believe the
recordings have already been
posted let me look at Q and A
uh the question is we I explain more
about the B plot
Graphics how do these lines
sit uh I believe they are talking about
the lines in the bip plot
uh difference between and dimension
dimensions are the principal component
igen values consist of the amount of
variability corresponding to each
Dimension uh Dr Helen can you please
reshape yester's documents with
me okay Dr Helen will answer that
Dr Helen Dennis has a question can you
help
them q and
as okay
uh function Fe underscore I not
found make sure you've installed Factor
extra to be able to run
this okay even the anonymous attendee
make sure you install factor x extra and
load it after installing make sure you
load all those who are installing
packages make sure you load then they
will
work
oh can you please send the link for the
Arab book uh this question I forward it
to uh my colleague uh Dr Thomas please
send a link for the book you told them
about
yesterday okay thank you Anonymous
you've solved your
question yes you can present the PCA in
table format you remember the igain
values those igen
values tell you about how much each
PCA has summarized then we also had the
variables corresponding to each PCA so
you can present that instead of a
graph Dr Helen talk about the
certificate the certificate can you see
okay ah discriminant analysis will be
done next time not
now in case we get an
opportunity next time we'll be talking
about discriminant
analysis uh and all other type of
multivar it depends on the person
organizing the training what they want
us to pass
on ah uh the once I standardize the data
when you say scale equals something that
means you're standardizing the same the
data
okay so the program does not show you
that it is actually picking one variable
and standardizing but once you tell at
the beginning of PCA you remember line
90 if you go to line 90 it has scale
equals to zero and Center so those two
we are looking at standardization
will week one be repeated I don't
know the is the uh Anonymous
attendee first go back and install all
the packages that are highlighted in the
script then load
them after that this particular line is
going to
work how about maneuver and simpler
analysis next
time uh remember that uh when my
colleague Dr Helen was
teaching she updated the visualization R
that's why we have to visualization
updated
R
okay you can ignore a visualization R
and use visualization update r
okay so there was a question about
explaining the biplot again
okay let me stop sharing here and see
whether I can pick the B plot among the
things that I were below
here
oh okay
let me pick one with
color okay
there
what I'm trying to look for the B
plot
okay okay so somebody said that we
should rehearse or explain more about
the bip plot we have this B plot here
and this is setosa then we have uh
vinica and uh Veron H vinica and V
vesca we also have arrows sepo P length
P width sepo length and sepo width we
also see that it's a plot of PCA 2
against PCA 1 this information already
is summarized in that data frame you
remember the data frame we
formulated it has all this information
now these are values the data values the
r r data values that were presented in
your data set okay there are ones that
are being shown here so that you can
look for data value 42 and see whether
it's an outlier and
110 and also these ones
here okay so we say that if if the
arrows are closed
together with angles less than 90 then
those variables are positively
correlated at the same time we see that
these arrows are increasing
increasing as principal component
increases to the positive side so they
high they are they are the ones that
contribute to pc1 and also positively
related so as pc1 is
increasing even P length P width and
sepo length is also
increasing meanwhile as PC one is
increasing SEO is decreasing you can see
it's arrow pointing the negative side
and also
downwards but when it when it comes to
PC2 it also shows that
uh as uh PC2 is increasing sep length is
still behaving the same way probably we
can rotate them and make sure that pc1
is on this axis Y and PC2 and see how it
happens here but before we did the
analysis we saw that se length was
actually positively
correlated to
pca2 so that is that is it about then
these two are
overlapping you can see we have three
groups but these two are overlapping
because they share the P length and P wi
it is contributing to their
overlap you can see P wi and P length
very closely related contributing to the
overlap of vinica and and uh Vasca
flowers of Irish
potato
okay share the graphs you are explaining
please if you run the script I'm using
the same
script and if you run the script you
will see the same
plots
okay uh somebody
saying
uh
uh he saying the course is really good
can we plan for more
they've
heard how can I locate my saved
script first of all go to R and go to
file save us then save in a folder that
you can remember then you can lo locate
them or you go to file
recent it will help you get your saved
Scripts
I think I have answered all your
questions most of the queries most of
the quaries Dr Helen is saying that they
about packages so it means that you
didn't load the packages so please load
the packages and then if you've
installed load them if you've not
installed install and then
load okay
that question goes to to my colleague
Thomas Thomas can you plan
python python actually is just like R
you just need to
change yes if you learn R then you know
python if you get the two scripts and
put them together you will see you
change a few letters and you you know
python
okay okay I think I'm going to stop
here uh since we've exhausted most of
these
questions and
uh I'll stop sharing
here
and let me hand over to my colleague Dr
Thomas to add more thank you very much
you've been excellent participant
thank
you
to the training uh we
also can
please sorry
I let my colleague first mute and then
okay I think she has now muted so yeah
yeah thank you for for for this uh yeah
for future course I think we are going
to continue to to uh conduct more
training together with with Forum but
also if you feel you need some
specialized training you can contact us
and you or you can organize and we are
we
to come if your institution wants
specific training uh we will always be
there so uh like we've been saying or
reemphasizing since the first day the
main idea here is that you should be in
position to identify the needs like you
said okay for my research I think these
are the techniques that I want to that I
would like to to use or these are the
techniques that we meet meet objective
so when you're doing analysis always
think about such objectives and that is
why sometime people misuse techniques
because they just run techniques because
they see that somebody else has used to
run the same technique so it's important
that
you identify your objective very clearly
and then look at your objective and try
to Rel what we been talking about if you
require multivariate then you go
multivariate if you require un variant
then you also go un variant uh we we
will not we are not able to C cover
everything within the the week and
that's why there are many things that we
left out uh but we will uh continue to
Li with the Forum so that more of the
courses uh can be
organized uh and then people
get uh at least increase the size of
their tool toolbx so that you're able to
use so this is the the message I have
for you today I thank you from my side
tomorrow we can pick up from there maybe
my
colleague Ellen want to say a
word otherwise then we
can yes enen can you say a few
words okay um
hi everyone um it has been great moving
with you from Monday up to where we are
from uh what Prof has presented today I
think most of the challenges within our
they about like loading the different
libraries required because uh each time
you run each time you run a command and
then you come up with something that a
particular function is not seen then
there are high chances that um a give
Neary package that is required to run
that function was not what was not
loaded so most of the time I would I
would encourage you that before you run
a script make sure that all the all the
necessary packages within the script
that uh you're going to run are loaded
and then that will make it easy for you
to see that you don't get most of the
complaints that you encountering but
also on the other side of the coin each
time you get errors it shouldn't be a
really a big problem or you shouldn't
get worried you can also get that error
and paste it in Google because in the
beginning we said that eror is an open
source implying that the problem that
you encountering somebody else millions
of other people have got the same
challenge so in case you paste that
particular problem in Google then you're
going to see some other people and see I
mean how were they able to get the
solutions and then you can also do the
same it just needs some extra reading
and then um commitment and in practice
and you'll get there but I'll just say
that R is friendly and like any other
software programming language all you
just need to do is to practice and then
in case you get an error you check you
see how can I really solve that but uh
we'll catch up tomorrow when we are
doing a clust analysis for more
information about M varat other people
are talking about uh um different
statistical techniques but uh we're not
going to cover them this time around
this particular week is just dedicated
to Mark barer analysis and maybe in the
future trainings that's when we'll
tackle other statistical techniques I
wish you the best and then I'll catch
you tomorrow bye-bye
thanks I think I will hand over to R
forum is Dr on
to
yeah is not coming on then I think we
can end today yeah yeah okay thank you
very much
everyone and
by e