Comprehensive Guide to Cluster Analysis: Theory, Methods, and R Implementation

Name: Day 5 Advanced Training on Scientific Data Management for Post-Graduate
Uploaded: 2026-01-16T09:34:32.625528+00:00
Channel: RUFORUMNetwork
Description: Summary and key takeaways on Comprehensive Guide to Cluster Analysis: Theory, Methods, and R Implementation, covering Introduction The session introduced
RUFORUMNetwork
Jan 16, 2026
•
3 min read
YouTube video ID: YqH22zRKJ6Y
Source: YouTube video by RUFORUMNetwork — Watch original video
PDF
Introduction

The session introduced cluster analysis as an unsupervised technique for grouping similar observations into homogeneous clusters that are distinct from other groups. Participants were reminded to download the Day 5 materials (PowerPoint, PDF, and two R scripts – cluster_analysis.R and cluster_selfread.R) from the shared Google Drive.
Core Concepts

Similarity vs. Dissimilarity: Within‑cluster distance should be small (high similarity); between‑cluster distance should be large (high dissimilarity).
Homogeneity: Objects inside a cluster share common attributes (e.g., shape, color, material).
Inter‑ and Intra‑cluster Distance: Intra‑cluster distance measures how close points are inside a cluster; inter‑cluster distance measures separation between clusters.
Linkage Methods:
Single linkage – uses the shortest distance between two clusters.
Complete linkage – uses the longest distance.
Average linkage – averages all pairwise distances.
Visualization Tools:
Dendrogram – tree‑like diagram from hierarchical clustering.
Elbow (Scree) Plot – shows within‑cluster sum of squares to help select the optimal number of clusters.
Scatter‑plot matrix – visual inspection of variable relationships.
Practical Example with Toy Data

A toy dataset (variables: shape, color, material) illustrated how choosing different attributes changes the number of clusters: - Grouping by shape yielded three clusters (triangle, circle, rectangle). - Adding color doubled the clusters because each shape split into red/blue groups. - Including material further increased cluster count. The example emphasized that the analyst decides which attributes to use; more attributes generally produce more clusters.
Hierarchical vs. Non‑hierarchical Clustering

Hierarchical (Agglomerative): Starts with each observation as its own cluster and merges them step‑by‑step using a linkage function. The resulting dendrogram helps decide where to cut the tree.
K‑means (Partitioning): Requires pre‑specifying k clusters. The algorithm iteratively updates centroids and reassigns points until convergence. Both methods were demonstrated in R.
Step‑by‑Step R Workflow

Load Packages – install and library required libraries (e.g., tidyverse, cluster).
Import Data – read the utilities.csv file from GitHub or a local copy.
Data Preparation –
Convert categorical columns (e.g., company) to factors.
Remove non‑numeric columns before scaling.
Normalization – apply scale() to obtain zero‑mean, unit‑variance variables, reducing noise from differing measurement units.
Distance Matrix – compute Euclidean distances with dist() on the normalized data.
Hierarchical Clustering – use hclust(dist_matrix, method = "complete") and plot the dendrogram.
Determine Optimal Clusters – inspect the elbow plot (fviz_nbclust) or cut the dendrogram at a chosen height.
K‑means Clustering – run kmeans(normalized_data, centers = k) for a predetermined k.
Interpret Results – examine cluster assignments, visualize with scatter plots, and discuss business implications (e.g., segmenting utility companies by fuel cost vs. sales).
Common Pitfalls & Troubleshooting

File Access Errors – ensure internet connectivity or download the CSV locally and set the correct working directory.
Package Installation Issues – run install.packages("packageName") before library().
Incorrect Data Types – convert character columns to factors; exclude them from scaling.
Choosing k – use domain knowledge, dendrogram inspection, or silhouette analysis to avoid arbitrary decisions.
Applications

Cluster analysis can be applied to market segmentation, genetic accession grouping, document clustering, image segmentation, and any scenario where natural groupings are sought without a predefined outcome variable.
Final Remarks

The session emphasized practice: participants should replicate the scripts, experiment with different variables, and explore alternative distance measures (Manhattan, Gower) to deepen understanding.
Cluster analysis empowers analysts to uncover natural groupings in data by balancing intra‑cluster similarity and inter‑cluster separation; mastering both hierarchical and K‑means methods in R, along with proper data preparation and validation, is essential for reliable segmentation across diverse fields.
Frequently Asked Questions

Who is RUFORUMNetwork on YouTube?

RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.
R Programming For Data Analysis Book Recommended
Provides comprehensive tutorials and code examples for performing clustering and other statistical methods in R, helping learners apply the session concepts immediately
Amazon →
Applied Multivariate Statistical Analysis With R
Covers factor analysis, principal component analysis, and cluster analysis with real‑world datasets, reinforcing the techniques discussed in the training
Amazon →
Data Clustering Software For Windows
Offers a graphical interface for hierarchical and K‑means clustering, useful for users who prefer point‑and‑click tools to complement R scripting
Amazon →
Links may be affiliate links. We only include resources that are genuinely relevant to the topic.
Summarize another video
Full Transcript YouTube

then after that we will we will start
let's just give two minutes and we start
at 5 P two maybe some other colleagues
are still finding out how they can log
in the internet issues and so on
okay e
friend I've seen your
comment and uh I'm not going to be
fast so for those who want to get
today's materials kindly check in the
same Google Drive the same Google link
that we've been sharing I updated it
with the materials for day five so you
can check um there are two files there
is a there's a PowerPoint the one and
it's in PDF and then we've got two our
scripts and one is one is named clust
analysis the other one is cluster
self-read um I saved it as self read
because I want you to practice at the
end of this session then you'll do
personal practice at your end okay but
uh we'll go through the clust analysis
our script together before this session
ends
okay um I think we can start now and
then uh other people can join us as we
move all
right I'm going to share my power
points okay
so with with def five um we're going to
look at cluster analysis introduction to
Cluster analysis I strongly believe that
at the end of this particular session at
least you're going to be well us with
what clust analysis is all about I'm
going to try to be as slow as possible
such that um we we everyone is able to
move on my PA and I'm also able to move
at on your PA at your
end okay
um um from from the from Thomas and
Susan they looked already at some of the
statistical techniques that we've been
talking about throughout this this week
factor analysis and then principal
component analysis and then today we
want to look at clust analysis as we
give you an introduction we assume that
uh at the end of the training or at the
end of every session you really find
time and see how you can relate these
particular techniques to your personal
data because remember we said one of the
advantages with uh when using our
programming language you can reproduce
the codes and
the codes that have worked on on a
project X can be reproduced for a
project y provided you know what each
particular command does and that is what
we that is what we we expect okay so
let's dive in we're going to look at
introduction to clust analysis and see
what it is we're going to just look at a
little bit of theory after understanding
it very well and then we go we going to
run our clust analysis script and see
how practically is being
done all right so when it comes to
Cluster analysis I'm going to um
sometimes it's called the subjective
segmentation so in case you find a
different name it still means clust
analysis but like in the First on on day
one whereby we talking about an overview
of mar varat analysis and we looked at
different techniques and asking
ourselves what they do when it comes to
clust analysis the main task is that I
want to find out which objects are
similar the object could be in terms of
observation could be in terms of item
depending on the on on the data that
you've collected or on your interest on
the different things that you want to
Cluster together but one thing that we
want to see are this object similar do
we see any pattern okay when we are
looking at
um different objects so we need to as we
proceed ahead at the back of your mind
put it that we want to see if we can see
um objects that are similar or patterns
that are similar because we saying that
clust analysis is a um is subject um
segment um it's a subject
segmentation and also apart from seeing
that the objects are similar we also
want to as we move on we want to see
that the moment the they are grouped
they are homogeneous within the group
they have to be homogeneous implying
that they have got similar
characteristics and they are differ and
they differ from other groups okay so
the takeway message here is that when we
talk about clust analysis it's all about
finding observations that are similar
and then also so that the groups that
are put together they should be
homogeneous within the
group okay and when you talk about
homogeneity it implies that each object
within the group they are similar to
each other as we going to see in the
coming examples we going to see that
very well and uh apart from seeing that
within a particular group that um
everything that inside there is is
similar to the other but also they they
have to differ from the other from the
other groups so each group should be
quite distinct in other words if you've
got each group it could be like each
cluster should be distinct from from the
other in the sense that when you when
you look at the observation of one group
then it should be dissimilar to any
other object of the other groups as we
going to move on we're going to see um
some of the concepts all the terms that
that are being used when it comes to
clust analysis like similar then
dissimilar with the dissimilar implying
that if they're different if they
different groups of class
they have to be different but within the
group um they have to be
similar okay and also we talked we said
that when it comes to clust analysis it
belongs to an interdependency technique
okay this is something that I picked
from the day one and I thought that we
can remind ourself and also I still
echoed that when it comes to Cluster
analysis we used to group similar items
within the data set into clusters you
can call them clusters you can call them
you can call them groups and uh um in
this particular case when grouping the
data in two clusters we aim at putting
similar um items or observations in one
cluster and then but they have to be
different from the other
clusters still also today we are going
to understand different terms about the
intercluster correlation and then the
intercorrelation distance when we talk
about the inra cluster distance this one
we look at the distance within the data
points within one cluster okay assuming
that we have got a cluster which has got
similar observations or items so the
distance in between these data points is
what we call the intracluster distance
and we going to see how do we calculate
this particular distance because we
saying that it should be small the
shorter the distance it implies that uh
um these particular the particular
points or the items are close to each
other or they have got a um similar
similar pattern and then when we talk
about the intracluster distance this one
looks at at distance between data points
in the different clusters and in this
particular case we're saying that
ideally they should be large because we
want to see that within the cluster the
distance is small but between the
Clusters the distance is what the
distance is
and then as we're going to see on we
shall see the different distribution as
well as finding the
patterns okay before I move on let me
check in the chat um are we on the same
page is the speed okay or I'm too slow I
increase are we on the same
page
okay all right
okay now let's let's take an
example I want us to consider this
particular example where we've got the
toys p q RX y z a b c it is just an
example that that I just want us to use
for us to understand what clust analysis
is all about and then when we go to R we
shall use a complete data set so in this
particular data set for example looking
at the toys now in this particular case
these are the observations I'm
considering I've got a variable called
shap and under shap I've got like for
example toy P it's a triangle Q is a
circle rectangle Etc all right and then
when I come to color I've got two colors
either a toy is red or blue and then the
make of the toy is either a metal or a
plastic
okay so now this is like a data set that
I've picked and I want I want to use it
such that I elaborate more of what
exactly we want to we are looking at
when it comes to clust
analysis okay now we're going to Bas on
this small data set to get to know how
do we build clusters or how do we group
the different
variables okay let's go back to this
particular to our table let's assume
that we consider shape we' seen that
under shape it's either triangle a
circle or a
rectangle and with a triangle we have P
toy P we have toy a we have toy B so we
are going if you are to consider
grouping with a triangle because
sometimes it's natural grouping then it
implies that under triangle we shall
have P A and B when we come to Circle we
shall have QX
Y and then C in case of rectangle as our
shape we shall have R and R and
S all right so this is what we having on
this particular slide because remember
we saying that when you're going to
group you can you group according to a
given variable of interest it's up to
you to choose whatever variable that you
want so in case we consider shape like I
mentioned earlier this is we are coming
up with three clusters or we are coming
up with three groups so because we Bas
we based it on the shape we have the
ones in the triangle and we go these
particular observations then within our
Circle we've got QX Y and c and then the
rectangle we have R and what R and Z so
what we are seeing here is that within
the group you have
homogeneity okay because within the
group we saying that P A and B they are
triangles then r z they all rectangles
and then with QX Y and C they all
circles So within the group they are the
same theyve got similar characteristics
but uh um between the group there is
homogeneity or they are distinct they're
different because in between we have we
have the triangle the circle and then
the
rectangle okay how about in case we
consider color remember when it comes to
color we say that they belong belong to
two categories either it's either toy is
red or it is blue so when we come here
you find that we're going to only have
two groupings or two clusters where we
have red and then where we have
blue okay what about in case we consider
both in case we consider shape and then
color implying that as you add the
different attributes of course the
number of clusters are going to increase
because we are seeing that in initially
when we considered only the only the
shape we had only one Circle but as we
added the color we have circles that are
red and then we have circles that are
what that are blue the same applies to
rectangle we had we had when we
considered only the shape as a rectangle
without the color now that we brought in
the colors the grouping is increasing so
as you increase the the variables in
this particular case it implies that
also the number of clusters all the
grouping is going to increase so let's
assume that okay what if we add we add
for example the metal where we say
within the metal either plastic or the
metal part of it then it implies that
also in that particular case the number
of clusters or the groupings are going
to
increase okay so in summary what does
that mean when we looking at cluster
analysis in this particular case
remember mentioned it is an unsupervised
technique there is no depend the
variables are not depending on each
other okay and it's not like in this in
the case for example where in case
you're looking at logistic regression
whereby you're looking at a binary
outcome and this particular case you
consider you want to understand How
likely the independent variables will
influence your outcome
variable within here there is nothing
like dependency and so you have a choice
to take any variable you want you have a
choice to select whichever attribute
that you want like the way I've
explained in the previous slide you can
either depending on your variables or
the attributes you choose um you choose
um any any variable that you want or
attribute because in this particular
case there is nothing like this variable
is depend depends on on the other
okay also different attribute will lead
to different grouping I've explained
that all say different variables will
lead to different different groups we
saw for the case of the shape we saw for
the case of color we saw when what about
when we combined and I know in case we
consider also the the last attribute
still the groupings are going to change
the group is homogeneous based on the
selected attribute only okay okay so it
is homogeneous basing on the particular
variable that you've chosen like the one
of shap like I mentioned we have the
triangle the circle and then the
rectangle and so um you take when you
take different attributes you'll get
different groupings like an example
where we had when where I show the where
you have the shape and then the what and
then the color of course automatically
the groupings will change so the point
is the more attribute you take or the
more variables you take the more
clusters will be required so but it's up
to you to consider which attribute you
want to use or which um variable you
want to take do you want to consider
color only do you want to consider shape
only you want to consider both or
considering all the variables all right
so remember as you increase the number
of attributes automatically the number
of clusters will what will will increase
but it up you to choose there is no
dependency of these
attributes okay and also the
segmentation develops on on on its own
based on the values of inputs like I
mentioned that the grouping comes
naturally depending on how how the
variables looks like because we saw like
for example color it is natural that
this particular observation are either
blue or white or red and uh ver mention
that is unsupervised learning and uh
also um in this particular case what we
do we are just the groupings are based
on the independent variables we don't
have um issues to do with uh with
dependency okay let me pause a little
bit and I see in the chat if uh can you
take the last slide again one file is
open
[Music]
are you seeing my
slides somebody's
complaining
okay okay great now are we on the same
page type one and then I continue
okay all right let's continue so um at
least from what I've just talked about
now right now we know what clust
analysis is all about that was just a
basic introduction such that we
understand how the grouping looks like
and then we also need to the concept of
seeing the intra cluster because we want
to understand um the distance within um
a given cluster should be small or
within the cluster and then it will be
large when between different clusters so
sometimes you want to ask questions like
okay now that I've got these particular
clusters the question is then um how do
I determine how do I
know um how do like for example how do
you know the objects are similar to each
other one of the one one of the of
course there are many ways on how you
can answer that particular question
but uh one of the possible ways we can
do that is by looking at the distance
between between two points and uh there
is a very common method the C distance
method and in this particular case what
happen is that uh um for us to get the
distance between two points assuming
that I've got d i is my first object and
then J is my is my second object I've
got many parameters let me assume that
I've got par I've got parameters pin so
I get the first observation in this
particular case my first object then I
minus it with the second object and then
everything is within absolute and I
Square till when I get to the last
parameter and then I take the square
root let's assume that well maybe you've
got only two observations and you want
to know the shortest distance for
example you're moving from this
particular point to the other of course
we see that uh in we see that to come up
with the length here for a straight line
we're going to get um um we're going to
get the U1 minus U2 square and then
everything we take a square root but in
this particular case what you can see
that yes it looks like a a straight line
but there are different possibilities on
how for example this particular B will
move from this particular point to the
other using the the program or when
you're using R we are going to be r will
will show us how we can compute the
distance between clusters in just a
command but we're going to see that um
that as we move on so mathematically
this is what this is uh how we calculate
um um the distance between between uh
between
points okay and uh there are also other
ways on how we can do that we say that
uh from the previous slide like the B
can use various ways to move from one
point to another there are many
possibilities but in r or in our case
what we're going to do we can also use
what we call the linkage function within
the linkage function it just help us to
look at the intermediate cluster
distance the first one I just showed how
you can do it mathematically and then we
can also use what we call the linkage
function we going to still see as we
move on um during the during the program
considering the linkage function still
here what we are trying to do we are
trying to see um we are saying that
there there there are various distances
within the groups that can be that can
be devolved you have many distances
devolved with from one group to the what
from one group to the other but you want
to consider there are different
possibilities that you want to consider
for example the shortest distance of
them all if that particular case then
we'll call that one a single linkage
there also another circumstances whereby
you want to consider the furthest points
in between the Clusters and then we
going to call that the complete distance
please mark the words because when we
get to the r script I'm going to be
using the same the same the same
wordings okay maybe to illustrate that
more about the linkage function on the
next slide um assuming that I've got to
clusters I've got this small cluster
here and then I've got another cluster
here with the linkage function what I've
just explained is that when it comes to
the of course for example if you
consider this particular point you might
find that you can have this particular
point and then it moves to different to
different points and you get various
Roots like I mentioned many roots that
you can consider but in case you
consider the closest among between these
two clusters when you consider the
closest distance then we call that the
simple linkage however in case you
consider the furthest points between the
two clusters we call that the complete
linkage for example in the green
assuming that I've got my Far Point here
and then the other Fest Point within
class this big cluster it is this
particular point so we will refer to
that as a complete linkage we're also
going to to see that in R how do we
compute um the distance between clusters
using the complete linkage and then also
which there also other functions like
the average linkage etc for the simple
it implies that I'm taking the shortest
for the furthest I'm taking the complete
linkage we've seen how we can compute
distance mathematically using the in
distance and then now using also the
linkage function just um in a on how
that is supposed to be done
um
um just remember that like I said you
can have different uh different
distances for example you can have one
point being connected to different
points so whereby it has many parts but
we are saying that in this particular
case with the shortest one we shall use
the single and then the far this we
shall have the um the complete
linkage okay
I think I've explained um I've explained
about the complete we are taking the
moster end points there also other
approaches for example the average and
then ITC but it's all about like how can
you calculate the distance between
objects okay and then between
clusters between objects and then
between the
Clusters okay so for us to do that there
are different approaches you can you can
still use when in case you're interested
in calculating the distance between
objects and then between clusters but
we're not going to look at all of them
I'm going to just um I'm going to just
look at a few and one of the common one
is the
theal
clustering and this particular approach
you can is usually um applied in case
you're having a small data Sate and uh
it supports you to show you each stage
of of the observation linkage the point
is whatever we drawing at we are trying
to see how do how many clusters the
number of clusters that we are going to
consider how do we get the distance
between the Clusters and within the
cluster and then between the Clusters
there are different ways on how you can
visualize that using your dendogram at
the same time also the scrip plot which
I think with the scrip plot has been has
been talked about throughout this
particular session since Monday and with
the spr plot the point is that we look
at the number of clusters and the within
cluster variany okay and then we'll be
able to with it we can be able to
determine the optimal number of clusters
that that that we
need um if we continue further I've
already mentioned about the scrip plot
um what it does most of the time like
how U my other colleagues um have been
elaborating in the previous two in the
past days what it does is that it gives
us all the possible clusters and that
within group some of some of squares so
on my xaxis I have the number of the
number of clusters for example I've got
eight eight clusters seven I come to
four and then up to one and then on the
on the Y AIS I'm having what is called
the um the within group sum of squares
so in this particular case remember the
point is that we want to reduce the
within cluster variability you find that
for example from cluster 8 up to C to
Cluster four the within group
variability is very small compared to
when you're having from uh from cluster
one when when you're sling down there is
a there is a big um there is a big
variability within the group sum of
squares and then the same applies from
two to three but as you move from 3 to
eight you find that um um the the
variability keeps on reducing and so if
you look at this very if when you look
at cluster number three you find that we
are seeing like what we call an elbow
and this elbow is the one that is
supporting us to give us the optimal
number of clusters that we need and then
depending on the data set where we see
the elbow like how it was Illustrated
Yesterday by Professor Susan um it
implies that within my given data set
these are the number of clusters that I
need the optimal number of clusters let
me pause a little bit and then I check
in the chat if we are on the same
page are we on the same page type
two okay
all
right if
um
okay
great in case you have a question that
is technical that needs more attention
you can type it in the q& A and you'll
get help all right okay um this one I've
just explained about the spit thought
I'm still building on from what was
shared whenever we have have like the if
the number of cluster is one in that
particular case we are having the total
variany within within the data set I've
just I've just mentioned that when where
I we see at the elbow like how it was
Illustrated yesterday that is um giving
us the number of clusters the optimal
number of clusters that we need um to be
able to group to group our variables of
interest there is also a apart from the
script plot there's also what another
one how you can visualize your data
within plust analysis which is called
the dendogram
plot okay and uh looking at this one the
dendogram is is a tree like diagram and
that is used when you want to illustrate
the arrangement of the Clusters produced
by the pical clustering it's just a
visual a visual
presentation however it shows the
arrangement of clusters for Med at each
step of the hillal clustering process
I'm going to show you an example maybe
the um within its structure the
different things that you have to note
um in mind is that uh each data point
with since it's a since it's it's hilal
in nature each data point gives us a
cluster we start with a cluster and then
we we we keep on building and see how
they get nested in the other
so um the the leaves or what we call the
terminal nodes they represent individual
data points for example if we go back to
the to the table that I showed you
earlier you we can have the different
individual data points representing each
toy okay and then we are going to see
that we also going to see how the
branches are being formed in the sense
that you can Marge two clusters and this
particular case it implies that
the length of the of the branches for
example in this particular case if it is
small then we are going to group it into
one cluster Etc and then
Etc okay so the Clos are the branches
then it implies that they more similar
to each other then we can put them under
one cluster Etc let's take this this is
how in case we run it in R how it's
going to look like depending on U so
in case whatever these particular points
that you're seeing at the end these are
the ones that I'm calling the nodes they
CER for a given different data points
for example like the one I mentioned
within the toy we can have these
different nodes as each toy acting as a
what acting as a cluster but in case for
example maybe you have like this these
ones are next to each other we can have
like cluster one this can be cluster one
cluster two we can have cluster three
this can be cluster
four um we can have five this can be
six we start this is seven then I can
have eight I have
nine this can be
10 then this is 11 and then this is
12 okay since it's it's hcal in nature
you find that um you look at B usually
here we've got the height and see how
how it it
changes so in this particular case uh
you find that clust six and seven
they're almost close to each other and
so we can we can collapse this into one
cluster so instead of having six and
seven we can collapse it and then it
becomes on
one cluster and then it implies that so
in that particular case it willu for
example and then we shall have now 11
clusters now moving up you find that uh
also this cluster here has been joined
Again by another cluster basing on this
particular point now if I call this for
example if I call it a and then it also
join with B I can still collapse the
whole of this to become one cluster and
then I can also collapse this to be one
cluster because they all close to each
other and at the end of the day as I
build going upwards this cluster here I
can is also join with this cluster by
this particular node and then the whole
of this I can turn the whole of I can
still dissolve the whole of this into
one cluster you find that as we build
upwards and then as I continue I see
that also I can have at this particular
Point I've got this cluster and then I
can also have the rate into I can have
the red into one cluster which later
joins here and I've got like the two of
them I can have them into again one
cluster so at the end of the day what
that implies is that um you can you can
build different these clusters are
nested in the other so as we check the
different distances within the within
these clusters we can try to bring them
close and see the ones that are close to
each other we put them under one cluster
and then those the ones that are far
from the other we group them also
differently and then also you can be
able for example with this particular um
visualization assuming that I put a line
here and then I cut off the the I cut
off the upper part then it implies that
that now I'm going to remain with only
two clusters I'll have this one here
okay which has got and then I also have
this other one on the the left and then
on the right all right so in case for
example I cut off from here what is
going to happen if I if I cut from that
particular point I'm going to remain
with this cluster the number of clusters
are going to increase I'll have one I'll
have two I'll have three then I'll have
four so that is how the that is how the
um the
dendogram plot looks like or can be
Illustrated and it's just also supports
us to find out um how the Clusters can
be grouped and then of course there are
other ways or methods that you can use
to enable you to get the optimal
clusters all
right okay
so in
summary um the D gram provides a visual
summary of clustering process how we
move from the many clusters and then at
the end of the day we diffuse them
everything to belong to one cluster for
example the one that in case we to
consider our toys and then at the end of
the day we consider that all the all the
other variables of Interest they belong
under the what under the toy cluster and
it's also just helping us to see how
clusters are formed and at what level of
similarity or
dissimilarity okay also the other way on
how apart from the hyal clustering we
also have what we call the non hyal one
and in this particular case we use the K
means
clustering all right this one is very
direct
and very simple because at the end of
the day it is you to decide out looking
at my data set how many classers do I
need it's up to you to decide out all
right so and most of the time when we
look at the none um especially when
you're having huge number of
observations this one you can easily
build up the algorithm and decide on the
number of uh the number of the Clusters
that you need for example you can say I
need eight clusters
I need five or 10 clusters so in this
particular case my K the K that you're
seeing here stands for the number of
clusters that you need but like I
mentioned um you decide how many
clusters are you going to have remember
as you increase um as you increase the
grouping then automatically the number
of clusters that will increase but also
there ways on how you can you like for
example we say for the case of the the
scrip load whereby it tells us okay the
optimal number of clusters that I'm
going to have are going to be three or I
need five but when it comes to the non
hyal we use what we call the K means
clustering and here it's up to you to
decide the number of clusters that you
need so this particular algorithm is
implemented in four steps that we are
going to look at all right so step
number one you partition your objects
into K non empty subsets but you do this
randomly so for example looking at the
toy example that that uh that that we
looked at earlier what you do like okay
how many clusters do I need let me say
assume that I need three clusters so
I'll randomly I'll randomly partition my
data into three class into three um into
three subsets remember they have to be
nonempty implying that these different
subsets they'll have the attributes of
interest that we are looking
at okay that is Step number one don't
forget that we are going to partition
the objects into K non empty
subsets all right what about step number
two so after the partitioning that I
need this number um that I need this
number of clusters we move toward step
number two we need to compute the seed
points as the Cs of the
cluster okay and
uh so in this particular case so step
number one you've done the partitioning
and now you want to create the what the
SE points as the cent of the clusters of
the different partitioning assuming that
uh let's take an example that maybe you
partion
each class has got for example 15
observations okay we have 15
observations we've partitioned and then
let's assume that you're considering
maybe an XY an XY Dimension so for us to
get the the seed points which I'm
referring to as the centrals it is the
same as gating um Center it is like the
m point or the the m point of the
cluster so in this particular case let's
assume that within the my subsetting one
in the in one of the sub subsets that
I've got for example in this particular
case I've got 15 observations so if it
is a if there are two observation that
is the X and Y maybe I picked two um um
I chose if I choose two attributes of
Interest then I'll consider getting the
the the Y AIS getting the the different
variables within for example let's talk
about
um I was going to give for the example
of the toys but within the toys we don't
have quality the the variables are not
quantitative because when considering
with the clustering they have to be
quantitative in nature such that we add
get the to we add um observations and
then we divide by the the total number
of observation so in this particular
case it will apply for the case of the X
and then it will apply for the case of
the Y maybe let me just give a um let's
assume that maybe you're considering for
example you have your Y and then you
have your what you have your X so we are
saying that within your X variable which
has to be quantitative in nature you add
the different points that are within
that particular cluster you divide by
the total number observation you do the
same for the case of the other parameter
which is y okay in case considering why
you also add the the total number
observations divide by um you add and
then you divide by the total number
obervation and so in that particular
case you'll be able to come up with what
we call the same the mean point or the
centroid which is the center so assuming
that in my case maybe if that is my um
maybe within this particular cluster
I've got that that croid I get another
cluster which also had remember these
clusters that I chose they they are um
with from my step number one I said I'm
partitioning objects into K non empty
subsets so K depending on what you've
chosen if I chose three so in each
cluster I'll come up with a minan
point okay and then that way if there
are three of them then I'm EXP I expect
to come up with three centroids if there
are eight of them if there are eight
clusters then I'm supposed to come up
with eight um um I would have eight
centr or the different minan points
within the what within the nonempty
Clusters and so with that um the next
step is that now assign each object of
the cluster within the with the nearest
SE point Point what do they mean for you
to assign each object to the cluster
within the nearest for example um if
this is my if this is my central point
now the of course the ones that are
within we we shall say that these ones
are within the cluster because they are
close but you but the ones that are far
in a different in a different cluster
then those ones are what are are are
they're not going to be closed so that's
why they are clustered in a different uh
in a different uh
cluster so what happen is that we repeat
this same procedure of course with a
with the with r we just need to to run
the command what happens in this
particular case you go back to step
number two until when we do not see any
change in other words you've reached the
optimal the optimal subsets I'll I'll um
put more details on that when we come to
R and then we see what happens in case
we are running the different
commands Okay
um so um just a few things that uh this
one I've already Illustrated I'm going
to just talk briefly about the more
details and then we can go to R and see
how we can do the things
practically okay this is just a summary
of the details I'm talking about we
talked about cluster analysis as
clustering and uh what we are looking at
we want to group sets of object that are
similar to each other and within the
objects they have to be homogeneous and
distinct with other groups all right and
with the example that we had there is
also Al a natural grouping within the
data set with specific points when it
comes to
similarity um it implies that uh um
usually when you look at clustering
implies a measure of similarity or
dissimilarity so between between the
data points with the with the similarity
we are looking within the cluster
because in this particular case the
distance between the points are is very
small then with the dissimilarity we are
looking between clusters because the
distance between the Clusters is
large okay and then we've already talked
about uh there are different ways on how
you can measure the distance for example
the the IDE distance which I've talked
about but you can also use the manhathan
and then the cing similarities all those
all those common um measures can be used
when you want to measure the common
distance and then also in addition to
the type of clustering type of
clustering methods I've talked about
there are some others though we're not
going to look at them in this particular
this particular period but you can also
do some further reading I've talked
about the hcal method um here we like I
mentioned you create a tree like a
structure that is the the dendogram that
I've just Illustrated ated whereby it
keeps nesting different objects to come
up with a different clusters and then
also um we talked about the
partitioning okay whereby you divide
your data set into a predetermined
number of clusters and this is the non
hilal method for example you can say um
my K these are the king clusters you can
say my K or the number of cluster that I
want you can Define them forehand maybe
you want four or you want eight remember
in case you've got for example a small
data set then the it is likely that
you're going to for example say um limit
the number of clusters compared to when
you've got a very big data set among
others these other ones for example the
the density based method and then the
model based methods I would kindly um
ask you that you can read that on your
own but I'm only concentrating on the
first
too okay um on the different
applications on clustering you can use
the market
segmentation okay it can also be used in
imaging document clustering actually the
point is it's about grouping for example
within the market segmentation maybe you
might want to look at the different
cells that are being made and uh you
know and then maybe you can also apart
from the cells you can also top it with
the with the age groups okay then you
can use those two variables or
attributes to support you in the
segmentation okay briefly let's look at
uh the steps in cluster analysis what we
are going to do in r or what we expect
to see and just to to wrap up again what
I've just
Illustrated so number one you need to
like you need to to prepare your data
okay number one is data
preparation within data preparation it's
the same as like like the things we
looked at data manipulation you have to
clean and then uh and then select the
relevant features for clustering like we
mentioned in the previous example in the
toy example we said you can cluster by
shape by color all by time okay so which
future or which attribute are you going
to use for clustering all that is
involved when you're preparing your data
and then remember you're also supposed
to normalize or standardize that data if
necessary because you find that
sometimes when you collect data there's
a lot of noise some some attributes or
variables are bigger compared to the
other so we need to we need to normalize
or standardize such that we play from
the same level and uh then we'll be able
to um to explain our results without
without having um a situation whereby
one variable contributes more compared
to the what compared to the other so
after that for the case of the step
number one you've cleaned your data
you've gotten to see okay this is the
I'm going to use this specific attribute
for clustering and then where necessary
you've normalized number two you need
what algorithm are you going to CH are
you going to use so depending on the
nature of that that data that you have
and the problem at hand you need to
choose the appropriate clustering
algorithm and then of course we've
looked at we've Lo is are you going to
use the the the hierarchical one or the
noner record one and then also you need
to determine the number of clusters if
required by your algorithm are you going
to DET are you going to use two clusters
are you going to use three 8 Etc that is
when you're chooseing your
algorithm okay then the other part is
that you need to know how you're going
to apply that and then how are you going
to assign the data points the different
clusters you're going to see that in R
and then evaluating the Clusters we
already mentioned that for us to assess
the quality of the Clusters getting to
know that well this is the optimum that
I'm going to use I mentioned that we can
also look at um the within cluster some
of squares because what we want to look
at here is that uh we want to reduce we
want to have the distance as small as
possible within the cluster and then we
can also do that with a visualization
either with a SC either with Scatter
Plots ITC depending on the appropriate
method that you
want okay so um among the common
clustering approaches this one I've
already talked about it the came in
clustering all right where we say um it
requires you to to state the number of
classes that you want you want to have
in advance um I've gone ahead to put
some commands that you can use for
example in case you want to run it in R
so here um um when in case we go to R
especially within the cluster
self-read script that I sent um this
this is where you're going to find
you're going to find this particular
commands and then it will help you for
example in this particular case you put
in the data set here we are considering
the data set as the iris the one that we
considering area and then the kin will
will give you the number of you supports
you to to get the result um that you
that you want to consider within the
cluster and then the center equal to
three it implies that you've chosen
you're going to use
clusters um I've already talked about
also the hierarchal clustering this one
you see it in the self fre that I sent
um these other ones are just also other
common algorithm that you can think
about the visualization you can use a
scatter plot you can use the dam you can
use the heat maps at least now we are
very familiar with what this
visualization approaches are all about
all what they do okay so I'm going to
check in um I'm going to stop here and
then we'll move on to see how do we do
everything in R what I've just
discussed
okay um kindly shared I'm I'm just going
through the the chat briefly to see if
there's some things that uh you've you
you've asked that I can
answer please can you reare today's
document um when you go to the Google
Drive check in de5 folder de5 folder has
all these materials that I'm talking
about um okay was giving an example that
in taxonomy it is used in the poly
genetics wow thank you thanks for that
contribution um the dams are already
looking like our polygenetic trees yes
they can be applied um in in different
fields and also within your area
specialization you can apply them there
and they'll give you they'll support you
to make the right decisions that you
want okay can you show us an example of
of of data
normalization okay we going to see that
in R how do we normalize Data before we
go ahead to do the
clustering all right kindly
explain the the prescy or absence
between clustering the categorical
variables okay um with clustering with
the categorical variables like I like I
explained earlier in that particular
case you that it depends on the
attribute like the one we saw that for
the case of the shape we had to check in
case we using the shape which variables
belong to the same shapes in the shapes
for example we had like a triangle a
circle and then a rectangle so the
clustering was it was just a natural
process because the different
observations where all the objects or
the items that we're looking at we
naturally grouping themselves that that
toy P or Q was belonging to to um a
triangle or it was belonging to a circle
or a square so that is how it is that is
how the categorical one is being done
but also as we're going to see within
our we are going to see that well um as
we are trying to um as we are trying to
do the the clustering we are going to
see that uh we can we can do we can plot
we can have the different plots and then
cluster them by the by the different
groups that you want to
consider okay then the file of data not
open um I am how possible is networking
with you as a Jun in projects um I don't
know I'm not getting your question maybe
you can you can rephrase it again all
right we are going to stop here and then
we move on to
R okay so now I assume that you've
gotten the file the the folder within
the on your desktop or where you've been
working
from
um I'm going to share
my our
script
brok
okay um I'm going to give you a break
over okay I'm going to give you a break
of 10 minutes such that you make sure
you download the the folder def def five
folder and then we start on R see how
what we've been talking about how is it
applied in
r
e
e
e
e
e
e
e
e
e
e
e
e e
that
I make
some
okay some question answer
[Music]
[Music]
[Music]
[Music]
okay it's now um 10 past maybe you can
okay um welcome back the 10 minutes have
pass so we are going to proceed um Dr
Thomas has more clarifications that he
wants to make okay Dr Thomas
okay yes good afternoon
everyone H is there are a number of
questions uh that I trying to answer
from q& a and I thought it would be good
to also uh say to the
general group uh now first thing is how
do we determine the number of clusters I
think uh Ellen has made uh some
clarifications uh but one thing we need
to know about clust analysis is that
clust analysis sometimes highly
subjective uh and assumption is that you
know the underlying population that
you're trying to Cluster using
information that is available so
sometimes you want to find out whether
uh with the information that you have
are you able to uh see I mean you for
example in in
in in genetics or
maybe let's say for
example you get a lot of accession or
plan accession from uh different places
and then you want to find out now if I
use uh phenotypic characteristic do they
cluster based on their origin based on
where they're coming from so there's
that that knowledge your own knowledge
that your theory uh that you come along
with that can help you to determine how
many clusters you need but the best
thing probably with cluster analysis
would also be to try especially if
you're doing hierarchical clustering you
try to say in most cases what saying
that you have a a a group so you you say
for example if I divide this into two
groups what will happen if I divide this
into three group what happen so you need
to be able to explore uh the content of
the the Clusters that you you made and
try to look at it in terms of what
you're trying to study uh when I was I I
did some work with clust analysis uh I
had about
1,000 access
and this coconut were coming from
different places so some of them were
related were associated with the Pacific
with the Atlantic oceans and different
countries so the the idea was uh based
on the the molecular markers that we got
uh how do this accession cluster so you
do look at a situation where the first
categor categorization would be you have
accession data related to the Pacific
Ocean versus those one that are are
related to the Atlantic Ocean and then
when you go to the Pacific Ocean you
will have some categorization when you
go to Atlantic Ocean they will have some
categorization so so that also involves
that you know exactly what is going on
you're not shooting entirely in the dark
so once you form those group you can
construct for example what call a
confusion Matrix to actually see uh
whether the Clusters that you form sense
in relation to what you know on the
ground and there
also issue of uh uh is there can I use
categorical can I use binary data for
class analyses yes the whole idea here
is that you should be able to construct
dist Matrix uh so whether it is binary
there are lots of distant Matrix that
are based on presence or absence there
like jaad and all the so you need to be
able to to look at those ones there
especially those days when you were
first getting markers and the markers
were just presence or absence so there
was a lot of this genetic distance that
are based on just on binary alone uh you
can also combine categorical or or
qualitative and quantitative there are
distant measures that can actually do
that at least I know there is the a go
dist that can combine qualitative and
quantitative information uh so what we
are trying to do here is we are trying
to open up uh your eyes uh so that when
you go out there to read or to do your
things you you're more enlightened your
uh you you can Venture a lot more than
somebody who had not had that training
so we are not going to give you
everything at a goal we are just trying
to open up your eyes all this
information available online is your own
decision to decide whether H you want
remain ignorant or you want to be
enlightened that's entirely up to you
thank you and back to
elen thank you so much Thomas I think um
that was a good recup more
clarifications on how we're supposed to
do um clust
analysis I'm also seeing um different
comments that I you also practicing what
we teaching you in respective data sets
and you're getting the right answers
that is good so keep practicing and now
let's
continue okay so in
the when you check in the defy
folder if I closely check in the
materials that I
sent there is a you have an our script
that is called cluster narsis R so it is
what we are going to use then the other
R script it is you that like like like
Thomas has said that I mean we give you
some some of the materials but later you
have to do the other part on your own
you can you cannot like learn everything
within one week and you feel like
everything is exhausted when it comes to
especially data analysis it's a learning
process and you keep adding one or two
things here and there but it's about
practice so um I'm hoping that you've
been able to get the materials from the
from the Google link put everything in
your working doc
call it maybe you have a fold on your
desktop where youve put now the def so
we are going to go through class
analysis. R scripts together and then uh
at the end of the day we see how we
managed to apply the things that we've
been talking about within clust
analysis okay let me check in the chat
if you've been able to uh get the
materials
are you ready to start on R type
yes if you're ready then you get
started
okay
great somebody's saying no I don't know
why they are saying
no yes we are
ready
H you don't see anything somebody saying
no file let
us
okay um one is someone is asking for the
CSV
file okay um okay let's get
started number one we need to load our
packages so you can highlight everything
and run or you can just run each each
package depending on what you want so
when you look at line number
nine the data that we are going to run
is from
the we are it is the you utilities.
CSV um data set and uh we we this is the
URL where the data is being stored in
the GitHub B A content that's where we
are that is where the CSV file that we
want to use is you can also follow that
URL and then you'll be able to see the
utilities. CSV file okay make sure when
you run line line
nine check in your environment and see
if something has changed or nothing has
changed when you run line nine
okay can you can you type one in case
you've managed to see my
data
okay great so with line nine we are able
to see that we've got the my data my
data which this is just an object name
you can change it to any name and in
case you want to view you can click on
the Excel icon and you find that uh in
this particular case this is how our
data looks like we have the company name
that is the First Column it has many
compan names we've got the fixed charge
we've got the r we have the cost we have
the sales and then we have the nuclear
and then the fuel cost so these are just
different variables within the utilities
data set CSV and we have
22 um
entries okay I'm hoping that at this
particular moment we can all be able to
view our data which I'm called the my
data you can also um sometimes you can
we can also export data from um from
from
from um R and then to from R and then to
Excel and then for you to do that you
can use the right. CSV command and then
in bracket you put the file name that
you want and then how you want it to be
extracted in case you want to see the my
data the Cs as a CSV file then it will
also be stored in the same what in the
same folder where you're working
from okay so I'm going to continue just
the just the the first things first when
when it comes to when it comes to our
programming remember we said most of the
time as you open your R script you have
to um you have to to run the necessary
packages or in case you don't have the
packages you have to install them after
installing you load them all right so we
we we've seen line 17 in case you want
to view the data set we can also check
the
structure um looking down here within
the console kindly do the same at your
end when you look at uh when you look at
the console they're telling us that it's
a data frame with 22 observ 22
observations of nine
variables um we also have the different
variables we have company and the fixed
charge it see and we know that company
these are different types of company so
it is categorical what we are going to
do we we want we need to convert company
to categorical such that R doesn't read
it as a
character okay so and we can see that
under line 19 in case I run line 19 and
then I run the structure of my data set
again I should be able now to see that
compan has got 22 labels it's a factor
with 22
levels
okay
okay
okay let me first check
somebody wants to
share um I first stop
sharing whom have you
promoted okay
okay good
afternoon good
afternoon okay so I was trying to load
but it's showing me some errors I don't
know I started from here to run it says
um error in library
T there is no package
called can you first click on the
there there install package up
there up you see line the line before
that there is can you put where this
installed you see that warning
there follow that line of the warning
then there is
install not there up
up click on you almost there
no there's along that line there is
install the word
install I'm not seeing it
unless no no no no
no
you above line one above your line one
mhm there's something like a a triangle
there indicating some kind of warning
can you follow along that and there is
install along that line okay you see
package AR okay
installed yes okay so can you clean up
your console so that we we see where the
problem
is there's a brush down above the
console on the
extrem yes clean on that
uhuh
okay to run what you want to run and we
see it's not moving again I don't know
okay you try to run okay try to run line
one one up to five and we
see no it's good to run one line at a
time don't run all of them because you
not know where the problem
is just make the curer put the curer on
the
line it's not responding maybe I will
uh can you put the C online
one it's not responding I've tried
to let me try here can you put a c
online eight it's not responding
completely okay let me let me run that
it's not responding completely so let me
let me reload okay then you close it off
and start again
okay are you going to share again can
you share
again okay let
me let me share again
okay it's like you're having an issue
let me just stop sharing so that I waste
time let me see if I can sort out
myself okay let somebody else um show or
Al
ala
ala
sorry what unmute unmute who
sorry sorry can you please
unmute yes unmuted now sorry yes uh I am
trying to haven't loaded the different
uh the different packages uh T ex read
Excel read I Tred to run line L to get
this uh data
we created this uh and then I've tried
to run line n but it's giving me an
error in my console as you see see can
you show your screen
please you're showing us something
different
really can you share your screen please
oh oh oh okay
okay share screen oh I'm sorry
okay yes
sorry okay so as I was saying I haven't
opened I have loaded the various uh
packages that will be using I tried to
run line
n and this for this my data object we
created but I am seeing in my console as
you could see uh as you could see it's
saying error cannot open the connection
and blah blah blah blah so I don't know
what's the problem and the alternative
also I tried alternative one for
line4 for the right uh CSV and I also
got an error error in eval Express
object my data not found line 14 cannot
you you can't run line 14 unless if line
nine is
okay oh back to line n how I can solve
this line n line L all you need is
Internet that is all and then I had
internet as you could see I run again
see cannot open the
connection so I don't know the problem
sorry that's it can't he can't read
it he's showing the red the red
comments
yeah 4.33
okay can you first clean your
console
okay
[Music]
okay okay um uh option number
two you can you can copy the the the URL
into Google and see if you can see the
data and you copy
direct option for line 14 14 right on
line n we on line line line n and then
you copy the from
https everything and then you paste it
in
Google and then you look for yeah
okay copy and paste it
um go to
[Music]
Google this is what I
see company fixed charge I see some data
but it's not in any format I
can yes um Thomas are you seeing the
data that he's having it's in text form
yes
[Music]
[Music]
okay um what alha what you're going to
do you're going to save this data as a
text okay and then after that you're
going to call into R when within the
text format and then it will be that's
when you able to see it within the
environment because this is that the
data that you're showing it's it's a
data that we going to use with the
company and then the other different
variables that we want to consider that
is a comma data Sate Comm it's okay
download it as a text
are you are you
illustrating so you you yeah a script a
text script I know they have a text
script
there me find
text just just P there then you
save okay let me copy it first just to
say I I am to save as text format
right yes
yes you
copy open a text just a text
editor not
[Music]
it
open okay past it
there
then the okay save us
so let me put it under the same my data
where I'm working right now yeah yes
yes
okay then what name I could give it
possibly put
that just d
text it's
okay you can remove the text it will
save automatically as aext okay just
remove the the TST yes
and okay okay save save
now okay so now you need to go to your
to your your your studio
hopefully it will
work hope
so you can use the red
lines yes I'm trying to clean my console
but I'm not seeing the broom no no
no yeah just okay you already in the you
need to set a directory where you put
that particular one okay
Dory session choose
directory choose a
directory
yeah is that where it's supposed to
be yes I think I saved it
there okay then try to now put my data
yes sir just type type what I'm telling
you my
data I'm not seeing you're not typing
it's
here no no type in up
there yes I line
six oh in line
six just put it down
linee my daughter
that's okay you uhhuh you put the the
the assign
sign okay
okay
uh my data
table read table
table take this read
table table not F table yes read
table MH and then you put then put in
bracket in
quotes y
quotes I only save it at that that dot
yes but yes but the txt will be
there okay
after after that put a comma after the
quot okay after the quot
comma and then SE
SE separate
equals m
and then put a comma and put put a comma
put a quot and a comma
inside the comma should be in
quot comma should be in quot yeah okay
yeah in quot and a
comma is
okay hey but you're not I'm not seeing
let me push this down so yes you can see
okay uh okay try to run that and we
see yes it's run it send me uhuh no you
did we still waiting
and my data squ R dots so it's G me in
the console you can see in the console
here let me try to clean this up
okay it has run now I want you to can
you type names my data and we
see names names with my data in
bracket my
dat I run R 10 or
nine nine uh it's giving me
my data first put my data in that
bracket I want to yes it's already there
oh that's what it gives you okay now
what you do uh you put another comma
after put another comma and say equals
true again again please go back to line
eight line
eight after the accept uhhuh after
accept uh you put another comma then say
Ed equals
true
after before equal
to after after se but it doesn't really
matter we want the first line to be read
to be read as the Ed where the names are
okay SE equals Comm in quot after
that you mean Ed equals
T equals t
head h e a
d first before you yes okay
okay equals t t capital T
yeah run
again h h
okay yes okay good I think you're ready
to go
probably thank you very much thank
you wow okay let me stop sharing thank
you so
much e
sorry about that you can hear me
now am I
low okay great so um I was also saying
that uh we can have another option you
just you just copy the URL into Google
and then when you get to the point where
you've got uh uh
your data you can right click and then
save it as a CSV with that you'll be
able to come up with a um an Excel file
which you can which you can import um
you know that us you've been doing it
like going into um set directorate
choose directorate and then guide R
where you store your your data that can
also be another easy approach that you
can use but like we mentioned if you get
stuck there's always a solution and then
from from that you're able to learn more
okay I'm going to share my script
again and then we
proceed
um okay so I believe
now we
are when you come to for example when
you come to line 21 like head you can be
able to see um you know the first six
rows of your data set we've been talking
about that from the beginning we looked
at how you can draw pair in case you
want to have a scatter plot Matrix we
mentioned that you can use um pairs as
the what pays as pays command from the
ggary package in case you run that when
you check in your when you check in your
uh plot window you'll see you see many
plots so I'm not going to go into that
because uh we had a lot of time
explaining what happens within the
scatter plot Matrix so let's
continue um from what we have we can
first we can have for example a scatter
plot where we are having fuel against
the cells with the fuel I think that's
uh when you check um you're going to use
the fuel cost okay this is column three
one two three where I'm highlighting in
the console and then also we have the
what the fuel cost so in case I run line
27 if I run line 27 when you check in
the when you check in the
console sorry in the plot window we can
see a scatter plot for the fuel cost and
then the Sals maybe as uh as you can see
um we can see some of the groups we we
can be able to tell that this data point
we've got three sections but let me what
about in case we do the coloring such
that it comes out more clear remember
these points they stand for the
companies yeah the companies that the
companies that we are considering within
our data set so in case I include a text
for for companies what is going I want
to put the labels and then I I want to
consider my position as four in case I
run that um from from the plot window
you see that there are some overlap in
case you zoom
out I believe that you can zoom out from
your end um if in case I stop sharing
that and then um briefly if I share
this what we are having here is that uh
we can see the different companies and
then we can also see some overlap we see
some of the companies at the extreme end
of the of the sales and then some of the
companies are within the middle and then
we see some of the companies up here we
are looking at considering the fuel cost
and then the what and then the and then
the sales cost so I'm going to stop
sharing this and then go back to R then
I'll explain
more um
um okay in case I want to reduce the
overlap I can put the size of the text
that I want which is under the CeX which
is equal to 04 you can change it to any
any value that you want for example
maybe you can give it
04.5 Etc so in this particular case this
is just um just to to help me to reduce
the the overlap that I'm seeing Within
what within the plot so in case I run
line
31 um let me
first I'm going to remove that I'm going
to run my line 27 again then I run line
31 okay so I'm going to zoom and then
share
this okay so
um all right so what we are seeing in
this particular case at least we've uh
We've reduced the overlap but just just
by the size so we can clearly see the
different companies where when we are
comparing them in terms of the fuel cost
as well as the sales can I hear from you
what do you basing on the plot that that
we are having any any conclusion or any
explanation that you can give just type
in the chat what we are having from from
our scatter plot what would you say
basing on what we are looking at right
now and then I'll
supplement any comment on the
results okay
so at alha okay is saying three
clusters um NADA SS much by low coost
somebody saying you're seeing eight
groups
ah no correlation the names of the
companies no correlation three
clusters no differences High variation
we seeing three groups I'm just reading
the comments that you're putting in the
chat um no clarity three clusters okay
can I get more
inputs maybe more to somebody saying for
clusters no linear relationship for then
three okay disp passed some
similarity okay so let's look at this
you find that um we can look at this we
basing on how our companies are split
and we are looking at clust analysis
we have a group of companies whereby um
the sales are high especially these
companies one two three four the the
cells are high and the fuel cost is low
I'm looking at the extreme right where
we've got the navida Paget Texas and the
andhole all right so we have that
particular group whereby um it has high
cells with low fuel cost we have another
group here in the middle whereby it has
got medium cells as well as medium fuel
cost and then we have another group up
here which has got low cells but with
high fuel cost so in other words we can
say that Bing on what we are seeing we
are having three
clusters okay and I'm going to stop this
and then I share my R
script
um
okay all right so um I was just
elaborated the the the where we wanted
to see how the companies are distributed
and how are they put into the different
clusters basing on the fuel cost as well
as the sales those are the two
attributes that that we considered in
this particular case so that is just
understanding having a feel of the data
that we are having so the next thing
that we going to do it's about uh
normalization somebody was asking like
okay in case I've got my particular data
how do I do normalization maybe just to
take you back a little bit remember
where we had like um um head head my
data I just want to bring out this again
and then
so this is what we are having you find
that basing on this data set that we are
considering here there are some for
example under the fuel cost there are
some that have got very small values
like this is 1.5 then we have the
0.7 yet within then yet within the same
class the same data set we've got the
celles and we seeing that the Sals are
within a thousand like we have
9212 then we have
11,127 Etc so the reason why we are
normalizing we want to we want to to
ensure that we are working on the same
plane we standardize our what we
standardize our all the different
variables that we are using to reduce um
to reduce noise so what we are going to
do um usually when you're normalizing
we're going what we're going we're going
to subtract the mean and then we divide
by the standard deviation I'm not saying
that this is the only method there are
some other methods for example um that
uh you can still use in case you want to
do normalization of your data set so
with the first one um looking at line 42
I want to first I want to I want to
consider the whole data set the my data
set and then I remove um company okay
then I remain with only the quantitative
data this is what line 42 does it is
removing all the the First Column all
the rows of the first the first column
and then all its rows so in case I run
that particular line and then I run Z
what I see is that now I'm seeing all my
data points but without now the what the
company because um these are the the the
attributes that I want to normalize so I
want to get the mean and then after
getting them I'll sub after get getting
the mean and then I also have to get the
the standard deviation so I'm going to
use the apply command now I'm dealing
with the Z data set the one without the
compan and then I'm starting with the
I'm considering columns so two means
that we are doing with with the with the
different columns and then I want to
compute the what the mean I'm storing
everything under my object name called
mean so in case I want I can run line
44 then I will see the different means
within U within my environment I can
also do the standard deviation what I'm
changing here is just the SD but I'm
using the same command so and I store
everything in my
SDS okay that is for the standard
deviation if you remember like we doing
so of a
transformation from uh and then we want
to have our data which is within a
standard normal distribution with mean
zero and then variany of one so that's
what we're trying to do so and I've
already mentioned that we did the
normalization because we want all our
variables to be in the um to have the
same Level Playing Fields
okay so now I want to get to get n o is
going to be my object which has got the
normalized data set and I put the scale
function my Z this is uh that remember
the Zed it's it's it's a data set
without the without um without the
company then the
center this is my these are the means
and then the scales are the what the
standard deviations so I'm going to run
this for me to come up with a normalized
data set now if I run that I can also
check and see how um after that I can
run the N um the nro O and then when I
check in my console I'm able to to come
come up with now what is a normalized
data
set okay so having normalized now we are
going to move to go ahead and see how
can we compute the distance we are
talking about the the distance between
the different clusters or within the
different data points and uh in this
particular case in case you're going to
use the ukan distance we can use the
command DST and then inside I put my
normalized data set i'm storing
everything under the distance
object okay I'm going to first run run
the three lines then I pause in case
there's a question so in case I run that
okay I can check the
distance um you find that uh what I'm
having is that
uh um that I'm I'm seeing many many
points and so I want to reduce such that
I want to make it more compact by
reducing the number of decimal points
then I'll explain so in in case I put
print in bracket I want to print the
distance but this time around to three
with three decimal points so in case I
run that okay
um this is what I'm
getting so um if I run up to three
decimal points this is what I'm getting
remember we are printing the distance in
between the what the different class
so here what we are having for example
um let's take an example
about uh distance between camp and seven
and then 11 this is where they intersect
I'm considering
11 11 here and then nine you find that
this six is a very big what is a very is
a very big distance so it implies that
we cannot cluster campan seven and then
11 but surprisingly in case you move
down we are having 1.6
1.66 so it implies that we can group
compan 7even with compan 12 because the
distance in between is very small it
implies that these companies have got a
similar attributes of interest that we
are looking at so you can look through
again and then we can also take um for
another example like
1.49 considering campan 4 and then
compan 10 the distance in between is the
distance is very small so we can group
them under one cluster okay where we see
that the distance is very big then it
implies that we cannot group those two
companies within the same within the
same cluster you can also check through
and then uh um we can still see for
example with compan 4 and compan 20 the
distance that is um is very small so we
can also have four and 20 within the
same within the within the same
cluster okay let me first check in the
chat and uh and see if if we've gotten
how to read the the the the the Matrix
distance is this part okay yeah I hav't
also
understood do mind to send me the Google
Drive
um
okay okay I can I can repeat it
again no okay let's go back to the
somebody saying that he has not got the
normalization part of
it
um I'm going to go back on Line
39
okay um I remember uh I think yesterday
where when we were talking about
normalization or having the our data
points within the same within the same
unit where we are using the scale
command if you remember in yesterday's
presentation where we using the scale s
that
we um we have all our our data points
within the same what within the same
scale so here I'm talking about
normalization it is also the same thing
if sometimes we do data
transformation when when you're um when
you're doing data transformation it
implies that uh you're transforming your
data to a standard normal distribution
within a standard normal distribution um
you you you're taking making it in the
sense that it will have your the the
standard normal distribution has a mean
of zero and standard deviation of one so
what we want to do with our data set
that is what we want to do data
transformation because from what we have
our data has got different means and
then different what standard deviations
and I showed you um for example that
looking at looking at these six um um
the the first six rows here let's go
into the fuel cost I was saying that you
might find that the cost of fuel for
example here it's
1.55 the the quantities of the different
prices are really very small but when we
come to the sales you find there is
another variable which has a big value
like
7077 and then when you move on we also
have another compan which has sales of
three of three of
3,300 and then
11,
127 so in this particular case the point
is why are we going to normalization
here the noise is very big all right so
when we tend to when we normalize we are
trying to reduce the Noise Within the
what within our data
set okay we need to work from the same
plain field because we see that when you
check for the case of the celles the
values are very big compared to other
values while even when you check for the
case of the fuel cost the the the values
are too small and so it is the reason as
to why we need to go ahead and normalize
our data set and for us to do
that this is we we run line 49
okay
um I first removed the variable which
was uh company that is line
42 I viewed my data set now I viewed Z
without the without the compan variable
so I I wanted to get um the means and
mean which I stored everything into for
the different columns so I run line and
line 44 as well as the what the standard
deviation and after that for me to come
up with a normalized data set I run line
49 which has the inputs of what I have
above so in case I run line 50 I'm able
to come up with a what a normalized data
set okay and uh this is what it is
showing me with the fixed charge with
the cost at least in this particular
case we see that there is no big
difference between the attributes or the
variables that we are having and they go
in case if we are doing the
normalization we assume here that now
the V the variance is one and then the
mean is zero okay let me pause and check
in the chat are we on the same page
there
[Music]
we removed company because we wanted to
companies categorical it has got
different companies one up to what one
up to 22 okay we wanted to have the the
the variables with with the with
measurements or the we wanted to get the
quantitative data to have this to be
within the same scale since the other
variables since compan is categorical
it's a reason as to why we first removed
it and then we wanted to normalize the
rest of the variables that are
quantitatively
measured okay
okay let's
continue what does mean s SDS is
standard deviation like we defined it
earlier here that uh we want to get the
standard deviation for for that for our
attributes and then get the means for
our attributes that's why we're able to
do the normalization
okay somebody saying that what if the
main and the max limit I mentioned that
uh there are different options that you
can use and as you move on as you become
um as you get familiarized with our you
might find that well using this
particular approach maybe it's lengthy
and it is taking a lot of time and then
there's also another approach that you
can use we looked at the global Min
marks which you're suggesting it can
also still will um give us similar
results so it doesn't matter whatever
approach that you take provided you the
the the main aim is for you to normalize
your data set so I agree with
[Music]
you okay so
zero okay somebody's asking why negative
and positives you find that um remember
we said um our Min support when you
normalize the mean is zero and the
variance is one so when you add
everything up it's supposed to give you
when you add all the the positives and
the negatives under each respective
column you're supposed to come up with
what is supposed to come up with
zero so that's why we are having it like
that so what exactly
does okay so and now having normalized
the data now we are moving on to see how
we remember we said um how do we
calculate the distance between two what
between two points and this particular
case we are going to consider the
distance between our objects which are
the
companies so here I run line um I'm
going to use the this command okay and
then I want to get the distance of the n
o that is my normalized data set and I'm
storing everything under an object
called
distance after that I want to print
everything I want to print print the
distance to three decimal places to make
it more Compact and I'm able to see it
clearly so um I run line
54 I move on to line
55 and then you can see that uh when you
from line 55 we are having one two three
four five six we are having up to six
decimal points so I move to line 50 6
that I reduce the number of digits or
the number of decimal points and I'm
telling R give it to me up to three
decimal points so in case I run line
56 so this is what I'm able to get okay
we have uh on the right hand side if you
check here I'm going to highlight this
we are having the number the number of
observations these were the the
companies from one up to 21 okay and
then also when you look when you look on
this other side we also have this
particular line that runs from one two
three up to up to
21 up to
22 okay
so um I was just giving an example
because remember what we said that when
you're looking at when you want to know
the distance between we are with between
the classes those that have got small
distances they they are grouped within
they put within one group and then
that's how we are able to to do the
grouping that okay this particular
company and this one we can group them
we can put them under one group because
they've got um small distances in
between them implying that they share
similar attributes of interest that we
are looking at and I give an example
that okay let's let's take like u in
case you consider this particular
distance and this one is an intersection
of through company seven and then 11 six
is very big some said that in this part
in in this case you cannot Group Company
seven and 11 together we want to group
companies where the distance are very
small in case we take an example for
example here of 1.6 we can see that
seven and then and then 12 we can group
them under one one cluster because the
distance in between these companies is
very small you can also check for this
particular distance which is U um the
one of 1.4 and this one is the distance
between 10 and 10 and 13 so we can group
compan 10 and 13 into one cluster where
we've got big distances it implies that
those particular companies cannot be
grouped within one
cluster and that is the communication we
are trying to get um from line 54 up to
line
56 what is what we are noticing here
that uh um the print gives us the
distance among observations these
observations were within our records
with the different companies that we got
so we are asking ourselves how close or
far is one observation is from the other
and I picked the one of like six that go
through compan seven and then the one of
11 so because the distance is very big
then and we cannot put them into the
same cluster then we can say that they
are
dissimilar where we see that the
distance is close much closer so we in
that case it implies that there are a
lot of similarities between those two
companies especially as per our
attributes of Interest so there are lot
of similarities whereas where the where
we've got a bigger distance it is dis
similar implying that um these companies
are
different we cannot cluster We cannot
put them under one
cluster okay let me check checking the
chat is that
okay
sorry show the
point like saying that uh above
one okay so in this particular
case okay let me illustrate again so
what we are going to look at here like
um when it is about something above one
we said that uh remember we said that
with with the intra cluster
sorry
is she sharing you promoted
huh Alma can you share
[Music]
for the intro so
yes s is saying that small is best
because we are
grouping um with similar characteristics
yes because remember we said that in
case in within
clustering if objects have got similar
characteristics then it implies that
they are very close to each other so the
distance in between them is going to be
small so in this case what I'm what I'm
using I'm considering the one that is
within the range of one above one then
that is a what um a big distance when
the so like with the when we looked at
the the example of the
toy where we had that within a cluster
we had the old triangles
Esa can you share your
screen yes
please so please I run the line
nine and this is what I
got
okay um when you when you look in the
environment what do you
see I'm seeing 22 observations of nine
variables so I have that
one
okay so where I'm
okay okay what is the
problem I want to view the
data you want to view the
data okay run line
17
okay okay
okay so what you're seeing so please so
yes I've seen the data but will it not
show in the
console but come
again you want to see the data data not
show
or I don't know if you get me I'm asking
that would I be able to see this data in
the
console yes you
can you okay let's go back click on the
clust analysis.
[Music]
r click on it okay so enter after
that just create space I
should I I should create for please
enter enter there and create
space okay just type my
data no don't put there's no space
you're calling your set my data don't
put
space
okay okay then run that line
am am I on the right track
please I think so unless if you have a
question okay I think I'm okay I'm okay
at least I can follow from here thank
you very
much
okay thank you too for
sharing um anybody else
anybody else who wants to
share some you want to
share and samel wants to share
after this
one okay samel chuk can you share
somewhere are we on the same
page okay please
sh
okay
[Music]
um looking for my own
yeah
yeah
um you there's a a green button which
has share screen
you do that I'm looking for the for
my um you're looking
for for my
app already open but can you see my
screen now
not yet I only see your face I don't see
the our script
oh
okay okay
great are you see this the the the out
script
now no not really I'm seeing
uh I
think um from inter and because what it
has top top 10 machine learning methods
with r it's not your R
script share the script where you've
been working
from uh that's what I'm trying to
share No some first stop the screen you
selected the browser
GitHub stop sharing and then you select
the other screen because I know that you
saw multiple screens on your window so
that's
why
okay must stop sharing or I can stop
sharing for you maybe you can stop from
there okay let me stop it and then um
you now I've stopped so please click
share then select the window which has
are uhuh you again first go ahead and
close that window which
has the
browser let me stop again for you uh but
when you
selecting make sure you select the
[Laughter]
r because now you you are sharing the
the browser
are you able
to
okay close that window which where you
are searching for the to machine
learning
methods and then only leave the AR
window which is on your screen you might
be having multiple things which are
open have you closed
it I'm seeing our
studio is it
this is okay now yeah go ahead yeah okay
so
um when I try to run
uh uh line
eight line
N I don't get
[Music]
anything I think the data um is not I
tried to uh set working directory
nothing is coming up I think is from the
CSV file or what so can you
guide but still I think we we looked at
that okay
so um
Samuel are you seeing in the console on
the top right there's a button that it
is red are you seeing that red button
yeah yes don't stop it it is showing you
that it's still running so you have to
wait until when um the until when it is
complete and then you'll see the
data but I think you can borrow from the
I think there's there's from the
previous person who shared you can just
copy the URL into Google and then when
when the when the data set opens save it
as a CSV and then you can just import it
directly in the normal way that can also
help you to solve that
problem copy that
line
yes okay
is that okay
mm past it in your
browser
okay are you seeing
anything yeah yeah something
yeah uh it's in test form so I should
save
it yes then right click and save as a CS
V like that you you have it in form of
an Excel and you can import it normally
for example using the um you can choose
the directorate and then you can import
it into
our
[Music]
okay are you seeing the different
options that
servers uh servers
um servers I have a under I think it's
under F
name
sorry save as type is
only Microsoft Excel comma yes yes okay
what we can do stop sharing then I
demonstrate it for you very first and
then we continue given that okay is it
okay yes yes all right um so just do
what I do I'm going to just share my
browser did you receive something like
this when you posted into the URL yes
yes okay so what you do you just you
right click anywhere within and then
click
servers are you there yeah so when you
click save us it will give you an option
of saving as a c V and then uh for me by
default it is giving me utilities as the
file name and then you click save when
you click save you're going to come up
with a an Excel an a CSV file
okay are you getting that yes
yes okay so it implies that so now I'm
sharing the
the your your CSV file should look like
this and then you can import it um the
the normal
way have you gotten the
CSV okay let me uh confirm
it yeah I'm
seeing try to open
yes yes exactly complain fixed up to
full
cost yes is there so how do I import
it you import the using the the methods
that we learned earlier how you can
import
data because the challenge was on having
data okay that is under setting of uh uh
working
directory yes you got the working
directory then create directory from
folder and then you have you'll have it
in hour yeah I think I'm okay now okay
thank you so much all right let's
continue
[Music]
um going back to my our
script I'm sorry I'm going to be a
little fast now such that we complete
and then other things you learn on your
own as you
practice okay so we are going to move on
to um um we want to see how we can use
the hierarchical clustering using the
the
denr okay which is also ative
clustering so to do that we are going to
do everything by default remember we
talked about uh the complete linkage and
the single linkage so by default R does
the complete okay so we are going to
store everything so now we are going to
store our distances remember the
distances we talked about between the
between the companies we are going to
use um U we're going to use function H
clust whereby H
isical and then clust is the
claster and I'm putting now my distances
okay and I'm storing everything under my
data.
hclust so in case I run this part in
case I run line
70 okay um if I want to see how it looks
like I can run my
object and now I want to plot the
denam so I get plot and then I put in
the my dat my object name that I got
earlier so I can run that if you check
in your in your plot window um it has
changed if I can zoom
out okay
um so if I zoom out this is what I'm
having and even you can see that uh the
me that has been used to Cluster it is
called the it is the complete one
remember the complete linkage and then
the single but in this particular case
by by default R does the complete so
what we are seeing here remember what we
had is that uh um um at the beginning
every for example in this particular
case every company is a cluster but
looking at distance then they go
grouping them each group grouping them
in checking on checking the distance
that is that is in between then they
group those particular companies let's
we can still go back and check for
example like here they grouped company
10 and 133 together when you look at it
actually in this particular case it even
when you look at compan 12 and compan 21
it was grouped into one cluster where
you see these are our notes and then
where you see that this one was a
grouping that in if you can look at this
12 and and 21 initially everything was a
cluster but given that the distance is
small in between them then we can
collapse it into one cluster so I want
us to go back again on the distance and
check where we had we want to check 12
12 and 21 we want to check then 10 and
what 10 and 13 and see are these
distances in between them really small
we can also check between 14 and 19 here
and then we have 1 and 18 just those few
and then I'll I'll come back
here okay
um if I go back and I share my R script
so let's check again in our distances
all
right um this is what we had let's check
they have grouped 10 and 13
together so if I check 10 and I come and
I check for 13 are you seeing that uh
the distance in between was very small
so that's why when we check in our plot
that uh they are grouped under one
cluster so we have 10 and we we are
having 10 and 13 that was put together
we also saw 12 and 21 so in case we
check for 12 and then we go for 21 we
still see that the distance between
those two companies it was
1.38 it's the reason as why also those
two companies formed a cluster I'm
trying only the last one which was one
and 18 if you check one and move down to
18 we still see that uh those two
companies were close to each other hence
they were grouped under one cluster
okay I think that makes a little bit of
sense from those that were asking why
are we uh for example maybe saying that
they belong under um maybe how are we
saying that with this particular
distance then we are grouping them
together okay also um you can put the
compan labels in case you're interested
that is line 72 from line 72 in case you
run it you find that you now instead of
having numbers the numbers are being
replaced by what by companies let me
stop sharing this and then uh um I
can
okay um just to show you the effect and
then we continue yes now instead of the
numbers so now with the what we are
seeing are the what are the companies
basing on the for example we've just
seen that Manson is is close to the
northern so all what we are seeing so it
implies that now this one will collapse
into when you're looking at New England
and United States instead of now two
clusters it will collapse into one
cluster and then as we move ahead you
find that now what we've collapsed into
one cluster then becomes closer to
Pacific so we keep collapsing and then
we move until for example when we get to
to height seven whereby everything has
been grouped under one complete
cluster like I was showing you in
because this meth this approach is
subjective for example in case you put a
line and it cuts across here and then
you can be able to count the Clusters
that you have in that particular case
you can count clusters one and two okay
I'm going to St sharing that and then go
back to um go back to my
script okay I'm going to stop I'm going
to do this spond on the next
St okay um because of time I'm going to
run only this last one and then the rest
we can keep talking and then because you
also need to give out the evaluation
form so line seven 23 in case I want
everything to align on the same label on
the same line I include an argument
called hung equal to minus one when I
run that um this is what happens what
I'm going to show
you so in case I stop
sharing and then
um
um yes so we are getting the the same
plot but only that this time around
everything is on the what is on the same
line but we're using the same
method what we are building we are
building everything from the from the
different distances that we calculated
to come up with the the um the dam
fls
okay so
um looking at the our Scripts
we've not completed everything but since
we said that uh it's a learning
procedure we can we can keep learning
the other bit that we miss to here when
you go it was going to give you cluster
information on how the clustering was
done and then the grouping so I'll just
leave this for you to do the practice
but in case like you get any challenge
feel free to contact us and then we'll
guide you and then also the other thing
as you running through this our script
I've put different notes that you can
still follow whether you see the
comments there are more explanations on
the different plots that you're going to
come up with and then uh at the end of
the day coming up with the different
information that we are that we
discussed during the the theory
part so over to you Professor Susan
[Music]
uh we have a short
evaluation that we we would like you to
feeling we are going to post it in the
chat and in the q& a it is in the Google
Docs so you just fill in and send to the
highlighted uh email so that we can get
your feedback the feedback is going to
help us improve the way we deliver to
you so in uh one minute two minutes let
me have my f my colleagues send that
evaluation is that the evaluation being
sent okay great so I don't need to go
and
open did I go out
no you're
in
okay
okay so please take off some time and
feel feel the evaluation is
short but we are really so
grateful for the time you've given us to
share our knowledge with you it's just
uh the
beginning we hope to to have more of
these when our organizers call us we we
will always be ready to come and share
with you
so I'll give
you another two minutes to fill in that
questionnaire then I'll hand over to Dr
raru to close
in the meantime Dr Thomas can say
something since he has been
quieto hello
everyone yeah thank you so much
for this
week H we are happy that uh you
are obediently and willingly sitting
here for five good days listening to us
we don't take that for
granted we also like to thank
reum for giving us this
opportunity uh but we're also saying in
case you you want you you want us to do
a training for you we are always
available yeah so you can talk to your
bosses talk to look Within your projects
if you want to think you we even
physically we can come to your country
or
whatever so I want to still reemphasize
that what we have done here is to open
your
eyes so that you can go and read more
look for more and also utilize this
method ARA is something that you need to
use from time to time for it to stick in
your head
if you don't use it it will
disappear yeah but the good thing is
that you have the internet always at
hand even me I I I I really check a lot
sometimes I need to check if I forgot
something I checked I started using R I
think in
200
five so been a long time so but I still
I'm always learning something new is
always coming
so we are very grateful for this
opportunity and we hope we have been of
great help to you people thank you so
much uh thank you participants we also
thank re forum for
organizing this training I now hand over
to Dr
r or any Forum staff who is
online uh thank you thank you thank you
so
much uh for even having this training
for five good days for 3
hours every day um we are so grateful to
Professor
Susan dren and Dr Thomas for uh creating
time to come and share the knowledge
with Africa we are so grateful to them I
am seeing colleagues who are saying
certificates unfortunately in this
training we do not provide certificates
is just to enhance your knowledge base
in uh statistical analysis so this is to
your own benefit not to just have a
paper so unfortunately we shall not be
having papers uh I see others who are
constantly asking for contacts for our
facilitators please type their names on
LinkedIn uh link up with them on
LinkedIn they'll be happy to help you uh
where wherever you need their help and
also I would encourage all of us to be
in our WhatsApp groups we also get help
from colleagues when we are facing
challenges so let me take this
opportunity to thank all of you for
attending this training let's meet again
when we have other opportunities for you
thank you so much everyone have a good
afternoon bye