Exploratory Data Analysis: Concepts, Methods, and Practical R Implementation

Name: Day 3: Research Methods and Statistics Training
Uploaded: 2026-01-16T09:46:01.626816+00:00
Channel: RUFORUMNetwork
Description: Summary and key takeaways on Exploratory Data Analysis: Concepts, Methods, and Practical R Implementation, covering Introduction The session began with a brief

RUFORUMNetwork

Jan 16, 2026

•

4 min read

YouTube video ID: kOWQ4OBJigg

Source: YouTube video by RUFORUMNetwork — Watch original video

PDF

Introduction

The session began with a brief personal introduction and a reminder that the focus would be on Exploratory Data Analysis (EDA) – the first and essential step before any inferential statistics or modeling.

What is Exploratory Data Analysis?

Definition: An approach that summarizes the main characteristics of a dataset using numbers and graphs.
Purpose:
Build confidence in the data.
Detect relationships between variables.
Identify data entry errors, outliers, and violations of statistical assumptions (e.g., normality for ANOVA).
Guide the choice of analytical tools and hypotheses.

Types of Variables

Quantitative (numeric) – measurable on a scale (e.g., height, salary, years of education). These can be continuous or discrete.
Qualitative (categorical) – place observations into groups (e.g., gender, job category, minority status).

Univariate vs. Multivariate EDA

Scope	Numerical Tools	Graphical Tools
Univariate	Mean, median, mode, variance, standard deviation, inter‑quartile range, frequencies	Histogram, box‑plot, bar chart, stem‑and‑leaf
Multivariate	Covariance matrix, cross‑tabulation	Scatter plot, grouped box‑plot, colored histograms

Five‑Number Summary & Box‑Plot

The five‑number summary (minimum, Q1, median, Q3, maximum) provides a quick view of the distribution and is visualized by a box‑plot. Outliers are identified using the rule:

Lower bound = Q1 – 1.5·IQR
Upper bound = Q3 + 1.5·IQR

Values outside these bounds are flagged for further investigation.

Measures of Central Tendency

Mean – sensitive to extreme values.
Median – robust; unchanged by outliers.
Trimmed mean – a compromise that removes a percentage of the most extreme observations.

Measures of Spread

Range – simple max‑min difference; highly affected by outliers.
Inter‑quartile range (IQR) – robust, focuses on the middle 50%.
Variance & Standard Deviation – incorporate all observations; sensitive to extreme values.
Coefficient of Variation – standard deviation expressed as a percentage of the mean.

Distribution Shape

Symmetric – left and right tails mirror each other (often normal).
Skewed right (positive) – long tail on the high‑value side.
Skewed left (negative) – long tail on the low‑value side. Histograms and box‑plots reveal these patterns and help decide whether transformations or alternative models are needed.

Correlation

Correlation coefficients range from –1 (perfect negative linear relationship) to +1 (perfect positive linear relationship); 0 indicates no linear relationship.
Strong correlations (|r| > 0.7) suggest a linear link, but the researcher must still interpret the substantive meaning.
Correlations should be examined within sub‑groups (e.g., managers only) because patterns can differ dramatically across categories.

Case Study: Employee Dataset

The participants explored a synthetic dataset containing: - Quantitative variables: years of education, current salary, beginning salary, time on the job, previous experience. - Qualitative variables: gender, job category (clerical, managerial, custodial), minority status.

Key analytical steps demonstrated: 1. Identify quantitative variables and compute five‑number summaries for education and salary. 2. Detect outliers (e.g., a negative value for years of education) and discuss possible data‑entry errors. 3. Compare groups using box‑plots and frequency tables to reveal gender imbalances in job categories and salary distributions. 4. Cross‑tabulation to test whether gender is associated with job type (e.g., all custodial positions were male). 5. Scatter plots to explore relationships such as: - Education years vs. current salary (weak/absent trend). - Beginning salary vs. current salary (strong positive correlation, r ≈ 0.88). - Previous experience vs. current salary (negative correlation for some groups, suggesting possible cohort effects). 6. Interpretation – the analysis highlighted potential discrimination, the limited explanatory power of education for certain job categories, and the importance of subgroup analysis.

Practical R Workflow

Installation: install.packages("car") and other required libraries.
Setting the Working Directory: Session → Set Working Directory → Choose Directory to point RStudio to the folder containing the CSV files and scripts.
Loading Data: read.csv("employees.csv").
Running Scripts: Execute line‑by‑line, checking for errors such as missing packages or incorrect file paths.
Troubleshooting: Verify that the correct folder (not a hidden sub‑folder) is selected, reinstall missing packages, and consult console messages.

Why Spend Time on EDA?

The presenter emphasized that thorough EDA saves time later: it uncovers data quality issues, informs model selection, and provides a narrative foundation for any statistical report or thesis.

Communication of Results

Use concise tables, bar charts, or pie charts to convey frequencies.
Highlight outliers and explain whether they are errors or meaningful observations.
Tailor the story to the audience—journalists may stress the gender disparity, while a technical report may focus on statistical significance.

Next Steps for Participants

Ensure all required R packages are installed.
Set the working directory correctly.
Run the provided script up to line 19 without errors.
Continue exploring the dataset tomorrow, focusing on multivariate visualizations and hypothesis testing.

The session concluded with reminders about the WhatsApp support group, the YouTube channel for additional tutorials, and a light‑hearted farewell.

Effective exploratory data analysis—combining numerical summaries, visualizations, and careful handling of outliers and missing values—lays the groundwork for reliable statistical modeling and clear communication of findings.

Frequently Asked Questions

Who is RUFORUMNetwork on YouTube?

RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

What is Exploratory Data Analysis?

- **Definition**: An approach that summarizes the main characteristics of a dataset using numbers and graphs. - **Purpose**: 1. Build confidence in the data. 2. Detect relationships between variables. 3. Identify data entry errors, outliers, and violations of statistical assumptions (e.g., normality for ANOVA). 4. Guide the choice of analytical tools and hypotheses.

Why Spend Time on EDA?

The presenter emphasized that thorough EDA saves time later: it uncovers data quality issues, informs model selection, and provides a narrative foundation for any statistical report or thesis.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

R Programming For Data Analysis Book Recommended

Provides comprehensive guidance on installing packages, setting working directories, and performing EDA in R, which directly supports the workflow described in the session.

Amazon →

Exploratory Data Analysis Textbook

Offers theory and practical examples of univariate and multivariate EDA techniques, helping readers deepen their understanding of the concepts covered.

Amazon →

Rstudio Desktop Software

The interactive development environment used throughout the tutorial; having the latest version ensures compatibility with packages and smooth execution of scripts.

Amazon →

Data Visualization With Ggplot2 Guide

Teaches creation of histograms, box‑plots, and colored scatter plots—key visual tools demonstrated for exploring relationships in the employee dataset.

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

e
e e
hello
everyone hi are you getting
me yes we are getting you yes
okay
yes I can try this for your exp best
yeah I think we can uh
start maybe let me show my face
briefly and then I can I don't think it
adds a lot of value uh but it good to
know who talking to you then from time
to time I can come back for those who
would want to uh to see me so let me
share my my screen uh we are going to
start uh with just a brief uh a brief
look at
U at exploratory data analysis this is
going to be a little bit of
theory no volume
your volume is clear I think okay I
think it's for Susan which is not clear
but that's okay I'm going to yeah so so
what I want us to look at briefly uh
before we start uh looking at the the
the Practical aspect of it uh uh
yesterday we the last two days we got
ourself into ARA an ARA Studio
uh we were able
to uh least install ARA and ARA Studio
we are able to uh put in some commands
that we able to install and call the
libraries uh we're going to continue
from where Ellen left uh but before we
do that I want us to just go through
what I consider one of the most
important thing in that analysis that is
exploratory aspect of it um
we can always look at a data set as
uh uh if if you look at this information
for example let's say we are interested
in height of of human beings of course
uh you want to know the the shortest
person you want to know the tallest
person you want to know the average
height uh you may also want to know uh
the different like the first quartile
the second quartile the median or the
third quartile this are kind of thing
that you want to know uh so what is
exploratory data analysis so the first
thing is that uh uh this is a an
approach that is aimed at uh summarizing
the main characteristics of the data
using numbers and graphs uh we already
saw some summaries like Ellen gave us
yesterday so it's always supposed to be
the first step in your analysis H it
will be unfortunate if you go straight
away and you want to do analysis of
Varian you want to do regression
analysis before actually trying to do uh
some exploratory data analysis um so the
the main goal of exploratory data
analysis is simply uh to obtain
confidence in your data to a point where
you're ready to engage in some
inferential statistics or you you're
ready to do some
modeling uh so these are I would say
these are the the main uh reasons why we
do exploratory data analysis the first
thing is you want to maximize insight
into the data set that you have after
collecting your data either from the
field experiment or from survey you want
to actually know what is what what does
what is there in my in your data set uh
you want to know are there relationship
between variables uh let's say do women
run faster than men
or or
maybe people from a certain area shorter
than people people from another area you
also want to determine to to see whether
there are some kind of maybe there
mistakes or there are some value that
seems not to really to to to relate or
to be together with others if you're
looking at for example you go to a
primary school and you want to say you
want to measure maybe the intelligence
or IQ of the kids and you find one kid
with the say IQ of 600 and the are in
20050 so that particular value will be
of interest so you want to know is this
a real value or could it be that uh this
is an
error uh so there are also assumptions
that are related to data analysis so for
example uh if you do analysis of
variance you know that the data should
be normally distributed and so on so
this is one this at this stage you'll be
able to see whether there are there some
assumptions
actually violated or not you can suggest
also the hypothesis that can be tested
and then you also be able to come up
with the tools that you will use for
your your data analysis so uh in short
or in summary uh you want to pick the
big picture look at the data you want to
check uh whether there are some obvious
mistakes uh you want to learn about the
distributions of the different
variables and you want to know if
relationship exists uh between the
different variable that you have so uh
we certainly need to pay a lot of
attention to exploratory data analysis
and please don't do your analysis before
don't go to rush to try to do analysis
of variance or statis modeling before
you explore your data
the number of method that we can use for
data analysis
exploratory uh let me just have table
okay so for example you can
use okay uh you can have summary based
on just
numbers uh versus uh graphs so you can
use the mean the median or you can go
and use things like the histogram the
back graph and so on and so forth
uh you could also look at the the the
method of explor data analysis whether
you are dealing with one variable at a
time which is
univariant versus the multivariate where
you're going to deal with more than one
variables at a time you could deal with
two variables at time three variables or
four variables at a time that would be
what you want to do uh So based on the
two
categorizations uh we can have for
example a
numerical univariant so let's say for
example yesterday Ellen showed us a few
one data set and then in that data set
there are several variables so if you
look at one of this variable at the time
and if you want to use mean the variance
this is what call univariate numerical
you can also use univariate graphical
then you can have numerical multivariant
and then you have graphical mul VAR so
we are going to uh look at each of these
methods in relation to a data set uh
that we are going to describe shortly so
this is a a graphical I mean this is a
table that shows the different method
that are available this is not an
exhaustive list um I the the
presentation uh was already given to you
yesterday so at least uh if if you Keen
enough you should have probably really
looked at
it okay so under
univariant uh you can look at the
measure of center or location so things
like the mean the median the mode the
qu you can also look at the measure of
spread so under spread uh we have things
like the rain we have variance we have
standard deviation and then we have
things like the interquartile range uh
you can also look at uh frequencies and
percentages I think we saw yesterday how
to generate some frequencies uh using
the summary tools I think this one was
one of the thing Ellen did yesterday and
then
graphically uh we're going to look at
things like box plot uh histogram the
stem and Le bar graph bar chart and line
graph so these are if you looking at one
variable at a time so let's say you you
want to see for example the age age of
the children the height of the children
uh maybe the weight so you can look at
each variable at a time either using
numbers or using graphs but you can also
look at more than one variable at a time
you may be interested in looking at
height versus weight uh you may be
interested in using looking at sex
versus distance from education uh so
under this the multivariate we have
different tools uh we have things like
cross tabulation we have Co variance and
variance
matrices uh you graphically you can uh
use the scattered plot you can use back
graph histogram and box plot we also
going to have a bit of time uh to
actually look at this uh graph graphical
method for numeric for for multivariate
data
analysis we have a small data set if
you've attended our training before
probably you're already aware of the the
data set so we are going to uh look at a
data set so this is a data set about
employees of a certain
company uh this data set is a got from
SPSS uh so there are a number of
variables uh that we are interested in
studying so the first thing is we want
to know uh maybe the sex of the employee
or gender whether male or female or man
or woman we want to know the the number
of years somebody spent at
school uh we want to know the job
description uh we want to know the the
current salary the beginning salary time
span on job previous six years
experience and whether somebody's
minority or not so here the main area
idea or the main possible area of study
would be whether there is
discrimination uh based on gender or
based on uh minority classification so
in some cases they are they complain
that maybe the female are favored for a
particular job than the men or the other
way
around so this is how the data set look
like uh what Ellen talked about
yesterday we talked about a data frame
uh so a data frame is simply uh you have
rows and
columns I'm try to get a pen so these
are the
rows uh these are our columns so for
example if you look at the first person
here is male uh date of birth uh 3rd
February
1952 I spent 15 years at school uh this
person is a manager uh this person earns
57 currently ear
$7,000 and then started work when was
earning
27,000 so you can see that the salary
has almost has doubled more than doubled
uh the time spent on jobs is in months
is 98 previous years experience and then
uh the minority is not a minority under
this you can either be a minority or a
non minority so when you're collecting
your your data in most cases uh your
study unit you need to have one study
unit and each study unit all the
information relating to the stud study
unit can actually enter into into one uh
one row uh this is for ease of analysis
uh but especially if you're not uh you
don't you don't have a lot of knowledge
on the computer but if you're very good
at programming you can always play
around but this is kind of a data set uh
that uh Ellen was talking about data
frame that Ellen was talking about it's
similar to what we know in Excel and
other packages so what we're going to do
is we are going to start by summarizing
the quantitative variables uh I don't
know from the chat uh what is a
quantitative variable can anybody can
you write can you try to give us a
definition what a quantitative variable
is I want to see
some answers in the
chat say measurable somebody say age
numerical
variable age dealing with
numbers salary numbers okay
numerical okay Express okay good
okay that's fine we can stop there
now some people are giving examples some
people are defining it which is okay at
least I see that we have some knowledge
we can stop there now okay so
quantitative variables are variables
that can be meas that can be Quantified
either by some measurement scale or by
counting okay so here height for example
is a Quant
quantitative variable and it is
continuous in that you can measure over
some continuous scale uh so one of the
the easiest way to
summarize uh a quantitative variable is
by what we call a five number summary
and the five number summary can be
expressed as the box plot so the five
number summary are simply the minimum
value so you want to know what is the
minimum value you want to know what is
the maximum value you want to know the
value that divides the population into
half that is the median so 50% of the
population will be below this value and
50% of the population will be above that
value and then the first quarle I mean
divides the data into quarters uh
25% of the data will be below uh the
first quartile and 75% of the data will
be above the above the the first
quartile then the third quartile is the
opposite 75% of the population will be
below and 25% will be above so if you
want to know uh the top uh
25% that is you go for the third quarter
now this same information can be
presented uh the same information can be
presented in a box plot so so a box plot
you'll have uh the the the
lower whisker and then the upper whisker
we're going to look at this uh
shortly so let's first identify which
variables from here are quantitative can
I can you identify the variable that are
quantitative let's try and see in the
chat one variable that quantitative that
you think is quantitative from here
years at school
yes
salary current
salary
okay okay great I time on job good ID is
not
quantitative okay we can stop
there okay so these are the variable
that the one in green and the on that
are quantitative at number of level of
Education that is the number of years at
school Uh current salary beginning
salary time on job and previous
experience now it's very very important
to be uh to differentiate between
quantitative and qualitative data
between
continuous uh and categor and and this
quantitative and categorical between
continuous and discrete so all this
information are available I'll ask you
to go and read them online so this are
the one in green this is one of them so
these are the ones that we have as our
quantitative variables so here the
gender is
categorical uh job category is
categorical and then my not
classification is also
categorical uh so the the the the first
thing that we need to do is to do the
five number summary so the five number
summary is simply going to tell us uh
from here
uh from this data we can see uh that in
this company the least educated person
is 8 years old uh and then the most
educated uh is 21 years old and then at
least 75% of spent less than 15 years at
school based on the the upper now
assuming now that the minimum value here
was
-10 uh what comment can you give about
this if the Val is 10 I want to see in
the
chat is it possible to have Nega 10
years at
school is impossible so what could be
somebody says an outlier what could be
the reason what could be here what could
be the
thing could be an error so you need to
go and check and say there's somebody
who has spent negative 10 years at
school I don't know whether that person
after going to school then they took
them to a doctor and they try to extract
as much information from the head as
possible so that the person go backward
10 years
okay great H so we we we can this this
information this information is enough
uh to actually start a discussion with
your supervisor or with your student
you're saying in this case that the Le
educ person in this company spend eight
years at school and the most educated
spend 21 years at school SCH and then at
least the 75% of the population spend uh
15 years at
school uh
which which which value how many what
percentage spend 12 years at
school what percentage of What
proportion of the population spent uh 5
years at
school okay I see
it 12 years at school have okay here you
can see here that we have about the
median is the one that tells us that at
least 50% of the respondent spend 12
years at school but you can see from
here
that the the the median and upper the
median and the lower quars are the same
so here you can see this thickness here
so this is what we have
okay so here we can see uh
that this is the upper
quartile uh which is uh at 15 you can
see from
here uh then the median and the lower qu
are at 12
the minimum value is at
eight okay then the maximum values here
are that there's maximum value here
there this is the one that is this is
our
21 but there's also another value here
which is is I think is
19 so now these two
values that is uh uh 19 and 25 have been
marked as uh as outliers from here you
can see from here that they outliers and
how do we calculate the outliers uh the
outliers are simply got by this so
here uh from you get the upper
quartile you add 1.5 times the
interquartile
range going in this direction once you
you specify that point anything below
anything above that will be consider an
outlier you come and do the same here
anything below here will be considered
an outlier so in this case there are no
outliers below but they outliers up so
there these people here seems to studied
more than the rest of of of the rest of
the members
so so somebody asking what is the inter
quartile range the inter quartile range
is the difference between the upper
quartile and the lower quartile so it is
this difference here so in our case here
is
15us
12 which is equal to three so what we
are going to do is we're going to
multiply this by
1.5 uh so will give us a 4.5 so
4.5 okay so what we do then is that we
get the upper
quti which is the
15 and then we add 4.5 to
it 4.5 to8 and this one will give us
19.5 so any value above
19.5 will be considered an outlier so I
think the value I give you this value
here is 20 I think not 19
this value here is 20 so 20 which is
above
19.5 any value above that will be
consider an outlier when we come down
here you will
have8
minus not not eight sorry 12 minus 12
which is the lower
quartile minus
4.5 so if you take this one away uh you
get 7.5
so any value about below 7.5 will be
considered an outlier but here the
smallest value is eight so there's no
value that is below
7.5 above wait a bit is that
correct is that correct is that
12 I do the calculation right
where is the 1.5 coming from the 1.5 is
coming all coming from uh somebody gave
us the standardized that value and told
us that is what we need for the
calculations
okay so we don't have any value below
7.5 because 8.5 is above that so there's
no value so the smallest value here is
eight and there is no value below
7.5 so there's no outlier on the lower
part of of the graph here okay now when
we go to the salary we see the same
thing you can see from here uh that the
minimum salary current salary is
15,750
750 the maximum salary is
135,000 the med salary is
28875 then this is the lower quartile
and this is the upper quartile okay so
you can see from here that the least
paid person ear
15,750 uh the
highest uh ears that 1,
135,000 so if for example we find a
value here uh that the lowest person
maybe is earning say let's say $15
uh so you begin to question that I mean
can somebody work for full year for $15
and you probably say well this could be
an error so one thing that by simply
looking at the maximum and the minimum
value you'll already be able to actually
determine whether there's any error or
not let me go to the Chart anybody
with uh please explain the calculation I
thought I did explaining the calculation
so the formula for
calculating the the boundary or the
upper
Point uh is is one point the upper
quartile uh plus 1.5 * the interquartile
Rin uh what's the difference between
quantitative and qualitative variables
qualitative variables or categorical
variables
are variables that categorizes or put uh
your experimental or your your study
unit into groups so for example you can
be either male or female that is
categorical or
qualitative but your height is is
quantitative because it can be measured
over a scale
okay so I'm going to do the the first
thing is here so here let's look at the
upper quartile our upper quartile here
is
37,000
37,000
050 that is our upper quartile and then
plus
1.5 times the interquartile drain
now our inter quartile range is going to
be the difference between the upper
quartile which is
37 uh 0
05
050 okay minus the lower quartile which
is
24,000 so if you get that that will give
you the upper boundary it will give you
the value here that is where this value
will stop then we saying any value Above
This will be considered outlier so you
can see there are very many outliers
here there is this particular value here
I think this is the value the highest
possible value here but we still have a
lot more value within here and then of
course you have your upper quartile the
lower quartile you have the minimum
value of Maximum value the point here is
then are all these values
really are all these values really
outliers what is an
outlier so an outlier is a value that
seems to be and does not seem to conform
with the rest uh but look here in this
particular case we have all the
employees of the company we have the
managers we have the the secretaries we
have all the office Messengers and
everybody else so if you're going to
group everybody together uh some people
may appear to be outliers because they
are grouped with other people but if you
put the managers together you're most
likely going to reduce what you may
consider to be an outlier so so this is
the importance of trying uh to trying to
to subset your data and try to make your
summary maybe based on some value so
that's what Ellen was talking about
yesterday when you're trying to do those
frequencies you may want to do the
frequencies based on the different
categories and this is where bring in
another bringing in another variable to
to so that you bivariate or multivariate
exploratory data analysis would help in
this particular case here
okay so outliers so if you see here we
seems to have these gentlemen and ladies
here who a bit off the rest so these are
outliers there are reason why there may
be reason why they they are out of the
crowd maybe they put out of the they put
out there by by mistake or because they
have something different they unique in
their own way so outliers are sep
separated from the rest the unusual
value so you may need to explore the
reason why these values are different
from the rest are they errors if they're
not errors then yeah they could be of
interest to you I mean if this guys is
here are supposed to be in the crowd
here but they seems to stand out then
you need to know why they here okay do
they consider themselves to be different
from the
rest so so in practice what always
happens is then that first of all you
need to check whether these are outliers
or not if you you you confirm that
they're not outliers uh the practice is
always to do analysis with on without
the outliers and then you can compare
the
results uh the other issues here are
what we call the missing values so the
missing values may also pose a lot of
challenges when you're doing the
analysis so the main issue is
does the missing value lead to different
results if the data was not Miss so you
need to say say look
at what is causing the
missingness uh so if for example let's
say uh uh we have just been in Uganda we
just been doing our population
sensors uh now if for
example later you try to March the home
the the home the houses the home state
you try to match them with the with the
data collection and you find that maybe
all the houses that are all the families
that appear to be far away from the road
are not included in the in the in the
sensus so the question the missing value
in this case are are very informative
because there's a pattern to the missing
value but if the missing value is random
in that maybe one household uh in in am
two or three is missing and then another
one you I mean if there's no particular
pattern so in that case you would uh
treat that missing value without really
having to bother but if the missing
value has as pattern then you may
actually need to to really get worried
about
that uh so now how do
we describe a population the
distribution of a population so the
first thing is uh there are certain
characteristics that may be interest to
you for example the center of the
population
the spread of the
population uh the PS how many PS do you
have and what about the shape do we have
heavy tals do we have lighter tails and
what about outliers so we're going to
look at how we can use a a things like
histogram and box PL to actually uh look
at the distribution of the population
that you're trying to study now the
first measure of the
population that we are interested in the
first measure is the measure of center
central tendency so we always want to
know what is the average how does the
typical uh plan look like in a
particular Forest how does a typical
Uganda look like how does a typical
Kenyan look like how does a typical
Nigerian look like so that is the
measure of the center if you say oh a
typical Nigerian male is about 1.8
m okay so that that's supposed to beIN
for us a Ty typical member would be but
that does not mean that there is no you
may not get somebody who is the a male
who is 1.2 M it may be there you could
also get a male who is 2.2 M that may
also be there but the the the the most
important thing here is we want to know
the center now knowing the center is
important because will help us to
determine how far away some values are
from the center now if you are too far
away from the center if you get a man
who is maybe 50 cm tall H and the center
for Nigeria is
1.5 then you begin to question this this
particular value here uh I mean you say
this is really not a typical Nigerian
man is supposed to be 1.8 now somebody
who is one
0.5 probably if not a child then this
person could have come from somebody
else or this value could actually be an
error that's what you want so there are
two ways of measuring Center we have the
mean or the average and then we have the
median ah somebody has just joined the
the class and saying which uh which data
set are we using we using the data set
in my head
today we we'll show you the data set
that we're going to that we're using
shortly okay so so we we need to to
calculate the mean I think we all know
how to measure the mean there is also
what we call the trimmed mean what we
know here is that the mean is affected
by extreme
values uh so if there are values that
are very far away from the center they
tend to affect the
mean okay so if you go to the current
salary for example and then the years at
school so you can see that there is a
big difference between the mean and the
median remember both of them are
supposed to measure the center of the
distribution so this values
here this outliers here tends to make
the mean to be much bigger compared to
uh to the to the median so remember the
median is you just align the observation
from the smallest to the highest and you
look at the one in the middle so the
median is not affected by extreme value
while the mean is affected by our
extreme value so for you to get closer
to the mean you can do what call a
trimmed mean but you can see here you
need to cut of a lot of a bigger
percentage of your data to actually get
the the trimmed
mean the same for school you can see the
difference between the mean and the and
the
median uh what I tried to show here is
that uh now if you change if you cut
part of your data off you see the the
the median doesn't change much so the
median as a center of of measuring the
distribution is very robust that's the
the word is is very robust it does not
change much with changes in the data
because if we go back to
our if we go back to this this this
pictures
here
was so you you see the median is just
you just you want interested in the
middle
here so this height here the one that
are the median
height so even if I chop of these people
here I chopped up this people here the
median is like to remain the same so
that's why it is not affected by extreme
observation because you can always uh
get them
off okay so this is what I'm trying to
show here that the median at the center
is very very
robust now what about the
spread now the spread is a measure of
how far values are from the center from
the central value so we said a typical
male Nigerian should be of height
1.8 m let's say like that so the
question is not all Nigerians will be
1.5 1.7 somebody may be
1.6 another
person another person may be uh
1.2 another person may be uh
2.2 so it spread this death measuring
how far apart or how how scatter this
values are in relation to the center and
these are common method that we use for
measuring the the spread we have the
range we have the variance we have the
standard deviation we have the
interquartile range and then we have the
coefficient of variation so all these
are tend to measure the same thing but
they have different characteristics and
therefore uh they may be affected uh
differently by say outliers and other
thing so so here uh the the easiest way
to measure is uh the range which is just
the differ between the maximum value and
the minimum value another is the inter
quartile range which the difference
between the upper quartile and the lower
quartile so both the range and Inter
quartile range do not take into account
all the data points so remember here the
range only takes care of take care of
this and that
you it doesn't care about what is here
all this thing here it doesn't
care so you you can have a situation
where where you can have a situation
where there's one value
here uh there's another value here and
then all the other value will be here so
if all the value is here so you can see
that it doesn't really matter the fact
that there's no variation here uh is not
really doesn't affect for you you you
measured the range which only takes care
of the upper the upper the extreme
values and then inter quartile range you
compare the upper quartile and the lower
quartile that's what you have so with
our
data uh so the next one is the the
variance so the variance is simply
you're measuring also the variation from
the center how far each observation is
from the center so you can see that uh
this observation here with a value of
six is far away from the is the farthest
from the center followed by that
followed by by this and then that so
here what we doing is we are simply
looking at the difference between each
observation and the mean then you square
you sum them and you you average and
then the square root of that will give
us the standard deviation so this is uh
what we have as a a measure of the
spread that
is so like I I indicated
before uh the if the the spread is also
can be affected highly uh by uh extreme
values so let's look at the two values
here so this is the
r now if I look at the entire data set
the range is 13 interquartile range is
three the variance is 8. two and the
standard deviation is
2.88 now if I I chop off 5% of the data
from this extreme end I come this one
changes to 11 remain three this one
changes to to 5.3 this one is 2.3 so you
can see here uh
that the range is affected a lot by
changes in the data so it is not robust
but interqual range is quite
robust but okay so you can see the value
remain the same and then uh the
variance and then the standard
deviations are also affected by uh
affected by changes in the data so if
you trim a part of the data you see so
extreme value also have effect on the
measures of spread especially if you're
using the variance and you're using the
r uh the variance and standard deviation
are the most common method that I used
for for
that okay so so the other thing that we
also interested in is to look at the
distribution pattern of value in the
data so so here for example we want to
see whether there's symmetry so this
when there symmetry means if you fall
this this this from the middle here the
parts should fall Neely on each other so
this is what we only call symmetric and
this is what we expect from normal
distribution uh so this is also
symmetric but you can see that here the
values are very tight very closer to the
mean here the values are a little bit
wider now when you look at this other
one here you can see that there is this
is not very symmetric this tail is a lot
longer so this is what we call
skewness so you have the the
skewness uh so the skewness can either
be to the right or to the left can
either be positive skewed or negative
skewed and then you can also have a
situation where instead of having one Pi
you have two PS so here you have two PS
so all these uh will determine the
choice of tools of analysis so we want
if you want to do analysis of variance
for example you expect to have this kind
of distribution or this kind of
distribution now if you have this kind
of distribution then you may need to do
something else to help with that or if
you have this kind of distribution then
something else need to be done you
either need to choose another uh tool
for the analysis or you do things like
data trans transformation and so on and
so
forth so let me go to the chat
question I know
there you're being answered in
the in the
Box are we
together okay we are following this is
this just a simple uh things uh so so
the the the the distribution the shape
of the distribution is mainly the
pattern that the values take showing
their
differences uh relative to one another
so this is mainly observed in like for
example the histogram so for example the
histogram it tell us that vales tend to
be close to a particular value or
multiple values so this is where a
typical value be what is the typical
values the center of the
distribution um so most values vary
above much above the about above that
typical value uh so so let's look at
this so here for example uh for the the
distribution of the salary the current
salary from what we have there so you
can see from here that the most common
values is
here okay so most of the values are
here okay now there are values that are
very far away from the center these are
the unusual the outliers the unusual
values okay uh so the the the entire
spread of the
data that is either from this side or
this side is what call the variation so
you can see that most variation are
found above the common value and while
on this side there is less variation on
this particular point so you can see
from here that we seems to have all
these values here uh that seems to be on
their own but this value here could also
be because of maybe those who are highly
paid you have people like managers that
get more money so they could be the one
on this side and then the other common
workers are the ones that are here so
here you may be dealing with more than
one distribution uh so when you're
trying to explore uh your data set uh
you don't need to rush to make decision
about whether the data is skilled or not
you also need to know what could be
responsible for the pattern that you
observed here because here we can see
that there could be a number of people
get paid more compared to those ones who
are paid less and so you could be
dealing with two uh two population you
could be a population of the managers
and then the population of everybody
else uh so uh so you can have a Str
where there is no pick you could also
have a situation where there is one pick
so these are just normal these are just
generated using the
computer this is for example what we
call a symmetric distribution with one p
so you can see this is the left tail of
the distribution uh this is the right
tail of the distribution and this is the
center so here we can for example put
this and say well we have a normal
distribution here while here we have a
non symmetric with one PE so this is our
PE we have one the longer tail here so
this is the right skewness and then the
shorter tail is on the left we can also
change you can look at it in this form
so you can see here we have the right
the right being a longer tail and then
this one you have the left being a
longer tail so this will be shorter tail
and this will be shorter tail so here
you can have what left or positive
skewed and then here you can have what
we call a right or positive skew but all
this should take into consideration that
instead of dealing with one population
we may actually be dealing with two or
more population so later on you need to
bring in those population into uh your
analysis and once you bring the
population into your analysis you
shouldn't be able to uh determine
whether uh is this need uh to to model
the those those those differences or
there is no need for modeling
them so the the box plot and the
histogram literally show similar
information so this is the same data set
uh so you can see where the bul of the
Box the box plot is this is also where
the majority of the the data point is
you can see the the outliers are
identified here so if you put them side
by side you can actually tell so this
values here are the values that we have
actually here the same with what we have
on this side here this will be these are
the values that are here and then this
values here are the ones that are that
are here so that is actually what we
have
there now I'm not going to talk about
this for
now okay so so these are some of the
things that we can look at you have the
material with you I'm just trying to
scan through this quickly this is a
negatively skewed this is positively
skewed negatively skewed mean the tail
is on the left side on the negative side
of the number line positive skew mean
the long tail is on the positive side of
the number line and then symmetric mean
the two tails are uniform so you can
break P this one into two and then you
make so here we see that the mean equals
to the mod equals to the median and here
you see that the mod is the one at top
followed by the median and followed by
the mean so the outliers are here while
here the mod is still on top then the
median and then and the mean so you can
see that these values these big values
here tend to pull it this way that's why
said if if if Bill Gates joins our class
here and we calculate how much money we
have on average we have on our accounts
then we may have we may all walk away
from the class very happy that we are
very rich because the average is very
high but the average has been pulled
away by uh by Bill Gates or or a rich
person in this class
so so the next one is then when we are
dealing with the issue of cend T spread
of skilless uh this don't make a lot of
sense when we're dealing with
qualitative no nominal
variables uh but for qualitative ordinal
variables sometimes it makes sense to
treat the data as quantitative when
you're doing exploratory data analysis
so here for example you have the like
scale maybe you have a scale of 0 to
nine like what is done by people
measuring diseases and what so at the
end of the day these are these are
ordinal scales they're not really qu
they're bit qualitative they're between
qualitative and quantitative uh so you
need to handle them with some kind of a
bit of caution as you you move on so
let's look at briefly at how do we uh
do uh how do we summarize qualitative
variables
so how do you know that you have error
by looking at minimum and maximum
values uh we I we we we make the
assumption that uh you are an expert in
the field in which you're collecting the
data and therefore there are values that
probably don't make sense at all to you
but you could also go and check for
example if you're measuring height of
school children now you should be able
to have a range uh which you expect
normal values to fall if the range the
normal values are expected to fall
between say uh maybe 70 cm and say maybe
1.2 M now if you going to get a value of
1.8 or 1. then you begin to say but is
this a child even an ad in this area
probably don't have that height so
that's how you can actually tell that
there there's something wrong so we
expect that you as an expert you
understand what is going
on okay so so from uh our list of
variables there are three variables that
are are qualitative we have uh the
gender uh man or woman
we have a job categorization you're
either a clerical staff you're a
custodial or you're a manager and then
the minority classification you're
either a minority or not a minority
depending on how a minority is
defined okay so here the main Tool uh
that we can use uh
for the main tool that we can use uh for
summarizing qualitative variable
include frequency
tables this is very simple we saw how
frequency tables generated yesterday
using the summary command uh so you can
have the frequency so here for example
can say uh
26 of the employees were female uh
258 were M so one somebody so we say
probably about 55% of employees are are
male so somebody would already start
saying well why are these people
employing more male and yet the
population of female and M should be
balanced so you can start your case
there already the same information can
be shown in a in a pie
chart uh so you can see here male or
female or you can show it in the
background so you can see from here that
comparing a pie chart
versus a b a probably gives us a better
representation especially when the
values are very close together so the
eyes are not very good at
differentiating circles but the eyes can
show us the so from here we can see that
the females are few the males are more
in this uh
company uh handling out liers I thought
I talked about that so with outliers the
the key issues are one the first thing
is you need to First find out whether
there are errors where there are not
errors if there are errors uh you can
still go back to the field and try to
correct or go back to the
original uh data data collection form
and try to sort it out um if it is not
an error and you confirm that gening the
value exists uh two possibilities the
first thing is you can do analysis with
and without uh and compare the results
but in most cases you can also say well
let if there's one outlier what you
could do you could pick that outline and
handle it talk about it separately while
you analyze the other one and give
different
recommendation uh if we look at the the
different job categorization so you can
see from here uh that the majority of
the employees are clerical staff you can
see about 77% of the employees are
clerical staff and then uh there about
under 6% are codal custodial are mainly
people concerned with security and other
things so from the graph you can see
it's a lot a lot a lot clearer so from
here we can see that the clerical stop
are the
majority followed by the managers and
then we have the custodio now we then
need to go and say okay now can is this
in one way or the other does the job
that you do H influence the pay that you
get of course the answer is yes but the
extent which it does we may need them to
bring the two variable together you may
bring the clerical the the job category
and current salary we look at them as a
by VAR analysis or by VAR so that we can
tell that okay I think the different
employees get different averages for the
different salary and that's going to be
what we're going to to look at next so
if you look at uh the minority
classification you can find that about
21 about 22% of people in this club can
be categorized as minority so you can
present in a frequency table you can
present in a a pie chart you can present
in a backra or you can also just present
it
in your text so if for example you have
two groups like this you don't need
actually to do draw a A or a pie chart
or even a frequency table what you
simply need to say in this case is that
these companies employs about 22% of
minority and then somebody will know
that is are majority so you have your 22
versus
78% okay so that is looking at one
variable at a time now we're going to
look atation we can look at two
variables at together because the reason
why you go to do a study is because
you're interested in studying
relationship now studying relationship
mean you can look at two variable three
variables or five four variables at a
go okay so so what we saw so far is that
uh uh that we have three categories of
employee we have the clerical staff we
have the custodial and then we have the
so the the the issue here is now that
can we relate can we relate
this
distribution of the salary to
this okay the question I would be
interested in asking would
be what kind of worker is
this okay is this person getting
135,000 is that person a manager is that
person a custodian that person a cleric
offer what about these gentlemen and
ladies here where do they fall do they
fall in this do they fall in that or do
they fall in that so I can look at these
two together so the question would be is
there relationship between the current
salary and employment employment
categorization so you can see from here
uh that sorry these are man managers
this a custodial and then these are the
claric so you can see from here that of
course in terms of the median the
managers get a lot more
money okay the custodios are here and
then the clerical are there so this is
also a table that can give you that so
you can see that the
minimum the least paid claric get
15,750 the least paid manager gets about
double
that okay and then we can see that uh 75
% of
clerical get below
31,000 uh 75% of the managers get below
750 or you can say 25% of the managers
get about
72,000 and that there's a the highest
paid person is a is a CL is a manager
but there's
also yeah you you there also some good
clerical stuff getting 80,000 the
maximum but this this guys are
actually uh well the seller don't vary
that much you can see the minimum the
first quartile the third median third
quartile the mean the third quarle you
can see the maximum you see the values
are very close close together so this is
the kind of information that you
actually need you see from here we seems
to have a lot more variation among the
managers okay so at the manager you need
to be able to negotiate your salary so
there people come with different baining
powers they may have that there's very
low variation between uh our custodial
there's also a bit of variation between
within the the different clerical sty so
these are the kind of information that
you as an as an expert needs then to
convert it into your into your bring
your knowledge in and then you write
paper we write the thesis based on
that the other thing is also we want to
see uh are there differences in
education among the different category
of the job so you can see from here that
uh of course uh the most the least
land manager took
yes uh okay I'm going to talk about Tri
me when we doing it
practically okay so so the the you see
for the
codal for the managers the person who
start the least studed person is 12
years the the the maximum is 21
years okay and then the
custodio is the maximum is 15
years and then CL the maximum is 19
years so we can see that there's
variation in the number of years spent
within each job
category now we want to relate uh the
job category uh by stand I mean we want
to look at the uh so here we looked at u
a job category uh by salary current
salary and then we also look at Job
category by level of educ a so the
question that one may want to find out
is can education level explain the
disparity in the salary so that's the
one of the the question that somebody
want to
know does can education level explain
the disparity in the salary so how much
variation is here compared to the
variation here how much variation is
there compared variation here how much
variation is here compared to this
variation here
yeah okay so so what we can do then you
can put this uh box plot for the two
side by side and you want to see whether
there's actually some kind of a
pattern
okay so if there's relationship between
education and salary
categorization categorization then the
spread in the two should be similar so
here for examp let's look at the the
current salary so you see this A
variation in current salary for the
managers this A variation in current
salary
for the codal this is variation in the
current salary for the clerical Stu now
when we look here of course we can see
the picture here slightly different from
the picture there uh but what we could
do is we can put just simple Co
variation and see so the value should be
similar so you can see here
28 versus
n so you can see that there's less
variation in education of the of the
manager but there's a lot more variation
in uh the salary they earn so this
already raises question that apart from
education you need something else to be
able to gain a good salary that's what
we say now now when you look here
there's a less variation
here compared to
here so what does it mean so what it
seems to hear that it looks
like differences in education doesn't
really matter much when it comes to
salary when you're codal so if you are
supposed to sit at the gate it doesn't
really matter how much how many degrees
you have your point your your your role
there is to open the gate close the gate
talk nicely to people and so on and so
forth Ure that bad people don't enter
the premises so in that case uh your
education may not really matter that
much because the job is standardized and
you probably don't require much of
education but I'm not a Authority in
that field so I can't say much now when
you look at here the current salary here
and this well we seems to see the the
similarity in variability but
still variation in education is not
equivalent to variation in the current
salary so this are just some of the
thing you can scan through your data uh
and then you should be able to start
actually thinking how finally this the
final story is going to come
about uh so now we want to look at the
uh current salary or managers by agender
let's look at the managers male or
female is there differences of course we
can see that the males have higher me
mean compared to the
female is this because of
discrimination could be could it be that
the male managers are more educated
let's look at that you can see the
average that and then there seems to be
no variation in education of female all
the female all the female
[Music]
managers have gone to school for about
16 years the males they bit they have R
of 17 they actually male who studied up
to 21 years at school so you could say
well uh maybe the managers who get more
salary could be more educated than the
female Sal but that's not all part of
the story but at least from this side by
side you'll be able to see something
coming up here okay so so this is what I
would consider uh I I would really
consider that you put as much time as
possible in your exploratory data
analysis because it's going to make a
big difference in the step in which
you're going to take to do the final
analysis I always spend much of my time
doing exploratory dat analysis so that
when I go now to try to do some modeling
or what I already have a a lot of
valuable information at hand and while
I'm doing this I'm already trying I keep
noting a little bit of things here like
okay there seems to be differences in
education between male and female
there's differences in salary between
male and female there are more males who
are managers compared to males who are
female uh to managers female managers
and so on and so
forth uh now if we look at the clerical
Stu education
versus uh
that so you you seems to see that
there's apparently very little variation
in education even of clerical stuff but
there those they are actually female
that have studied a lot they even studed
a lot more than more so you see the
these two
females have studied more than
75% I mean the above the the male as far
as studies is concerned but they not
they may not be they may not even be
this ones here they could be huning
salary somewhere here and and those are
the kind of thing you really want to to
dig deeper into so that you'll be able
to appreciate actually what's going on
in your data
set okay so so the so here what we we
can also do we can actually look at
relation between any two uh two
qualitative
variable so here I look at the job
categorization here I look at the gender
so the question would then be when it
comes to job
distribution is there some
bias if it is bias bias again is which
which which gender okay uh can I explain
the what is the source of the bias is
the bias coming from education is the
bias coming from the society itself or
where okay so we can look at a joint
distribution and the question is always
is there association between job
categorization and
gender are male and female equally
distributed in the different job
categories so this is what where we have
what called cross tabulations and the
cross tabulations you can go ahead and
even test uh the hypothesis so you can
see from
here uh if we are to to look here so you
can see here that uh if you look at the
clerical
stuff uh you can see here we
have more
female as clerical staff compared to
male okay now when you look at
custodial there actually no female who
are
custodial I think this data set was
taken very long time ago so it will be a
taboo to get a woman sitting as the gate
keeper uh these days things have changed
a bit so the all the the the 27
custodians are male uh and then of
course the gender activist would want to
capitalize on on this part here uh this
is where now you're going to see a lot
of stories coming so you can see uh that
um a bigger percentage of the the the
managers actually made so out of the 84
managers 10 10 out of
84 are are female so this is enough to
to bring in issues
now it also depends entirely how you
want to communicate somebody May
communicate this
way
no so so this is is 10 I mean somebody
may look at it from this angle or
somebody may look at it from this angle
it depends entirely on what level of
alarm you want to
raise so people who who write
newspapers are who do what they know how
to pick information so the same
information can be presented in two ways
one in a very mild way another one in a
very
very in a very very diff very tighter
way
okay so here you you have the frequency
you have the relative frequencies so you
can present the same information here so
you can see from here uh this is the
custodial this is the the managers so
they are no female codio so here you
have clerical they are more female
clerical compar to
uh so so this are male
so so you can see here we can compare
this this is male versus female this is
male versus
female of course this is male versus
so so the the key thing here is uh about
communication what do you want to
communicate
if so that's why they say statistics can
can is can be a statian can be a liar
it's not that statian can be a liar it's
not not about lying it's it's just
trying to present the Alternatives that
is there so your focus depends entirely
on how which information you want to
pass and how alarming you want
information to come
okay uh so those who have done analysis
using SPSS uh we we always talk about
marginal so marginal is mean at the
margin of the table so here for example
male and female you can talk about the
marginal distribution for by gender you
can talk about the marginal
distribution uh by
the but for for clerical stuff so this
is how we can present a relationship
between qualitative variables this is
cross tabulation you can
have you can have Cross tabulation or
you can have two-way three-way tables
the the more weight table the the more
complicated it to become
so so when you look at this for
example you what you can say talk about
here for example say all custodial jobs
are taken by
men okay that's one way looking at it or
you can say the female are denied C job
mean it's the same information but
presented completely in a different way
you can say 88 men took 88% of the mar
positions or you can say female took
only 12% of the majority position
depends on what kind of bell you want to
run uh women took about 75 70 57% of the
clerical
job I saw a story in the newspaper
yesterday was that yesterday they say uh
only
10 only
10 say only six
only six
candidates uh from a certain regions
were selected and this six was out of
60 uh so when you look at this say but
if you look at in terms of percentage
then you say 10% 10% actually if you
consider that there are so many region
10% is actually a lot so this person was
trying to alarm by saying only six
candidate were selected but in the story
uh is when you go deep in the story then
you say the six is actually out of 60
not out of 1,000 because if it's out of
1,000 that would be different from it
out of 60 so this is the a simple way of
presenting your results and I want to
emphasize here presenting result does
not mean writing 20 lines you can write
two lines or three lines that
communicates what you want very very
clearly if you think if you want to
write one page on this graph here then
you're writing something else you're not
writing something
useful okay so finally we are going to
look at the relationship
between uh going the next 10 minutes
then we go on on on on we go on analysis
so so the the next thing is we want to
look at relationship between different
quti two two more quantitative variables
so initially our quantitative variables
were education
level Uh current salary beginning salary
time on job and previous year's
experience so the question that one
would be interested in would be is there
relationship between the number of years
spent at school and current
salary so do I need to if I spend more
years at school will I get paid more
that depends on what you do at school
no does the current salary depends on
the beginning salary so if I beain my
way properly certainly it will have an
impact on what I will get it depends on
your gaining power does the current
salary depends on previous year's
experience depends on what you are doing
previously okay so so if you want to
present two quantitative variables uh
the the the the starting point is always
what we call the scattered plot H so
with the scattered plot here one
variable is going to be on the x axis
and the other variable is going to be on
the Y AIS uh so in this case it really
doesn't matter what is on the xaxis and
what's on the y axis but later uh if
you're doing regression like what my
colleague is going to look at from
tomorrow uh is going to be if you're
looking at regression for example then
it's important to know what comes on the
xais and what comes on the Y
AIS
okay so one or more additional
categorical variable can also be
accommodated in the scattered plot so we
are going to see actually how we can
bring in additional information on the
scatter plot when we start doing our our
analysis
so is there relationship between level
of education and uh number of years at
school number of years at school and
current salary so so if these are the
number of years at school there are
people who spend 8 years at
school uh 12 years at School 14 years at
school 16 years at school 17 years of
school 20 years at school and 21 years
of school so among those who
spend 208 years at school this is the
range of the salary that they get okay
and then those who spend 12 years at
school this is the range of salary they
get uh 14 years at
school so you can see this one spent 14
years at school but there's somebody who
spend 12 years school getting more money
than them of course these are 15 years
at school
uh 16 years at
school 17 years at
school okay uh 18 years at
school 19 years at
school then there is
a 20 and 21 years at school so what you
can see clearly from here that actually
people spend a lot of time at school are
not the best pay see these two these
three people here they spend a lot of
their time at school and their salary is
in that range now look at this gentleman
here there's a gentleman up
there spent 19 years at
school okay when these guys were
studying person already start making
money now they far much up there but
that is not the point the point is that
there's Trend so we can see there is
actually Trend okay so we seems to see
there's actually some
relationship okay I pointed that already
so in terms of the so you can see there
503 people who spent eight years at
school
196 116 and so on and so forth there's
one person who spent 21 years at school
this person studied a lot but he getting
little compared to uh those ones who
study for for 19 or what so you can see
the value that we have there of course
we can also look at the beginning salary
versus the current salary so we also
seem to see kind of a trend going on
that if you start highly start lowly you
also low but there seems to be this
gentleman here he
started he started from here uh is so at
that point okay so that is what so
there's also a positive trend there of
course there are these people here who
started low but they were able to climb
very
fast okay what we also going to see is
that you can actually uh draw multiple
Scatter Plots at one this this ones here
you can draw several of them at once and
then you can see from here that there is
a relationship here
this is the same as this one here uh
there is no you can see here there seems
to be no relationship no relationship no
relationship of course the the the the
the the the we can see that is what we
call High bowling we may need to do get
some figures in so we can go ahead
and and and and bring in uh some uh
values that's where we look at things
like correlation and so on and so forth
now one thing I want to also note here
is that sometimes uh when you want to
run uh or to draw a scatter diagram now
it's always important to take into
consideration the fact that you're
dealing with different
categorization so it may be important
either to use color to show the
differences between the different groups
or you se you subset your data and draw
a histogram or Draw scatter diagram for
the different uh the different groups
separately so if we look at uh female
managers only you can see literally that
there is seems to be some relationship
here between the beginning salary and
the current salary some some small
relationship but for everything else
there seems to be no relationship at
all so apart from where you started from
Once you I mean there is you don't gain
any advantage either because of
Education or what or what when you're
female that is what you can look at from
this that but that is entirely depends
on the expert in the field to explain
what is going on now when you look at
the mail managers you seems to see this
relationship is there uh there also
maybe this bit here but there also
there's something a bit there but also
the rest you don't seems to see are also
the relationship now when you go to
clarico we seems to see this
relationship
existing uh nothing really going
on custodial nothing seems to go on So
based on the high alone it may be a
little bit difficult to actually tell
what is going on so the best we can do
is probably to go and look at what
called correlation so correlation will
Trend to give us the strength of a
linear
relationship between uh two or more
variables now uh Susan is going to uh
talk on this more but I just want to
show you as part of the exploratory data
analysis what really
happen so here the correlation Falls
between1 and positive one this a perfect
relationship so zero is when there's no
linear relationship between them so if a
value is close to positive one or close
to1 then you say there is a strong
linear relationship if it is far away is
closer to zero then there's no linear
relationship there is the possibilities
of having other relationship but the
linear one is not there so when we look
at uh current salary versus beginning
salary you can see
0.8 88 so this is close to positive one
so that mean there's a strong positive
relationship between the beginning
salary and your current salary that
means if you start low you'll also
earning low so the higher you start the
higher your next salary will be so that
is a positive trend going in that
direction so if we put it in for all the
observation so we can see there is a
relationship
here there is a relationship here
0.66 there is a relationship here and
then nothing here nothing there some
negative something negative here but
nothing there's something negative here
so these are the different
relationship uh that you'll actually
have now of course you need to be able
to uh to understand the first is you
need to be able to interpret these
values what do they mean and then from
there you should then be able to say
well uh this is the value means this but
based on my my my knowledge or based on
what on the theory that I have this
makes sense or this doesn't make sense
now remember this one is based on all
the
observation now I may want to see the
correlation for only managers I may only
want I may want to see relationship for
only
um female managers or male managers now
the subset you select to observe will
depend entirely on your own knowledge or
how you want your story to be writing a
thesis or writing a paper is about
storytelling so you need to have a rough
idea what kind of story you want to tell
and then the statistic will come and
help you to tell that story if you don't
have a story used to tell the statistics
will not give you anything I promise you
that okay so here if we look at let's
look at just simply a correlation
between the current salary and other
variables for the data set now for the
entire data for the combined data
set okay so you can see that there's a
strong relationship between the current
salary and the beginning salary and
education so these are two positive
relationship
um something strange also coming up here
telling you
that actually your previous experience
may actually work against you it has a
negative
correlation I don't know what that means
in terms of employment but could be
related to technology changes and so on
and so forth now let's look at the
managers if you look at managers
alone we you see now we
see uh beginning salary and current
salary is still strong but education now
doesn't
seems to
have a a strong relationship when you're
dealing with managers alone for clarico
is also
reduced for custodia is even negative
now so if you seems to have gone to
school too much in school you seems to
to to lose out when for the castol your
job so you can see from here that the
relationship between the current salary
and the beginning and and the current
salary and the beginning salary uh let's
look at one at a time if you look at
this the relationship changes also
depending on the group that you're
dealing with now look here for custodio
there's actually no relationship between
the beginning salary and the current
salary so you can see all these now what
is clear from this table is
that don't rush and run your
correlation for the entire data set and
stop there you may want to do
correlation to the different subsets and
then see whether some something is going
on now like here for
example if your clerical
staff the pre the the exper the more
experience you had in your previous job
the less likely that your current salary
is going to be uh the question would be
why the same for the manager we seems to
also have a manager here that if a a
female manager
spent had had more previous exper exper
experience is likely to be PID
less having to have lower current salary
and
then then previous years's experience
seems to have no effect on the mail
managers it seems to have negative
effects on the clerical both male and
female now these are now the things that
should be tickling your mind as a
researcher and say oh what is going on
here okay yeah so so this is where we
are going to stop for now I want us to
uh to rest by resting I'm going to be
asking you a few questions in relation
to the data set
uh we yesterday uh we gave you the
material uh for day two and day three so
let's go to
the to the chap
first how many people have downloaded
the material or the want how many people
have the folder with exploratory the
folder with the name
exploratory Thomas tell them to download
the zip folder some of them are saying
that D is corrupt but within the zip
folder the data is there okay so you
need to go there's a zi folder you need
to pick that folder and unzip
it somebody said they cannot open yes
you need to unzip the
folder it's not opening you need to
unzip
it you don't need to open the the PDF or
the what
now so can can the link be shared so
that people download those who have not
downloaded
now don't write anything now we're going
to share the
link okay the link has been shared
please download the material if you is
disturbing it download again
okay
download I'm going to show you the
content what you are supposed to have in
that
folder the link has been shared again
and
again e
okay what I want is the that uh I want
you to this are the content of the
folder the name of the folder
exploratory and
influential uh if that is if the name is
is is maintained
by our colleague Ellen was the one who
shared that so I want to know how many
people have the content that i' I have
here
okay I can see no script the script is
there the script ends with explore. r
capital
r see the script is the one check what I
have the the script is the one
exploratory dd- analysis.
R there is no way you can have a folder
without the script the script has to be
there so I want you to get that folder
and put it
somewhere I want you to put that folder
if you're not very sure of what has been
going around and you want it safe play
safe now carry that folder and put it on
the
desktop the script should be inside
pleas how many how many how
many how many file do you see in that
folder there is Algeria with. CS CSV
this exploratory data 20th
May uh 20.pdf there is exploratory
uncore dataor analysis. RoR there is
a factorial.
CSV
unless unless your system decide to
delete it that thinking it is a virus
ahuh don't tell me it is empty it's not
empty no I don't have the no no no no
these files are not
damaged unless they're corrupted from
your own computer
so those with only
eight
please check try
to try to download again and see if you
can get all the
nine at least the majority have
yeah so so I I generated my I'm using a
mark so there are two folder there
another folder within the folder
itself okay uh anybody with the anybody
who claimed has cannot get to to you can
be promoted to to share the screen and
we see
who wants to share the screen who is
having a challenge to share the screen
and we see what is
there you say you cannot get the screen
maybe they can share and we see what the
problem can you can you share can you
allow to
share I also want to see the face of my
former student if he's the
one I can stop
sharing that person is not yet put up
the hand it's called who
H can you put up your hand and you're
promoted anybody with a problem please
put up your hand and we promote it
okay we have
promoted can you share your
screen he's joining as a panelist so he
should be able to share his okay
okay what happened to him is it
sharing no not yet okay let Holden a
bit say could not be able to
display I'm promoting him but he's not
to be promoted let me just pick another
P he's there now he can share okay
okay H you can you can share your
screen we have also promoted Prince Maru
marua
okay okay starting starting to is
starting to
share who declined no it's trying
somebody's sharing we can we can wait
okay okay so where where's the
folder
okay uh click on that
one ah okay H go to the next
one yes okay
please yeah what you want to do is you
copy all these files copy all of
them and go and create a new folder and
put it on the
desktop create a new
folder ah just say day
three okay you put put the content
inside
okay yes paste inside
there
okay okay now what I want you to do I
okay that's fine can you open your other
file that we started with I think now
you have all the file that's okay
now so but I want you to go back to
where we started from I want other
people to also
see close that one
yeah open your file where you add it
before okay so now now look here there
there there there is a day two when you
open inside there there are two
folders uhuh not that
one go
back go back
yes okay you see now I I I I use the
mark
system so you may have two more file
in that the second
one
okay so what I want is to show is that
some of I mean the folder could be empty
but you need there many many folders in
so you need to look for the right one
the one with a DOT these ones here are
corrupted files they're not the one
you're looking for so you need to look
for the right
folder you there are two folders there
why don't you get the that yes open that
one
uhh no no no first start it start the
process
again what what I want to show is that
uh if you download that folder because I
I use my computer I use I use a mark So
it tends to create another image a
folder image that may not have the
content so I you can use that day three
material you can now stop sharing
so imp possible I want everybody to to
to to create that folder on the
desktop and then you'll be ready to use
it you should be able to have nine
different uh files within you can stop
sharing now
any other person who needs to
share okay Prince can you share so that
we can start no it's sorted thank you
okay okay any any other because of the
other folder which was hidden somewhere
okay good so any other person who has
not
yet uh done so that we can help
so Ambert you say you're only seeing
eight can you share your screen Ambert
can you
promote we have promoted Steven he had
raised his hand okay Steven
you got nine but one is empty one is not
opening which one is not opening okay
Sten is
sorted
amoot
okay can you share your screen
okay is after that I'm going
to go down under there's a
green there's a green thing with an
arrow
below the arrow is pointing up word
share screen yeah so just click on
that have you seen it
seen
you but you select the okay
okay ah okay can you go back to where
you get the file you got that
from to get no no you go back where you
got it from there's another folder you
pick the wrong one let him click on go
to the Google Drive Link
I think he has it already on his can you
go to the your download
folder hey are we
together
Amo if you have this ones with
dots if you have this one with dots that
mean you open there are more than one
folder you go up open the one without
the
dots Amo can you go go up again and go
to
another
okay I think now we we have lost
him okay so what I'm going to do let me
share my folder again
okay so I want you to
open open your folder I want to if you I
want to see these files
here
okay uh so we have the file that ends
with extension do
CSV CSV CSV
CSV CSV then there's pollination two
pollination two doesn't have the CSV
that's the text file so it's supposed to
have
txt then we have the PDF that is the uh
that is the the the presentation and
then we have the RoR that is the the r
is
the the r is the the the the script so I
want you to click uh right click on the
explore on the one with the r and open
with the open with r studio just click
and see what I what I I'm doing there
are we open
okay I want you to see the
screen uh look at my
screen have we all open the the
script if it can't open that mean you're
open the you're not opening the right
file okay can I can I get okay those one
can I just get only those who have
failed to write if you failed type
one okay among those ones who fail can
you raise one of you raise your hand up
and tell us and show us what you're
trying to open
if you don't have the file you have to
run to M and pick it from here
now I promoted the
B okay Bon can you show us let me end
and stop
sharing Bon can you show us your screen
hello um I've not failed to
open but I just wanted to let
participants know that maybe at their at
their end they're double clicking to
open it so they should right click okay
and then choose the application to open
which is our studio that would be the
challenge yes thank you okay good thank
you yeah if you open and say install
Agri say yes you can click on install
that's okay
we have promoted
T
yes
T yes just one person now if I see what
is happening I know somebody say I'm
using the old one which is
true because I'm old
okay one person to show then we start
otherwise we are not going to do
anything we are going to form coalation
of the
Willing the data set are in the folder
so if you have the folder you have
everything yeah somebody puts it right
The end justifies the
mean okay so I think I'm going to we are
going to stop the process and I'm going
to start uh running let me share my
screen
again okay so so the the first thing is
uh we all we have all installed ARA and
AR studio and then as Ellen was saying
yesterday uh you need
packages uh to run the analysis now with
packages there are two things
the first thing is you need to install
the packages now
if you have already installed the
package once you don't need to install
it again so the first step is install
the package now if you already installed
the package The Next Step would be to
call that package in so that you can use
it so the first thing is I want you to
run line number
three line number three Esa Library cars
I want you to run and tell me what you
see mind says loading required package
car data
how there are people say there's no
package
ahuh okay for those who say there's no
no package AR no package
car okay let's stop
now I want you to
remove go to line number
two first listen to the instruction stop
stop writing
I want you to go to number two and
remove remove the the the ask
key can you remove that and run that
line install packages
car so if you tells
you Library car not found that mean it
has not installed so you need to install
it please don't just say error error
there's all sorts of
Errors okay so you need to
[Music]
install you need to install the library
install. package
cars okay and then once you do the
installation then you can try to run the
library
again Mohammad you first installed the
library cars look at my line number
two and run line number two remove the
the first
symbol the the ask symbol remove it and
run the
line only install the package when you
run line three and it shows you an
error so if it is successful that is
okay
okay okay so let's go and run line
number what you do uh run line number
five if nine number five shows you error
please you can run line number
four which is the sitech
okay so you can repeat for all of
them okay so can you run line number
six so the first thing is you if you're
not sure whether you install or not
first run R run seven if you see error
then you you go and run run line
six you run nine if you see error you go
and run
eight run
11 if you see error go and run line 10
run 13 if you see error go and run line
12 and so on and so
forth you you may not need all this
now
okay so I'm going to give
you
a five minutes
to run all the line until line number
19 how many people have R successful up
to line 19
yes if your car if package if the car
package successfully you continue
running
there's somebody still asking that we go
over from
loading yeah B you can wait for it to
run so we're going to take a bit a few
minutes to to to run
and
then if your script cannot open that
mean you are in the wrong folder First
Look for the
folder look for the right folder if your
script is not running then you're in the
wrong
folder go and try to look for the
folders that have files without
uh what do I I move my screen I do you
see my screen well
now okay there are those ones who have
finished
all okay three more minutes
you try try to help them first let me I
need to scan something and
S
e
e e
hello in case you get a message that
uh package required that means that you
haven't installed the package car so
make sure that you first in St the
package before you load
it someone is asking that they don't
have a link to our
studio um Thomas over to
you okay uh Helen can you post the
Google link in the chat someone wants
the Google
link now if you don't have the link to
AR Studio go and type AR studio in
Google and then click you will see a
download information coming up with the
link in that document and then click on
download our studio either for Windows
or for Mark or for any type of uh
uh laptop that you
have uh Thomas is coming back I'm taking
over briefly as we see that people can
load their packages and also after
loading the
packages in case you get um an error
that say that loading required package
that means you've not actually installed
the package so make sure I install the
package go through the list and install
a package first also if you
cannot
uh access your data that means you've
not set your working directorate but
let's make sure that we are comfortable
with the installing the package and
loading okay so in the in the chat is
everybody comfortable with installing
and loading packages let me see yes or
no
okay Helen please uh send a a link to
the Google Documents
send it like three times or four so
whoever hasn't got data set please uh
the Google link has been shared by Dr
Helen uh just uh copy it you can even
put it in a Google and run it will give
you the the documents or click on it and
you go straight to the to the folder
how can I update my RS Studio into the
latest version we dealt with that on the
first day uh Dr Helen did that but if
you want to update you have to download
a new version of our studio then install
it and then open our our studio and it
will be updated
automatically uh those who if you click
on the link you have to unzip the
folders in order to access the data in
the
folder uh masked is not a problem masked
means it is hiding behind something else
but the that you try loading that
package and see whether it
runs yeah if it has successfully
unpacked it means has been installed and
it it has been loaded on your computer
it's ready to use
AR installer installer is also a package
that is used for updating
our I'm just looking at the uh the
WhatsApp uh chat I mean the chat to see
the queries of the
participants but if you need a longer
answer please use the Q and A and my
colleague uh Dr Helen will be answering
some of those questions in Q and
A how do you unzip just double click on
the folder and it will ask whether you
want unzip
error in there's no package car please
install it
aresh uh you have the old version 4.33
is the old version of R you will need to
update later after this uh session the
current version is 4.4
uh for you to know whether our studio
has been updated go to tools and check
for package
updates then it will tell you
whether our studio has been
uh updated
uh I've seen this message of link for
WhatsApp group whoever developed the
WhatsApp group can you please share the
link
uh
adaa you are talking of installing
packages which package the car package
if you want to install the package go to
your R Studio window and uh type install
do packages in bracket uh inverted
Commerce car close bracket run that
after running if it is successfully then
type Library in bracket car and run
again then the package will be
stalled okay back to Dr Thomas he going
to take over from there thank
you thank you very much okay I I think
now we can uh continue
running
uh the commands
thank you very much Susan for stepping
in okay so what we are going to do now
is uh
yesterday uh you were taught how to set
a working directory so we are going to
continue with that so I I want
you
to you need to know
where you you need to know where your
folder is so let's go we want to set a
working directory I'm going to show you
one way and later I'll show you another
one so can we go to
session click on session
up at the top of the bar there is AR
Studio there's file there's edit there's
code there is view there's plot then the
session and come down to set working
directory
then choose a directory just see my
screen
session set working directory choose a
directory are we
there you click on
that so look
for look for your where your fight is
mine is on my
desktop are under Forum
training and then exploratory
so you put a curer on the folder that
you want to make your working
directory and then you click down
open so if you click open you should be
able to see can you tell us can you
uh can you see a com on on you need to
look
[Music]
down you need to look down in the
console uh if you see something set
working directory and it gives you some
command let me see what is in the
chat can you type what you have seen in
the
chat what do you have in your working
directory
uhuh that one is
good
okay I want you to see look at the other
console look at my AR console look at
mine mine is here mine is telling me set
working
directory double quot a tier desktop re
Forum training stroke exploratory and
influential if you have any query put it
in the q& a it will be will be answered
I just want to see the what you see in
your in your in in your on your
conso please stop insisting on putting
your number and talking about WhatsApp
we can do that
later we are going to ask the we're
going to ask the the the room Forum to
give you five minutes to sort out the
WhatsApp issue after the lecture is
done
okay anybody with a challenge doing that
can you raise your hand up and they give
you
options put all the nine folders in the
work the N thing in the working
directory can can can anybody with a
challenge can you please raise your hand
up so that we can
help Francis Mira
okay Francis yes Francis can you share
let me stop sharing my screen fris can
you share your
screen oh
nice okay the nice baby but what we me
you need to share the screen with the
folder
on you have many things when you asked
sharing the screen you select the
one select another screen to share you
have many many screens
there now let's first leave the screen
the the packages for now what we want is
we want to have
a is
uh is there can you show for us the
folder or show for us your your your
your your your our
studio can you talk and tell us what is
yeah you tell us what what is the just
talk now unmute yourself and tell us
what is
there okay I cannot be able to load the
data if I go to session
and new
session not new session go to set
working directory sorry sorry
sorry you go to
session you tell us what you're doing so
that we can direct you I choose the the
working directory my directory I already
sa the data in the desk on the desktop
and you need to create did you create a
folder where you put those those data in
yes sir is this one written day
three okay you click on the day three
don't open
it so what you do you go to go to
session uh go to set working directory
choose a directory yes go to desktop and
click
on click on the folder don't open it
don't click double to open it just click
on it make the c on it yes then you go
the what is written down as open you
click on
it okay thank you okay good you can stop
sharing
now he was not sharing
wasy okay any other person wants to
share the screen with the
challenge we are not using Exel F
please there's somebody with
Anonymous check if you you're appearing
as an
anonymous attend Anonymous
attendee can you share and if be
promoted there an anonymous attendee
I've promoted
emu okay em can you
share yes we are setting entire working
fold as a working directory a working
directory is like where work is going to
be done so you're telling ARA if you're
looking for a data set look within the
working directory don't look
elsewhere yes you promoted me but I
already fixed the problem good then okay
thank you okay so let's let me now go
back and and share my screen I'm now
going to uh go and start the
analysis you cannot see the session
who is that can you Paul Paul can can we
promote
Paul PA
zind
zind
zind it's it's very important that
you're in the right folder then things
will move very fast if you are not in
the right folder things are going to be
very
messy he's joining as a panelist
okay yes sir okay so can you show us
your screen or tell us what is going
on try and share
it I fail to find that
uh do you have the four
Windows yes I do uh can you go up on the
the last do you see where on the the
uppermost the bar on top
do you see anything on
top try to put try to put your Cas like
you're getting it out of the screen
up yes do you see AR screen AR studio
file edit
[Music]
Cod I'm also using Mark yeah using a
mark yeah do you see where where the
mark sign
is you see the the small Apple app at
the right left hand corner yes next to
that Apple there must be a word studio
studio yes the next one is file yes
edit you see
edit yes
view yes
plot plot yes and session yes yeah you
click on the
session okay do you see set working
directory yes uhhuh choose a
directory yes uhhuh then you look for
where your fold inut put on it and then
click open oh okay okay put the
whole you click on the folder itself and
then you go click the other and open
highlight the folder before you click
open don't double click on the folder
because when you're inside then you can
there it has to be on the folder not
inside the
folder yes it's it's well now thank you
okay great okay now what we want to do
then what you can do you can copy that
that that command that is reflected in
the in the AR console and then you put
it on in under your command so that next
time you can run that line now you don't
need to so you see my my line number 21
do you see my line number 21 I hope I'm
sharing my screen
I'm not
say okay
sorry do you see my line number
21 if you see my line number 21 please
copy your command from down and put it
to replace my line number 21 replace my
line this line here with your command
one you copy from the console
okay okay so now I I want us to use the
next 20
minutes no one is attending to
BOS can you help
BOS he said he can't open a
f can you let this one be the the last
person now we can promote we can run the
last 15 minutes run the file and then
tomorrow pick it up and finish it up
before we ear it over to but it's
dragging a lot too
much can we see the the person
showing this step is especially very
very important why why this is important
because in most cases you don't need to
be very perfect at writing code what you
need is you need to be able to get
somebody's get somebody's code modify it
put your data in the right in the right
folder make that folder you working
directory and everything else will work
perfectly that's why I'm I'm stressing
this okay good
I I
demonstrated okay so we I'm going to
demonstrate
again okay so what I'm going to do just
look at
uh sorry
just hold on a
bit okay so I want all of us to go up on
session uh from session you go down to
set working directory and choose a
directory so that's how we set a working
directory you go to
session uh set working
directory uh choose a directory
please go to Q and A and people will
help you there to download the
material okay so we go to
session uh set working directory and
choose the working directory then from
there you go to where your directory is
so you put the CER on the folder not
inside the folder
okay so for now we are going to to run
so I want us
to uh to look at my line number
24 can you run my line number 24 I have
a question
mark and then I type the word Irish I
want us to run that
line what do you
see so you can see on the the
lower lower right
window uh it tells us that we are
talking about Irish data set it tells
you that Edgar Anderson Irish data set
it describe for you the data set so this
is one of the data sets that we have in
ARA and this data set is available for
you to use to practice so one when we
teaching R in most cases we use the data
set that is in building AR so we're
going to start with that and then later
we're going to call our own data set
okay so can we run the
summary okay so I'll run the summary let
me go to the
Chart okay so what is the
minimum uh
observation can you type the minimum
observation I see what you've
seen 4.3 okay the
maximum okay good yeah so here is
remember we talked about the five number
summary so here you have the five number
summary okay thank you you can stop now
typing I've seen all of you got it
correctly uh so this is the five number
summary that we talked about include
plus the mean so the mean is also
included there so from here you can see
that the mean and the median are very
close together now I want us to run line
number
26 so line number 26 we are trying to
plot uh the SE length versus the SE
width of the of of Irish
H how many people are seeing a
graph so let me let me okay I can see
good okay
great let me stop this and share another
screen of mine
yes okay let's first wait we can
continue installation later I want us to
first look at this so you can see from
here we talked about the scatter diagram
a scatter plot and here you're talking
about two variables at one so you can
see here we have the Irish SEO which is
on the xaxis and then the the SEO width
and SEO length on the Y AIS do you seems
to see any relationship between uh the
two variables let me see on the chart
are there relationship between the two
variables no no no no
okay okay so we seems to see no
relationship between the two variables
now let's uh we're going to improve
on uh let's run our
okay let's run the second line line
27 now in okay let's run line 27 and we
see what happened in line
27 okay so we can see from line 27 that
we able to put in color
okay now the colors show different
species because there are three species
here so there are three different
categories so when you look here for
example you can see that the one in
black seems to have a very good linear
relationship between the Seer length and
the SE width if you see only the one in
black
you can uh you can unmute yourself and
tell us what is the the
issue first tell us what you want to
share then we can uh see we can allow
you to
share okay so what we can see from here
that apart from the simple plotting the
two the two variables the SE length and
the SE width we can also color for so
that we have different colors for the
different variable so this is one of the
the powers of ARA this is where ARA
excels I still see some nose I don't
know know for
what how do you know which one is SE
length and SE width no no don't worry
about the key for now the keys will come
later on you can see down down here you
say Irish then there is a dollar sign
then there SE length and then upper one
we have the SE WID that's how you can
know the different but what I want you
to know in this case is that we can plot
uh two variables and then we can bring
in a third variable which is the species
which is a categorical variable to
actually differentiate them now AR is
very very good as far as graphic is
concerned so we are later going to see
how we going to complicate the simple
Graphics here until when you can write
your title you can write your you can
change uh the different symbols that you
want to use so for now let's first try
to enjoy what we seeing
here okay so can we run line number okay
so we let's first run the line then we
can begin to look at the so let's look
at line number
28 so 28 we looking at the seal length
and the path length
okay I want to let me first stop share
okay let me share my I want to zoom then
I can share the zoom
but okay so so you can see for SE length
and petal width you can see that we have
two kind of relationship we have the red
and the green seems to have a good
linear relationship between the SE
length and the SE width while uh the the
the the species the third species seems
to have uh first of all the third
species have a very the pathol length
are very short you see almost all the
values are are below two while the other
ones the values run from three up to
around seven you see this is the
information that the coloring will bring
in now if you don't color you seems to
see kind of relationship that may not
really fit that well and if you remember
uh the
correlation how many people are not
seeing my
screen how many people are not seeing my
screen yeah
okay I'm going to redraw the
gra okay so so the here what you can see
here is that we have our window where we
have have the
command
then below the command we have the AR
console that is our second window and
then our third window we have where we
have the environment and then the fourth
window is where the graph is being uh
shown so you can see from here uh that
on the fourth
window uh that is the fourth window that
is the lower the lower window on the
right we can see SE length versus pth
length and we can see that there seems
to be two relationship here there is the
black one and then the other two so the
the the species that we are dealing the
three species that we are dealing with
seems to behave
differently okay let me go to the chat
and then
uh please help we what is wrong what
what could be wrong
uh so J Janella can you can you promote
Janella in CB to share the screen uh the
last one so that we can conclude by
looking at what is on the
screen Janella can you share can you
promote Janella to share screen
Janella is it Janella zanella don't know
z
z
oh sorry I'm promoting
it
to yes zanel is joined as a panelist now
so she should be able to okay
oh let me stop sharing
first okay can you share go
ahead yeah and then amito after we need
to get the this people five minutes to
sort out a few things on their Zoom you
always close very fast so the first
thing is you see here you see on top
there is that packages agricol cars Doby
and others are required but not
installed you see you can actually say
install you can click the install and
then you can install those file but for
now first show us the what you have okay
can you run that line 26 and we see can
you see my R yes we can see okay yeah
can you run your line
26 you don't need to highlight you just
need to have the
I don't know something seems to be wrong
with your
graphics I don't see anything okay can
you first can you click can you go on
the side of
the no go on the lower on the the four
window uh can you click on the pack yeah
can you click on the package up
there click on that click on the help
click on the help
I want you just to try to click a few
things up and then maybe we try to run
and see what's happening okay you go
back to the
plot uh I don't try try to run again and
we
see it
looks it's empty I don't get any plot uh
okay I think this is now what you do
first install install click the
installation and then you the install up
there no no no look at the the the
warning line the the yellow line the
green yellow
line you see the yellow line with the
warning on
top on top of the SC on top of your your
script immediately on top of your script
I see it I see it can you click the
install part of it there click on
that mm I've clicked c on it yeah just
click on it and let it install and then
we sort out the rest
tomorrow you allow it allow it to
install the those ones and then that's
fine so so what we are going to do I I
request you uh ensure please this is
your
assignment uh ensure that you can that
you have installed all the all the the
packages and then um make sure that you
are able to set a working directory and
then try to run the lines that we have
here run them don't worry whether you
get errors or not try to run them and
then once you get an error try to look
at the error and understand what the
error uh is could be uh could be telling
you so we will pick this up tomorrow we
we need to run this and if you can run
the whole thing that I would very very
happy with that so I request that you
given five minutes uh so that you get
the link to the WhatsApp group I'm also
not on the WhatsApp group
uh I'll give my my my number to to to
one of my colleagues to give to the
person who open it so that I get also
included onto that or I'll try and see
so please let's have five minutes to try
to sort that one out uh then we can call
it a day but tomorrow we going to pick
up and and and run this uh slowly until
uh we reach the end please remember
there's YouTube channel you can go and
follow uh there's a lot of material on
on the on on on on the web you can go
and
follow Okay so
yes please I I it's done installing but
still the plots are not showing so I'm
not sure what now no first the think
first close it up and then we can uh
redo it again and then we see but uh I
need to check probably it could be
related to some graphic issue with your
computer which computer which computer
are you
using I'm using a wway version computer
on on a window yes Windows yeah what you
what you what you you can do is go on
Google and then type I cannot see my AR
plot that should be able you will see a
lot of they will give you suggestions of
what you can
do thank you very much
okay so no giving up takes take your
time and and and slowly but surely we'll
be there
but following the instruction is very
critical at this
point so you have five minutes to to
sort out the the issue of
uh the issue of of
WhatsApp or one person could collect the
the the the numbers and give it to
whoever doeses but the link has been
shared you have to just click the link
and then you can have it
we are going to see how to script how to
save the script but you if you know how
to save your file in in M on
Microsoft uh word or what you can do you
do the same thing you
know can this person who calls him or
Anonymous I mean you did not sign in
just say some I mean we can have a look
at the you can say I mean you seems to
be having some challenges can you talk
just unmute
yourself unmute yourself and tell us
exactly what are the what challenges
you're getting
is there any way one can retrieve the
console without running aror okay the
the the whole thing here is that once
you have your script and your script
runs you don't need any other thing you
can keep your script and run it again
and
again okay so tomorrow you you can save
the file the way you do your normal the
way you save your normal mic Word file
or Excel file wherever it is but
tomorrow I I I'll
demonstrate how that is
[Music]
done please make sure you have this you
have the the fold by tomorrow go and
like go and and copy all the content and
create your own folder on the desktop or
somewhere where you can uh access it
very
easily you can also post your problem on
the WhatsApp and then your colleagues
should be able to help
you okay
please somebody's dying to be on
WhatsApp please
help okay so bye-bye
everyone bye thank you for being good
participants we see you
tomorrow thank
you thank you Ellen how are
you very well thank you thank you for
the work well done enen stop
hiding I don't know are you in the
kitchen or in
the today I'm hiding
[Laughter]
we want to see
you I'm
here always we used to being together
[Music]
yeah it's
okay seeing you
guys better bring me some
cheese and
chocolate yes
good evening good good evening s oh it
is good morning
yeah all right
byebye
bye regards to all okay I will thanks
see you soon sure sure yes