Day Two R Training: From Data Manipulation to Exploratory Data Analysis

Name: Day 2: Statistical Data Analysis using R Programming for Staff and Students of Makerere University
Uploaded: 2026-01-14T15:26:58.538024+00:00
Channel: RUFORUMNetwork
Description: Summary and key takeaways on Day Two R Training: From Data Manipulation to Exploratory Data Analysis, covering Introduction The second day of the training
RUFORUMNetwork
Jan 14, 2026
•
4 min read
YouTube video ID: GhHKfGNj7HA
Source: YouTube video by RUFORUMNetwork — Watch original video
PDF
Introduction

The second day of the training resumed where Day 1 left off, focusing on practical data manipulation in R and RStudio and introducing exploratory data analysis (EDA) concepts.
Homework Review

Participants were invited to present their homework.
Mona demonstrated joining two data frames using left, right, inner, and full joins, explaining the resulting row counts and column structures.
John attempted to share his screen but faced technical issues; the instructor proceeded with a live demonstration.
Recap of Day 1 Topics

Installation of R and RStudio.
Understanding levels of measurement.
Importing data from CSV, Excel, SPSS, etc.
Setting the working directory and loading required packages.
Core Data‑Manipulation Techniques

Creating New Variables
Example: salaries$half_salary <- salaries$salary / 2 adds a seventh column.
Recoding Existing Variables
Creating a categorical salary band (low/high) based on the mean salary using ifelse.
Renaming Columns
rename(salaries, Sex = sex, Experience = years_of_service) changes column names while preserving the data frame.
Subsetting Rows & Columns
subset(salaries, rank == "Professor") extracts only professors.
select(salaries, -c(half_salary, salary_cut)) removes unwanted columns.
Merging Data Sets
Demonstrated left, right, inner, and full joins with merge() and dplyr verbs.
Exporting Results
write.csv(salaries, "salaries_clean.csv") saves the manipulated data for future use.
Transition to Exploratory Data Analysis (EDA)

Purpose of EDA: Summarize main characteristics of a data set using numerical and graphical methods before formal modeling.
Historical Note: Concept introduced by John Tukey in the 1970s.
Variable Types and Their Implications

Type	Examples	Typical Summaries
Quantitative – Discrete	Number of children, cigarettes per day	Frequency tables, bar charts
Quantitative – Continuous	Salary, height, weight	Mean, median, variance, histograms
Qualitative – Nominal	Gender, job category	Frequency tables, pie charts
Qualitative – Ordinal	Education level (low, medium, high)	Ordered bar charts, box plots
Numerical Summaries

Measures of Central Tendency: mean, median, mode (median preferred when outliers are present).
Measures of Spread: range, inter‑quartile range (IQR), variance, standard deviation.
Outlier Detection: Values beyond Q3 + 1.5*IQR or below Q1 - 1.5*IQR are flagged; the instructor illustrated this with the years_of_school and salary variables.
Graphical Summaries

Box Plots – Five‑number summary, visual outlier detection, skewness assessment.
Histograms – Show distribution shape; the salary variable displayed a right‑skew.
Bar Charts & Pie Charts – Used for categorical variables (gender, job category). Bar charts were preferred for readability.
Scatter Plots – Explore relationships between two quantitative variables; added regression line and confidence interval with geom_smooth().
Customization – Changing orientation (horizontal = TRUE), colors, titles, and axis labels.
Correlation and Relationship Assessment

Correlation coefficients (r) quantify linear relationships:
salary vs. beginning_salary: r = 0.88 (strong positive).
salary vs. years_of_school: r ≈ 0.66 (moderate).
salary vs. time_on_job: r ≈ 0.08 (weak).
Significance tested via p‑values; a p‑value of 2.2e‑16 confirmed a statistically significant relationship for salary vs. beginning salary.
Practical Session Workflow

Install & Load Packages – tidyverse, readxl, ggplot2, dplyr, etc.
Set Working Directory – via RStudio menu or setwd().
Import Data – read.csv("employee.csv") → data frame employee (474 rows, 10 variables).
Inspect Data – head(), str(), summary().
Convert Character Columns to Factors – employee$gender <- as.factor(employee$gender).
Perform Summaries – numeric (summary(employee$salary)) and categorical (table(employee$gender)).
Create Visualisations – bar plot for gender, pie chart for job category, box plot of salary by job, scatter plot of salary vs. education.
Subset for Correlation Matrix – remove categorical columns, then cor(employee_continuous, use = "complete.obs").
Assignment for the Next Session

Exercise 1: Using the employee data set, display the relationship between two quantitative variables (e.g., salary vs. education) with an appropriate plot.
Exercise 2: Show the relationship between two qualitative variables (e.g., gender vs. job category) using a bar chart or mosaic plot.
Exercise 3: Visualize a qualitative‑quantitative relationship (e.g., salary distribution across gender) with a box plot.
Closing Remarks

Participants were encouraged to practice the commands, review the YouTube recordings, and explore the Google‑Drive resources.
The next session will be led by Dr. Thomas Odong, covering advanced EDA techniques and statistical modeling.
Attendance was automatically recorded via Zoom; no additional registration required.
Key Takeaways

Mastery of basic data‑manipulation (creating, recoding, renaming, subsetting, merging) is essential before any statistical analysis.
Understanding variable types guides the choice of descriptive statistics and visualisations.
EDA provides critical insight into data quality (outliers, skewness) and informs the selection of appropriate analytical models.
Effective data manipulation in R sets the foundation for robust exploratory analysis; by correctly handling variables, detecting outliers, and visualizing patterns, researchers can make informed decisions about the statistical techniques that best suit their data.
Frequently Asked Questions

Who is RUFORUMNetwork on YouTube?

RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
Summarize another video
Full Transcript YouTube

good morning everyone i welcome you to
day two
um
from where i resumed yesterday we are
going to we are going to continue from
there hope you had
a great night
i'm just wondering if there are some
people who managed to do the homework
such that i will give them the first
five minutes even if we get one or two
people we give you
each two two to three minutes such that
you present the work that i left you can
raise your hand and then uh
amit will help us to give you the right
to join the panelists
uh groups that you're able to present
amit are you there good morning
i mean yeah it's okay so can you select
anyone who you want to present yeah yeah
we can get uh the first one heil because
i saw he has put his hand already and
then we can have
the second one
okay
bare hands already up
i think they can go ahead
okay
finally please
please share with us what you managed to
do with the homework and then now we get
started
hello good morning
good morning
uh this is my first class just to attend
the training so i don't take any exam at
home
this is my first time that's why i just
um to leave the message
the first session
okay fine
okay
it's okay
um
then uh
oluwapole you managed to do the homework
hello doctor good morning yes good
morning
can i hear my screen
it's all right please go ahead
i don't have options here screen here
what options do you have on your screen
that you have uh a green screen which ha
sorry a green icon that shows share
screen
below
as you check on on your computer down
there
tell me the different icons you're
seeing this we have a chart key and a
then share screen
okay
hello
yes we can hear you go ahead please can
you scream can you see my screen
yes
[Music]
i
i understand
my budgets
and i
showed all the bookies
then
i
set my directionary my directory
and
hello can you hear me
yes we can please continue
okay
uh
after that i i i
i do my my directory
uh by
manuel
my cessation
uh
do directionally and and
i can find the
please
after that
i go to
to view my two files with weight a and
weight b
and set it
as
df1
by command df1
less than -3
cv weight or a and weight b
after that
i
do the structure for f1
and find it uh
11 rows and five column
with five with five variables
and
with b
uh with uh
for 14 rows with three variables
after that
do the left join
and the left join every uh keep
everything in table a and
add the other in in
in table
from table
uh with b
and ignore
rows 13 and 14
after that
did the right join
right join
keep all and elements in where b in in
work
write it as n a
and in an nr
join
uh in a join uh
marriage all coming rows in a and b
this is a and this
b when i find it
after that
i did full joint
between two
uh
everything in the uh between two set
data set
uh a full joint
uh merge everything between
the two tables
but i don't know this is correct doctor
or not
i tried
thank you so much muna for the great
work he has what you
is it right
okay thank you
thank you so much i like the spirit of
trying and just to show you whatever i
tried is very correct i'm glad you
understood the commands and the way
you're articulating how we are managing
the different data sets
um
i think that is great
so
we are going to
since we've had mona to present to us
what she has presented i highly believe
that we've been we've been all following
how to emerge using the left join using
the
the right join the inner and then the
full joint the way she was um
elaborating can you get only one more
person maybe there's something you want
to add and then we'll proceed
alwa pain
i don't know if i'm pronouncing the name
well please unmute and share and then we
start
if i pro i've not pronounced your name
well you can help me to pronounce it
right
okay
[Music]
my name
is john
work well
i'm actually using guys i'm using my
phone and i'm practicing with my laptop
i'm trying to share the screen on my
laptop
but um i did exactly what i did
[Music]
are you sharing
i'm finding a difficulty
with this kind of person
i can't hear very well
are you sharing your screen
no
okay let's proceed now what we are going
to do today yesterday when uh just uh a
recap of what we did um
we looked at how to download and then
install r and r studio now computers
went ahead to learn about the different
levels of measurements then uh importing
data into our we either depending on how
the data has been served and we looked
at the different ways of how that data
can be imported into our and we also had
us a comment that says that we are can
can work with other softwares in case
you've got data that is said maybe in
starter or spss or resource you can
still import that data into our
so yesterday we closed when now we had
looked at the theory bit and we wanted
to go through data manipulation um
practically using using r so it's a bit
that i'm going to recap on it's a bit
i'm going to go through and then
professor suzanne will come into to give
us what we are supposed to have for day
two so i'm going to share my screen you
can also do the same where you are
open you are you are studio and then
ensure that uh we are we're still using
the same data set the data set we had
yesterday for work salaries it's the
same data split we're going to do and
try to do what happens in data
manipulation the things that you want to
do if you if you check in your
powerpoint the one that we were using
yesterday um and i clearly remember we
talked about that within data
manipulation you can do a lot of things
and uh
but then but they everything depends on
where exactly you want to shift at the
end of the day for example we talked
about creating new or adding variable
new data set
recording of an existing variable a
renaming a column you only want to
delete a column then the levels of
measurements and merging data i'm glad
that moon has been able to explain the
emerging data very well so i'm not going
i will not go through that
so what was always
the moment you've got your your package
for example like yesterday we installed
our packages so what we are going to do
today we do not need to install them
again we are going to just load the
packages that we need so that we
continue with the manipulation
i've highlighted all my packages and
then i rang them because remember these
are the packages i've got in good
functions that are supporting us to
to to to write our commands very well
and do the manipulation that we want so
in case i load my libraries please when
you look in the console there is
a stop button that red button comes the
moment you're running a package and the
moment uh it is successful the red
button will disappear so we looked at
that yesterday i'm loading my package
with the work
again yesterday we looked at how to set
the working directory using various
options and it also depends on where
your folder is saved on your computer so
i don't change my folder i'm only going
to run the set directorate line site
that now i don't need to go through the
process i had since i saved that command
all i need is to run it and the moment
it agrees in the
in the console then it implies that my
path is correct
okay
remember i'm using the salaries.csv
so i
use the vcc command i labeled my data
set salaries please also do the same
site that we move on the same page so
when i come there and i check in the
when you when you check in the
environment the global environment i can
see that my data set has been
successfully imported salaries with 397
observations and six
variables all these are commands that
are in the salary are script we looked
at them yesterday so i'm going to scroll
down
and uh
i want to the part of the structure the
categories that we talked about they're
really very important
so i'm going to run my lines telling her
that
rank is a category display and then sex
and i can also check the structure of my
data set all the things we looked at
them yesterday and now i can confirm
that yes the way error stored my data it
is the way i want it to look like
sometimes you might be interested to see
the summary of your data set in uh
everything together where you get in
case of uh continuous variables you'll
have a summary statistics for minimum to
the maximum and then the categorical
ones they will give you uh the
frequencies so now i'm going to move on
to data manipulation and go through in
case uh what happens in case you've got
a data set and you want to manipulate it
according to the objectives of your
study so let me check in the chart
please import the doctor and then have
you managed to import that data
successfully in r
import the data in r and then
type for me yes if you're done
we need to work together as i work you
also do the same
okay
um
[Music]
okay i'm saying people are telling me in
the chat that have
been able to import their data if you've
imported excel there's no problem it's
the same thing
but remember with the the one of excel
um
[Music]
it will give you a name called work
salaries by default it's what it will
save for you so where there is seller is
you can change it to work salaries or
you can rename like the way we we are
going to see so the first one uh we say
that uh we can create new or add
variables sometimes you can have a data
set but within that data set you can
create another variable a new variable
remember our data set is salaries
assuming i want to create a new variable
called have salary
so how at the back of my mind i need to
ask how do i get her
implies that i need to go to my data set
and then consider the column which has
got salary okay and i divide everything
by true so it is what i did on this
particular line that you're seeing half
salary is my new variable i'm and i'm
starting with salaries because i want
this column the new column i'm creating
or the new variable i'm creating to be
part of the the main datas
there so it is salaries dollar sign have
salary and i'm assigning it remember
looked at how to assign a variable
yesterday or in this particular case an
object name have salary so i go to my
data set salaries and i pick only the
salary variable i get and i divide
everything by two so it implies that now
i'm going to come up with a new variable
so in case i run this particular line
if you look in the console it has been
it has successfully run and when you
look in the environment i'm seeing i'm
seeing something new that instead of now
having only six variables i've gotten
seven variables in case you want to view
you can come and click on there on the
excel icon when you click on it and
you move you're going to see that now
you've got an
another column they are no longer six
variables but now you're having seven
variables with a seventh being have
salary so it went ahead and took this
particular column for salaries and
divided by by two
of course it could be other maybe other
complex uh variable that you want to
create it doesn't matter but the point
is getting the
basics
first
number one creating new or adding a
variable in the existing
existing data sets
now number two about recording and
existing variable
okay what we are in this particular case
i still wanted uh remember what i said
that depending on
depending on the objective that you want
to achieve by the end of the day so how
how do i record an existing variable
for example in case at the end of the
day i want to look at people who belong
to who can who belong to a low category
and then high category of salaries so in
this particular case remember salary is
a numerical variable and i got its
summary statistics you can as well run
only a particular variable from the main
data set in this case it's salaries
dollar sign seller when you run that the
results that you're going to get down
here in the console you only see now the
salary variable you don't see other
variables but in case you had put seller
in bracket summary in bracket salaries
it will give it bring you a summary of a
full data set but in the
in this one here what we did is uh we
got um
only one variable okay and i am given
the since it's a a numerical value i'm
given a
summary statistics with the minimum the
first quantile median mean the third and
then the maximum salary
so from here i said that um can i use my
main as a slash hold such that i'm able
to categorize those who gets in
and i'm assuming that in case somebody
gets below the mean that is a low salary
if somebody gets above that mean that is
a high salary so it is then
so how do i do that i further go ahead
and create another variable
which i termed as salary card
remember my main data set is salaries
but in case i want to add another
variable within the same data set i need
to add it there so i'm saying that
within my main dataset salaries add for
me another variable
which is called
this is what i use okay because i want
to create two options um
those that belong to the low salary and
those that have got high salaries like
we mentioned yesterday you can use the
hope command to support you to
understand what a particular command
means to give you more adjuvant on how
you can go about it
so i come in here what i start with is
the test that i want
[Music]
salaries and pick for me all
all whoever has a whoever has a salary
that is less than that mean which means
one one three seven or six and
categorize those as low this is the
first bit okay label them as low so and
since it's if else so it is a logic it
is the
other and
then label the other people high so this
is the high bit we have got two groups
the low and then the high those are
below the mean they are going to be
labeled low and those who have who have
above the mean they will be labeled high
remember
sometimes maybe for example as we
[Music]
you have a binary output a yes or no and
it's the same case that can work here
where you have low or low or high there
are two outputs so in this particular
case a by uh a binary logistic
regression can work so in case i run
this particular line i expect my number
variables to increase so if i run that
line
uh when you're checking the environment
the variables are increasing from seven
to eight and when i check in my console
it is telling me that it has
successfully run
i can be able to check i can be able to
view i can click on the excel icon and i
check whether the salary cut variable is
now part of my main data set
when you look here we are having another
salary cut and it has got two out two
levels it has got the high and then the
low so in this particular case in case
you want to get for example a summary
you're going to come up with the number
of those who belong to the high category
and the number of those who belong to
the low category
okay
that is great so that was re-recording
an existing variable so it implies that
uh from one data set initially whereby
it was just a continuous variable would
use uh uh multiple linear regression by
the moment you reload it into a
categorical then it implies that the
method of analysis also has to change
okay let me check in the chart are we on
the same page
are we on the same page
okay
in case you have a question um kindly
share it in the q and a
and then you'll be you'll be helped so
now let's proceed what about in case you
want to rename a variable okay renaming
columns remember initially when you look
at when you check the structure
okay i can check the structure of
salaries usually what we know about the
it gives us the different variables in
your data set so this time around when
you come to the structure what we are
seeing
um
it is giving us uh we have three nine
observations and eight variables and
right now i guess we understand why we
are having
variables so our data set has increased
however when you check again on the
salary card for us we are saying that it
is it is categorical in nature but r is
seeing it as a character so in case
maybe um your objective to use salary
cut as the dependent variable you still
have to go and use as a factor such that
r changes the way it is seeing the
salary card into a categorical value all
right so now these are the variables
that we are having and we want to rename
maybe you want to rename
the variables we can change maybe you
want to rename you want to rename sex or
you want to rename years of service and
you put experience or you want to to
rename this plane and you put maybe area
of specialization all the things can
still apply the concept remains the same
so we use what we call the rename
command
okay we use the rename command but
remember everything emanates from the
salaries data set so what we do using
the rename command and then inside the
name you start with your data set which
is selected
[Music]
for example in case i wanted to change
uh sex to
from uh which is in small and small in
uh small case to uppercase then i could
write what i want that change this into
where there is s e x change it to
capital capital x or somebody else can
say that no it's still within my data
set salaries
i can't go ahead and say rename the name
is the command
[Music]
side if you see this
with the data okay you can still um
check and see how do i use rename you
start with the data set then you say
salaries okay salaries is uh is my data
set and
maybe you can say um you want to rename
let's say
you want to instead of years of service
you want to put experience
okay and you say where you rename where
there is years of service
you put experience so that's what you
can do
so in case you run the first line
you run and check you see that it has
successfully run so i expect in case i
check the structure again whereby my sex
with with a small case has changed to
sex with an upper case and then in case
i rerun my second line i expect where
there is years of service to have
changed into experience okay i can check
the column names by using the call names
of the data set because i want to check
has it changed you can as well move to
the main data set by viewing if you can
see very
well here we had years of service but
now it has changed to experience and
then where we had sex with small kills
now it is an upper case in case you
don't want to do that you can go ahead
and use call names call names implies
that you want to see the names of your
data set salaries and that is the
command that
we can use so in case you run this
particular line then what you're going
to see that okay now i have uh
experience what i've renamed and then
i've also renamed the sex with that the
upper case so that is another way on how
you can manipulate your data maybe you
want to replace the variables to names
that you feel that
you can quickly use router throughout
the analysis whereby you are not you can
you don't forget them
okay great
now let's move on to another another way
on how you can manipulate data you can
do sub you can subset rows or columns
all right but uh in this particular case
i'm going to use uh i'm going to use the
rank variable remember initially when we
took the summary of the the whole data
set we were able to when we came to
categorical values we are able to see
how many number the number of uh um the
number of respondents that belong in
every level within our categorical
variables so with the subset i want to
use the rank and see maybe um within
this particular rank like we saw
yesterday we had
we had an assistant professor an
associate professor and professors like
the way we are saying that as we are
discussing try to shift what we are
learning to to make to to your data set
so what is happening here assuming that
within my particular data set i want to
work with with professors only maybe
i'll look at other other levels later so
what we do we can use the subset the
subset command the way you hear the word
subset you're trying to pick something
out of the main the main data set what
do you want to substitute so we start
with that we start with the salaries
our data set within the arguments what
we start with and then in this in the
argument you can include what you want
to subset now so you tell r that go
under the rank variable and pick for me
only professors implying that now in
case i view my data set again i don't
want to see other levels or attributes
that we talked about i only want to see
only one category of people i still say
this one can work maybe if you're having
uh
if you're having a certain crop that
you're working on and you've gotten
various crops that you want to study but
in case you want to start the one crop
in particular you can subset it from the
rest of the other crops and then do the
analysis and then later maybe you can
study how the risks are related to to
what you've studied
so remember um for you to see the
numbers or the frequency for a
categorical value we use what we call
that table command okay apart from the
summary remember there are various ways
that you can use so you can use table
and then you run this line
if you check in the console table
salaries salary is my main data set but
i specifically want to pick rank
table only works with categorical
valuables in case this one can also work
with um
it can also work with the huma with the
discipline where we've got humanities
and then sciences and then it can also
work with the sex where we have female
and then males but under here i say that
when you look in my console i've got 64
associate professors 67 assistant and
then 266
professors and then i say that now i
only want to work with this uh with only
this group of uh only this group of uh
of of rank that belong to professor so
what i do i said we use the subset
command but now the same around since
you're getting a portion or a fraction
of the data set of from the main data
set you need to rename your data set so
what i say now i'm going to rename
instead of salaries now i name it
salaries proof
okay implying that the
the decision now i'm going to get it
only contains professors and what do i
expect for me to predict that i'm right
i expect to come up with
266 observation in case it's more than
that then it implies that there is a
problem with my data set so in case i
run this particular line here
okay if i run it i check and see that
now i've got a new data set called
salaries proof
okay and under here it is true that i'm
having 266 observation of my eight
variables you can click on the excel
icon and this time around and check when
you look in this particular column it is
only the ranks that i'm seeing i'm not
seeing any other any other rank so it
implies that
now people have this as my other data
set which i can want which i can work
with
okay let me check in the chat
um are we on the same page
in case you missed yesterday please
refer to the
refer to the youtube letter that was
sent check the youtube that we used on
how to install and download all those
bits are there we are not going to go
back today
okay great
all right so now that is uh one way on
how you can you can subset or get a
fraction
of your data set and you'll be able to
see that
okay what about in case you're having a
data set after manipulation you want to
delete some bits you want to delete
certain certain columns for example in
this particular case maybe uh
the new columns that you have we have
have salary we have salary cards and you
want to delete them such that you go
back to the original you go back to the
original data set still we use the
subset we still use the subset command
which i'm having here
remember what do we start with first
it's our data set you put in the name of
the data set and then you use select
select implying that from the main data
set remove for me this particular
variable okay select is equal to minus
the minus you're seeing in front of c it
is remove minus from the main data set
minus for me this particular variable so
inside here inside the c you're minusing
the variable that you want to be deleted
it could be one variable or you can have
multiple variables more than two or uh
or three and so on so in case you're
having more variables other than one
like in this particular case what we did
we had a that you remove have salaries
so in case i run this line
and i check back again
you see that now my variables have
increased i've reduced in the
environment i can still view and see
whether
that one has been deleted still you can
check that now
salary um
[Music]
you can now check that uh have salary
variable
it is
meaning it has been deleted you can as
well see in casa in this second line i'm
still doing the same thing but what i'm
doing i'm i'm i'm trying to show you
that you can delete more than one
variable so now i'm telling r delete for
miss salaries and also salary category
i'm no longer interested in them so i
come and i run this line and i check
again i see variables have now reduced
to five i can as well view i see rank
displaying yes since phd and then that
okay that is how you can work with a
subsetting your data set but everything
goes back to
everything goes back the objective
that you said from the start
the other one is level of measurements
which we discussed yesterday we talked
about how to deal with missing data and
then moon has discussed emerging a data
set i'm not going to go through that
because uh it has been already discussed
so now the last bit on descriptive
statistics which is also uh which is
also important and then professor susan
is going to start
i want to read
my data to do some
descript statistics and then we'll go on
because why i'm reading back remember um
up i've deleted out some of some of the
some of the columns yet uh i might i
need them to see how do they work in a
descriptive statistics so in case i read
my table again um i can check that now
getting my original data sets that the
one that i had
no anything supposed to be respect you
can use cert we've already seen that
whereby your errors got imbued um
inbuilt commands for example in case you
want to get the mean within the summary
we are able to get everything at once
using the summary command but you can
also run particular bits if you're
interested in them
like here i'm that i want her to give me
the minimum of salary remember salaries
is my data set and then i want the
minimum of the of salary which is the
column that i've called for i'm asking
for the maximum i'm asking for the mean
i'm putting everything on one line
wanting them with full columns okay and
then i can also get the standard
deviation and the same thing can apply
to can apply to
other variables so in case i run that
line it will be it will give me a for
output the first one will be the minimum
the second one as the maximum the mean
and then the and then the maximum
right so now i've already mentioned that
in case you run vertical value
a categorical variable we use table so
the moment you run the table you'll be
you you get table command within the
table command call the variable that you
want to get okay and then it will give
you the number of frequencies for
example here i've put stable salaries i
want for the rank for the display and
then for the gender so in case i run
that it will give me my
in the out
in the r so then the final date of
discipline i'll come up with a and b the
humanities and then i will be the
science says then with six i'll see 39
female and then now
358 males
remember
table command works for only categorical
variables
now in case we move on after getting the
table sometimes it is important
especially if you're going to report
your data and you want to publish to
quote the frequencies
and uh
i know
i've already run this the part of a
table into salaries and then rank
however i can store the whole of
information
into
a into an object which i'm calling ranks
okay i can store the whole of this
information into a variable i call ranks
remember what we said that you can
assign a variable
so in case i get i say that i want to
store everything in rank so i come here
and i say this is a an object i've
created it can be any name
an object can change it doesn't matter
so let's dash tab
this one here
okay so if i run that particular one
what i'm seeing that uh in my console
i'm seeing ranks
less dash table and so on i'm not saying
the result so for me to see the results
i need to run the
um i need to run the object itself the
moment i run it i'm able to see the
results
i can further go ahead and get the
percentages of the fraction by using the
prop dot table command and inside now i
put in my object so when i run that line
what i'm going to say
is that they put them get i'm given the
percentages um
the fractions for associate professor is
0.16 for assistance 0.168
and then for professor 0.6
sometimes
um
no sound i can't okay can you hear me
now
you can
okay
so sir about that let's continue saying
that uh
you can round off
okay by using
the round
the round command when you look at where
my red um shot is it has got inside what
what do you want to round off
so it's what you start with then comma
dig it's equal to zero where you're
saying big it's equal to zero it implies
that i put the the decimal places that
you want
okay so in case i run this line
it now i'm getting uh the percentages as
well as the they've been rounded off
okay so we can have that
and then the other pattern now on data
visualization what we can do it still
depends on the nature of the of the
variable if it is uh if it is
categorical like we mentioned earlier
you can use a ba a bar plot or you can
use a pie chart if it is uh numerical or
continuous in nature you can use a
histogram or a scatter plot you're going
prof is going to come in with more
details on now on exploratory data
analysis
so here in remember i had this before
when i was calculating the frequencies
the table command
inside the table command i ask for
salaries and then
from the salaries data set pick for me
rank only which i saw earlier that
i had uh with my frequencies i had uh
associate with 64 assistant with 67 and
then prof with
266. so i want to plot a bad plot and
see uh how it looks like
so
in case i plot remember yesterday what
we said you check everything in the in
the plot window so if you zoom out
um
i don't know if you're able to see this
are you saying my plot
huh
you're saying it now
huh
okay
um
now when when you zoom out this is what
we are able to see um these are
frequencies remember i've not plotted
percentages but also the percentages are
possible and we see that now we are able
to view uh to view graphically the the
wrong category within r you can as well
go and label each um
each bar maybe with with the number of
frequencies that it has or if you've
used percentage you also indicate that
okay so i'm going to share my r again
and then continue
sometimes in case you're having many
plots in the approach window you can
also delete them off you can clear the
the plot window by typing dev dot off in
bracket you don't put anything so in
case you run this particular line
remember in the script that you have i
put some of the comments so you can
always read and i say that this command
what it does it removes the current
figure from the plot if i run that the
figure will will be uh will be cleaned
up you will not see it anymore
and yeah of course when you're doing
data visualization there's a lot of
things that you can do uh how you want
your your plot to look like do you want
it to look horizontal or you want it to
look vertical like the one that we had
previously so now on this particular
command i use the same this i use the
same plot that i had up here but i said
i put another another argument for raise
implying that uh it is horizontal so
this time around
i want a figure that is horizontal and
then i say equal to true initially we
had
the one that
was vertical in nature so in case i run
that command
um sorry
i
in case i run that line
you see that if i zoom out
[Music]
so let me share what we are getting
so
um
i think i'll close this
okay
if if you zoom out you'll be able to see
that
our figure the same around the direction
has changed
this is what this is what i'm
if you change the structure to now
horizontal this is how you're going to
see how your graphics is going to look
like but still you can see that it's the
same information what you've changed
just the axis and you would see the the
rank on your y on your y-axis and then
the frequency on your on your x-axis and
you can still go ahead in case you want
to change the colors if you want to
change
the the colors for each you can also go
ahead and put in a lot of arguments
labeling the axes putting the title and
so on all the things can be done in r
the point is getting the basics and then
the rest
we just need to build on you can as well
do the pie charts you'll come up with
the same thing and then lastly in case
i'm to consider for the case of a
continuous one like for the case of
seller risk we use the plot command
which we are having there and uh i'm
plotting salaries because uh it is a
continuous variable and we're going to
come up with a scatter plot you can run
that and then always remember to zoom
and you check and see how your figure
looks like the moment you zoom
uh you are able to see how just how your
uh your variable looks like so in case i
share that
um what you're seeing now it is uh it is
a scatter plot for salaries we are
seeing that uh
how
how they
how the salary
um
variable is being split all over in a
scatter plot and of course you can come
up with a different conclusions or you
want to fit a regression on it and you
see whether um it is giving you the
desired results
so let me go to the last bit
back to the r studio
if i'm to scroll down we said that you
can as well um
within your command you can put in other
other arguments like in case you use
main main means title you can say main
equal to you put in the title that you
want you can even color the title in
case you are in case you're interested
you can as well run histogram in case of
a continuous variable and this
particular case the command is hist
h-i-s-t and then in bracket you put
what you want you want to draw histogram
for what in this particular case you're
saying go into my salaries data set and
put for me uh
and plot for me salaries so you get this
and then at the moment you run it you
will be able to see the the plot for the
histogram
so in case uh um
i stop sharing that and then i share the
i share what you've come up with so this
is how you're going to see
the
histogram for the
for salaries if you don't want a scatter
plot and then you see where the data is
mostly because where the data is mostly
concentrated
so i can stop sharing that and then i
share
my r
studio again
so basically i'm coming to the end now
given that we've looked at uh we've
looked at the starting bits of data of
data
of descriptive data and we've done a lot
of uh a lot of we've done data
manipulation sometimes as you end
sometimes
as you're manipulating your data maybe
in case you want to retain the value the
new variable that you've created and you
want to create is that you want to
create a new excel file sheet so that
you use it again so what we are going to
do in that particular case we use what
we call the right.csv command
what does it do the write.csv command
exports data
from r to excel initially when we're
importing we're importing from excel to
r
now we would what about in this after
data manipulation you've created some
variables which you have to maintain so
what we but now you want to return this
data into an excel so what we do we use
what we call
we are exporting now
from
from r to excel and we use the right dot
csv command
the right dot csv
what you start with you start with your
the name of the data set
after the name of the data set you go on
to the file the file name what you want
to call it so here you can say i want to
call it salaries.csv
so if you run this particular line
after running it you go back to the
folder where you you had set your your
working directory and check and see
whether the file salaries the csv is
there then that one implies that is
the new data set that you've created in
other words you've been able to import
you've been able to import data from r
to
excel
okay i think that marks the end of the
bid that i wanted to that i wanted to
cover the other bits we can follow them
the control l and then the list
okay
[Music]
any question
um
could you show us how to add the
equation the regression line we are
going to see that
so just you just need to be a little
patient
so i'm going to give you a break of five
minutes
okay i'm give
i'm going to give you a break of uh 11
minutes and then professor susan will
take over
okay somebody's asking that uh
please share as they are script
okay
and somebody's asking about a recap on
on about loads
you do it
okay that particular bit
more details are going to be covered by
professor susan under the
the descriptive the exploratory data
analysis so please just hold on after 10
minutes she's going to to join in and
then
you proceed from there thank you
good morning participants good morning
can you hear me
somebody saying what happened to the
sound
can you hear me
okay
okay let's start i hope you've had a
good break
and i would like to thank my colleague
helen
for having taken you through
the our
programming language we are going to go
into exploratory data analysis
and after this
my colleague thomas
odong will come in if at if time allows
today if not tomorrow
so
some of the things that we are supposed
to cover here practically she has
already done them so when it comes to
practical part
we will be perusing through b just
recapping
okay
okay so we are going to look at
expiratory data analysis
and expert data analysis is aimed at
summarizing the main characteristics of
the data set using either numerical
or graphical methods
it's a critical step in analyzing data
because then you're able
to
see what is happening to your data
before you do formal modeling and
hypothesis testing
explorator data analysis was promoted by
john turkey in 1970
maybe before most of you are born
and
this was meant to encourage
statisticians to explore the data
and possibly formulate hypotheses
that can lead to new data collection and
experiments
so wherever you have data
and you've collected your data from the
field or you've collected secondary data
from any nsos
then you have to do your exported data
analysis to see whether there are trends
to see the magnitude to see differences
between groups
so that is exploratory
data analysis
the importance of expert data analysis
it is to maximize insight into our data
set
by uncovering underlying structure
so in this underlying structure we are
able to see whether their relationship
between variables
uh the type of relationship is it a
linear relationship is it a curvilinear
relationship is it an a linear
relationship it is c so it's only
through exported data analysis that you
are able to view these types of
relationship
we are able to extract important
variables and create new ones and also
do data transformations so when you you
plot a scatter diagram you're able to
know whether the data
has a linear relationship between two
variables when you construct a histogram
we're able to know the distribution of
the histogram
and if it is nani normal you can now
decide whether you're going to do data
transformations
or not and then through exported data
analysis we are able to know whether our
data is highly correlated
and then we decide to create new
variables using multivariate
data analysis using using principal
component analysis
so by doing that we are reducing the
correlation from the original data state
and creating new variables but it is
through
exported data analysis then we are able
to detect outliers and mistakes
outliers are values which are different
from the rest of the data set
so those values can either be by mistake
or they are genuine outliers outliers
affect the estimates of our parameters
for example the estimate of mean is
highly affected by outliers if we have
for example we are looking at
income
excuse me
income of
people in our country and most of the
people are poor that means that the
average income is going to be skewed to
a lower end so it will appear that most
of the people in the country are earning
very low yet it is because of the
outliers that there are very few people
who are earning very little that have
affected the average value of income
you're able to know your mistakes maybe
you have entered your data wrongly
you're able to come up with those
mistakes
and then correct them before you do the
formal
analysis
we are able to test underlying
assumptions
under which
all our statistical inferences are based
for example we've already talked about
histogram when we construct a histogram
we are able to know whether the data is
normally distributed or not and most of
these statistical inference we assume
that the data is normally distributed
we are also able to know whether we have
increasing variance or not
using plots that will look at when we
are talking about regression
and
experimental design
okay we are able to support the
selection of appropriate statistical
tools and
techniques
once you do export data analysis if
originally you had said you're going to
do analysis of variance and then you
find that actually
the relationship between the response
and the categorical variable you have is
a linear relationship it doesn't require
you to cons to to run an anova and
experimental design then you end up
running a regression so that's what
expiratory data analysis does
okay
so we find that the type of exported
data analysis depends on various methods
we have
the method of summarizing data either
you're going to use the nanographical
or the graphical methods
and then we have number of variables are
you only looking at one variable at a
time or you are looking at many
variables
which we call multivariate and if we are
looking at one variable at a time we
call that
univariate
so if you have for example height and
that is what you want to look at you
want to look at the height
of various plants
to find out whether there are
differences
and you're going to look at one variable
at a time how about if you have very
many variables you have height then you
have cola diameter you have number of
leaves you have several variables and
you all want to look at those variables
at a goal then we will use the
multivariate
type of summary statistics
we also see that
under these two the nanographical and
graphical methods there are various
types
of summary statistics that we can use or
exploratory
data analysis
underneath graphical we have what we
call measures of location
and these ones are used to to locate the
average the median
the mode then we have also measures of
spread
which tell us how far the observed
values are from
the mean
which include the range the variance the
standard deviation interquartile range
we can also summarize our data using
frequency tables
the frequency tables
uh tell us how many groups you have in
your data
and their frequencies
under univariate you can also do some
graphical methods which we are going to
see in details we can look at a box plot
we can look at a histogram
a statement leave a bar graph or a pie
chart
all those depend on the objective of
analyzing the data
okay
then under multivariate we can do cross
tabulation between a qualitative
variable and a quantitative variable or
we can do cross tabulation between two
qualitative variables which we are going
to see in details we can do covariance
correlation matrix
and under the graphical section we can
construct a scatter plot
which we've already heard of
and in a scatter plot we are able to
establish to establish relationships we
can construct a bar graph
where we are able to know which
particular category has the highest
frequency
we can construct a histogram
which helps us to know the distribution
of the different variables and then a
box plot is tells us the distribution as
well as which group
or which variable has the highest
media
okay
so we are going to look at this in
details then also the type of variable
determines the choice of summary
statistics
so the choice of summary statistics
depends on the type of variable we are
dealing with
there are two major types of variables
we have the qualitative
and quantitative variables
the quantitative variables are also
divided into two we have the discrete
and continuous variables
so if measurements are integers obtained
by counting
then those measurements are called
discrete
for example
number of people in a household
number of cigarettes smoked per day
number of participants
in this
course
number of trees number of plants all
those are discreet in nature
then continuous
variables consists of measurements
that can take
a that can fall within a range
of values obtained by measuring
for example weight
height length
others are continuous variables
so when you're writing your proposal
you're able to actually know the type of
variables you're going to collect and
you're able to associate the type of
analysis
that is likely to be performed
when you collect your data
okay then we have the
qualitative variables
the qualitative variables describe
characteristics or attributes of an
individual
and they can be categorized either as
ordinal
or
nominal
nominal variables are assured to do with
some quality
or characteristics
or attribute
which a variable
consists of
but they do not have a natural ordering
on them
for example color
if we look at the different type of
color of hair we have people have brown
hair
others have red hair others have black
hair
maybe some have green hair depending on
the likes of people
but the other does not matter
vehicle type there are various types of
vehicle but other does not matter if you
have a
a ford and another another toyota
and then you put a ford first and toyota
second the order does not matter
so when you allocate names to
characteristics or attributes of a
variable then we end up with a nominal
variable
when the names
are related to some natural ordering
then we call those that ordinary
variables
for example height
short
tall
average
so we can never call a short person told
by the fact that the height
of that person
rings it falls in the short category
then if we look at taste
if you're given two types of biscuits
one which is sour another one is sweet
there is no way you're going to call the
sour biscuit sweet
automatically because of your taste buds
you call it a sweet biscuit another one
will be
a sour taste
we if we look at temperature
we have high temperature low temperature
moderate and then
zero temperature which is
cold or very cold or
very hot
so you can see that
the where temperature is
are categorized depends on the magnitude
of the temperature that is the natural
ordering attached to the categories
under that particular variable which is
ordinal and the other matters if it is
high temperature there's no way you can
shift and call it low temperature
okay
okay in the
chat
any question
any question in the chat can i see in
the chat
any question
okay none
okay so let's proceed
okay
okay thank you let's proceed
so we have a data set i have already
sent it in google drive
it's called employee
and we would like to describe that data
set depending on what we've already
learned
so that data set consists of variables
gender of employee male female
educational level that is number of
years at school
job category we have three categories
cleric custodian and manager
current salary in u.s
denomination beginning salary still in
the us denomination
time on job in a month previous salary
in years
minority categorization it's a yes and
no
so from these variables
can we identify quantitative variables
in the chart can you name the
quantitative variables
among these variables that we have in
our data set which we are going to use
which of these variables are
quantitative in nature
let me see in the chart
okay current salary okay
experience beginning salary
experience
number of years time on job
okay okay thank you thank you very much
and among the quantitative variables
which one is continuous
among the quantitative variables
which one is continuous
okay salary current salary
okay experience
okay thank you very much
and among those variables
which one are nominal
among those variables
which one
are nominal okay we have gender
yes
gender
minority yes
okay
gender minority thank you very much
thank you very much participants that is
really excellent
i am so excited
that you know what is happening
and you're so attentive thank you very
much so we find that gender of employee
is nominal because those are just names
male and female
educational level number of years at
school
number of years at school
qualifies to be quantitative
job category is also nominal because
they are names and the naming
does not have a natural ordering
current salary beginning saturday they
are also continuous in nature
time on job
they are continuous in nature previous
experience
in years
and then minority categorization is
nominal
okay so knowing the type of variable
also helps you to know what type of
uh
summary statistics or expert data
analysis we are going to apply
okay so the data set looks like that
is it visible at your end
is this data set visible in the chat
okay
okay great
thank you
the data set is visible it consists of
gender
the birth date
years in school
job category
currency beginning start time on your
previous experience
then minority
so we are going to distribute this data
set
uh to do our expiratory
data analysis
so first of all look at summarizing a
quantitative variable
and look at one at a time
so in summarizing our quantitative
variable we can use a box plot
a box plot consists of a 5 point number
summary
it consists of you can see this line
here the top most line
before you see the dots
the top most line which is named apa
whisker end
this value here is the maximum value
and when i lower down my cursor
and go to the box
the first line here the first edge of
the box which is named q3
or
upper hinge is the
third quartile
when i move to the dark
black line
it is called a median
when i move down to the end the edge the
first edge of the box it's called a
first quartile
and when i look go down
to the line
this is the minimum value
the difference between
the
third quarter and the first quarter is
called the interquartile range
these dotted lines are called whiskers
there's an apa whisker and a lower
whisker
so the box plot is called a five-point
summary
the point you see here is called an
outlier but we can always we can always
confirm
that this is an outlier using a formula
a simple formula
okay so that is a box plot what does the
box plus tell us it tells us the
distribution
within that variable
so within the variable if we look at
that box
we can know that either our data is
skewed to the left the right
or normally distributed
now if the box plot from this line here
the maximum
to the minimum
if if this upper part is not equal to
the lower part then we know our data is
skewed
so if you look at our box plot here the
upper part is longer than the lower part
the upper part represents the right hand
side of the box plot the lower side
represents the left hand side so this
right hand side is longer than the left
meaning that if this was a real data it
would be skewed to the right
because the right hand side is longer
than the left hand side
now if these two parts are equal
then the data is normally distributed or
we call it symmetric
if if this left side was longer than the
right side
then we call that a left skew
so that is the five point
summary the box plot we can also use it
for more than one variable to compare
between two variables look at their
average values are they different are
they the same is the distribution of
variable one the same as variable 2
and we can also identify outliers
using
a box plot
so we've already seen the quantitative
variables we won't re we want to repeat
that
so the ones highlighted in blue are
quantitative
variables
so we want to summarize educational
level
if you can see in the table here
we have our minimum number of years
spent in school is eight
the lower quartile is 12
the median is 12 the upper quarter is 15
the maximum is 21.
so the least educated person in that
company has spent 80 years at school
so in our current setting if somebody
has spent eight years in school
that particular person should be should
have finished a level
okay
now
the maximum number of years spent in
school is 21 years
so you can imagine
that person who has spent 21 years in
school
should have done a master's degree and
probably even a phd
okay
so the average number of years spent in
school is 12 years
meaning that that person
has already done
olivo or
under uh he has not yet done
undergraduate
okay
so you can see that this particular
information in the table can be
summarized in a box plot you can see the
maximum value
then we have the upper quartile
the lower quartile
the median and lower quartile are same
value 12 12
then the minimum value
we've already seen that the entire
quartile range is the difference between
the upper quartile
and the lower quartile
the formula you see here
it's upper quartile plus 1.5 times the
interquartile range
so this formula here is used to find out
whether
the values above this
value
are
outliers so if we substitute
uq
plus 1.5 times the interquartile range
our uq from the table is 15 years
our interquartile range
is going to be 15 minus 12 which gives
us 3
okay
so if we substitute in this formula
here uq you see it here i hope you can
see it
cuq plus 1.5 times the quota range
so if you substitute our uq is 15
plus 1.5
times the interquartile range which is
3.
so this is 15 plus 4.5
gives us 19.5
the maximum value is 21.
so any number above
the value we've just calculated 19.5
is an outlier
so the maximum value is an outlier so
you can see that in the box plot it has
been highlighted as
outlier and a value
above
the value you get with this formula uq
plus 1.5 times interquartile range
is an outlier
when we go to the lower end
we put in instead of uk we put in the
lower quartile
minus
1.5 excuse me
one minute
okay excuse me participants
okay on the lower end you want to find
out whether
we have any outliers
the formula changes instead of uq we use
loq
loq for lower quartile
minus
1.5 times the interquartile range
so if we substitute loq was 12 the lower
quartile minus in bracket 1.5 our
interquartile range is 3
so we find that
12 minus 4.5
we will get a value of 7.5
so our minimum value is eight
and a number
below
any number below 7.5
is an outlier
our minimum value is eight it's not an
outlier
okay
so is there any question about finding
out about outliers
let me see in the chat
today's working material has already
been sent let me request my colleague to
post the
uh
my colleagues are going to post the
link the google link the information is
there
okay
so somebody says that i should repeat
can you explain the lower part okay
so the lower part here
the formula should be lq
minus
1.5 times interquartile range
that is the lower end
so any number
below
the value you get with that formula is
an outlier so
in this case in our example
the loq is 12
minus in bracket 1.5 times the
interquartile range which is 3
so the value we get is 12 minus 4.5
and that gives us 7.5
so any number below 7.5
is an outlier on the lower end
so
in our box plot here we do not have
outliers on the lower end
okay
on the upper end
we already know the formula any number
above that value
is an outlier that's why you see the
maximum value
is an outlier because the value we got
with this
was 19.5 and our maximum value is 21. so
any number above 19.5 becomes an outlier
so when you see outliers you don't just
delete them
what we do investigate are they genuine
outliers
if they are genuine outliers we leave
them in the data set
analyze our data with the outlier in
look at the magnitude of our parameters
and then analyze the data without
outliers look at the magnitude if those
two values
with outline without outliers are
changing
suddenly for example if we calculate
our
average value average number of years
spent in school
and we find that
with an outlier the value is for for
example i'm just giving an example 16.
then without outlier the value is 12
that means that that outlier actually is
extremely affecting our parameter of the
mean
okay we can even test that difference is
it significant should we go ahead and
delete our outlier using a two-sample
t-test you can actually test the
difference between the data the
parameter you get with an outlier and
the parameter you get without an outlier
if the two two two values are not
significantly different then you go
ahead and analyze your data with your
outlier
so i'm going to repeat
if you find that you have outliers in
the data set
you can analyze your data with your
outlier
you record your parameters
then delete the outliers
analyze the data without outlier
record the parameters
if we find that
the difference between the value with
outlier and without outlier
is significant
then
we
use the data set without outlier if we
find that there are no differences then
we leave the outlier in and go ahead and
analyze our data
okay let's go ahead so that was
about the box plot of
number of years spent in school
how about current salary
we see that the minimum current salary
is 15
750 u.s dollars
the maximum is
135
000
that is an outlier we can find out
actually whether it's an outlier
let me find out from chat
is the maximum value the maximum current
salary and outlier can you compute that
and we see in the chart
is it an outlier
is the maximum value which is 135 000 an
outlier
we have the formula uq plus 1.5 times
the interquartile range
can we see in the chart let me see in
the chat
is it an outlier
okay i have a question i will answer it
later
is
135 000 an outlier
let me see in the chat work it out what
value do you get what value do you get
what are you comparing with
we know that
in
so somebody says this is an outlier
okay
our upper quartile is 37
37
0 5 0
plus 1.5
now the interquartile range
is 3750
minus
the lower quartile which is 24 000
so you fill in that value
i'm also working out as you work out i
want to confirm what you're saying
okay i've gotten a value of 56
625
that is the value which i have gotten
which we are going to compare with 135
000 and we see that 135 000 thank you
monday we've gotten the same value yes
it is higher than the value we use to
determine whether there is an outlier
okay
so that maximum value is an outlier
is it a genuine outlier yes
at one time t this company was paying
their manager
135 000. that manager must have brought
in a lot of business
so that's how we investigate this
outlier is it genuine yes why
should we analyze our data with this
outlier
we will find out okay
somebody says no
okay
let's go and see the box plot thank you
very much
so if we construct a box plate of
current salary
you can see that because of this outlier
here
check where the median is
so what can we say about the box plot
what distribution do you observe in the
chat
what do distribution do you observe
somebody says no outlier what
distribution do we see
skill to the right okay another one says
q to the left okay
said skewed skew to the right
okay continue skewed okay
i can see most people are saying
one person is saying no more
distribution we shall prove that
uh okay skill to the right okay thank
you very much
okay so let's look at the box plot
i say that
if the if we start from the maximum
value
to the media that is one side
if that upper side the right side
is the same
as the lower side from the media to the
minimum value then the data is normally
distributed
if the right side that is the top part
from the median to the maximum value is
longer than
the left side which is from the median
to the minimum value
the data it has a right skew
or positive skew
if the left side from the median to the
minimum value is longer than the right
side then the data is skewed to the left
so this data set we see here is q to the
right
okay
and the reason why is q to the right is
because of those outliers they are
trying to pull
the the the measure of central
attendance towards themselves
okay
so we see the maximum value
the upper quartile
the lower quartile the median
and the minimum value
so if we look at our data we find that
the least paid person earns fifteen
thousand seven hundred and fifty i'm
assuming now you've collected your data
you're writing around the data
so you you go on saying the least paid
person ends that
the highest paid person earns 135
then 25 that is the lower quartile
and 25 percent of the employee
earn less than 24 000
then 75 percent
below
70 i mean 37 000 that is the third
quartile
okay
so let's go ahead
outliers
observations that
are separated from the rest of the
values
are called outliers
you need to explore the reason
for the unused observations
outliers could be of interest
if not an error
so practice often don't do data analysis
with
and without outlier and compare the
results this is what i was saying
that we can test and find out
if we calculate our average value for
example
of data set a with outlier
we get a certain value
and then compare it with data set b
when the outlier has been removed
what happens to the magnitude of the
mean
sometimes it reduces other times it
increases
if it increases because of the presence
of outliers
is that increment significant
do we need to worry about the increment
so i said you can actually test the
differences
using because now we have two data sets
which we assume to be independent
and we can do a two sample t test on the
mean values capture the standard
deviation test the two the difference
between the two we can also say that
okay since this data set was collected
from
the same employee it is paired
it is it is correlated so we use a
paired t test and find out whether the
average difference is significant
and if it is significant that means we
have to pay attention to the outlier
it's going to affect
the
results of our parameters
if we are going to report to another
company that company x on average earns
this match we know that with the outlier
in
we are going to report a higher salary
compared to when we don't have an
outlier in the data set
okay
if the if if if it is a mistake
you go back to your data sheets
and correct
the outlier
for example if you are entering data
and you've coded your data you have male
is one female is zero
now instead of entering male 1 you
entertain
then 10 becomes an outlier it is
different from the rest
so for for categorical data how do we
identify the outliers
we do crosstabulation
of two categorical variables for example
we cross tabulate gender
and job category
and under gender you will see that there
is a third category called 10 that
you've created because of the mistake
that was done
so once you identify that there is a
group which you do not you're not
interested in
and it is a mistake you go back to the
data sheets and correct
that agenda of that particular
individual
okay
okay let's look at uh
descriptive
let's look at numerical summaries
for
a population distribution
we know that
in statistics we have two branches
descriptive statistics
and then inferential statistics the
descriptive involves numerical summaries
and graphical displays
inferential confirms what you've
observed in the descriptive
uh statistics
now under numerical summaries we have
measures of central attendance
measures of spread
shape
we also are interested in outliers
okay
then
measures of center
we have
examples like the mean
which is the commonest measure of center
attendance
or the commonest measure of center of
the distribution
the mean is computed as the sum of all
observations
divided by the total number of
observations
it is highly affected by
outliers or extreme observations
excuse me i'm sorry about that
a ma that mean is highly affected by
extreme observations
we use the trim domain to exclude
the extreme observations from the
culture to mean so if we cultured our
mean
with the outliers we are going to have
an inflated value which is going to
paint a different picture
for example
looking at income
we find that income is skewed either
to the extreme high or the extreme low
depending on the number of
extremes you have if you have very many
extreme lows
then your average income is going to be
skewed to the left where we have very
low very low values if you're going to
have very many people earning a lot of
money
or even if you have five people earning
in billions
your income is going to be skilled to
that value
and you'll end up reporting that the
people in your country are doing well
yet actually
they are not doing well
so for us to avoid reporting
values like that one which misrepresent
the income we can use the trim demand we
trim off the observations which are
extreme and then compute
a new mean
which is representative
of what is happening in the population
so the difference between the mean and
median we can see that
we have calculated the mean of current
salary and years at school
and also we've computed the median
okay
remember mean and medium both of them
measure
central tendency
you can see that
the current salary the average is thirty
four thousand four hundred and twenty
the median is twenty eight eight hundred
us dollars
median is not affected by extreme values
because what you do you arrange your
data set in ascending order and then get
that particular value of
occupying the middle position of the
data
that is the median
now we say that trim demean if we dream
of 80 percent of the outliers
then we are going to see the mean the
trim the mean becoming 28 000 almost
similar to the median
the same applies to years at school
the average medium
median is 12
the minimum median is 13.47
but if we dream of
most of the extreme values we end up
with 12.4
as our median
okay so that is that shows you how
the mean is highly affected by the
presence of outliers
so here we have
more
values of current
salary
and we are looking at we are comparing
the value of mean
and median
so
when we compute
the mean
and we do not remove any
extreme values
the mean is far greater than the median
if we remove five percent
of the extreme values
the mean is far greater than
the median
if we go on increasing how many
extreme values we are going to remove
and remove 25 percent
you can see that the mean is changing
changing towards the median
and when we look at the median it has
not changed the value has not changed it
is still 28
800.
okay
so we see that the mean
is highly affected by outliers
and in case you have
a variable that is extremely affected by
outliers
it would be better for you to use
the the
median
median as a measure of central tendency
is very robust
to outliers and does not change much
with changes in dataset
let me see in the chat
any questions
what is the formula for calculating trim
demand
don't our analysis be affected by trim
demean
my colleagues are going to help me
answer those questions
okay
but i advise you to put such long
questions in the q and a
so that in case
they are being answered then they don't
disappear
okay
so as they answer those questions they
will post the comments in the chat and
everybody will be able to see
whether
the
the data set is affected
by trimming off
outliers
the other numerical summary is the
measure of spread
spread is a measure of how far a value
is from the center
how far a value
is from the
central value
that is spread
so we find that
if we look at that box plot here
examples of measures of spread include
the range which is the difference
between the
maximum and the minimum interquartile
range the difference between q3 and q1
we look at the standard deviation
the variance
all those are examples
of measures of spread
variance and standard deviation
the variance and standard deviation are
based on square deviations from the
center
where the center is the mean
so we can see the values for
if we look at this number line
we can see our observations x1
x2 a3
and then the mean is 3
so want to catch it how far
is 1.5
from 3
or how far
is 2 from 3 or 2.5 from 3 and we can
also
find out whether
those values are
different
so the standard deviation is the square
root of variance
so here we have a data set where we've
used streamed
data set we have the range interquartile
range
variance and standard deviation
okay
so when you look at the variance
it's quite big it's 8.3
but compared to our mean
the variance is really small so the data
set does not vary a lot from
the observed
average
okay
so let's look at uh distribution
from here we are going to do if time
allows
we will do
the practicals
if time doesn't allow then first thing
in the morning
we will do that practice so shape of the
distribution we've already been talking
about
skewness
so here we see that we have these two
distributions
one is uniform
one is no more
normal because it has bell-shaped
and if you draw a line in between
one side is equal to the other
that is normal distribution
okay
so here we have the uniform distribution
no peak you can see that
the there's a bit of a peak here and
here and here and then here
okay
in the uniform distribution
it is almost similar to a discrete
distribution because in each interval
the probability
of
selecting anything in each interval is
similar
to the discrete random variable
but the data set used to construct
the uniform distribution
is continuous
okay let's look at left skill and right
skew so if we look at the first graph
you can see that
the left tail is equal to the right tail
so if they are removed we will remain
with only a
the middle bit
okay
so for skilled distribution
i mean for symmetric distribution
the right tail is
similar to
the left tail
now we've already said if the right tail
is longer than the left tail
then that data set is skewed to the
right
as you can see
the data set is skewed
to the right
let's see what happens in the next one
now if the left tail is longer than the
right tail
then that type of skewness
is called
left
skew or negative skew
okay
in a symmetric distribution the median
is approximately equal to the mean
and also approximately equal to the mode
okay
so i've already seen the left skew
in the right skew the mean
is greater than the median while in the
left skew the mean is less than
the median
so we can now change the goal possible
if you have qualitative variables
you don't have
numerical values what do we do
now the concepts of central attendance
spread and skewness have no meaning for
qualitative nominal variables because
they are names we are talking about male
female if we get the average males
doesn't really make sense
or the skewness of male
doesn't really make sense
for qualitative ordinal variables it
sometimes makes sense to treat the data
as
quantitative for exploratory data
analysis purposes
so you can use the
median
spread and skewness
under
the quantitative qualitative ordinal
variable
so how about if the variable is
quantitative
it has numerical values
here we see that we can summarize our
variable using the minimum the maximum
observation the median the mean the
quartile
range
jakota range variance standard deviation
then we can also use the box plots
histogram then we can also look at
skewness and also identify
outliers
okay
so in our employee data set
we can also we've already highlighted
the qualitative variables and those
include gender
job category and
minority
so for our qualitative variable we can
summarize our data using a frequency
distribution
a pie chart or a bar chart
okay so we are going to look at these
ones
so the frequency table consists of
categories you can see here in this
little table here these are the
categories gender female male
then the frequency corresponding to each
category and the relative frequency is
this frequency i divide by the total
number times a hundred
and then the relative frequency for all
the categories should adapt to one
okay
so we find that pie charts they are not
a very good way of summarizing dataset
when we compare the pie chart and the
bar graph
somebody will easily read off
the value for the bar chart as compared
to pie chart
so in this case
you only use pie chart if there is no
any other
graph that you can construct
so we can have our frequency
distribution again for job category
and
job category
so if we get the frequency for each
category we have the cleric the
custodian the manager
and we can get the relative frequency
and for it to get the relative frequency
we get the frequency for each category
divided by the total number of
observations that's how we are getting
these values here then the relative
frequency should add up to one
we can plot our data either using a bar
graph
or a pie chart
you can see that the title for the graph
should fall
below the graph
and the same applies to the pie chart
from both the graph bar graph and the
pie chart
we observe that there are more clerics
as compared to
the manager and cleric
and custodian sorry
okay so that particular
company employs more of clerical work
than
other
categories
if we look at variable minority we can
summarize it by the two groups yes and
no
and we find that most of the people in
the company were not minority
and compute the relative frequency so 78
percent point one of the employees are
not minority
meanwhile the minority are 22 percent
how about if you want to explore
relationships or before i go there any
question
any question
can we go ahead
relative frequency
okay so i'm taught to repeat relative
frequency
okay
we see that we have categories yes and
no
each category has the corresponding
number of observations that fall in yes
and also those for no so to get these
values here
0.219 we are going to get 104 divided by
474.
so that gives us that value
the same applies to know 370
divided by 474
gives us
78
i mean 0.781
as the proportion of people who are not
minority
so if we can multiply by a hundred to
get percentages there are more people
who are not minority
that is 78 percent as compared to 22
percent
who are minority
so even the minority when they go to
different employees they end up being
minority i thought that now
there would be more people employed
but maybe
the order does not matter
okay exploring relationship between two
variables what are we talking about what
if you have a quantitative variable and
a qualitative variable
or you have two quantitative variables
or you have two qualitative variables
how do we summarize
those variables
we can bring back our box plot we can
use the box plot as i mentioned earlier
on
to summarize our quantitative variables
and we can use a bar graph
for each category to summarize
the
qualitative variables so you find that
the bars red green blue
represent the different categories
is there a relationship between current
salary and employment category that
question goes to you participants
is there a relationship between current
salary and employment category
and if you say yes
why
if you say no
why
so
let's see in the chat is there a
relationship between
current salary and employment category
okay so people are saying yes
yes because the higher the rank the more
the pay okay
somebody's saying no
okay
thank you very much so we see that
there is a relationship between
salary and employment category
you can see that
for cleric
they earn
very little but they're the highest now
here we are just looking at frequency
it is the highest number of people in
the in the company
okay
and the list number of company is
custodial
so we are going to go ahead
and look at
salary current salary
we were asking a question is there a
relationship between current salary and
job category
so this gives us the different
categories and their
salaries
okay
so you can see that the manager
earns best
but also when we go to look at number of
years spent in school
by job category you see that the manager
has spent more years in school
maybe he has gained more experience
maybe has gained more knowledge
hence being called a manager
and warranting the high salary
okay
so we can actually transform our
information in the table into a box plot
so you see that this is the box plot for
manager
this is custodian this is cleric
the cleric earns less compared to the
custodian
compared to the
the manager
the manager earns highly
because the blue the dark line is our
median so we are looking at the median
which has the highest median it's the
the manager which one has the list
according to the graphs we see the one
with the list
median
is
cleric
but the reason why it is like that
is the way maybe the values were
computed computed or in reality actually
the median for custodian is lower than
cleric
so the mean for custodian
is higher
than the cleric
if you look in the table
so educational
number of years in school and job
category
we will see that the manager
has spent more years in school
compared to a cleric
so compared to the cleric
the manager has spent more number of
years in school and also compared to the
custodian
so he has amassed more experience
and knowledge
that's why he's earning a better salary
compared to the custodian and also when
you go for job interviews
they'll ask you what is your highest
level of education
so if you go with a person holding a phd
for a job with a manager and you have
only one degree
and we assume that you have the same
experience then the person with phd will
get a job compared to a person with one
degree
and also the
the number of years spent in school
determine how much
that person earns
so if we put these two tables together
current salary and educational level
number of levels in years
can educational level explain the
differences in salary
can
is there a relationship
okay
so in the chat can we explain
can we explain the differences in
in salary
using education
so let's see in the chat
okay
so
in the
if we look at these tables are we able
to tell the differences are we able to
explain why the manager is earning
more than
the
the cleric using the educational
number of years spent in school
somebody's saying yes
yes we can
your educational history will determine
your salary
uh the higher the years of education the
higher the salary
okay
yeah
apparently the higher the the years of
school you the higher the as you spent
in school
the higher the salary that is true even
in universities you will find that a
professor
earns higher than a lecturer because of
the number of years spent in school as
well as the experience
okay
i won't share what my colleagues have
just said
because you will all
i'll share that when we end the
training
okay
so we can actually present our data in
form of a box float now to give a hint
when you're doing your research when
you're analyzing your data don't present
both a table
and a figure of the same data
you can either use a figure to explain
whatever you want to explain or a table
you can mix much for one question you
can use the table for another question
you use
the figure so don't present like the way
i'm doing i'm just this is just because
i'm demonstrating but when you're
writing up your manuscript
present all a box plot only or a table
only
and remember that when we are presenting
a box plot the the titles go beneath
my
sliders become a bit small these titles
are supposed to go beneath
so don't do as i do just do as i say
that the titles go below the figures
okay
so these box plots show the number of
years spent in school per job category
and also the number of the current
salary per job category we've already
seen that
current salary if you look at
manager is skewed to the right
and also cleric is skewed to the right
the custodian
not very sure
we need to expand the scale in order to
be able to see our custodian when look
at uh
educational
education years at school
he has a school for manager is skewed to
the left
meanwhile custodian
is skewed to the right
and cleric
clerical seems to be
normally distributed but this is the
median
so it is q to the right so the median is
just at the edge
so it's skewed to the right
okay
so for us to be able whether to be able
to see a relationship between education
and salary we can use other
summary statistics like
uh
the scatter plot
so we can also look at current salary by
gender what are we doing we are
summarizing a quantitative variable
versus a qualitative variable how do we
summarize a quantitative variable
and a qualitative variable you can see
if we choose on gender and salary the
male
are earning more than
the female
the reason i don't know
maybe the male are also highly educated
yes if we look at the box plot
for male
number of years spent in school
we can see that there are more number of
years spent in school for male compared
to the female
is there some level of discrimination
against male managers
i'm not very sure
i'm not very sure
we need to
look at
i don't think there is any
discrimination
per se maybe maybe not
okay so let's look at uh cleric
by gender so we've
cut out data set for cleric and
summarize it by gender so sarah
for cleric by gender we see that the
male still cleric are earning more than
the female
and they are spent in school
by the male cleric is more than they are
spent in school by the female cleric
maybe that explains why
the the female are earning lower than
the male
okay so let's look at uh
the distribution pattern for a
qualitative variable
given by their relative frequencies
and and
frequencies so if we look at job
category
and also look at gender
we see that
there are
more clerics in the company compared to
all other categories and as well as more
male in the company compared to all
compared to the female
is there an association between job
category and gender
so for us to see whether there is an
association between job category and
gender we need to do across tabulation
are males and females equally
distributed in the different job
categories
that one we also need to construct
across tabulation
here we are just looking at the
proportions of each category in the
company but with these separate
frequency distributions we may not be
able to tell whether there is an
association between job category and
gender but with across tabulation we are
able to tell whether there is an
association between job category and
and gender
okay
so let's construct our cross tabulations
between job category and gender
so when we look at the the first box
shows us the frequency
and we can see that cleric
they are more female clerics than male
clerics
if we look at custodian they are no
female custodian
all of them are male
manager they are fewer male managers
female managers as compared to the male
and we can compute our relative
of frequency
so a relative frequency for example for
the first cell
is 206 which is under female divided by
the total number the grand total which
is 474.
so we are able to calculate the relative
frequency for all these cells
and from here we can see that
the highest number
of employees are female clerics
in this
uh
cross tabulation of relative frequency
so we see that the female cleric
represent a higher proportion compared
to the rest of categories
okay
we can also use a bar
plot
a bar graph
to plot
job category
versus
gender female male okay
so we can see that for female
still there are no female managers
that's why you see there is no
bar representing that group
and meanwhile in the mail they are
evenly distributed across i mean they
are
well distributed across the company all
categories have
males representing them
so we can look at the
cross tabulation of female gender versus
job category
so we know that the totals the totals
are the extreme end right and are called
marginal totals they are
the totals for each
rows
they are the marginal total for job
category and the totals for the columns
are the marginal totals for gender
okay
so we can actually also compute our
relative
frequencies for these
marginal
distribution and see which particular
category is highly represented in the
company
compared to the other
so within the table we are able to know
within this cleric how many male are
there and how many females are there in
the company
and then across the categories we are
also able to tell
if you are female and going to that
employee
you're likely to become
a cleric depending on your educational
level
and less likely to become
a custodian because there are even no
custody and female
employees in that company
okay so this is what i've
already been
talking about
that we can compute the relative
frequencies inside here
and talk about them
so we write around the table
our jobs distributed equally between
male and female no
it appears like
uh gender determines what job category
you're going to have
for example we have no custodial
and there are more female clerics so in
this company they are likely to employ
female clerics as compared to managers
female managers i don't know for what
reason
we need to investigate more
okay
so you can either represent the same
information here using a graph
or
a table
so in your manuscript you use one of the
two
whichever appeals to your eyes if you
want the table use the table if you
under graph you use the graph but it is
the same information
that you are looking at
okay
so is there an association between job
category and gender
that one we've seen that yes there is
likely to be
a
dependency
of
gender and job category we will prove
these dependencies when we look at uh
chi-square tests or we look at
log-linear regression
later in the week
but uh using the descriptive statistics
we see that we can use our bar graph to
see whether they are associations you
can see that
the females
we can't get any female
custodial
and also female
we get fewer females compared to
managers male managers
and they are more of female clerics
compared to
male clerics
so
gender
have plays a big role in determining
what job you're going to have at that
company
and that goes hand in hand with your
experience
and educational level
so how about if you want to look at only
quantitative variables
only quantitative variables
is there a relationship between number
of years at school and current salary
those are the questions you're setting
for yourself
remember this is a data set that we
downloaded from an archive and then you
just formulate
questions so you can also formulate such
questions for your research
does the currency depend on beginning
salary
does current salary depend on previous
experience
so we can answer those questions
so we can use a scatter plot
a scatter plot consists of
a variable on the x-axis
normally called
the explanatory or independent variable
and then a variable on the y-axis called
the response or dependent variable
so in most cases when our experiment has
been performed
when you put in when you change the
value of the independent variable
then the outcome you're measuring is
likely to change
so normally our explanatory variables
are those variables which do not change
a lot
that you can determine that you can
measure with minimal
error
and the response are those ones which
vary
with any slight change of another
variable
is there a relationship between
educational level and current salary
let me see in the chat do you observe
any relationship in this graph
and if yes what type of relationship do
you observe
in the chat
or somebody saying it is lunch
okay let's see
do you need a break
as you answer that question
they say yes positive relationship
okay somebody says he doesn't know how
to interpret
the higher the educational level the
higher the salary
somebody says a break is needed let's
continue
okay
okay thank you very much
so
for you to observe a positive
relationship values
either
the increase right from this where my
hand is you see that cursor this
particular one the increase from this
bottom going to the left hand side
then you know that is a positive
relationship
so we see that there is a positive
relationship between
current surrey and years in a in
education
okay
so we can also find out the magnitude so
if these lines were starting from this
point here
the the way at the y and x meet in that
corner there
and go out spreading
then you would have a perfect
relationship but you see they are
deviating from that line
meaning that the correlation between
current cell and years in school
is moderate is not very strong
so we look at the magnitude and confirm
is there a relationship between
educational and current saturday
we see that there are some
outliers which are affecting our
graph
and we circled them in red
those which are very low
and those which are very high remember
that
a valley is an outlier
if it's an outlier confirm that it is an
outlier before you delete it from the
dataset and also confirm
that it is likely to affect
your results
before
you decide to delete it
so we've identified some of these values
you can see that
number of employees
with eight years of experienced 53
and those with 12
are 190
this is just a summary of years at
school a number of employees
and then
those ones with 21 years 20 years we
have one person
with 21 years
and then
two people with 20 years so those are
the ones representing this
circle here as outliers
and the 53 are the ones here in this
circle here
okay
then those with the 19 years are 27
of them
also affecting our results but we cannot
remove them until
we confirm that actually
they can be detrimental to our results
is there a relationship between
beginning salary and current salary
let me say in the chat
what type of relationship do you observe
in the scatter diagram
what type of relationship do you observe
in the scatter diagram
somebody says positive yes positives
positive what linear and linear
what positive relation do you observe
okay
linear relationship
okay
somebody said skew to the left seriously
can you see skewness in the scatter
diagram
we can only see the relationship and
also identify that there are
values which are likely to be outliers
for example if we see in our scatter
diagram there's this particular value
here which is likely to be an outlier
and we can
find out whether it is but there is a
positive linear positive strong linear
relationship between the current salary
and beginning saturday
okay
so
we can actually draw very many scatter
diagrams
at a go between the quantitative
variables
so you can see this this
column here is for current salary the
first one
the second one is for beginning
the third one is years in school
the first is time on job
the fifth is experience
so we see current saturday and beginning
saturday is that topmost
if the first graph you see here
on the second column and also
below the box for current size those
ones are the same
current saturday and beginning saturday
has a strong positive linear
relationship
between those two variables
and then if we look at uh
current salary and years in school you
can see as in school is right here
there is a relationship but it's not as
strong as above
okay
then when we look at time on job
oh there is a very weak relationship
between current salary and type on job
the same applies to beginning saturday
and time of job you can see it's a
random scat of points indicating a weak
relationship between those two variables
okay
then we have previous experience and
current salary we can see that there is
a relationship a linear relationship but
this time instead of being positive it's
negative you can see the values are
flowing
downwards okay
most of the values are flowing downwards
so there is a relationship a linear
relationship
but
it is negative
okay
so you can actually draw all your plots
at a goal and present them in
in your work instead of drawing like 10
graphs or eight graphs you can present
them
in one go
so here you can also look at the
different
graphs and look at their
whether they have outliers
okay
so this is a plot for male
if we separate our data into female male
and then draw the scatter diagrams
so like my colleague helen taught you
how to
split the data set you can split your
data set by gender and then analyze
female alone and also analyze mail alone
so you can see a strong relationship
between the currency and beginning
salary but a weak relationship between
previous years and experience
of experience and
current and beginning surrounding
okay so this is more of those variables
i mean those scatter diagrams again
and
so let's look at correlations
we were talking about correlations
uh in words
correlations measure the linear
the strength and the magnitude of a
linear relationship
correlation show us whether we have
negative
relationship or positive relationship
so the values you see in red are the
correlation values
for
two variables for example
current salary and beginning saturday
has a correlation of 0.88
we saw that in the diagrams we could
observe that they had a strong
positive linear relationship and the
correlation value is confirming that it
is true
currency and beginning salary have a
strong positive
linear relationship
then we have a relationship between
years in school the second the third
graph and current salary
so this figure here
shows that the correlation value is six
point zero point six 0.661
indicating
a strong relationship not very strong
and not weak
between
years in school and current salary
then when we look at time on job
and current saturday it is a weak
relationship the correlation value is
0.084
then when we look at previous year
previous experience and currency
it's negative you see this minus
negative 0.13 it is weak
and negative relationship
so in your
summary statistics in your write-up you
can have your
values zipper imposed onto your graph
and shown
in your
your analysis
okay
okay
so here we present our correlation
matrix
as another way of summarizing data set
data
so
these are the
variables we've been looking at as well
so we have our combined data
where we have beginning salary
and current salary
educational level and current salary
job
time on job and currency and previous
experience which you've just seen in the
graphs
then we have a data for current salary
but we've split it into manager
cleric custodian
so we have
manager cleric custodian
then we have
a
female manager mail manager
so the things you can do if you're
really interested in going in depth
uh
understanding of your data set
so if you look at manager we find that
there is a
strong positive correlation between
currency and beginning surrey for
manager
the same applies to cleric
and als custodial is positive but weak
because it is 0.07
so there is this there is a weak
positive
relationship between current salary
and beginning salary for custodian
okay
so you can go ahead and
look at all those
in your free time
and study the
the data set so that we see
other relationships that we can have
so in the chat
do you need a break before we go into
practicals
uh dr thomas you have a message here i
would have to go offline
my battery is low no electricity sorry
about that
we are really
touched
i don't have anything i can't help you
i'm over here in uganda
how do we know the relationship is
strong and weak if we have a correlation
of over 0.0.5
irrespective of minus or positive then
that is strong
so if if a correlation is below 0.5
then that is weak
if it is equal to 0.5 it is moderate
it says let's resume at
11 a.m
on nigerian time oh my goodness
okay we shall resume at one
ugandan time which is 11 nigerian time
lunch time for ethiopia okay
no problem let's have a break and then
we are going to do practical so when you
come back
i make sure you download the script
which is called
eda it's what we are going to use for
now let's have a break
how long is the break
up to 1
45 minutes
sorry 15 minutes
don't go for 45 15 minutes
15 minutes break
and come back and we do the practicals
hopefully we can finish them in one hour
and tomorrow
my colleague
dr thomas is coming in with
very interesting information don't miss
don't miss
because you're going to get more
information so thank you
let's see each other
list each other after
a break
okay
thank you very much
uh
[Music]
are we back
from the break
can i see in the chat whether we are
back from the break
are we back from the break
okay thank you thank you very much
uh we are going to do the practicals of
what we've been talking about
in the remaining one hour
and uh
you're now familiar with r
open the script called eda
that is exploratory data analysis
and uh
the first thing we do is to install
packages but i know you've already
installed these packages so go through
these packages and see which one you
don't have
and click put the cursor for example if
you don't have tidevast
just put the casa in the line where
there is install.packagestylebus
and run
don't
rerun what you already have
because i already have all these ones
i'm not going to install those packages
but i'll give
i'll give five minutes for you to be at
the same pace with me and make sure
you've installed all these packages
okay
so install
packages advanced
install all runner install
read excel that is read excel
then read our
deploy app plotrix ep display
okay
i believe my colleague has already told
you the use of most of these
giggy plot
so make sure you've installed after
installing
go down where they see load and load the
packages so this one i'm also going to
do
so i'm loading these packages so what i
do i highlight
okay
i highlight and click run
then in the background they are loading
okay
so in case you've not installed any
package
don't load it first of all
first install it if you see there's
anything
that you've not
uh
installed
just get the name and write
install.packages in bracket inverted
commas
you write the name of that package
so in the second part here where i put
load you can see i've tried to explain
what each of this package does
tidevas for data manipulation
lana r
exports data from r to excel
the player helps to compute summary
statistics
you know so the the meanings are there
read excel
plotting 3d
graphs
data visualization that is geeky plot 2
then gigi b
p r create
used to create a graph
ok
so you can see that i have added gigi b
p
u b r
in case you find it's not part of those
which i have highlighted as install
just copy the name
and go up
and install it
so i'm going to instruct install
dot packages
once you start writing it highlights and
then i'm clicking
that i need inverted commas
there we go
so in case you find that you you've run
a package and say doesn't exist
go up where you install
commands are
then you put in install.packages
in the bracket
open
i mean inverted commas g b
p u r
so this line wasn't there you can add it
i have highlighted it in blue
is it visible can i see in the chat
are we together
please share the script
the script is in the google drive
if you go to the google drive
my colleagues are going to send out the
google drive having issues what issues
please highlight the issues in the q and
a
child says i wait
yes we are together okay
so make sure that you first of all
install the packages
then load the packages and in the
process of loading if anything says that
it's not
available that means you've not
installed it
and you go
and install it
mary put the issues in q and a
writing q and other issues will be
sorted out by my colleagues
okay
or you you've posted it here
okay you've not installed all these
packages you said no package lana r
no package
uh tied vast so make sure you run the
first part of the script which says
install error in libra deploy because it
requires some package before it it loads
no package read excel
no package all these the error is
actually minimal
just have to run
the first part
mary you see this first part here of the
script where i've said
install packages
okay
install packages highlight you can
highlight all of them like that like
what i'm doing now
and then click the run button
okay
setting working directorate you can use
the quickest command
go to view
on a session
set working directorate
and
then choose directories so those who
would like to set the working director
go to session
set working directorate
choose working directorate that's where
i am
and then click choose
directorate
so when you choose directorate
then go
browse where your
you've
you've kept your your your data set
employee
so like me mine is in day 2 so i click
day 2 and click open
so by doing that you'll see
in the console
the
command set working director it is
already highlighted
okay
okay
so can we go ahead so after you've
installed the packages you've set the
working directorate go to load
load the packages again
so for those who have
loaded their packages make sure you've
set your working directorate and you set
your working directorate you can use
session window
session
at the menu bar click set working
directorate and then you click choose
working directory you choose and
browsing your documents where your
folder is
and
then you click on that folder
mine is called 2 and click open so it
sets your working directorate
so are we together in the chat can i
continue
repeat setting working directorate my
run button is not working
okay yes
so i'm going to continue
if you have a query put it in the queue
and my colleagues are there
they are ready to help you
okay
so we've set our working directorate
you can actually run the command so in
this case
don't run what you see
what you see in the script is my
directory don't run that but you can
copy what you see in the console and
paste it here
and run
so what i have left in the script
is my information
i was trying to do both so in this case
i have actually copied
the
command line in the console for set
working directorate and if you run
you'll have the same thing
so after that let's go to read
excuse me
read
s
csv file
so in case it is excel you use a
different
command line but now we are reading
employee which is a csv file so just
post your cursor on line 33
and click run
once you click run you will see that in
the environment window it is saying that
employee has 474
observations and 10 variables
to see your data set you can either say
head
which will print the first six values so
if we click
if you click run at line 34 you will see
the
first six values of your data set
so in the chat has everybody
uh reading their data please repeat i
just come back and i'm lost
okay
so what i'm saying is that in the script
that you already have called eda
are
put your casa on line 33
i am assuming you've set your working
directorate to where the employee is
so put your cursor on line 33 and click
run
once you click run you use head to print
the first six observations
there you go
i am assuming that everybody is fine
you can use other
commands to view your data set
head is one of them
str describes the structure of your data
so if we run str
it's going to tell you what is in the
data so
if i
can
make this become a bit bigger
you can see that in my console
it says data frame
474
observations and 10 variables we have id
gender is character and consists of m
and f
birthday is a character
education
is an interval
job categories a character salary is an
interval
so we can change
all those characters to factors
because we need to use them in the
analysis
so let's let's see what is going to
happen
okay so we can also use head three
employees
if we want the only three observations
you can specify
if you put class employee it's going to
tell you the nature of the data set
normally used when you've converted
your
characters into factors and you want to
confirm whether they've changed to
factors you can use class employee
number of rows displays the number of
rows in the data set
number of columns displays the number of
columns in the data set
and then column names to tell you the
column names in that data set
so this i think my colleague really did
a good job
in
telling you what r does the language
so let's go to descriptive statistics
today we saw that
our exploratory statistics consist of
numerical and non-numerical
ways of summarizing data
and under the numerical we have central
tendency which is mean median mode then
under spread over or
over a bit we have range interquartile
range variance then we can also
summarize our data using frequency
tables
and under the graphical methods we shall
see
we have box plot we have the histogram
the pie chart
and the bar graph
okay so we can produce summary
statistics
using
the
code summary or the command summary
and employee is the name of the data set
so if we all got line 68
if you've not changed the script line 68
says summary
employee
and that gives us the summary statistics
of employee so let's look at the console
in the console
we have
oh
oh okay i'm sorry
i can go back
that
okay if you go to line
68
the code has summary and that shows that
you want summary statistics for the
whole of your data set employee
so if you go to line 68
and run
that line
you will see values appearing in the
console
for each variable
so
if we click
maximize the console
you will see
that for all the variables you have in
your data set
you have a summary for example there is
a
they are telling us we have gender
birthday
let's go to job time
so ls beginning is beginning saturday it
is has been summarized
if you go to the console you said the
minimum beginning salary is 9 000
the median is fifteen thousand the
minimum is seventeen
and the maximum beginning saturday is
seventy nine thousand seven hundred
eighty
then you have summary for job time
where the minimum job time is
63 hours
and the maximum is 98. our job time is
in month 60 63 month
and the maximum is 98 months
then we look at previous experience in
years the minimum is zero
and the maximum is
476 that is more than a year
so that is general summary
so you can click on this button here
to normalize
the
the windows so we want to get
summary statistics
for at least one variable
and it is neat and well presented so it
is
understandable so for us to summarize
our continuous variable we can use
measures of central tendency
measures of dispersion or a frequency
distribution table
so current salary is a continuous
variable and we want to know the minimum
value the maximum the mean
the range the variance and standard
deviation so if you look at line 71
it consists of the code
for writing the summary statistics for
salary
and
for you to get the minimum you write
mean
m-i-n into employee
and the dollar sign is calling the
variable salary which you want to
summarize
so let's run that that line
and also run the next one
so we've run two lines one with minimum
maximum mean another range variance
median so let's look at
the
values we've got in
so you can see in the console
the minimum salary current salary
is 15750
and it's giving us
the standard deviation sd
which is 17775
then on this next row we have the range
we have the variance
and the median value
okay
so let's go ahead and see
what else we can do with our data set
so how about if you have a categorical
variable like gender
if you have a categorical variable we
cannot get the minimum the maximum of
gender
so in that case we want to
to construct a table
for gender
so you put your cursor on line 75
line 75 and run
so what we are going to get is just a
frequency
and when i look at my console i've
made it a bit
so in the console i have f and m
where f is female m is male and they've
given me the frequency
216 and 258
that is the number of employee
number of female number of male
in the company
so in the chat before we go
should i continue
any problem so far
currently set working directorate and
then copy paste the cvs file
ok that comment maybe it's not meant for
me
please continue so far so good
should i continue
says yes yes
okay
okay
somebody got lost you need to tell us
where you got lost from
i hope you're not on the streets of
kampala
yeah
okay
okay so let's continue and if you're
lost
if you lost the youtube is available
you can actually play
everything right from the beginning
where helen started
so we've we've ended that summarizing
gender
and what we got to adjust the frequency
of gender female male
so how about we have
job category which is also categorical
and we want to run the frequency
so
we name
the results of the table job
and then we say job is equal to table
employee called job category
so that is line 78
click run
and you can print job
and it also just gives us the
frequencies
we want the proportions
we want
the proportions
we want the proportions so what we are
going to use
is to pre to use prop dot table
in bracket job
gives us the proportions for each
category
so if you run
line 80
you will see
the proportions for each category but we
can't present proportions so what we are
going to do is convert them to
percentages
and we are going to round off to the
small places because there are so many
so the next line line 81
it says round in bracket prop dot table
in bracket job
close bracket times 100
at two decimal places
so run line 81
if you run it you see that you have
a percentage for cleric
percentage for custodian
and percentage for manager so in that
particular company we have 76.58
cleric
and they are fewer custodian to that you
know 5.7 percent
okay so that was summary statistics
numerical summaries so let's see what
the graphs say
okay
so remember we can construct a bar chart
we can construct a histogram
we can construct
a pie chart
so we will start with the categorical
variables and we start with gender we
want to construct a background for
gender
and the
command is bar plot table into employee
call gender then you close the bracket
so we are going to run line 89
and if you look in the
extreme bottom corner
you will see a graph
and the graph
is a small one
so maybe
is it visible
is the graph in the plots window visible
let me see in the chat is it visible
it's saying yes okay
thank you
so the graph is visible
and you can see that
they are
more male than female but remember
you've plotted numbers not percentages
okay
you have plotted
numbers not percentages
okay
then dv
dot off
make sure that
our
graph has been deleted
so let's continue
so remember that when we do our box plot
our plots were looking up
that is when we are using job category
so probably if you want to make them go
horizontal
use the same command but add on comma
horizontal equals true
then
it will flip them
so if you run lan line 96
so that's how your table i mean your
graph is going to look like
now what if you don't want them all to
be gray you want to give them color
you can actually use the color command
and make
them
color
okay
so here we want also to construct a pie
chart
and the command is pi
in bracket table
in bracket
employee call job cut and then
you
use different colors so i'm going to run
line line 91
and
line 91 gives me
this small thing here which i'm going to
zoom in
so let me stop sharing
and share
so this is what we've gotten it has
cleric
showing the largest part
manager and custodian showing the
smallest part
okay
okay so let's share
the our screen
if somebody has a question
hmm
oh okay okay why do we use a white gray
night these are the
colors in a
are
in the in the palette which have already
been named
so you can go in the palette and choose
any other color you can actually use
blue
red green
okay
so if we if we change we can say
white let's put here
red
and then put here
blue
and run
so hopefully you're following
so we run that
and you can see that
our pie chart has changed blue
red and white
okay
so all these
colors are in the palette within r
you can specify whatever you want but
you know that for when you're doing
manuscripts you need to put
colors which are not
colored unless if you're going to pay
for colored
graphs okay
so we can go on
and
construct
construct a table with percentages we've
already seen that
so the next getting percentages for job
category and gender
so we've said call is the object
we want to get tabulate gender i mean
job cut
so if you run that
and then run the object it gives you the
frequencies
prop the table is giving you proportions
and then when you multiply by a hundred
you're getting
percentages
when you round off you're reducing
and then you can now construct a bar
graph
with percentages so you can see that now
our bar graph instead of having 250 as a
maximum value it has
it has 70.
okay
so we can
now plot a scatter plot
for salary
and then histogram because now salary is
continuous
we can now construct a histogram we
cannot draw a histogram
for gender
because the aim of a histogram is
showing distribution but if you already
have a categorical variable you're going
to have two bars and you can't even tell
the distribution of the data set so
that's why you see we've not constructed
a histogram for categorical variable
so as i said the type of variable also
helps you to determine what type of
summary statistics
or exporter data analysis you're going
to use
okay
so if we do plot salary
we are just plotting one variable you
can see what comes out is not really
very good it's just
a plot of random scatter
of points
no pattern because we have only saturn
in it
when we use hist
we are telling r to draw a histogram of
salary
and if you click on line 120
you get a histogram so let me zoom into
this histogram
so when you look at this histogram what
do you see what distribution do you see
in that chat
what distribution do we see in the chat
can you tell us
what distribution do you see in the chat
skill to the right
what distribution do we sit in that chat
somebody says no more
skew to the left skew to the left
okay
thank you very much
okay so when we look at this graph here
if the longer tail is on the right hand
side
then the graph is q to the right
this side you see is the right
and this side you see is the left
so you can see our longer tail is going
to the right side
so this is a right skew or positive skew
and in case we get another data showing
left skew
there the the
very few values are going to be on the
left side and we call it a left skew but
this one
is skewed to the right
okay
so i'm going to stop sharing
okay so back to the script
we've done a histogram and we've seen
that salary is skewed to the right
so that tells us that we cannot use the
median as a measure of central amino
they mean
as a measure of central tendency because
those outliers are going to affect the
value of the mean
the recommended measure of central
attendance is the median
which is not affected by
outliers
so we can plot salary and this time we
are saying type is equal to 0 and the
column
is equal to let's see what
line 123 what it shows us
so you can see instead of having black
we specify that all those points should
become
blue
you can add
font size you can increase the font size
you can add that
title
so if in the next
the next line shows you how to add the
title
you can see salary has popped up once
you run
the title i say title in bracket main
equals salary
column dot main equals red
and i'm just telling r that my main
should be red
and the fonts
the font of main is four
okay
any question in the chat let me see in
the chat
any questions so far
any questions so far
okay so let's go to the next part let me
see how many minutes we are left with
we still have at least 20 minutes
let's do the next
okay
thank you
attendance for registration
i've been informed that because you're
very few of these 216 you've already
been registered on zoom
so you don't need attendance
okay
so let's look at relationships between
two variables
either qualitative
or quantitative
so we can use a box plot
to show relationship between
variables or we can draw a box plot for
one variable which is continuous so in
this line here line 133 we are saying
draw a box plot of salary
and salary is a function of job category
the data set is called the employee
the main title is salary
by
job category
name x-axis as job category and y-axis
as salary
okay so let's run that and see
so those are the box plots you've
obtained let's zoom in them
okay
are the box plots now visible
can you see them
okay
so you can see that what we observed in
the notes is what we see here
we see that cleric
are skewed to the right
the values from the black line here in
the middle of every box is the median so
from the median climbing up is the right
side and below the median is the left
side so you can see the right side is
longer than the right side meaning the
data of cleric is skewed to the right
custodian is not obvious to see
then manager is also skewed to the right
because the observations we observe that
the tail on the right is longer than the
tail on the left
okay so we stop sharing that
and share
the script
so that is using the command box about
if you want to use ggplot
to construct a scatter
and also box plot
for you to use digiplot
you write gg in brackets the name of the
data set
plus the geo underscore point indicating
that
that you want to make it a beautiful
graph
and x is called
sorry beginning y
is just salary so let's run 140
and we see a beautiful graph coming up
which is a relationship between saturday
and
surrey beginning you can see it is a
strong positive linear relationship
okay
so you can also use gg plot but instead
of adding the geo right here you add it
at the extreme and this is what the
second line is saying
so if you run the second line
it gives the same thing
okay so you can use the gg plot instead
of the plot command
and
the last bit of this i hopefully it is
the last
we want to convert
job category and gender into factors so
that we can use it in our descriptive
statistics using numerical summaries
so remember i said str is to check the
structure of the data set
so if we put it in
it will show us
that
gender is
a character
and birthday is a character job card is
a character
your educational number of levels is not
a character is an interval
so we want to change them to a character
okay so job category for us i remember
my colleague telling you that you use as
dot factor which is okay
here i've used factor job cut
z code factor which is also fine
so we can run line 146
and you can see that
if you also run the class it's going to
tell us that job cut has been converted
to a factor as well as
gender
so let's run descriptive statistics by
groups
now we couldn't do descriptive satisfy
groups if we hadn't converted job
category and
and gender to
a character
so if we want to run
summary status by group
then we write employee
we put this pipe
where it is like
a conduit in which you're passing your
commands
you don't want to write one by the one
you want to write all of them at go
so employee pipe
the pipe command group by
job category again the pipe
display
semicolon summarize
and then you want the count
you want mean for salary
number of
missing equals true
standard deviation
or salary
then
number of missing values
okay so let's run line 152
and see what it gives us
so you can see that
in this case
we are going to have
our
job category with the minimum
salary
so you have the cleric and their
corresponding salary and standard
deviation and count
custodian corresponding salary and
standard deviation
manager their corresponding salary and
standard deviation
okay so let's go back
to the script
and i'm assuming there are no questions
so let me see in the chat
but can you also ask the question in q
and a for the panelists yes
please put their questions in the q and
a
they stay longer and they can be
answered
okay
okay so should i continue let me see how
many minutes i have
only a few 14 minutes left
okay
thank you i'm going to continue
and as soon as it reaches 5 minutes to 2
hour stop
okay
so we want to construct a box plot of
salary as a function of job cut
we've already done that
but let's see what this box plot gives
us
we run line 159
and it's the same box plot we've seen so
far
but we can also use a gg plot to
construct a box plot
if you run line 162
the box plot now is more
is it has grids has all those beautiful
things in it
so it's more clearer
i hope you can see the box plot in your
plot
window
and now if you want to construct a box
plot
but then you also want the box plots to
have different colors
you can go into the palette
and open up whatever color you want so
it is the next command
box plot using the function
the palette function
so
we run that line 165
and this time we get box plot with
different colors
okay
now if you don't know what palette is
you can put question mark palette
then it's going to give you information
about
the the palette
but you can see it has given us
different colors
so in your research if you allowed you
can have the different colors for the
palette for the different groups
so if i run question mark i'm asking
myself what is a palette
and it is going to tell me
in the plot section again
it's going to give you information about
the palette
the colors that are in a palette
okay then graphs off is going to clear
the graphics window
so if we run off
we will see that under our plots we
don't see anything
okay
then
we want to construct a scatter plot
for beginning salary and employee
and at the regression line you can see
add regression dot line
at the confidence interval is equal true
you also add the coefficient correlation
coefficient
so others is specified in this line 175
and when we run
we get a scatter plot let me zoom in
and share
so you can see that we have it has
printed are you requested for
correlation coefficient
you requested for a p value
it has printed a line
on your data set if you really need it
if you don't you don't need to put the
line there
so that's what we've done there
the p-value is small you can see
it is 0.0
so okay 2.2 to the power minus 16
which is really small
showing us
a strong linear relationship between
current salary and beginning salary
okay so i'm going to stop sharing this
and i'll share the screen
okay
so here there's a bit of interpreting of
values which you i mean results
we say the p-value of the test is two
point by the way the test was testing
that
null hypothesis no linear relationship
between current salary and beginning
salary
alternative there is a linear
relationship between
current salary and beginning salary
so we got a p-value of 2.2 to the power
minus 16
which is smaller than 0.05 the
significance level
and we conclude that
there is sufficient statistical evidence
reject the null hypothesis
and
conclude
a relationship between current salary
and beginning surround
the r value is the correlation
coefficient
okay
so i had my colleague teaching you that
you can drop columns from the data set
now i want to only construct i want to
have only continuous
variables so that i can
run correlations for many variables and
also draw multiple
graphs
so what i'm going to do i'm going to
divide the categorical variables put
them in one section and also have
the continuous variables in another
section so this is the last part
so
the object where we are going to
store our
continuous data is called employee 1 is
equal to the subset
and the data is employee the original
data select gender
job category
what else
a birthday
identity
minority all things that i don't need
then you close them up
and let's see what happens next
and then you head so you've subset
you've you've removed gender job cut
and
birthday from the data set
employee and the new data set is called
employee one so let's run line 198
and head
so if you
view your data set we are going to see
that we only have educational salary
current saturday beginning saturday job
time and previous experience only
so that is the data i want to deal with
if you just
run the command it's not going to run
so here
we call our object reads and we want
correlations for employee
so we run one
line 201
and
it gives us
the
correlation
for employee you can see there are many
because they have so many decimals we
are going to reduce them
to two
so we run
and it gives us
it gives us the
correlation values with the two decimal
places
for each variable
so
current saturday and beginning saturday
you can see the correlation by 0.88
as we had seen before currency and
education is 0.66
current salary and job time
0.08
currency and
previous experience is negative
0.1
okay
so those are the correlation values if
you want to get them from your data set
remember if you have some categorical
variables you need to remove them first
so for you to have a correlation matrix
with the p values
you run the function r
r correlation and it is found in the
h m i s
c package hematic package
okay
so i can see we only have five minutes
and uh
i have my homework so i'm going to stop
sharing and share the homework
so let me share the homework
in the five minutes you can
[Music]
okay so tomorrow when we come
this is the homework you've done
exercise one exercise two
is using the employee data set
and graphical procedure display the
relationship between
two quantitative variables
two qualitative variables
a qualitative
variable and a quantitative variable
so
it is only one question
you can choose the variables of your
your preference
so that is
the exercise we are going to do
for tomorrow morning
so please copy it
let me try to put it also in the chat
let me go in that chat
and
put it in the chat
before they switch off
exercise to
open
so it's there in the chat
it says disk error oh it is open
the let me
oh i can't do i can't open it and also
put it in the chat as well so please
just copy what you see
try to copy
and i thank you for being patient with
you have
we've cruised through the work hopefully
you've picked something
and tomorrow let's see the presentation
of those two
three questions and then we will
summarize
thank you for being excellent
participants
my colleagues have said that you didn't
ask so many questions that means that
you understanding everything
and we believe that
this training is
just a recap of what we've been doing
and those who have been
on the previous sessions are just
revising
and if not then you're very very
excellent
you're good thank you very much
okay so
i see you tomorrow
don't miss
doctor
thomas odong is coming in with new ideas
and they're important for all of you
thank you
the only
they refused me to share what they were
talking i wanted to tell you about the
refusal but tomorrow morning i'll make
sure i'm the first on the panel
and share with you kwaheri
thank you
kaheri
the youtube is available by the way even
now
you can go there
and watch
slowly
and understand
if you didn't pick so much from the
training
use that youtube
okay
the links have been sent for both day
one and day two
so please practice
and
practice
practice and practice thank you so much
you've been excellent
the materials are in the google drive
and even for tomorrow by the time we
start the sessions
dr thomas would have updated the google
drive
thank you so much
and uh helene thomas has sent the google
drive thank you thomas
and where is the attendance has been
captured
on the zoom because we are very few you
don't need to register every day once
you attend
you will be able to be registered
automatically
okay
thank you thank you
uh
hey
i can't read this but maybe tomorrow
justina you can't tell me what it means
okay
okay can you allow me to stop here
i'm gonna mute myself
and i wish you a good evening
see you tomorrow
you