Day 5 of Advanced Statistics and Experimental Design Training: Categorical Data Analysis, Chi‑Square Testing, and Survey Sampling Overview

Name: Advanced Statistics and Experimental design Day 5
Uploaded: 2026-01-14T15:49:07.370531+00:00
Channel: RUFORUMNetwork
Description: Summary and key takeaways on Day 5 of Advanced Statistics and Experimental Design Training: Categorical Data Analysis, Chi‑Square Testing, and Survey Sampling
RUFORUMNetwork
Jan 14, 2026
•
5 min read
YouTube video ID: ILx2Dq4NeQI
Source: YouTube video by RUFORUMNetwork — Watch original video
PDF
Introduction

The fifth and final day of the World Bank‑funded Advanced Statistics and Experimental Design training took place online, hosted by the Center of Excellence in Agri‑Food Systems and Nutrition, Mozambique. Participants and facilitators gathered for a wrap‑up session that combined administrative updates with substantive statistical content.
Administrative Announcements

Contact collection: Participants were asked to provide two phone numbers (WhatsApp and an alternative) for post‑training communication and future training invitations.
Certificates: The team acknowledged delays in issuing certificates for the first two modules and assured that certificates for the current module would be released soon.
Evaluation form: A short Google‑Form (Monkey) evaluation would be sent at the end of the session; participants were encouraged to complete it.
WhatsApp groups: Two active WhatsApp groups already existed for peer support; facilitators would add any missing contacts.
Future training: Facilitators expressed willingness to travel for face‑to‑face sessions in Mozambique if requested.
Categorical Data: Concepts and Visualization

Definition: Categorical (qualitative) variables assign labels to observations (e.g., gender, marital status, preference categories). They can have two or more levels and are non‑numeric.
Frequency tables: Counts of each category are tabulated; relative frequencies (percentages) are obtained by dividing by the total sample size.
Visualization options:
Bar charts (vertical or horizontal)
Pie charts
Segmented bar charts (stacked bars)
Side‑by‑side bar charts for comparing groups
Example: A sample of 40 students chose a preferred attribute (Rich, Happy, Famous, Healthy). Frequencies (7, 21, 4, 8) were converted to percentages for reporting.
Cross‑Tabulation and Marginal/Conditional Distributions

Cross‑tabulation (contingency table): Summarizes the joint distribution of two categorical variables (e.g., gender vs. chance of becoming rich). The table can be 2×3, 3×3, etc., depending on the number of levels.
Marginal distributions: Row‑wise or column‑wise totals that describe the distribution of each variable independently.
Conditional distributions: Percentages within a row (or column) that show the distribution of one variable given a specific level of the other.
Interpretation: By examining marginal and conditional percentages, researchers can assess patterns such as whether males report a higher perceived chance of wealth than females.
Chi‑Square Test of Independence in R

Purpose: Tests the null hypothesis that two categorical variables are independent (no association). The alternative hypothesis states that an association exists.
Test statistic: (\chi^2 = \sum \frac{(O - E)^2}{E}) where O are observed frequencies and E are expected frequencies calculated from marginal totals.
Degrees of freedom: ((r-1)\times(c-1)) for an (r\times c) table.
Decision rule: Compare the calculated (\chi^2) value to the critical value from the chi‑square distribution (or use the p‑value). If (\chi^2_{calc} > \chi^2_{crit}) or p < α, reject the null hypothesis.
Example: A 3×2 table of academic rank (Assistant, Associate, Professor) vs. salary category (High, Low) yielded (\chi^2 = 23.13) with 2 df, p < 0.001, indicating a strong association.
Practical R Workflow for Categorical Data

Create factor variables: factor() converts character vectors to categorical factors.
Build a data frame: data.frame(var1, var2, ...) stores all variables together.
Cross‑tabulate: table(var1, var2) produces the contingency matrix.
Add margins: addmargins(table_obj, margin = c(1,2)) shows row and column totals.
Proportions:
prop.table(table_obj) → cell percentages.
prop.table(table_obj, 1) → row percentages.
prop.table(table_obj, 2) → column percentages.
Chi‑square test: chisq.test(table_obj) returns the test statistic, degrees of freedom, and p‑value.
Interpretation: Use the output to state whether the variables are independent and discuss practical implications.
Survey Sampling Fundamentals

Population vs. Sample: The population is the full set of interest; a sample is a manageable subset used for inference.
Sampling error (margin of error): The difference between a sample estimate and the true population value; commonly expressed as ±5 %.
Confidence intervals: For a proportion (p), a 95 % CI is (p \pm 1.96\sqrt{p(1-p)/n}). The Z‑value changes with the desired confidence level.
Sample‑size formula for proportions: (n = \frac{Z^2 p(1-p)}{E^2}) where E is the desired margin of error.
Types of Surveys

Cross‑sectional: Data collected at a single point in time (e.g., a one‑off questionnaire on student performance).
Longitudinal: Repeated observations of the same units over time, including:
Trend surveys: Different samples at multiple time points.
Cohort surveys: Same individuals followed across periods.
Panel surveys: Same households or respondents surveyed repeatedly.
Sampling Techniques

Technique	Key Idea	Typical Use
Simple Random Sampling	Every element has equal probability of selection	Baseline surveys when a complete sampling frame exists
Systematic Sampling	Select every k‑th element after a random start	Easy to implement with ordered lists
Stratified Sampling	Divide population into homogeneous strata, sample within each	Improves precision when strata differ markedly (e.g., urban vs. rural)
Cluster Sampling	Sample whole groups (clusters) and survey all members within selected clusters	Cost‑effective for geographically dispersed populations
Multistage Sampling	Combine two or more methods (e.g., stratify → cluster → systematic)	Large‑scale national surveys
Closing Remarks

The session concluded with a reminder to complete the post‑training survey, an invitation to register for upcoming modules, and gratitude expressed to all facilitators (Prof. Rogério, Prof. Susan, Dr. Helen, Dr. Odong, Dr. Namaweji) and participants. The organizers emphasized that the skills covered—categorical data handling, chi‑square testing in R, and robust survey‑sampling design—equip attendees to conduct rigorous statistical analyses in agri‑food and nutrition research.
Participants left the training equipped to transform raw categorical data into meaningful tables, visualizations, and statistical tests (chi‑square) using R, and to design reliable surveys with appropriate sampling strategies, ensuring that future research in agri‑food systems and nutrition will be both methodologically sound and practically impactful.
Frequently Asked Questions

Who is RUFORUMNetwork on YouTube?

RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
Summarize another video
Full Transcript YouTube

foreign
[Music]
good afternoon everyone uh you're most
welcome to day five which is our last
day for this training at the moment let
me say at the moment we might be having
some more additional training if you
request so
so you're most welcome to the day five
of our training on the advanced
statistics and experimental design
it's brought to you and funded by the
World Bank through the center of
excellence in agri-food systems and
nutrition in Eduardo Wonderland
Mozambique
so uh please uh thank you so much for
joining in on time once again and we
want to thank our facilitators for
taking us through all these days and for
being very patient with our slow
learning and catching up with every
person in the team so thank you very
much to our facilitators and the
participants thank you to all of you who
make it on time so as usual the
registration link is is being shared in
the chats please ensure you register so
that we can share your numbers
your fault your phone numbers so that uh
we follow up on the training so in our
media we are lucky today Professor
rogerio is with us so let him I'm going
to give him a chance to have the weather
too for us bro thank you very much for
joining in good afternoon
thank you thank you
um facilitator
uh good afternoon participants uh
welcome to this live session the fifth
day of hard working and trying to learn
about Advanced statistics and and
experimental design
uh one thing I would like to request you
um is to provide us with two contacts uh
the the World Bank in his assessment
they want to contact the WhatsApp number
the one which was WhatsApp and the
additional contact which you can
facilitate to contact you so please
provide us with two contacts beautiful
numbers
we are going to create a WhatsApp group
so that you can facilitate us to
communicate with you regarding this
training and future training so please
provide us those contacts uh
the second request that I would like to
leave to you is to after the training
the World Bank or any agents hired by
the World Bank will contact you through
to to to check more about what was the
training so please when you see an email
which will be coming with World Bank
there will be indication that it is
relate to the World Bank so please uh
respond to that email because that would
be really important to confirm that the
training took place and you are involved
in that training
uh so uh with this I would like to wish
you a wonderful session today which is
the last one and again thinking you
faking the facilitators and the the the
the referral for this collaboration the
process of delivering this session the
discourse on Advanced statistics and
experimental design thank you very much
I thank you thank you very much Prof
um thank you so much I think also just
to let you know that there are two
groups already or Whatsapp groups where
uh the participants are in so I'll ask
my colleague to add you on those groups
because the two WhatsApp groups are
already running
and uh the participants really engage a
lot on those groups some of the
facilitators are even part of a part of
their WhatsApp groups
yes so um in case uh on the forms on the
registration link uh circulating in case
you are not able to put in uh the
WhatsApp numbers please ensure that you
put the number where you can be reached
directly since the WhatsApp number the
WhatsApp groups are already running if
this provision for two numbers please
fill in all the two numbers but most
importantly if you cannot fill the tune
contact details fill the one you can be
reached on directly for direct calls
so I think uh that's uh the only
announcement I know the certificates
will pop up we do know that you and you
all did not get your certificates please
please bear with us we are compiling to
see who have missed the certificates for
the first and second training and we are
also going to issue the certificates for
this training so please please bear with
us and put your learning first before
you before the certificates let me say
that so that you concentrate on the
training and not on the certificate at
the moment but they'll be issued thank
you very much uh Prof uh Suzanne let me
hand over to you and your team
okay thank you very much
and Professor regario we are so grateful
again for another day of sharing our
knowledge with our participants we
welcome you good afternoon
and I would also like to inform you that
in case anybody would like to invite us
for a face-to-face training
since we have no restriction on travel
we can always travel if we are informed
and we prepare on time and travel we
also encourage Professor rogario to
invite us to Mozambique
so that we do a face to face with our
postgraduate students there but it will
also be online so that other
participants also can benefit thank you
very much and I invite Dr Helen
yes yes Prof
professor before I really want your
contacts because what you are saying is
something that is in our plan so please
I want
the context yes I'm going to pick to
send you an email so that you get all
our contacts and then we keep in touch
yes thank you very much professori yes
okay just a small one
um
the details in the registration link not
on the on the chats please field in the
registration link not on the chat and
also the admins of the group if anyone
can put their WhatsApp groups especially
the one which is not yet full the link
you can also post it here
thank you okay thank you very much and I
hand over to Dr Helen today we are going
to shift from experimental design we are
going to look a bit of survey so that
people can get a feel of survey and how
we can handle categorical data thank you
very much uh over to you Helen
if you need to
hold on another thing is we are going to
send out a short evaluation form it is
going to be in the monkey form so we'll
give you a few minutes before we close
today I encourage everybody online the
participants please fill in those
evaluation forms they are good they help
us to make our training better for
future purposes for you and me thank you
very much okay bye-bye
good
afternoon everyone
okay so what we're going to look at
today
we're going to look at two things
one is uh you're going to look at
analysis of categorical data and then
number two we shall have an overview of
uh of sampling so but for now we are
going to start with the analysis of
categorical data just to keep you posted
all the materials are within the week
two day five folder you can kindly check
there and confirm we posted the
materials yesterday so in case you
haven't done so please pick the um the
forward it has week two day five it has
both the
um the work on sampling and categorical
okay
let me let me check let me confirm in
the chat are there some people who got
the materials already from the Google
Drive
just confirm to me in the chat okay
great
thank you so in case there's some other
people who haven't please do the same go
in the Google Drive the folder has week
25 sampling and categorical
um notes that you're going to use today
all right
so what we are having today like I
mentioned we're going to go through
analysis of categorical data all right
maybe
um still I wanted to I wanted to make it
as a participatory session site that we
hear from each other sometimes when
you're teaching online
um you need to be very Vigilant to know
that the people that you have at the
other end they are with you can somebody
type for me in the chat when they talk
about categorical data what comes at the
back of your mind
just type a short sentence or two or
three words
okay
what is category calling data
data with labels
okay
discrete data
data that can be grouped more than two
points non-numeric okay qualitative
groupings thank you
I'm glad that a percentage of us has a
feel of what categorical data is all
about now let's proceed on we know that
when it comes to statistics it's a um we
need to gather organize analyze and then
draw conclusions from the data but what
is very what is most important is that
before we go ahead to do the the
organizing and then the analyzing and
drawing conclusions we need to first get
to know the nature of the variables that
we are handling I'm looking at what we
are having here
um I've got a snapshot of how an Excel
sheet looks like oh data that has been
collected okay and in my columns I've
got the I've got the variable names and
then the rows gives me the the
characteristics that are that are picked
or attributes that are picked from each
particular individual okay and this
individual could be a person could be an
animal could be anything you know there
has been that describes a data set if
just to take a little back in we had the
the previous the previous data sets were
having we had one that had that had
diamonds whereby there are different uh
different
um different variables where and they
are collected on on a different diamonds
so
um a variable is any attribute that
takes different values for different
individuals okay for example like image
if we are looking here we have got type
and then under the type our type is the
variable but can pick up wood steel wood
Etc and then when you also come to the
design it can also pick up it has seed
invented
Etc when you look at speed it also um it
it also has different values that have
been collected from various individuals
so after understanding how our data
looks like the different characteristics
of our variables then we can go ahead do
the analysis and then draw conclusions
okay
um as far as category code data is
concerned like some of you have echoed
already in the in the chat we are
looking at categorical variables we know
what a variable is a variable pixel
different attributes all right and now a
categorical a categorical variable it's
assigned labels like some of you
mentioned that places individuals into
particular groups and then for the case
of the quantitative for a quantitative
variable it takes numerical values for
which makes sense to find an average
okay sometimes even um categorical
variables will be termed as qualitative
variables because per end of the day
some of them are just like Words a very
quick example that we can take for
example we can take um our variable as
marital status and then marital status
we can have different like
um different levels like married and
married
Etc all right in this particular example
if you want to what we are having on a
screen for for example categorical
variables we have that time we have also
the design type has two levels the wood
and steel and then design plus sit and
then invented there's other variables
that are numerical in nature that we are
seeing them here they are quantitative
variables
you can also think through other
examples at the back of your mind an
example of a categorical variable
um I remember at one point somebody was
asking about what about examples whereby
we apply a liquid scale for example
whereby you want to maybe say disagree
um strongly agree neutral and so on yeah
all those when for example in case
you're looking at attitude okay you can
also look at that as a categorical
variable
okay so here we are looking at an
example that I'm portraying here whereby
it was just a random sample that was
taken from 40 students and there are
different ways on how we can represent
our data but specifically today we are
looking at categorical data and the
question was uh with the do I give there
were which would you prefer to be okay
and there were four levels of the of for
the case of the preference Rich happy
famous and then healthy so
um and and different data was collected
so it was collected from individuals and
then we go ahead and looking at the
different levels that we are having from
the preference we can be able to tell
that yes is a category call it's a
categorical variable um a categorical
variable we also
um have another example here and the
question is that uh identify the
individuals and variables in this data
set that I'm showing down here here and
classify each variable whether it is a
categorical or a quantitative so can you
type for me in the chat let's first I
can get like only three people or
whoever is interested among when you
look at this particular slide all this
 that I'm showing which one is
categorical can you um write down the
category called variables in the chat we
are just reminding ourselves okay George
is writing gender
um then
yes state
great
somebody's writing buff is but um a
categorical variable we have state grade
status
Etc wonderful
I believe we've gotten that and now
let's type the one that are quantitative
can we type the one that are
quantitative in nature quantitative
we're having age height
great
wonderful thank you we've gotten that so
we can push on and there are different
ways on how you can really come up and
summarize this kind of distribution if
you remember in high school I don't know
the different schools that we went to
most of the time whereby they would tell
us to tell the you know as you're
counting the moment you make a band of
five then you put across you know and
then you get to know that after that you
can easily
um you can easily count them very the
um in a short time but then also we can
do a frequency table most of the time if
we want to represent category call data
we use a frequency table whereby now in
this particular case from the given
example that we have we have up here we
are having data whereby
um it was categorical nature for the
students are sampled from the question
that was on the previous slide and
somebody would just choose you know
depending on the level so you would you
just choose a preferred status and
coming here you find that we've got four
levels we must happy healthy and then
Rich and then the frequency these are
the numbers we have 7 21 4 8 and then
the total is what the total is 40. So
after within the frequency table after
getting our frequency values we can go
ahead and calculate the relative
frequency and those are the different
percentages that we can come up with so
with the famous we know that our total
is 40 so we are going to get for each
level we divide by 40 and then we come
up with this percentages
so what does that imply but end of the
day if we are presenting categorical
data most of the time depending on the
way that you want to present it if you
are using a table you always have to put
the frequency at the same time the
percentage and uh and then after that
you'll be able to explain and judge from
the different levels maybe which one
takes up a bigger percentage compared to
the other depending on the narrative
that you want to that you want to put
across but uh when we come to
categorical variables we make a
frequency table and then you put a
frequency as well as the percentages or
you can do a visualization and I believe
that we've already looked at it maybe I
know so you're going to look at it more
as we push on
all right now yes I was talking about
how we display categorical category
called data we also have another example
here if you remember about the Titanic
data set within R we have a Titanic data
State I don't know if the sum of you
know about this particular story but
among the different variables that was
collected within the Titanic data was
class all right and then class with
category call whereby some people were
in first class second third and then the
crew these were their accounts and then
we can after if this is a frequency
table which has got the counts and then
we can convert the counts to percentage
it will come up and then we come up with
a relative frequency table so having
gotten the total then we divide by each
particular level to come up with a
percentage and then we get to know among
the different classes that we have which
one had the
um maybe for example more people we are
saying that the crew had 40 percent
40.21 Then followed by the third and the
first and then the second class
okay
also you can you know we've already seen
that you can do we can do back graphs
when it comes to categorical data
sometimes uh we usually say that uh
graph speaks oh a picture speaks louder
than the figures to most people and they
can easily see it and follow so you can
also represent it as uh as um as a
pictorial and this particular case we
are still I'm presenting what we had in
the previous Slide the different classes
and then having the percentages on the
on the y-axis and then the class on the
on the x-axis and it's also good that in
case you come up with a visualization or
a picture of something it is good that
you always enable it up if you don't
want to have it in the form of bar
graphs you can do
um you can have a pie chart I'm just
going through this very briefly because
I at the back of my mind we've been
looking at these things throughout the
last week and even within this
particular week
some other people
um you can also do what we call a
segmented back graphs they're also
interesting it's all about it's all
about how you can present your data to
be catching and also it goes back to the
to their to the audience all the what is
your audience or which people do you
want to really show your data and then
are they within the Academia are you
presenting the politician are you
presenting at a conference all these
things are really matter a lot depending
on the audience to which that you're
going to present your data so still with
category code we can do segmented back
graphs and this is how it looks like
we'll explain that already or you can do
a side-by-side bar charts all these
shoes that you've really
um understand how it helps you to
understand how your data behaves for
example in this particular case it is
sale categorical we plotted um a
distribution of of of fun the type of
phone somebody owns in respect to that
different age categories we can see for
the case of iPhone the the the the the
the brown one is 18 to 34 as per the
scale the yellows those are between 35
and 54 and then the grain is for the 55
plus and above
okay
now let's move on and also talk about
given that now we know what a
categorical data is we know how you can
represent it and then at the same time
we've also um looked at how it can be um
how can you represent it in case in a
visual form if you want to create a
pictorial you can do
um you can do a bug graphs and again you
can do pie charts or segmented graphs
depending of course it goes back also
remember on the levels that a particular
variable have that a particular variable
has when you talk about
um categorical that we cannot miss to
talk about the concept of cross
tabulation or sometimes it's referred to
as a contingency table whereby in this
particular case you're considering two
categorical variables if you remember
very well um when we are looking at data
manipulation of the various data that we
have we mentioned that in case you
having two categorical variables for you
to check the relationship you can go
ahead and do what you can go ahead and
do a cross-stabulation and then in the
process you can check whether the two
variables are independent I don't want
to preempty that you're going to look at
it as we push on now looking at this
we've got a cross tabulation whereby
we've got two variables we have the one
for the groups where we have group one
group two group three and then we have
the one which has the termites either
um attacked by attached by termite or
not attacked by the termite those are
two we've got one one group has two
levels then the other variable has three
Neighbors
so what we are seeing here inside here
these are just distributions that also
attached by termites in group one are
193 attached by termites in group to
around 148 in group three there are 210
and then the subtotals we're having here
we we sum from 193 148 to 110 and we
come up with five five one
of course we need to know this
particular background but when we get to
R you know it's just a command that
we're going to write and then we'll be
able to see how all these contingency
tables look like but sometimes it is
important to have the background of what
really the computer is going to give you
now when you come to the uh not attached
termites in group one we have three or
seven as the the count then in group
2352 group three
group three it is 290 and when you sum
them up you come up with nine
um nine four nine
okay so with some with some row wise the
first group which is five five one and
then the second group which is nine four
nine those are the subtotos these are
the totals for the case of the the
terminals the two groups and then we
also have well in the other group which
has got three groups group one two and
three and the group one those that are
attacked by the termites 193 and then
the the ones that note we have three or
seven so when you sum now column wise
the subtotals are 500 in all the three
groups they are 500 and in total our n
is one thousand five hundred so what
you're seeing here when you're seeing
you're seeing a distribution of two
categorical variables and we've got two
rows and three columns so in this
particular case we have we have a two by
three contingency table okay let me
first check in the chat is in the same
page
okay
are we on the same page
we can't see
okay I'm going to reduce the speeds or
about that okay
speed too much please
okay I'm going to I'm going to reduce
the speed set that we all follow very
well but at least I've seen um you've
you've typed yes implying that we are
still together
okay
let's move on
so
okay
um let's go slowly on this particular
slide side that we are together remember
in the preview on the previous slide we
saw what um Cross tabulation or some
other people refer to it as a
contingency table and this particular
tables depending like I mentioned on the
levels you find that when one when one
variable has two levels and the other
one has three levels when you do a cross
tabulation you'll come up with a two
times three in case one is of two levels
and the other one is of two levels then
you'll come up with a tabulary of either
contingency table over two by a two by
two so when we come to this particular
tables the contingency table we can have
uh um we can have an N times M
um ways on how you can create
contingency tables but it still goes
back on it is being defined with the
what with the levels now I want to talk
about two week tables okay and marginal
distributions we've already looked at
this particular concept I remember when
uh when Professor Susan was explaining
about uh
um exploratory data analysis but now
let's get more detail what happens in
case of category call so we are saying
that
um when a data set involves two
variables what do we begin with most of
the time we begin by examining the
counts how are they distributed within
this particular groups if you don't want
after the accounts you can as well go
ahead and get the percentages in each of
the categories within that particular
variable
okay
so a two-way table
what comes to your mind when you say
about a two-way table this one describes
two categories called valuables okay two
Web Two categorical variables whereby
you've got the accounts according to the
rules and then the what and then the
columns now let's take this particular
example what we are having here we are
having that uh we are having two
variables two categorical variables one
is gender where we have a female and
then male and then the other one is um
they are asking that the chances of
getting rich among the youth chances of
getting rich is categorical and it has
almost no chance that is one level some
chance but probably not that's another
level a 50 50 chance a good chance and
almost certain so in other words getting
rich and chance of getting rich variable
has got five levels if we are to count
one two three four five
and now then for the case of the female
and then the male that is what that is
gender so we are doing a two-way table
over to check how the the counts are
distributed within the levels looking at
this particular table we see that among
the females they are 1996 females that
said that almost they're seeing no
chance of getting rich but the males are
98 these are just counts now when you
total up 96 plus 98 we come up with
194 this is the total that we get over
of of those who stayed almost almost no
chance with with irrespective of gender
okay this is our one one nine four in
case you put in um you divide the female
and the males we are seeing that of the
194 we have 96 males and 98 96 female
sorry and 98 males
okay when we come for the case of almost
certain we are seeing that and the
females are 486 and then the males are
597. when you total them together we are
having one zero eight three so what
you're seeing here
um the total here these are they they
sum up all for example with the case the
females and then the males so the total
count the total count of those who say
um of not getting rich are what we are
having here 194 712 1416
Etc
and now that is if you're looking at it
um row wise the the for the category of
getting rich and these particular totals
as we move ahead these are referred to
as marginal distributions okay the
1947121416 and 1083 are marginal
distributions they're distributed in
respect to to gender
okay and now what about in case we
consider we consider the gender itself
among gender we have the females and
then the males and they're also
distributed according to the chance of
getting rich within the different levels
so the different
um when we sum up here the different
um total that we get the two three six
nine these are the total females being
distributed among this
um the variable chance of getting rich
and then two four five nine it is um it
is the total the total count for the
males that within the variable of
getting rich okay so here looking at
this particular tutors on your own on
your right the imaginal distributions
for somebody getting rich and then the
marginal distributions for gender are
the ones that the total count for the
um them wise
okay so you can ask yourself a question
that how many young adults you are
surveyed so the adults that we are
surveyed in total they are four thousand
eight hundred and twenty six
in this particular question what are the
um what are the variables described by
by these two were very this two-way
table the variables are two gender and
then and chances of getting rich
okay I think they are we okay
um this is just my illustration how we
can get the marginal distribution which
I've already explained earlier so just
to take a close look we are in this
particular case we are having two
variables that are categorical one is
gender which has got female and male and
then the other one is job category so
first to come up with the marginal
distributions for the job category we
take the totals the marginal okay the
marginal totals that we are having here
and then for the case of the marginal
distribution for gender we take the
totals that are being summed and column
wise so we'll have the margin
distribution for gender the one that are
the totals that are being summed row
wise at the marginal distribution for
job category
okay hope that is that is very clear and
then also the other thing maybe it's
about uh the conditional distribution we
also you know within uh when you're
looking at um categorical data you can
be able to know okay what is for example
the percentage of female in this
particular case that uh that have got
clerical jobs given that there are 363
um uh 63
um
um Clary called clerical jobs so it is
just from here we are just getting the
totals that we are having and then when
you push onto the second we are moving
on to calculate the percentages for each
particular cell so given what we are
having somebody you can ask you can be
able from the different um and the
percentages within a cell you can be
able to come up with a conclusion and
then also you know make comments because
most of the time when we are explaining
data you look at the main key result
what has been so interesting and also it
goes back on the objective that you want
to achieve okay so just to to take a
little bit when we come to the females
that we did that Clary call in this
particular case it is 206 we are
dividing by the marginal total of those
within the clerical job then with the
custodial it is 0 over 27 the 27 if you
remember very well it's our marginal
total we are get for those that are
dealing in have that have got custodial
jobs all right
okay
okay
now
yes I've read it up to about uh um
we are still under the two-way tables
and marginal distributions these ones
have explained so we are looking at the
counts within the distribution of the
respective within the respective cells
okay and uh sometimes people prefer to
have percentages because they are most
most of the time more informative than
the counts and we've seen how you can
quickly calculate them but in case but
as we move on to R you'll find that it's
very easy on how to calculate um
percentages within across tabulation
because you just need to know what
command you really have to have to use
but the back of your mind now you need
to know what are the marginal
distributions we are saying that these
are just the different totals within a
respective category by their
um by their distribution so at the same
time you can also make a graph to
display the marginal distribution
depending on what kind of graph that you
want
okay
all right going back to example that we
had earlier which I showed you about
looking at gender and then the chances
of getting rich how do we um examine the
marginal distribution of chance of
getting rich it's the same as the one
that we've looked at for the categorical
for the job category and gender
um
on this one we've already seen our
different photos so after from uh from
that particular table above we can move
on further and convert it to what to
percentages here
if you're considering the marginal
distribution all our totals without
um in case we don't want to separate
according to gender so it implies that
we are we are going to get our totals
the for example almost no chance when
you run the frequency okay the usual
frequency that we know we are coming up
with 194 and the total is
4820 and 26 so these are the percentages
that we are coming up with the marginal
distribution and you can still quickly
do a bad shot and float and C uh to give
you a more a more clearer view on what
you want to talk about so you can you
can represent this particular variable
one in form of uh in form of a table or
in form of a chart and it still goes
back to you on what audience or how do
you want the people representing that
your work to understand it very well
okay
now
so in case you want to look at the
relationship between two variables I've
already mentioned about the conditional
distributions for example in the
previous one where we are looking at the
chances of somebody becoming rich
um given the different levels so you can
condition that we can say well for those
who say that if what would be the chance
for a female to become rich given that
the total they given a particular total
we know a total but you want to come up
with a particular particular percentage
of a given soil so what do we do number
one we know we select the columns or the
rows of interest that is very important
and now you got hate to use that tell
you use that data in the table to
calculate conditional distribution the
percentages and after that you can
display them on um or on a graph you can
do a segmented bar graph you can do a
pipe chat or you can do bug graphs or
you can go ahead and use a side-by-side
bar graphs like we explained earlier for
compulsions
okay
so initially what's what we saw earlier
if I go back briefly in this particular
case
um here we are looking at marginal
distributions considering all our totals
irrespective of the
um irrespective of division for example
the case of gender where we have males
and female so we are taking our all for
if you're drawing just a usual frequency
table you have the frequency and then
the the percentages those are the
marginal distributions that you're
having and then if you want to go ahead
and now look at it with respect to to
the different levels then in this
particular in that case we are looking
at condition distribution now in this
particular case for example we are
looking at um we are considering males
remember our
um this particular
um uh shot that we have the total males
are four four thousand uh four thousand
to two thousand four hundred and fifty
nine so for us to come up with a
condition distribution for example at
almost no chance so it's going to be 98
we divide by the total what by the total
males which is um
2459 the same applies to all the risks
it's what we are having here so we are
having
um we are trying to explain the
relationship between gender and in this
particular case we are considering males
in response to our chances of getting
rich after doing this you can also go
ahead and do for the case of what for
the case of the females we know that the
total for the females from our screen
are 2367 and you can be able to what is
the percentage of uh of a female over
female
to be to have a chance you know some
chance of getting rich given that the
total is
2367. So we go ahead and compute our
percentages so whatever that we are
having now in this our particular table
they're just conditioned
um they are conditioned distributions
okay
foreign
[Music]
okay so here we are you can also go
ahead and
you can also after having your
distributions in this particular case
you can represent them in pictorial form
you know all these are just are just
different ways on how you can present
categorical data and check the
relationship between two categorical
variables so from here we can see the
the males and then the females and then
the opinions are being colored for
example in this particular case the
yellow 50 to 50 chance you can easily
see that and then the red and we have a
good chance among others and still
within R you can just write one command
and all you can come up with all this
particular results but it's good to have
um what they are talking about
so um sometimes just a diagram or the
marginal distribution and the
conditional distributions might not
really give you a feel of what
conclusion can we sing among these two
variables so but we are going to see how
do you come up with a conclusion if
there is a relationship between these
two categorical variables we are going
to look for at this particular stage
we're not going to look at it but we are
going to look at it as we move on and
it's also
um I would also encourage you that
sometimes you can't say that there is a
strong association between two
categorical variables but they could be
there's also some other factors that
might be influencing
this variables to be like that
okay
so
in summary what have we looked at so far
in case we want to make sense of
categorical variables
or categorical data number one
we have to we can display categorical
data with a bar graph you can use a pie
chart you can use a segmented about
graph
you can go ahead and calculate and
display the marginal distributions of
categorical variables from a two-way
table like we've seen
after that you can go ahead and
calculate and display the conditional
distributions
as per the previous slides and then
after that you move on to describe the
association between the two categorical
variables
okay let me check in the chat
are we on the same page
are we on the same page
okay
sure all right
okay
thank you thanks for the replace there
are many
now like on the previous slide if I take
you back briefly
we mentioned that
here
there is a point that we are seeing
we are seeing a relationship between
um chances of becoming wealthy or
becoming rich
um
and what and gender these are two
categorical variables but we are
wondering yes we are seeing the we are
seeing how they are how they are
categorized and we are seeing we are
seeing our picture but the question is
what you can we come up can we come up
with a conclusion can we test that there
is a strong association between these
two variables so how do we do that in R
if you remember within our data
management where we are seeing the
method of analysis and then the method
of visualization so when you're having
two categorical variables for you to
check if there is an association or if
if there is an association between these
two categorical variables we have to do
what we call a test of independency and
so it is now the concept that we are
going to look at
okay
maybe before we put before we move on to
see the test of independency can you
type for me in the chat what comes at
the back of your mind when when you when
when they tell you that well for you to
test an association between two
categorical values you need to test for
independency what does it mean
type for me in the chat
okay somebody's saying it means that you
use a chi-square great Chi square and
then a thetas compulsion chi-square test
can I see more
um no relation
you have to find a chi-square test we
are testing for relation
okay to check whether we have
no dependency instead of
non-independency okay we are testing
some people some other step you can use
a fish exact test and hypothesis all
right thank you I believe that
um
yes Annette you're going to be answered
anytime okay so when in case we want to
we want to um to make a conclusion to
check if there is a relationship or an
association between two categorical
values we do what we test for
independency and how do we do that like
most of you have commented in the chat
we use a chi-square test so so a
chi-square test is a test that that is
used to test for independency between
two categorical values by variables okay
and we are going to go ahead and test it
so the the null hypothesis would be
there is no association between two
categorical variables and then the
alternative is the risk and Association
when they say within the now there is no
association or somebody could say there
is no independency and then the
alternative is the risk independency
remember if if we just go back a little
bit within probability when we talk
about independency it implies that when
you're looking at two particular events
where when they are independent it
implies that um the occurrence of one
does not influence the occurrence of
what the occurrence of the other okay
now here when we are looking at two
categorical variables all the time if
you want to
um if you want to test if there is a
relationship you do what we call a
square test of course there are some
scenarios when you're calculating a
chi-square and maybe
um when you're when you're running it
within R and it fails that's when the
concept of using the official accepts
test comes in somebody commented it
within the chat
okay and uh also within uh when we are
looking at a chi Square when a
chi-square test there is also what we
call expected values remember what we
see within the within for example the
Excel sheet or the the collected data
that is the observed okay but then well
what about the expected remember that
all of this particular week we have been
sharing about experimental data and then
we've got a sampling sorry experimental
error then sampling error we believe
that still when this particular data is
corrected there is an error there is
there there we cannot be a hundred
percent a hundred percent right that uh
chance that we've got all things that
all things right so we're also going to
calculate the expected values all the
things that I'm talking about they lead
us to how a chi-square
um statistical test looks like we've
talked about when it comes to expected
values we get the rototus those are the
marginal totals times the column totals
divide by the grand total all right
maybe to take you back I talked about
the marginal rows and then the marginal
columns they are the ones that are
referring to us here and then we divide
by our our total
okay but it's important at the back of
your mind to know the what is what are
you testing okay how do you set the
nerve and how do you search the
alternative it is very important because
as we push on it's the one that directs
you either to say that you reject H
naught or you throw to reject H naught
okay a small one
now when I talked about
um uh remember within the chi-square it
is being governed by two things we've
got the observed frequencies okay what
we picked from the from our data and
then also what we have the expected
frequencies and in this particular case
I've already talked about marginal
probabilities whereby you you multiply
the the product of the marginal okay and
then the the marginal total stems the
column totals divide by the total number
of observations and within r that could
be done very perfect to well after that
we get that the difference between the
observation and the expected frequencies
still within the software that we are
going to see all these things are going
to be calculated automatically but
sometimes it is good for us to know the
feel of how these things come about
after step three of getting the
difference between the object frequency
and the expected then now we we the next
step is for us to calculate the
chi-square test the value that we are
going to check what value do we have
okay let me first pause a little bit I
check in the chat hope I've not left
anyone
okay
in case I also want to encourage you
please if you have um questions you can
type them in a Q and Q and A my
colleagues will be happy to answer those
various questions and then the ones that
I see that uh um or quickly to say I'll
be able also to answer them up
okay so when you check in many
statistical
um text books or even if you Google at
internet all right okay Square
um test statistics we are saying that it
tests for Association or it tests for
it's a test for independency or somebody
can say it is a test of association with
the null we mentioned there is no
association and then the alternative
there is an association when you check
in most of the books this is uh
um this is how the the chi-squared
statistical test looks like okay and uh
so we have the oi we get the observed
The observed frequency minus the
expected frequency the diversion it is
square the the difference or you can
call it the deviation and then uh
everything is squared we divide by the
expected frequency and you can also use
chi-square tables you know within
different statistical tables we've got
chi-square tests and we can read off the
rows and then the number of rows we can
also take look at the number of columns
and these ones help us to come up with
the degrees and the different degrees of
freedom so you need to know within when
you're looking at
um when you're looking at your
contingency table how many rows do you
have how many columns do you have so
when you multiply the the rows minus 1
one then times the C minus one it gives
you the
um the degrees of freedom when you check
in different statistical tests we've got
chi-square tables so you can look at
this particular values from the table
the that you'll get or you can you can
collect you can calculate them and then
compare you and then you can also use a
p-value to compare if there is an
association but we're going to look at
that
all right
um just to to Echo again within the
chi-square statistics what do we do we
compare the observed values to the
expected values remember we observed
this is what you collect within your
data we're going to see it in R and I'll
still emphasize it more and then in case
you do the computation what would what
are the would-be expected values so we
are comparing both of them and then the
moment we are trying after that is you
you're trying to to check is the
difference between the observed and the
expected are statistically significant
and sometimes you'll find that if you're
checking the difference between the
observed and the expected when for
example you you might find that uh you
come up with a very low a very low
um value for a chi-square whereby that
implies that the difference between
these two The observed and the expected
is really really small and then
sometimes when you with most of you if
you've come across it that you can
calculate and also you see that the
chi-square you know it gives you a zero
and what should come at the back of your
mind is that in that particular case The
observed values are equal to the
expected values and and uh and so all
this particular information when as you
are going to go ahead and check how to
compute the association using the
chi-square test all this should come at
the back of our minds in the different
perspective figures that we are going to
come up with
so in general what is the summary before
we move on to the calculation rule
number one you need to know your the two
variables that you want to check okay
and then what are the two variables the
two variables have to be both
categorical in nature you set your
hypothesis the narrow no association or
no independency and then the alternative
is the reason Association or the risk
independency in case we do most of the
when you're going to your statistical
software this case which is R you can
set your significance level in you
considered one percent at five percent
Etc but the key rule is that when the
chi-square table value I've talked about
the tables itself we've got statistical
chi-square tables sometimes they
referred to as critical values if it is
less than the calculated calculated
implying that but you can go um you can
do the calculation of a chi-square
manually from like what I explained
earlier having the observed and then
getting that would be expected frequency
your mind you you square the deviation
then you divide that will give you a chi
square that is calculated but most of
the time if what you read from the table
is less than what you've calculated by
the rule of thumb we reject the H naught
the moment reject the H naught it
implies that now we are going into the
alternative and we conclude that the
categorical variables are independent
okay let me first check in the charts
okay
no independency Association all right
now we are going to just take this
particular example I would just pick
from a different textbook whereby
um what we are having here we are
looking we want to check is there an
association
um between a city in this particular
case which is the area and the offense
what we are having here is that we have
got area a b and then C and then the
these are the the respective offenses
that we have burglary nobody concept
they talked us what we are having here
in the in the rows the total rows these
are our marginal distributions for a it
is 55 for B it is 49 and then for cities
46.
right and then if we are to look at the
distributions for the offense for
backgroundly we have 50
um robbery they are 60 and then cut
theft they are 40. now let's go ahead
what we are seeing here these are just
observed frequencies or what I mean what
there's a free the the accounts that are
being shown from the data that you can
get them and what we are seeing here
these are
um we can say it is a three by three
contingency table it is just a
prostabulation between area and then the
the type of offense so what are we going
to do next here we need to go ahead and
come up with the expected frequencies we
then when we get the the difference
between the observed and the expected we
Square them then we divide by what by
the the total sum to come up with the
chi-square
okay
so
um number one I know like I mentioned
two variables that I'm having here they
are
um they are both categorical so I go
ahead and set my whole the hypothesis H1
H naught and then H1 okay with h naught
I'm saying there's no association
between offenses and then and then the
area of the city and then the
alternative is that there is an
association
okay
so I'm not going to go into the
different calibrations on how these
things came about but in case you're
checking the notes you can easily follow
you can easily
um follow through and check how these
how the the different results came about
you can easily check through on your own
and and look at that so what we are
having is that uh at the end of the day
for example where we have a here yet we
are having the the sums of the different
rows and then the columns so we have o i
minus E observed one observed um
expected one square divided by E1 these
are the values that we're having down
here so what we're going to do we are
going to go ahead and sum them up and
everything will be divided by
um 150 as the total so
here the values were substituted within
the test Statics and and we came up with
two 3.13
all right this is the value that has
been calculated so the calculated value
is 23.13 remember if we go back briefly
we are having
um yeah each variable has got three
labels so for us to come up with the
degrees of freedom we get um so it is a
three by three we are going to have the
number of rows minus one another number
of columns minus one so that's how we
are coming up with the four degrees of
freedom
in case we take our level of
significancy as 0.05 okay and
so from if you check in any any
statistical statistical table like here
we are saying that when you read you
come up with
9.488 what conclusion would you come up
with we are having that the calculated
is 23.13 when you read from the table it
is
9.488 what conclusion can we come up
with can you tap the answer in the chat
what conclusion can you come up with
okay
will Freddie think that we reject H
naught reject H naught Kenneth agres
also there is um
is smile will reject
okay
so
is less than the calculated hence we
reject no association so reject H naught
there is no independency
okay
so to reject H node
okay
so remember if I take you if I take you
back from our conclusion we say that if
the Top If the calculated value
um if the critical value that is the
table value is less than the calculated
we reject H naught okay now let's go
back and then we Mark ourselves if we
are right we are having that from the
table it is
9.488 which is
um this is from the from the table and
what we are what what is calculated is 2
3 23.13
so in this particular case we are going
to we reject H H naught since what is
what is uh read from the critical value
is less than the calculated
okay so the first one was when your
Alpha was 0.05 what about in case we use
one percent significance level if we use
one percent significance level we are
told that the critical value is 13.
13.28 then what do we do in this
particular case
looking at one percent significance
level the value from the table the
critical value is 13.28 what do we do
again still in this particular case and
then we can now we're going to come up
with a conclusion together
what will you do
okay still they're saying that we reject
H naught we still reject thank you
it is still less than the less than the
calculated the same Etc so in this
particular case I'll conquer with with
you so what we'll do we will conclude I
will still we reject H naught and say
that there is uh there is a the test is
highly significant so in this case we
can say we can come up we can conclude
and say that there's some evidence of
association between the two factors in
other words the two categorical values
are independent
okay let me check in the chat are we on
the same page
no Independence
no association
something
yeah when there is an association there
is no Independence
okay
[Music]
okay
so
um in case we move on into we in case we
move to R we are going to see how we can
present how we can present categorical
variables
so what we can do we've already seen how
Factor variables look like within the
factor variables that is what I refer to
them as the categorical ones we have to
convert them to factor we create
frequency tables and then then there
later we'll use the the table function
from the table function will be able to
calculate the proportions using the
table function and then we can also get
the marginal frequencies using the
margin dot table in Brackets
and then inside we can put our inside we
can put our Command what we want to look
at so
um this particular what I'm showing here
we have n which is equal to Let's assume
that our n is equal to 500 and what I'm
having
um I want to create a data frame I've
got my a
um I assign it we are going to see this
within within R I assign it as a as a
factor I create a sample which has a 1
and then A2 this is just an imagine data
set that I'm creating okay and I put c i
I need to have them together then I
already know what my n is rep equal to
true so implying that when I when I come
to a my A1 and A2 are going to be
replicated 500 times
the same applies to B I want to create a
factor variable and it's within a sample
where I'm going to have B1 and B2
my n is still 500 and then our rep
replicate 500 the same applies to C and
then I can create a data frame or a data
set that has got a B and then a b and
then C I will attach my data set
depending on the value of the objective
name that I've given it and then using
the table
the table command I can have I can put
in whatever variables that I think I put
in the first variable in this particular
case which is a and then B and this is
going to be our output I think this will
make more sense if we do it directly
within our
okay and then also we can do the margin
dot table for example in case we want to
if you want to get the marginal
distribution one that would be the for
the first variable that you're
considering and it will be summed over B
which I love the frequencies and then we
can also sum it over the the other
variable
we are going to demonstrate that when
you take the prop.table in that
particular case we will be
um we'll be getting the percentage with
the cell percentages
okay in case you have the prop.table and
you put in the name of your data set and
you and then you put comma one it
implies that now in that particular case
you're getting row percentages if you
don't put any now anything inside the
inside the command you're getting
um the cell percentages within each
particular cell
if you put two we are going to say this
in R it implies that you're going to
come up with column percentages that
you're going to come up with because
remember we have the rows and then we
have the we have the columns
okay
all right so now that we are going to
use or we are going to use our sometimes
for example you might you you know you
might not be able to to do it manually
and you're going to use um you're going
to use a statistical package so what
conclusion can you come up with it all
goes back to the to the P value how do
you interpreter p-value and then a
chi-square to select that you are able
to come up with a conclusion and say
that there is there is a relationship
that exists between the two categorical
variables that that you're considering
so we are going to see this in in R and
then we come up with a different
conclusions
all right
so this this was the theory bit now we
are going to move on to the analysis of
categorical data in our set that we see
what what we've been sharing how do you
do it in r in a very simple way
okay I encourage you to go to the go to
the
um the folder we are going to use
categorical data.r is what we are going
to use within that particular fold
okay so I'm going to stop sharing this
and then I share the category I share my
R open your r or new computers
okay
okay
we are going to use a script to go
recall data.r it's what we are going to
use it is also one of the scripts that
we sent in the materials of week two the
five
okay
how about official test
um you will use official tests in case
we in case the if you're using a
chi-square and you come up with an error
most of the time in the rule we have to
say that the counts within the if you
check the counts within the the cells if
it is less than five there are high
chances that as you compute the
chi-square we're going to come up with
an error so the moment you're using a
chi-square and you get an error that
should be an indicator to ring for you
that the chi Square in this case it will
not work so let me go in and use a
fissure exact test which folder we are
using week two the five sampling and
categorical folder within the Google
link
okay somebody's asking how about SPSS
users we are not looking at SPSS now we
want to see how do we do it in r
okay can you confirm to me if you've
opened your scripts how do you load the
data
that was one of our first lessons Okay
Okay
can we open the categorical data.r
Scripts
okay it's what we are going to use and
then inside still that same folder there
is a
somebody's asking can I use startup of
course why not
depending on the on the statistical
software that you're using you can still
go ahead and do cross tabulation
all right so can we continue can you
have you if you've opened your category
code.r categorical data dot R script can
we continue
type yes if it is okay
okay
all right
so
what we are going to start with now we
want to see how do we
um do the do the analysis
there are two examples within this
particular script there is this first
one which I told you that it's just
we've just created it and then the other
example we are going to that we are
going to use it's the one that has
um it's the one that has a data set and
that data set which has got what
salaries the CSV but we are going to
look at that we are going to have these
two examples and then we check how do we
do that
okay we've already said that the moment
you open your
um the moment let me ask in the chat are
you seeing my R script
have I shared
can you type one
okay
all right
so
um
you what we do we first we first run our
libraries all the time if you open if
you open an error script you first run
your libraries and if they run
successfully then it implies that you're
good to go most of the time in um
sometimes we come up with errors that
maybe a certain function is not formed
as we've been seeing that throughout
this particular week it implies that
maybe there is a package that you that
that you needed to load or to run but
but you didn't
okay now let's let's take line 13 I'm
saying that I want to create 500
observations that is my n you can change
it to anything that you want I'm just
giving an example you can change it to
10 you can change it to five
you can change it to anything so in case
I run line 13 and I check in the r
console it has been it has been noted
okay I want to create my first variable
which is a okay remember
um when it is a factor you're letting r
that you want to create a category
called a categorical variable and I'm
creating a sample I'm putting everything
inside the sample which has a 1 and A2
A1 and A2 these are my levels that are
under variable a
okay and I'm telling my my n these are
the number of observations that I want
for a and then I want them to be
replicated 500 times r e p that is
replicate which is equal to to true
so in case I run that particular line
line 14.
okay so you find that when you check in
the environment it is showing you that
fact it is a factor with two levels with
A1 and and A2
okay I can do the same for B
variable B it's the same explanation
only that this time I've got B1 and B2 I
can also run that line and then I also
run line
I also run line 16 which is still the
same when you check in the environment
it is still telling me that c is a
factor with two levels which has C1 and
C2 B is a is a category called a factor
variable with two levels B1 and B2 and
then the same applies to a
so now what I want now is create a data
set which I've given a name called my
data
okay that is the object name which is
which I've given my data you can change
it to anything that you want in case you
know you don't want to use my data and
then I'm like okay I want to create a
data frame remember when we talked about
a data frame which we whereby you want
to just to create a data set and inside
a data frame I put my first variable a
my second variable B and then my third
variable C and I put them together
inside that that the data dot frame so
if I run line 81 what is going to happen
if I run line
um if I run Line 1 8 and I look in my
environment you find that now it is
showing me that I've that I've gotten a
data set which has got 500 observations
and three variables
you can click on on the you can click on
the Excel icon and also have a look on
how these variables look like we are
seeing that with a it has a 1 and A2
levels bb1 and B2 and then the same
applies to C but remember this is just
um this is just an example that we've
created the moment after this we can
look at now a real example let me check
in the chat with the we have the same
page
are we together
okay
great
okay
um we've also been talking about uh
um attaching your data set so we are
having variables a b and then and then C
just to remind you in case maybe you get
stuck on something that is not so clear
with you when you're doing it in R what
you do you use a help command you type
question mark and then you you after the
question mark you write something that
you want to get more details about
okay now I know for at this particular
stage my my data set it is the my data
set is called my data
okay so if I want to do a and I've got
three variables a b and a b and c if I
want to create
um across tabulation or create a
contingency table between two variables
for example remember what we said you
write table and then inside you put the
first variable and then the what the
second variable
so in case I run line 26.
if I go and run it I expect to see my
results in what my results in in the
arrow console
okay so we are having a and then we are
having B those that uh those that are in
A1 represented within BR 121 we are
seeing the different counts and since
they are there it's a two level each
each variables go to two levels so we
come up with a two by two
um a two by two contingency table or a
cross tabulation
all right so in case you
you can also put line 26 into an object
which I've called my table
if I write my table and I say let it be
equal to line 26 I still come up with
similar results I run that particular
line and then for me to see the output I
run the object and I still come up with
similar results we are going to see it
when we are
when we are running
um when you're running a data set itself
such that it makes more sense and even
as us to pick it very well okay we
talked about the the the the margins so
in case you want to these are just the
sum okay if you want to create the the
margins margin dot stable and inside you
put the name of of uh of uh my previous
output which is the the contingency
table for a and then B in case inside I
put a one it implies that I'm going to
I'm going to come up with the the the
the the total frequencies of a
okay if I run that particular line
you can come and check in the console
that uh remember it's it's it's a matrix
form row by row by column so my row of
in this case it's one and it's being
represented it has got the the a so what
what what's R has done it is summed over
one two one plus one to five so we come
up with that two four six and then the
one to seven plus one two seven that's
how we are coming up with that two five
four
okay you can also change and instead of
one you put two in this particular case
there will be summed over the the
frequencies that the frequencies for
that for the case of what for the case
of the B variable so when you run that
this is the output that you will get if
you check if you go back with your
output so it is being summed now
um you get one two one plus one to seven
and then one to five plus one to seven
so we come up with the the total
marginals of both on the column end and
then the row end
okay in case we just want to come up
with the proportions of the of that of
the different counts we use what we call
the probe dot table
so that is our Command that we use and
then inside you put your object your
object name and my object name I used
mine I used my table so so in place I
run this it's going to give me the cell
percentages of what I explained earlier
within the PowerPoint so in case I run
that these are the proportions that we
get in the respective cells you can go
for ahead and multiply them by a hundred
and check
um by within our remember we talked
about manipulation you can go ahead and
then multiply them by a hundred at the
same time you can also
um round them off okay what we've done
up there
for us to come up with low percentages
what we do here we came up with the cell
percentages all right now in case you
want to come up with raw percentages you
go inside the product table command and
then you put a one if you want to come
up with column percentages instead of a
one you put two
okay so in case you run you run this
particular we run line 45 we are going
to come up with the row percentages
okay and then if you want to
um you want to come up with the column
percentages instead of one you change
and then you put your words you change
and you put two or two and that is the
output that we are going to come up with
um I'm not going to try and line 57 it
will not make sense now so what I want
us to do is now to do it on on our data
can you go um and import the data set
which has work salaries let's repeat
whatever that we've done up here using
work salaries and then we we do more
clear interpretation basing on the data
set that that we have
okay
okay
um
let's import the
workselleries.csv dataset within this
data set maybe as you as you said the
working directorate we have got
um different we've got different
variables we have the rank we have the
discipline yes since PhD these are the
variables I'm reading out we have the
years dot service we have research we
have the salary
maybe I can Fast Stop sharing and then I
share it and we see it together
um
this is the data set I'm talking about
that we want to import in are the work
seller is data set
it has the rank
rank is uh is a categorical variable
with different levels we have professors
assistant professor and then
um
okay assistant and then associate all
right we have another variable which is
discipline and it has got two levels B
and A B that is uh
um Humanities and then a Sciences we
have yes since PhD there this is just
qualitative I'm sorry quantitative
they're just number of years we have
then yes since service since the
somebody is employed the time that has
been spent we've got sets this is gender
either somebody is a male or female and
then we also have salary so this is the
data set I want you to import in R and
it's the one that we are going to use
okay
[Music]
foreign
now let's go back briefly
um let's go back to line line 28
as then we will move to the other
section I created the my the my data
object
and inside the my data object I wanted
to create a table
um across tabulation for variable a and
then variable B remember
um here our a they are the rows and then
the B are the other columns
all right
so in case I run line if I run line 28
I will not be able to see my output
if I run now the object itself the my
the my table I come up with
um
I come up with these these the counts
inside the the in in a in variable a
those that uh when you're looking at an
associate when you're looking at the
accounts in A1 and they're also in B1
they are 121 then in A1 and also belong
to to to the b b two category they are
125 if you come again to A2 we are
having
um those in a the ones that belong to
level two but they also in B1 they are
127 and and so on A2 and then B2 there
100 and 27.
now when we talk about if you've got uh
if you've got one two one and then one
two five
uh one two seven and then one to seven
this is a two by two
remember when we talked about the the
mind the margin table we are trying to
get the sum
okay we are getting the we are going to
get the sum of the of the rows in case
you put one here
so it implies that what is we are going
to have is going to be if you put one
remember our rows are
um the the rows at the S so it is going
to sum for us the the S so if you run
this particular line here line 32 you
find that so it is it is if you look in
the console where I have highlighted it
gives us the marginal or the sum totals
for level A1 which in that particular
case is 2 4 6.
okay and then for A2 it's going to go
ahead and sum one two seven one two
seven which will be two five four two
two five four and these are the marginal
totals for a giving us for the levels of
A1 and then and then A2
now in case you replace instead of
having one and you put two so now it is
going to sum column wise so we are going
to come up with the total margins for
the case of B if you run line number two
what is going to do if you look in the
console now it is in my table output
it's going to get one two one and then
it sums it with one two seven that is
the first column in the B1 and then in
the B2 it is going to get 125 plus
127. so for us to see that marginal
output we use the margin boot table that
is line 136
okay 136 and then we'll run that one so
in other words when it's so our B1 is
going to be 2 4 8 and then our B2 it's
going to be 2 5 2.
let me first check in the chat are we
are we on the same page there as we
proceed
are we together okay let's continue
slowly and then we get to know the
difference of what we are doing
now if we remember with the marginal
tables we are getting the the the the
the sums of the different levels within
a cross tabulation if you're considering
the the row wise that is
um we are going to put a one inside the
command if you're considering column
wise instead of one we put a what we put
a two
how about in case you want to do if you
want to do uh
um when we go to the percentages all
right remember when we are dealing with
percentages we are calculating
proportions within a particular within a
particular table within a particular
cell in case you let me if you go back
briefly
um if I take you back to my table what's
what we are calling a cell it is that
particular count within in in case
you're considering for example a two by
two each count it represents a cell
okay so now when you come to line 40 and
remember how our table looks like if our
table had one to one count one two five
one two seven and then one to seven so
in case you want to get the percentages
for each particular table for each for
each cell that's this one is the grand
total we do we'll do prop.stable and
then inside we put a my table we put our
object our object's name because in that
particular case we'll be coming up with
cell percentages
okay so when you run line 40
will be having
0.242 this is for the first cell
and then when you come to A1 then A1 B2
it is going to be 0.250 these are just
the different percentages that we are
getting but in case now if we are to
consider row percentages it implies that
now my count is going to be divided by
the total amount within my within that
particular Row for example in in A1 if
I'm to consider I'm considering the the
row for the case of raw percentages it
implies that each particular value
within a cell it will be divided by its
total marginal value so the first one is
one two one we'll divide it by 264
because that was the total to come up
with the raw percentages okay depending
on what you want to report so for the
case of the row percentages we get each
particular cell and we divide it by the
raw margin of total and then if it is
the column now we are going to be
dividing by each cell we divide by the
value the total the total column value
that is the difference between the row
percentages and then the column
percentages so most of the time in case
you um in case you for the case of the
cell percentages you've not you've not
differentiated anything you're just
taking the the full Touch of what you're
having but with the self with the row
percentages you are considering for the
case of uh for example in this
particular case which is with two
variables A1 and A2 and then for the
case of the column we'll be getting our
our count we divide it by the total
value of a the B1 and then B2 I don't
know if that is clear
I hope that with the with the when we
use the the real example it will make
more sense
does it sound okay
okay
community
all right
now what we are going to do let's use a
real example from for example from the
letter that has been collected and see
if it will make maybe in that particular
case it will come out uh visibly very
well now what I want you to do is that
uh let's import our work salaries data
set
here I'm I'm you I'm setting my working
directory which I already have then I
want to read work salaries within my
within my um within R if I run line 65
I'm using an object name salaries
I run line 65 and I check that within my
environment I'm having 397 observation
of six variables which we saw within the
Excel sheet you can also click on the
Excel icon and then View
we saw with line 66 it can help us to
see the structure of the data set let me
first check in the chat have you been
have you been able to import the work
salaries data set
have we imported
okay
all right
great I think now at least we know what
that means okay so when we look at the
Str the salaries the structure of
salaries remember service is my object
first we are given our data Flame okay
that that data set is a data frame with
397 observations and then six variables
when you come and check on the rank it
is being seen as a character so we need
to convert these two
um a factor the same applies to this
plane because we know the discipline is
there it is also a categorical variable
and then sex is which I which here are
racing it as a character so we need to
tell it that convert
um converts convert character to a
factor so my line 69 I'm just going to
run that hope we all know what what it
means then I run the discipline I run
also the the sex I can now go back and
run my structure again and now from here
I can see that uh
um now my rank is uh is is is is a
categorical with River with three levels
discipline has two levels and then six
has also two level the rest are just uh
um quantitative they're just numerical
so now let's move on
okay
um
maybe the other thing that I did online
I wanted to to say how all these factors
influence somebody's salary and uh in
this case my salary is is is continuous
or in this case is being read as an
integer we can we can convert the
salaries into a categorical variable
because we need to to cross tabulate
between two categorical variables so if
you remember what we did in in data
manipulation I can categorize basically
the media and basing one I mean
depending on whatever that you want if I
run line 99
78 it gives me a summary statistics of
the variable salary
okay with the minimum the median mean
and then the maximum when you come to
line 80 I say that I want to create a
new variable salary cut all right within
my salaries data set and then I use an
if else command if else command I put
what inside I need to put my test when
the test is passed I put uh it will be a
yes when it is not passed it will be
unknown so I considered one three seven
or six as my name that in place there's
the test that I've put now is that uh if
my salary is less than the mean my yes
will be the low otherwise label it high
so I expect that in case I run this
particular line I'll be able to see a
new variable that is being created so if
I run line 80
okay
um I'll see that
um for me to review to check if it has
been successful I can click my view I
can type view salaries and then I'm able
now to see so where I have good law it
implies that that salary is less than
the mean otherwise then R has recorded
it as high
okay now let's go back
so what are we going to do when remember
with the categorical data we use type
table as the command and then
we
we used step as the command so I can say
salaries and rank I can run that and
then it will tell me that uh within the
rank I've got to relay voice the
associate professor which has got 64 as
mine frequency then assistant
67 and then Prof as 266.
okay now I also want to run the one for
the one of the one of celery cut which
I've just created if I use table it is
showing me that the frequency for those
that have high salary it is 168 and then
the low salary is two to seven so
can I find out if there is a
relationship between salary and rank
on this particular case it's going to be
um Rank and then salary category because
salary cut is the one that is what that
is categorical
all right so in case I come down here
for me to do a cross tabulation if you
saw the first example we had the table
and then we had put a and then and then
B remember these are my variables so
instead of putting a and b what I'm
going to do I'm going to consider my
first variable as salaries going to my
service data set get for me Rank and
then the other one is salaries dollar
sign
um salary cut both these variables are
categorical
okay let me check in the chat are we on
the same page
are you following what I'm doing
okay
great
okay so if I run you can also do the
same at your end if I run line 89
I hope you're seeing what we are having
if I run line 89 so it implies that in
in one of my variable is rank which rank
has three levels and then the other
variable is salary category which has
two levels so I have a a three by two
Matrix or a three by two contingency
table this is microstabulation and
they're telling if you can read this out
very easily is that we've got three
um we've got three associate professors
that get high salary
then assistant professor it is zero no
one gets high salary and then for the
case of a full Professor the 165 that
have high salary when you come to the
low you can say 61 associate professors
Get Low salary then 67 assistant
professors have got low and then 101
professors Get Low salaries that is how
we are reading it
that is our
um
our table now we are going to take
another exam I can assuming that I want
to put everything into an object I'm
taking you back to the other example
fifth example that we had so what about
in case I get line 89 and I put it
inside an object my table so it's gonna
give me the same result I run that and
then when I run my table I'm getting the
same
now I want to explain the concept of the
one of the margin add margins or the
emerging tables this is the concept here
whereby how do we add the margins for
example in case I want to get the
marginal distributions for
um for Rank and then the marginal
distributions for salary cut
so this is what I'm having are you
inside now I use the command and margins
after adding margins inside I put my
object which is my table and then margin
is equal to C inside one and two one
these are the rows and then two other
columns I'm telling r that I need the
the marginal sums for
um for the for the rank that is number
one and then for the columns for the
salary cut that is that two so I'm going
to to sum
um the margin sums for both the rows and
then the columns this is what I was
explaining earlier in the other example
but hopefully that it will come out very
well so that is line 94. if I run that
this is what I'm getting
um
let's complete this part before we get a
break
so
um if you're seeing the sum now we are
seeing that wait now the total sum the
marginal distributions we've got we
shall have 3 plus 61 we are having a
total of uh 64 associate professors and
then
266 professors in case now we with some
column wise that is that too
so those that belong within our rank
we've got
168 with high salary and then 229 with
low salary
how about the proportions the prop table
the one that I mentioned whereby in this
side up there I said cell percentages so
we need to get the sell percentages for
for each so we use what we call a prop
dot table so when I run this particular
line I'm going to this is this is going
to be my output all right so it is uh
[Music]
so what we are having here I'm
explaining the cell how to get a cell
percentage we are taking up everything
so it is going to be here it will be 3
over the grand total the grand total
which is 397. these are the total number
of observations so it's going to be the
same even here it will be 61 divided by
a full our full total the grand total
those are the sell percentages when you
come to the same story here 0 over the
grand total and then uh when you come to
165 divide by 397 these are going to be
our brand by our grand totals that's
going to be our prop dot prop dot table
okay what about in case of cell
percentages now for the case of cell
percentages it is going to be with
respect to a particular level for
example if we are to consider this
particular value if we are now
considering a row percentages so it
implies that we will be dividing by the
the total sum the marginal sum that
we've got for example in this particular
case if we are considering associate
professors 64 67
um 266 these are marginal distributions
the summation for of that the total
frequency of that particular level
so in this particular case if you have
to pick the cells if we are to pick the
the row percentage is going to be three
I divide by what by the total so it is
different from the one of the cell
percentage and then yeah it will be 61 I
divide by 64. those are the the row
percentages in case of column
percentages now you divide by the total
column marginal value for each
respective cell
let me check in the chat if water if we
are on the same page
is it clear
okay
all right now let's move on let's
complete the last part very first and
then now we'll see how to continue now
remember
um my prop that I I saw the prop dots
table then inside I put my once I put my
my object so for me to see if there is
uh and there is if the two variables are
are dependent or not I use a chi-square
in r
wow
let's let's do for only five minutes and
then we go for a break please
we need to complete this part and then
we come back and do the last bit so
we're going to use the chi Square let's
let's be a little first a chi-square
test so I want to the my first variable
is Rank and then the other variable is
the salary card if I run line 98
okay can somebody tell me what we are
within our output we have the Pearson
chi-squared test we are given our data
we have salaries we have Rank and then
salary caps our Chi square is one to
eight point six two we've got two
degrees of freedom I believe now we know
why we have two and then we have our
Peak value as 2.2 times 10 to the power
negative 16. okay can somebody tell me
what conclusion can you come up with
depending on the results that we have
okay
so there is an association between Rank
and salary cut grade
so we can conclude that there is what an
association so from nine from line 98 I
can put everything inside my object
called test
so here if I run that same line again
I've not changed anything and then I
type out my object test it will give me
a similar output from there you can you
can you can type test and then you put
the statistic if you put that just
statistics it will give you the
chi-square value in case you want to
split but all the values are being I
mean what you need to come up with that
conclusion is being is being given if
you want to know about the p-value
checking the expected values The
observed and then the method that is
being used so this other bits also you
can go ahead and check is there any
relationship between sex and salary
category still it's a chi-square to tell
you that if there is another question is
is there any relationship between
discipline and soil category let's run
line 113 and see where okay what answer
would you give is there any relationship
is there any associate station
um
okay yes there is
um that is great what about line one
1 110 what would you conclude with this
particular result that we have
can you type in the chat
what conclusion can you come up with
yeah
what conclusion can I get more answers
we accept the null hypothesis great that
there is no
um no association in other words we fail
to reject H naught so and the rest of
the other bits
maybe the last Parts it's online line
one
122 in case you want to consider three
categorical valuables what do you do we
still use the same the same command my
table you we have you want to check rank
discipline and then six when you run
that particular line and run line one
two three you can also be able to tell
um we are we've grouped uh according to
the disciplines that the associate
professors who
um who are doing Sciences we have four
females and 22 males then those that are
doing
um Humanities we've got six and then
three then the males are 32 when you
come to the professors that are that
have got um
that have got Humanities that they're
discipline their eight females and then
we've got 123 emails for the case of the
those that are doing
um Sciences we've got 10 females and
then
120
125 males and then the rest you can
practice on your own since you now we
can we can stop there on the part of the
category called Data
okay
any question
before I go for the break
find me Lucia today's material well in
this particular case we are not going to
use the the fishes test because we've
not got any error when we're using the
chi-square but in case we come you come
across something like that then you can
use the the fish egg fish exact test
break time please okay
okay let's take a break for 10 minutes
and then we join again at 10 at 4
4 11. thank you
foreign
[Music]
foreign
[Music]
[Music]
foreign
[Music]
[Music]
[Music]
[Music]
foreign
[Music]
foreign
[Music]
[Music]
okay thank you
all right welcome back now let's let's
get a feel on um overview sampling
surveys we're going to be discussing
because I believe that you've also
encountered whatever that I'm going to
discuss with you so in case we reach
somewhere and I seek for your views feel
free to to participate for those of you
that have uh that have done sampling
surveys you've gone to the field you've
tried to collect data and and all the
and the different experiences we all go
into
um different experience when you're
collecting data okay
now I want us to just to recap a little
bit on what is the population what is a
sample because sometimes you find that
um the population is very big and there
are some factors that will not allow you
to be able to get the whole population
okay in terms of in terms of time and
then in terms of costs and you cannot
reach everyone unless if you're having a
population that is very small for
example census and then you you go in
and consider every particular individual
so in that case what we do we usually
consider what we call a sample and it's
a sample it is just a subset of a
population and in this particular case
when we collect data from a sum using a
a sample that we picked from a
population it should and we have to
ensure that it gives us the similar
similar representation of what a
population looks like to avoid bias so
from the collect from the collected data
of a sample we go ahead and perform
analysis but at the end of the day we
have to keep in mind the issue of
probability we make we make a new
friends from the data I've collected and
it should reflect with the what with the
population so the circle moves on and on
that is what is happening with the
sample service what is your population
what is your sample Etc so as already
mentioned about uh population surveys
whereby we they used to draw an
influence about a population okay maybe
there are some there are different
things that you really want to draw
conclusion on for example and maybe if
you're dealing in business about the
cells the income the behavior the
satisfaction of the people and so on so
instead of going
positive
yeah posting so
you give them an example button
right now
okay let's continue
um so
when we move our head we think that uh
um as far as a good survey is concerned
most of the time it has to really give
us a clear representation of what is
happening with the underlying population
and most of the time we do it to to
Really to see that uh we avoid bias in
the selection so the characteristics
that we pick within within a sample
should be the same characteristics that
are within the the population
um the population of Interest or
understudy to avoid bias in in the
selection and this is being done through
um a random through a random process to
ensure that there is no preference in
the treatment of any specific subject
that we are considering
okay but nevertheless like we've been
been hearing about lovely sweet with the
experimental the experimental errors and
then also in service we face what we
call
um sampling errors these are unavoidable
I mean you cannot you cannot dwell with
them because most of the time the point
is that not all the elements of the
population are selected so there is a
possibility that uh maybe
um could be based on what you want
estimate is not exactly of the sample is
not the exact it doesn't exactly match
with the population value that we've got
from this from the sample so that
discrepancy between the estimated value
and the true population value is what we
call a sampling error
okay though as we move on we are going
to say that most of the time depending
on the literature we have to ensure that
we minimize the sampling error and we
make it as small as possible most of
literature like we'll see um they
consider it to be a plus or minus five
percent but uh what we have to note from
this particular slide what we should
pick is that it is an avoidable error
some people call it an acceptable error
and it looks at the difference between
what is estimated or what we've computed
and the true population values
okay there are many factors that
influence when you're going to do a
sample service like I mentioned
sometimes it could be due to the funding
due to the timing but what is most
important is that uh the sampling design
that we select should enable us to
minimize the sampling errors
from
um from most of the time with like I
mentioned we ensure that we've got a low
sampling error and this could be a plus
or minus five percent and sometimes it
is being referred to as the margin of
error when you when you check in various
literature and also there are already
standard
um standard estimate that are being put
in place when we are determining uh when
we are determining the the sample size
and then the margin of error that we can
use of course there's a lot of
literature out there
okay so in case we take this particular
example whereby it's an opinion poll
and it was a survey that reported 29
adults they were considering that aids
an urgent problem okay under the margin
of error in this case it is plus or
minus three percent how was the data
collected it was based on Telephone
interviews remember I collect data we
collect data differently some use
questionnaires some use a telephone
interviews depending on what kind of
data that you're collecting and and then
also is it is it qualitative is it
quantitative are you going to if it is
qualitative are you going to use um the
key informant interviews are you going
to use
um if the focus group discussion Etc but
in this particular case and we're back
to the example we are considering a
telephone interview as the as the the
method of data collection that was used
and our n is 872 adults okay so what key
things should we note from here we've
got uh
um the sampling error which is the same
as the margin of error in this case they
considered plus or minus three percent
we've got our um the probability of
estimate which is uh
um uh 29 our study population is being
shown we've got the sample size and this
is our sample size which is 872 of
course like I mentioned
um we cannot go in for a full population
so we we pick a sample size that would
that have that will give us
um an underlying representation of the
population that or the population of
Interest or the one that we want to
consider going back to to this
particular example we can really see
that the outcome
and the outcome is uh we have uh yes it
can be considered that is binary in
nature we have a yes and then and no
okay because the question was asking if
still HIV is still an agent problem so
somebody could either say yes or no and
from our outputs in the previous slide
those that responded Yes we are 29 so
that is the probability of those that uh
that responded that it was an urgent
problem so it is 0.29 okay and then so
when you take one minus then you get the
one that responded now to be
0.71 percent which is 0.71 we can
further go ahead and and get out since
we are given n as as eight seven two we
can multiply it with the different
probabilities and we come up with the
the different number of respondents so
um this I mean the output that we've got
here yeah it was already something that
is condensed from how they work from the
data that was collected and then and uh
so this point estimate was just
calculated like the way we were having
from a frequency table and then you
calculate the relative frequencies and
by the end of the day they came up with
they came up with this particular
conclusion
okay so I've talked about I'm just
giving a summary of the other of the
other slide that we are having so the
one was yes they said it was an urgent
problem so it can also be coded like
that within a question here and then
zero as a what as unknown
all right but now we saw that
um 29 of the individuals said that it
was still an urgent problem how
confident can we be okay so most of the
time we when we are trying to check on
the degrees of confidence when you check
in most statistical best basic status
books when you want to construct a 95
confidence interval around our estimate
here our estimate is the
um the 29 that we are having so and when
you check in most best Statics books I
guess we've come across this our
estimate plus or minus Z this is the
um the 0.050
0.05 and then the standard error of up
of our estimate so
um this this that Z 0.05 we get this
particular value from our standard
normal tables so we are trying to for
the case of the 95 confidence interval
the Z value when you read it all the
time it will be 1.96 you can check these
are just standard standard estimates
when we want to get the the standard
error so it is for our estimates what
we've estimated as uh the point estimate
of 29
it is uh P cubed the probability of
subsystems failure Over N so we've got
that P we've got by q and then the N we
also have our estimate so it's a matter
of fitting everything within this
particular within this particular
expression and then
um so in case you feed everything back
we see that the this particular section
here we are having the 0.03 because you
compute it up it's the one that was
represented when there when they're
coming up with a conclusion and then our
estimate of 2.9
of of 0.29 so we can be 95 percent
confident that our Point estimate of 29
of 0.29 lies between when you minus it
will be 0.26 and then when you add it's
going to be 0.32 but what is interesting
maybe something that you can pick from
this particular slide is that we've got
this particular end which is our sample
size so here we can and this approach
the p and the q's these are the
probabilities or proportions that we can
pick and we can also still
um come up with something that we can
calculate our n here the sample size
that was that was used
okay so from what we have we can
conclude that from this particular
expression mathematically we have the
estimate and then you you have a plus or
minus the margin of error remember with
the margin of error it's what we also
um this what you can also term as the
sampling error
all right
okay
the
um on this particular bit from the
example that we had if I only pick the
part of the margin of error which which
it which was given to us as 0.03 I'm
saying that you can manipulate this
particular expression and come up with n
because we were just if we just feed
into this particular values and then we
made in the subject the 872 which is
your n here so implying that in case you
feed in this particular values we know
this is our Z value considering the 95
confidence interval however you can also
go ahead and consider different interval
if you don't want um 95 percent you can
consider 1999 it is whatever that you
want and even if when you go to R you
cannot adjust that the default value to
any
um confidence interval that you want we
have the dip okay most of the time the
literature I want to get a 50 50 plus
minus 5 percent and then these are
proportions probability of success and
then the one of failing so if you've
been very keen most of the time still in
literature we can use this particular
formula to calculate our sample size
provide them provided we know that pin
and we we know that what we know that
you
okay so
um and so in this particular example for
example in case we get we guess our we
we know our estimated value for p all
right all we need to know is now we can
estimate that we can estimate the
required sample size given and we've
specified the margin of error like I
mentioned in literature we
um most studies
um they use plus or minus five percent
but in the you can also go ahead and use
the one that is lower we always want to
use
um a small
um sampling error and then you can also
be able to play around and check how to
determine whether in case you need a
light sample or when you need a small
sample but from what I've just
Illustrated here this is a formula for a
sample size in case of proportions and
the point is if you if you have an
estimated value of P okay that is uh the
probability of success and then you can
be able to get the failure and compute
your sample size
okay
all right some some
um questions that you can explore in
case you want to do you want to find out
your sample size you can ask yourself
like what is your target population
most of the time if you're going if
you're going to carry out a particular
research and if you want to do a survey
what is going to be your target
population are you going to use for
example you're going to use maybe
students at a university are you going
to consider all the students within a
university then what what
um what is going to be your sample size
or the sample population that you're
going to pick given that you cannot
consider everyone and how are you going
to reach this particular people that you
want to that you want to
um to to do research on okay and all the
things you have to ensure that how
reliable are the estimates that you're
going to use and also the Precision of
the SMS that you want that you want to
that you want to use how much confidence
could you put for example Dr about the
95 confident and then I assure that the
the sample size that you you you're
going to use clearly depict the
population that you're having in mind
all these are the questions that you
need to explore as you're going to go
ahead to determine
um your sample your sample size
all right there's also what we call
um
there's also what we call types of
survey designs I don't know if you've
come across them of course there are
very many but the common one we have the
cross-sectional surveys and then
longitudinal surveys so can somebody
type for me in the chat when when they
talk about cross-sectional surveys what
comes at the back of your mind
cross-sectional serving somebody to
anybody who has done a cross-sectional
survey to tell us what it is
okay one time sample collection
time series data and I'm sure what
you're telling us crystalation is one
time off okay equivalence
collection in one area that is observed
in a six period of time
all right
data collected once great
all right okay so
um just supplementing on what on What
You've uh on what you've shared when we
come to cross-sectional surveys this is
just data that is collected at a point
at one at a point of time let's let's
take an example maybe you want to
collect data on on performance of
performance of University students all
right and you know you set up your
research objective and then you move on
to collect your data after collecting
your data do the analysis then come up
with your conclusion that is an example
of a cross-sectional data you just you
just conduct it at just one point and uh
that is the end you come up with your
conclusion either you're answering your
objective or you want to inform policy
whatever it is so that is of course
those are cross-sectional surveys what
about longitudinal surveys
somebody to tell us what longitudinal
surveys are before we go into the before
we go into the the different groups that
we can consider longitudinal
yes I'm reading the comments in the chat
and they'll care about the cross-section
thanks for contributing uh-huh and then
uh longitude you know somebody's asking
for registration link I'm asking
longitudinal surveys
okay that's a collection on a sample at
different points in time great using
repeated observation of the same
variable okay collecting data at
different times great
um data from a period of time used for
similar so many videos or units
repeatedly over an extended period long
term great
okay I'm still going to supplement from
what you've shared when we come so
we we see that with longitudinal surveys
this is just a data that is collected
over time repeated number of times and
it could be Trend it could be a cohort
it could be a pan under the trend we are
considering the the same population at
different time points okay the sample
population at different time points for
example maybe you're considering the
um the the the if you're talking about
the the the different trains if we
talked about collecting different
samples of rooms at a given
um on give their on given times that is
Trend you want to see
um you want to see how cells the
different cells how cells are performing
over time that would be
um a Trends and under the trends most of
the time if you hear about time series
analysis that's when we apply the the
trend kinds of surveys and then when it
comes to to cohort we are studying the
same population each time over a given
period an example you can take here
let's assume we've got students that
have reported within the university all
right and then you want they've reported
in the first year and you want to study
their performance until the third until
third year so in that case that is going
to be a cohort you're using same
um same same students the for example
maybe the ones in first year and then
you go ahead and see them in second year
and then the third year so that is our
same population over time and so it is a
cohort and then when it comes for a case
of that panel if you're considering a
panel here we are studying the sample of
respondents okay
um you're looking at a particular person
or a particular household for example if
you're looking at this maybe let's take
an example of a child growth you're
looking at this same child and see how
they are growing over time
okay
all right
um
so we have different sampling procedures
all right we have what we call the
non-probability sampling and then we
have the probability some sampling
so and still within this particular
these these two types they are
distributed under the probability
sampling we have the simple systematic
stratified cluster of all this they form
a mount State sampling we have the
non-probability sampling where we have
you can have convenience super passive
syllable
Etc so we're going to go ahead now and
discuss each together and see what it
implies and then I'll be supplementing
okay I've talked about this particular
slide so we are going to skip it there
are different advantages of using um
um of using of using the probability
sample and then also the disadvantages
of using a non-probability sample okay
so I'm going also to to go ahead
with that
so let's move on now to methods in
Sample surveys let me ask in the chat
what is simple random sample link this
is one of the one of the myths of the
probability of a probability sample time
what is a simple random sampling
can somebody type in the chat
we are discussing the first one simple
random sampling in case somebody has
done has done
um collected a survey equal chances of
selection into the sample
um selecting samples randomly every
event has an equal chance of being
selected wow wonderful
so ensuring that each sample is an equal
probability of being selected into a
sample I'm just reading your comments
and they are all they are all
interesting we are saying that
um just speaking from the
no sorry
selecting sample based on equal
probability
each element anywhere the point is that
in this particular case when we talk
about when we talk about simple random
sampling
the point the the whole thing is that
there is randomization there's a
randomization process in all but what we
do is that uh there is an equal each
element has an equal probability of
being selected from our from a list of
all the population units like some of
you have contributed so the point is
that each element has an equal an equal
chance or an equal probability of being
selected to form our from a population
from a population unit to form our
sample and most of the time we donate
samples n and then small n and then
population as capital and in other ones
there is no no preferential treatment
during the selection
okay
all right now let's discuss about the
second one which is systematic sampling
can you type in the chat what is
systematic sampling thank you for the
responses
um I'm reading them in the chat in
simple random sampling there is an equal
chance of being selected yes so you
randomly select and everyone
um has the same chance of being selected
so now we are defining systematic
sampling
this is what you're sharing is not right
sequential selection
elements as elected at a regular
intervals using a pre-determined pattern
I agree with you then
at every interval you select an item
randomly following a sequency selecting
participants at regular intervals
and then um I'm going to read only the
last one starts with a number selected
randomly and then a sample selected at a
regular interval wonderful okay
now let's move on with the systematic
sampling so we are saying that
like some of you have shared it is also
another another method of random
sampling but what is so special with
this one with the systematic method like
you've mentioned is that there is a
certain a certain criteria that is
followed whereby the first unit is
selected at random okay and then after
that a predefined a predetermined
pattern is followed how do we determine
this particular pattern if we are saying
that the first person the first unit is
determined we follow what we call a
sampling interval and the sampling
interval for example
um
um in case I can say if I denote it as K
this one is always equal to n my
population over the smaller n that is
the the sample size to give me
um to be able to give me the the the
sampling interval that I'm going to take
and and most of the time when for
example maybe you you having you you
want to do
um uh you want to you want to sample for
example let's say you've got a
particular list of your students let's
say a hundred of them and you want to
know okay how many can you consider a
new particular study and you want to use
uh you want to use as
um a systematic sampling so it is up to
you you can determine well from my list
if I've got one to a hundred you can
determine you can pick randomly any
first person that you want to consider
but now for you to come up with the
interval that you're going to consider
you get the total n and then you divide
it by a small n depending on your end
and then you divide by a small n which
is your sample size
thank you
okay let's take this particular example
here if my N is a hundred capital N is a
hundred and my small n is 10 for me to
come up with a sampling enter where it's
going to be a hundred I divide by 10 and
then this one is going to give me I to
give me my K as as 10 so it implies that
between 1 and 10 I can choose any any
person to start with and then after that
I'll now I'll I'll go on picking every
tennis person to include in my in my
sample so in this particular case if for
example if I pick the seventh one then I
count up to the tenth person this person
is put in is included in my sample so 17
the 27 up to the 97th
depending on now those those people that
you'll be circling they are the ones
that you'll be including in what in your
sample maybe you can ask yourself that
uh what about in case I'm not coming up
with an integer most of the time what
happened is that you round off to the
nearest either below or above it doesn't
matter still most of the time you stay
into that particular you the range that
you want let's consider for example if
you want your small end to be 175 and
your capital N the population is one
thousand and when we first here when you
see the sampling interval when you
divide out it's going to be 5.71
okay so they are trained there when you
run you can you when you when you can
set five or you can round off two two
since if you consider five you're going
to use 200 if you consider six you're
going to use
167 as 0 as your sample size that you're
going to consider in the study
okay
and then also you can also ask yourself
is that uh when do we use the systematic
sampling all right
um one of the reasons could be in case
you're having a small population and
when the list that list that you want to
use is roughly of a random order and
when you the list of the population
um does not existed it is most time
referred over the simple random sampling
because I mean it is easy to measure you
can easily form your framework and then
be able to include people within your
sample
okay I mean this is just an example that
we can that we can consider but also we
have this particular example that we can
put up above our mind here we're having
a frame our N is a hundred and then what
if we want our sample size to be 20 so
it implies that my sampling interval is
going to be 100 by 20 and I end up with
with five okay that is my sampling
interval so it implies that I can choose
any any my run my first unit between one
and five one and K
okay so because my care remember it's n
over
capital N over small n
so after choosing the first sample then
I'm going to I'm going to to pick every
15th person so from 4 I'll count then
whoever falls in the fifth position will
be included within my within my sample
and then later I'll be able to get these
people that within this particular
um those positions the ones that I will
take and include to act as my sample
size which is of 20 in this particular
case so the blues are the ones that are
picked for example it starts from four
you count uh one
um one two three four five so nine fours
in the fifth position and then you pick
up that particular person you can also
use this kind of sampling maybe for
example in a hospital setting and you
you're counting the number of patients
that are coming in a little well I'll
pick I'll pick I'll consider patient
after every uh every third person so you
you randomly choose the first one and
then you count you're the one that takes
a third position you take up that person
okay now let's move on to this other one
which is stratified sampling
and we Define
stratified sampling
Jen is saying grouping elements in small
groups codes Twitter based on some
characteristics thank you Cloud
any other person stratify from
stratified
you're positioning data just giving a
hint uh-huh
you structure internally great
then grouping elements into
non-overlapping groups
yeah Martin you're right then
um
they are in terms of structure or layers
that's great selection following
categorization all right then
hmm
what
let me see this is okay
foreign
to supplement on what you've shared with
stratified sampling we said that in this
case you're partitioning in two groups
which are called structure and then
sampling is performed separately within
each stratum okay and they're not
supposed to be overlapping as a
colleague has shared
right so there is no issue of
overlapping the stratum has to be
um mutually exclusive for example you
can consider Urban and then rural areas
then you can also consider maybe you
want to look at economic categories
putting them into different groups
geographical regions Etc within the
structure they have to be homogeneous
but across the stratum they have to be
heterogeneous those are some of the
characteristics that we can pick
okay
um then with the Clusters
um remember with the structure
um in that particular case we are
partitioning in two groups but when it
comes to a clustering this particular
case we are taking the whole group the
group of the of the population elements
so the the whole group acts as our
sampling unit not as a single element
like what we saw in the what in the
stratified so the Clusters in this
particular case are the our first
sampling units what we choose how to
consider
okay so an example you can take maybe
your clusters that you've got a maybe 10
schools and then you only choose to
consider three schools to include a new
study but also this this course that you
choose they have to have similar
characteristics of what you have to have
similar characteristics across the same
schools that you're going to study of
Interest we have the amount stage
sampling under the mouth stage sampling
we cons we can this this one considers
all the other all the other sampling
that we've talked about the different
stages so you can you can do everything
that you go you can stratify after
stratifying you do class you you do
cluster sampling then from cluster
sampling you go ahead and do systematic
sampling so all these they form what we
call Mount stage sampling and most of
the time they are done at uh
um usually the national
level service they perform them using a
multi sampling and within this
particular design I'm saying that you
can include the different sampling
procedures of interest that you want to
that you want to consider this is also
another example which has the you can
move from the districts you've got the
villages and then until you go to the
household level but at this particular
at least different levels you can
combine the different sampling
procedures the systematic the stratified
and then the cluster they all come up
and form what we call amount step
sampling
okay so what I'm going to do I'm going
to stop here given that we have little
time left but I encourage you to we've
got some work that is left but for you
to go ahead and read
at the same time I would encourage you
to run the r script which has got how to
come up with a sample size it is not
they're not had commands given that now
you have a background of R the moment
you go and run the sample size dot R
script it should be very easy for you to
understand and then make use of the
notes we are not going to take down the
materials now from the Google link so
try to get back and also practice
throughout Professor Susan I'm going to
practice it
okay ladies and gentlemen
thank you so much
for being patient for the five days
we have survey
that he has been posted in the chat
please uh fill in the survey for the
next five minutes and then after five
minutes we are going
to reform to close
so the forms I mean the surveys being
posted
in the chat
thank you very much
yes
good evening all
right
[Music]
thank you
foreign
I hope the people are feeling in the
survey which was shared
and while uh we continue to fill in the
survey I think Professor Rogelio is
feeling our midst I would like to hand
over for him to use this opportunity so
that he can also to give the closing
remarks
professor agario
okay ladies and gentlemen once again
thank you so much for making it to the
last day of the training but today we
have another training coming up I'm sure
you have registered
I just hope that you look it up on the
same I bet I think it will be after next
week if I'm not mistaken
uh so please uh thank you so much for
being very engaging and um for joining
in to participate on this I would like
to thank our facilitators so much so
much for taking us through thank you so
much Professor Susan thank you so much
Dr odong and Dr namaweji thank you very
much thank you so much for being a
wonderful great teachers again and uh we
definitely will see you again in our
next training
when it comes up thank you so much