Day 5 of Advanced Statistics and Experimental Design Training: Categorical Data Analysis, Chi‑Square Testing, and Survey Sampling Overview
Introduction
The fifth and final day of the World Bank‑funded Advanced Statistics and Experimental Design training took place online, hosted by the Center of Excellence in Agri‑Food Systems and Nutrition, Mozambique. Participants and facilitators gathered for a wrap‑up session that combined administrative updates with substantive statistical content.
Administrative Announcements
- Contact collection: Participants were asked to provide two phone numbers (WhatsApp and an alternative) for post‑training communication and future training invitations.
- Certificates: The team acknowledged delays in issuing certificates for the first two modules and assured that certificates for the current module would be released soon.
- Evaluation form: A short Google‑Form (Monkey) evaluation would be sent at the end of the session; participants were encouraged to complete it.
- WhatsApp groups: Two active WhatsApp groups already existed for peer support; facilitators would add any missing contacts.
- Future training: Facilitators expressed willingness to travel for face‑to‑face sessions in Mozambique if requested.
Categorical Data: Concepts and Visualization
- Definition: Categorical (qualitative) variables assign labels to observations (e.g., gender, marital status, preference categories). They can have two or more levels and are non‑numeric.
- Frequency tables: Counts of each category are tabulated; relative frequencies (percentages) are obtained by dividing by the total sample size.
- Visualization options:
- Bar charts (vertical or horizontal)
- Pie charts
- Segmented bar charts (stacked bars)
- Side‑by‑side bar charts for comparing groups
- Example: A sample of 40 students chose a preferred attribute (Rich, Happy, Famous, Healthy). Frequencies (7, 21, 4, 8) were converted to percentages for reporting.
Cross‑Tabulation and Marginal/Conditional Distributions
- Cross‑tabulation (contingency table): Summarizes the joint distribution of two categorical variables (e.g., gender vs. chance of becoming rich). The table can be 2×3, 3×3, etc., depending on the number of levels.
- Marginal distributions: Row‑wise or column‑wise totals that describe the distribution of each variable independently.
- Conditional distributions: Percentages within a row (or column) that show the distribution of one variable given a specific level of the other.
- Interpretation: By examining marginal and conditional percentages, researchers can assess patterns such as whether males report a higher perceived chance of wealth than females.
Chi‑Square Test of Independence in R
- Purpose: Tests the null hypothesis that two categorical variables are independent (no association). The alternative hypothesis states that an association exists.
- Test statistic: (\chi^2 = \sum \frac{(O - E)^2}{E}) where O are observed frequencies and E are expected frequencies calculated from marginal totals.
- Degrees of freedom: ((r-1)\times(c-1)) for an (r\times c) table.
- Decision rule: Compare the calculated (\chi^2) value to the critical value from the chi‑square distribution (or use the p‑value). If (\chi^2_{calc} > \chi^2_{crit}) or p < α, reject the null hypothesis.
- Example: A 3×2 table of academic rank (Assistant, Associate, Professor) vs. salary category (High, Low) yielded (\chi^2 = 23.13) with 2 df, p < 0.001, indicating a strong association.
Practical R Workflow for Categorical Data
- Create factor variables:
factor()converts character vectors to categorical factors. - Build a data frame:
data.frame(var1, var2, ...)stores all variables together. - Cross‑tabulate:
table(var1, var2)produces the contingency matrix. - Add margins:
addmargins(table_obj, margin = c(1,2))shows row and column totals. - Proportions:
prop.table(table_obj)→ cell percentages.prop.table(table_obj, 1)→ row percentages.prop.table(table_obj, 2)→ column percentages.- Chi‑square test:
chisq.test(table_obj)returns the test statistic, degrees of freedom, and p‑value. - Interpretation: Use the output to state whether the variables are independent and discuss practical implications.
Survey Sampling Fundamentals
- Population vs. Sample: The population is the full set of interest; a sample is a manageable subset used for inference.
- Sampling error (margin of error): The difference between a sample estimate and the true population value; commonly expressed as ±5 %.
- Confidence intervals: For a proportion (p), a 95 % CI is (p \pm 1.96\sqrt{p(1-p)/n}). The Z‑value changes with the desired confidence level.
- Sample‑size formula for proportions: (n = \frac{Z^2 p(1-p)}{E^2}) where E is the desired margin of error.
Types of Surveys
- Cross‑sectional: Data collected at a single point in time (e.g., a one‑off questionnaire on student performance).
- Longitudinal: Repeated observations of the same units over time, including:
- Trend surveys: Different samples at multiple time points.
- Cohort surveys: Same individuals followed across periods.
- Panel surveys: Same households or respondents surveyed repeatedly.
Sampling Techniques
| Technique | Key Idea | Typical Use |
|---|---|---|
| Simple Random Sampling | Every element has equal probability of selection | Baseline surveys when a complete sampling frame exists |
| Systematic Sampling | Select every k‑th element after a random start | Easy to implement with ordered lists |
| Stratified Sampling | Divide population into homogeneous strata, sample within each | Improves precision when strata differ markedly (e.g., urban vs. rural) |
| Cluster Sampling | Sample whole groups (clusters) and survey all members within selected clusters | Cost‑effective for geographically dispersed populations |
| Multistage Sampling | Combine two or more methods (e.g., stratify → cluster → systematic) | Large‑scale national surveys |
Closing Remarks
The session concluded with a reminder to complete the post‑training survey, an invitation to register for upcoming modules, and gratitude expressed to all facilitators (Prof. Rogério, Prof. Susan, Dr. Helen, Dr. Odong, Dr. Namaweji) and participants. The organizers emphasized that the skills covered—categorical data handling, chi‑square testing in R, and robust survey‑sampling design—equip attendees to conduct rigorous statistical analyses in agri‑food and nutrition research.
Participants left the training equipped to transform raw categorical data into meaningful tables, visualizations, and statistical tests (chi‑square) using R, and to design reliable surveys with appropriate sampling strategies, ensuring that future research in agri‑food systems and nutrition will be both methodologically sound and practically impactful.
Frequently Asked Questions
Who is RUFORUMNetwork on YouTube?
RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?
Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.