Advanced Statistics and Experimental Design: Day 4 Training Overview
Introduction
The fourth day of the World Bank‑sponsored Advanced Statistics and Experimental Design training was held online for participants from the Centre of Excellence in Agri‑Food Systems and Nutrition, Mozambique. The session combined a theoretical recap with hands‑on R programming, covering multiple regression, dummy variables, polynomial regression, and model diagnostics.
Recap of Simple Linear Regression
- Purpose: Model the relationship between a single response variable and one predictor.
- Key assumptions:
- Response normally distributed with constant variance (σ²).
- Residuals normally distributed, mean zero, independent.
- Linear relationship between response and predictor.
- Diagnostic tools:
- Residual‑vs‑fitted plot (checks constant variance).
- Q‑Q plot (checks normality).
- Residual‑vs‑time/order plot (checks independence).
- Remedies: Log‑transform the response, add quadratic/cubic terms, or include time as a factor.
Multiple Regression
- Extends simple regression to one response and two or more predictors.
- Model form: (Y = β₀ + β₁X₁ + β₂X₂ + \dots + ε).
- Assumption added: Predictors must not be highly correlated (no multicollinearity).
- Example used: Volume as a function of tree diameter and height.
- Coefficients interpreted by holding other variables constant.
- R² (adjusted) = 94.4 % → strong explanatory power.
- p‑values for both diameter and height < 0.05 → significant relationships.
- Degrees of freedom calculated as n – p (observations minus number of parameters).
- Sequential sums of squares show each predictor’s contribution.
Indicator (Dummy) Variables
- Needed when a predictor is categorical (e.g., food type, plant species).
- Coding rule: Number of levels – 1 dummy variables.
- Example: Three food types → two dummy variables (0/1 coding).
- Dummy variables allow inclusion of groups in regression; they affect the intercept and can be interacted with other predictors to obtain parallel or separate slopes.
Polynomial Regression
- Used when the relationship is curved rather than linear.
- Model includes higher‑order terms: (Y = β₀ + β₁X + β₂X² + β₃X³ + ε).
- Demonstrated with hardwood concentration vs. paper tensile strength:
- Linear model R² ≈ 0.30 (poor fit).
- Quadratic model R² ≈ 0.90 (substantial improvement).
- Residual diagnostics confirmed better fit, though a few outliers remained.
Practical R Implementation
- Setup: Load required libraries (
car,psych,ggplot2, etc.) and set the working directory. - Data import:
read.csv()for datasets such aseggP.csv(water uptake, food uptake, egg production) anduptake.csv(CO₂ absorption experiment). - Exploratory plots: Scatter plots using base R,
ggplot2, andplot()to visualise relationships. - Correlation analysis:
cor()andcor.test()to obtain Pearson coefficients and significance. - Model fitting:
- Simple linear:
lm(Y ~ X, data=…). - Multiple:
lm(Y ~ X1 + X2, data=…). - Polynomial: create squared/cubic terms and include them in
lm(). - Dummy variables: create binary columns with
ifelse()and include them. - Model summary:
summary(model)provides coefficients, t‑values, p‑values, R², Adjusted R², and F‑statistic. - Diagnostics:
plot(model)→ residual‑vs‑fitted, Q‑Q, Scale‑Location, Residual‑vs‑Leverage.shapiro.test(residuals)for normality.dwtest()(Durbin‑Watson) for independence.vif()to detect multicollinearity (VIF > 5 signals concern).- Outlier detection with Cook’s distance (
cooks.distance()). - Model comparison: AIC and BIC values guide selection; lower values indicate a better trade‑off between fit and complexity.
Model Building Considerations
- Variable selection: Keep predictors that are statistically significant and biologically/economically meaningful.
- Multicollinearity: If VIF is high, consider removing or combining correlated predictors, or use ridge/weighted least squares.
- Influential observations: Examine Cook’s distance; remove only if they unduly bias parameter estimates.
- Confidence intervals: Provide a range for each coefficient; if the interval includes zero, the predictor may not be significant.
- Future topics: Time‑series analysis, mixed‑effects models, and survey data analysis were mentioned as upcoming sessions.
Conclusion
The day equipped participants with a solid understanding of multiple and polynomial regression, the creation and use of dummy variables, and a complete workflow in R—from data import and exploratory analysis to model fitting, diagnostics, and selection criteria. Attendees left with practical scripts they can adapt to their own agri‑food and nutrition research projects.
Effective regression analysis hinges on choosing the right predictors, checking assumptions with diagnostic plots, and using information criteria to balance model complexity with explanatory power.
Frequently Asked Questions
Who is RUFORUMNetwork on YouTube?
RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?
Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
plot (checks normality). - Residual‑vs‑time/order plot (checks independence). - **Remedies**: Log‑transform the response, add quadratic/cubic terms, or include time as
factor.