Understanding Correlation and Simple Linear Regression with R: Theory, Practice, and Model Validation

 4 min read

YouTube video ID: Fd1mnabOdAE

Source: YouTube video by RUFORUMNetworkWatch original video

PDF

Introduction

The session introduced correlation and simple linear regression using the R programming environment. After a brief theoretical overview, participants applied the concepts to a real dataset (egg production, water uptake, and food uptake) and learned how to interpret statistical outputs.

Correlation Basics

  • Definition: Correlation measures the strength and direction of a linear relationship between two continuous variables, ranging from –1 (perfect negative) to +1 (perfect positive).
  • Types of Correlation:
  • Positive (points rise from bottom‑left to top‑right)
  • Negative (points fall from top‑left to bottom‑right)
  • No correlation (random scatter)
  • Interpretation of r:
  • |r| ≈ 1 → perfect linear relationship
  • |r| > 0.5 → strong
  • |r| ≈ 0.5 → moderate
  • |r| < 0.5 → weak
  • Computation in R: cor(x, y, method="pearson") for continuous data; method="spearman" for ranked data.

Visualising Correlation

Scatter plots reveal the pattern: - Tight cluster around a line → strong correlation (e.g., r = 0.96). - More spread → moderate correlation. - Random cloud → weak or no correlation.

From Correlation to Causation

The instructor emphasized that correlation does not imply causation. Examples (bird weight vs. length, hospital births vs. bird births) illustrated that a statistical association alone cannot establish a causal link.

Simple Linear Regression

  • Goal: Fit a straight line Y = β₀ + β₁X + ε to predict a response (Y) from an explanatory variable (X).
  • Parameters:
  • Intercept (β₀) – predicted Y when X = 0.
  • Slope (β₁) – change in Y for a one‑unit increase in X; sign indicates direction.
  • Estimation: Ordinary Least Squares (OLS) minimizes the sum of squared residuals (differences between observed and fitted values).
  • R Implementation: model <- lm(Y ~ X, data=dataset) followed by summary(model).

Interpreting Regression Output

  • Coefficients: Provide intercept and slope with standard errors and p‑values.
  • p‑value: Tests whether a coefficient differs from zero; p < 0.05 indicates a statistically significant relationship.
  • R‑squared (R²): Square of the correlation; represents the proportion of variance in Y explained by X (e.g., R² = 0.956 → 95.6% explained).
  • Adjusted R²: Adjusts R² for the number of predictors (relevant in multiple regression).

Model Diagnostics & Assumptions

  1. Linearity – confirmed by scatter plot and correlation.
  2. Normality of Residuals – assessed with QQ‑plot or Shapiro‑Wilk test.
  3. Homoscedasticity (constant variance) – examined via Scale‑Location plot; random scatter indicates the assumption holds.
  4. Independence of Residuals – checked with residuals vs. order plot; lack of pattern suggests independence.
  5. Influential Observations – identified with residuals vs. leverage plot and Cook’s distance; points outside the dotted lines may unduly affect the model.

If any assumption is violated, remedies include: - Transforming variables (log, square, etc.) - Switching to logistic regression for binary outcomes - Using polynomial terms for curvature - Applying mixed‑effects or time‑series models for correlated observations.

Multiple Linear Regression

  • Extends simple regression to several predictors: Y = β₀ + β₁X₁ + β₂X₂ + … + ε.
  • Same assumptions apply, plus no multicollinearity among predictors (they should not be highly correlated).
  • Model comparison can be performed with ANOVA tables (aov()), and significance of each predictor is examined via its p‑value.

Practical Workflow in R (Step‑by‑Step)

  1. Install & load required packages (ggplot2, car, lmtest, etc.).
  2. Set working directory and import the CSV dataset.
  3. Visualise data with plot() or ggplot().
  4. Compute Pearson correlation.
  5. Fit the linear model with lm().
  6. Summarise and interpret coefficients, p‑values, and R².
  7. Diagnose assumptions using plot(model) which produces the four standard residual plots.
  8. Address any violations (transformations, removal of influential points, or alternative modelling).
  9. Report key statistics: slope, intercept, p‑value, R², degrees of freedom, and diagnostic conclusions.

Reporting Results

When writing a project report, include: - The scatter plot showing the linear trend. - Correlation coefficient and its significance. - Regression equation with estimated β₀ and β₁. - p‑value for the slope (and intercept if relevant). - R² (and adjusted R² for multiple regression). - A brief statement on whether model assumptions were met.

Closing Remarks

The training emphasized that statistical software is a tool; understanding the underlying mathematics and assumptions is essential for credible inference. Continuous practice with real datasets solidifies these concepts.

Correlation quantifies linear association, while simple linear regression models that relationship, estimates its parameters, and validates assumptions; mastering both in R equips researchers to draw reliable, interpretable conclusions from their data.

Frequently Asked Questions

Who is RUFORUMNetwork on YouTube?

RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

PDF