Day Two R Training: From Data Manipulation to Exploratory Data Analysis

 4 min read

YouTube video ID: GhHKfGNj7HA

Source: YouTube video by RUFORUMNetworkWatch original video

PDF

Introduction

The second day of the training resumed where Day 1 left off, focusing on practical data manipulation in R and RStudio and introducing exploratory data analysis (EDA) concepts.

Homework Review

  • Participants were invited to present their homework.
  • Mona demonstrated joining two data frames using left, right, inner, and full joins, explaining the resulting row counts and column structures.
  • John attempted to share his screen but faced technical issues; the instructor proceeded with a live demonstration.

Recap of Day 1 Topics

  • Installation of R and RStudio.
  • Understanding levels of measurement.
  • Importing data from CSV, Excel, SPSS, etc.
  • Setting the working directory and loading required packages.

Core Data‑Manipulation Techniques

  1. Creating New Variables
  2. Example: salaries$half_salary <- salaries$salary / 2 adds a seventh column.
  3. Recoding Existing Variables
  4. Creating a categorical salary band (low/high) based on the mean salary using ifelse.
  5. Renaming Columns
  6. rename(salaries, Sex = sex, Experience = years_of_service) changes column names while preserving the data frame.
  7. Subsetting Rows & Columns
  8. subset(salaries, rank == "Professor") extracts only professors.
  9. select(salaries, -c(half_salary, salary_cut)) removes unwanted columns.
  10. Merging Data Sets
  11. Demonstrated left, right, inner, and full joins with merge() and dplyr verbs.
  12. Exporting Results
  13. write.csv(salaries, "salaries_clean.csv") saves the manipulated data for future use.

Transition to Exploratory Data Analysis (EDA)

  • Purpose of EDA: Summarize main characteristics of a data set using numerical and graphical methods before formal modeling.
  • Historical Note: Concept introduced by John Tukey in the 1970s.

Variable Types and Their Implications

TypeExamplesTypical Summaries
Quantitative – DiscreteNumber of children, cigarettes per dayFrequency tables, bar charts
Quantitative – ContinuousSalary, height, weightMean, median, variance, histograms
Qualitative – NominalGender, job categoryFrequency tables, pie charts
Qualitative – OrdinalEducation level (low, medium, high)Ordered bar charts, box plots

Numerical Summaries

  • Measures of Central Tendency: mean, median, mode (median preferred when outliers are present).
  • Measures of Spread: range, inter‑quartile range (IQR), variance, standard deviation.
  • Outlier Detection: Values beyond Q3 + 1.5*IQR or below Q1 - 1.5*IQR are flagged; the instructor illustrated this with the years_of_school and salary variables.

Graphical Summaries

  • Box Plots – Five‑number summary, visual outlier detection, skewness assessment.
  • Histograms – Show distribution shape; the salary variable displayed a right‑skew.
  • Bar Charts & Pie Charts – Used for categorical variables (gender, job category). Bar charts were preferred for readability.
  • Scatter Plots – Explore relationships between two quantitative variables; added regression line and confidence interval with geom_smooth().
  • Customization – Changing orientation (horizontal = TRUE), colors, titles, and axis labels.

Correlation and Relationship Assessment

  • Correlation coefficients (r) quantify linear relationships:
  • salary vs. beginning_salary: r = 0.88 (strong positive).
  • salary vs. years_of_school: r ≈ 0.66 (moderate).
  • salary vs. time_on_job: r ≈ 0.08 (weak).
  • Significance tested via p‑values; a p‑value of 2.2e‑16 confirmed a statistically significant relationship for salary vs. beginning salary.

Practical Session Workflow

  1. Install & Load Packagestidyverse, readxl, ggplot2, dplyr, etc.
  2. Set Working Directory – via RStudio menu or setwd().
  3. Import Dataread.csv("employee.csv") → data frame employee (474 rows, 10 variables).
  4. Inspect Datahead(), str(), summary().
  5. Convert Character Columns to Factorsemployee$gender <- as.factor(employee$gender).
  6. Perform Summaries – numeric (summary(employee$salary)) and categorical (table(employee$gender)).
  7. Create Visualisations – bar plot for gender, pie chart for job category, box plot of salary by job, scatter plot of salary vs. education.
  8. Subset for Correlation Matrix – remove categorical columns, then cor(employee_continuous, use = "complete.obs").

Assignment for the Next Session

  • Exercise 1: Using the employee data set, display the relationship between two quantitative variables (e.g., salary vs. education) with an appropriate plot.
  • Exercise 2: Show the relationship between two qualitative variables (e.g., gender vs. job category) using a bar chart or mosaic plot.
  • Exercise 3: Visualize a qualitative‑quantitative relationship (e.g., salary distribution across gender) with a box plot.

Closing Remarks

  • Participants were encouraged to practice the commands, review the YouTube recordings, and explore the Google‑Drive resources.
  • The next session will be led by Dr. Thomas Odong, covering advanced EDA techniques and statistical modeling.
  • Attendance was automatically recorded via Zoom; no additional registration required.

Key Takeaways

  • Mastery of basic data‑manipulation (creating, recoding, renaming, subsetting, merging) is essential before any statistical analysis.
  • Understanding variable types guides the choice of descriptive statistics and visualisations.
  • EDA provides critical insight into data quality (outliers, skewness) and informs the selection of appropriate analytical models.

Effective data manipulation in R sets the foundation for robust exploratory analysis; by correctly handling variables, detecting outliers, and visualizing patterns, researchers can make informed decisions about the statistical techniques that best suit their data.

Frequently Asked Questions

Who is RUFORUMNetwork on YouTube?

RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

PDF