Day Two R Training: From Data Manipulation to Exploratory Data Analysis
Introduction
The second day of the training resumed where Day 1 left off, focusing on practical data manipulation in R and RStudio and introducing exploratory data analysis (EDA) concepts.
Homework Review
- Participants were invited to present their homework.
- Mona demonstrated joining two data frames using left, right, inner, and full joins, explaining the resulting row counts and column structures.
- John attempted to share his screen but faced technical issues; the instructor proceeded with a live demonstration.
Recap of Day 1 Topics
- Installation of R and RStudio.
- Understanding levels of measurement.
- Importing data from CSV, Excel, SPSS, etc.
- Setting the working directory and loading required packages.
Core Data‑Manipulation Techniques
- Creating New Variables
- Example:
salaries$half_salary <- salaries$salary / 2adds a seventh column. - Recoding Existing Variables
- Creating a categorical salary band (
low/high) based on the mean salary usingifelse. - Renaming Columns
rename(salaries, Sex = sex, Experience = years_of_service)changes column names while preserving the data frame.- Subsetting Rows & Columns
subset(salaries, rank == "Professor")extracts only professors.select(salaries, -c(half_salary, salary_cut))removes unwanted columns.- Merging Data Sets
- Demonstrated left, right, inner, and full joins with
merge()anddplyrverbs. - Exporting Results
write.csv(salaries, "salaries_clean.csv")saves the manipulated data for future use.
Transition to Exploratory Data Analysis (EDA)
- Purpose of EDA: Summarize main characteristics of a data set using numerical and graphical methods before formal modeling.
- Historical Note: Concept introduced by John Tukey in the 1970s.
Variable Types and Their Implications
| Type | Examples | Typical Summaries |
|---|---|---|
| Quantitative – Discrete | Number of children, cigarettes per day | Frequency tables, bar charts |
| Quantitative – Continuous | Salary, height, weight | Mean, median, variance, histograms |
| Qualitative – Nominal | Gender, job category | Frequency tables, pie charts |
| Qualitative – Ordinal | Education level (low, medium, high) | Ordered bar charts, box plots |
Numerical Summaries
- Measures of Central Tendency: mean, median, mode (median preferred when outliers are present).
- Measures of Spread: range, inter‑quartile range (IQR), variance, standard deviation.
- Outlier Detection: Values beyond
Q3 + 1.5*IQRor belowQ1 - 1.5*IQRare flagged; the instructor illustrated this with theyears_of_schoolandsalaryvariables.
Graphical Summaries
- Box Plots – Five‑number summary, visual outlier detection, skewness assessment.
- Histograms – Show distribution shape; the salary variable displayed a right‑skew.
- Bar Charts & Pie Charts – Used for categorical variables (gender, job category). Bar charts were preferred for readability.
- Scatter Plots – Explore relationships between two quantitative variables; added regression line and confidence interval with
geom_smooth(). - Customization – Changing orientation (
horizontal = TRUE), colors, titles, and axis labels.
Correlation and Relationship Assessment
- Correlation coefficients (
r) quantify linear relationships: salaryvs.beginning_salary: r = 0.88 (strong positive).salaryvs.years_of_school: r ≈ 0.66 (moderate).salaryvs.time_on_job: r ≈ 0.08 (weak).- Significance tested via p‑values; a p‑value of
2.2e‑16confirmed a statistically significant relationship for salary vs. beginning salary.
Practical Session Workflow
- Install & Load Packages –
tidyverse,readxl,ggplot2,dplyr, etc. - Set Working Directory – via RStudio menu or
setwd(). - Import Data –
read.csv("employee.csv")→ data frameemployee(474 rows, 10 variables). - Inspect Data –
head(),str(),summary(). - Convert Character Columns to Factors –
employee$gender <- as.factor(employee$gender). - Perform Summaries – numeric (
summary(employee$salary)) and categorical (table(employee$gender)). - Create Visualisations – bar plot for gender, pie chart for job category, box plot of salary by job, scatter plot of salary vs. education.
- Subset for Correlation Matrix – remove categorical columns, then
cor(employee_continuous, use = "complete.obs").
Assignment for the Next Session
- Exercise 1: Using the
employeedata set, display the relationship between two quantitative variables (e.g., salary vs. education) with an appropriate plot. - Exercise 2: Show the relationship between two qualitative variables (e.g., gender vs. job category) using a bar chart or mosaic plot.
- Exercise 3: Visualize a qualitative‑quantitative relationship (e.g., salary distribution across gender) with a box plot.
Closing Remarks
- Participants were encouraged to practice the commands, review the YouTube recordings, and explore the Google‑Drive resources.
- The next session will be led by Dr. Thomas Odong, covering advanced EDA techniques and statistical modeling.
- Attendance was automatically recorded via Zoom; no additional registration required.
Key Takeaways
- Mastery of basic data‑manipulation (creating, recoding, renaming, subsetting, merging) is essential before any statistical analysis.
- Understanding variable types guides the choice of descriptive statistics and visualisations.
- EDA provides critical insight into data quality (outliers, skewness) and informs the selection of appropriate analytical models.
Effective data manipulation in R sets the foundation for robust exploratory analysis; by correctly handling variables, detecting outliers, and visualizing patterns, researchers can make informed decisions about the statistical techniques that best suit their data.
Frequently Asked Questions
Who is RUFORUMNetwork on YouTube?
RUFORUMNetwork is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?
Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.