Mastering String Data, Regular Expressions, and Smart Survey Design: A Comprehensive Guide

 4 min read

YouTube video ID: WiRY_fq2_t4

Source: YouTube video by ChisquaresWatch original video

PDF

Introduction

The session covered fundamental concepts for handling string data in statistical software, the difference between quantitative and qualitative analysis, and advanced techniques for building engaging, data‑rich surveys.

1. Strings vs. Numeric Variables

  • String (character) data: a single letter, number, symbol, or space. A collection of characters forms a string (e.g., words, sentences).
  • Numeric variables: contain only numbers; they are treated as real values in analysis.

2. Qualitative Analysis Methods

  • Thematic analysis: groups raw text into high‑level “buckets” (themes) that may use words not present in the original data.
  • Word‑cloud analysis: visualises the literal words that appear most frequently in the text.
  • When to use each: use thematic analysis to explore underlying meanings; use word clouds for quick visual frequency checks.

3. Open‑Ended vs. Closed‑Ended Questions

  • Closed‑ended (quantitative): easy to answer "what" questions (e.g., prevalence of smoking = 35%).
  • Open‑ended (qualitative): capture the "why" by allowing respondents to elaborate in free‑text, providing context and nuance.
  • Trade‑offs: quantitative data is straightforward to tabulate; qualitative data requires coding, thematic or word‑cloud analysis.

4. Working with String Data in Stata

Four essential commands were demonstrated: 1. replace – substitutes a sub‑string with another value. 2. encode – converts a string variable into a numeric factor. 3. real – transforms a numeric‑looking string into a true numeric variable. 4. sub – finds and replaces patterns within a string (often used with regular expressions).

5. Regular Expressions (Regex) – The Basics

  • Purpose: locate patterns rather than exact values; essential for extracting phone numbers, emails, IDs, etc.
  • Meta‑characters: . (any character), * (zero or more), ? (zero or one), ^ (start of string), $ (end of string), [] (character class), () (grouping), \ (escape special symbols).
  • Quantifiers: specify how many times a pattern may appear ({n} exact, {n,} at least n, {n,m} between n and m).
  • Character classes: [0-9] for digits, [A-Za-z] for letters, [^0-9] for non‑digits, etc.
  • Escaping: to treat a meta‑character literally, place it inside square brackets or prefix with a backslash.

6. Practical Example – Cleaning Phone Numbers

  1. Initial data: phone numbers appear in many formats (spaces, dashes, brackets, slashes, dots).
  2. Step‑by‑step cleaning with sub:
  3. Remove spaces → sub(phone_text, " ", "")
  4. Remove "(" and ")" → two separate sub calls.
  5. Remove "/", "-", ".", and ":" similarly.
  6. Extracting a 10‑digit number using a regex pattern: stata generate phone_number = regexs(1) if regexm(phone_text, "([0-9]{3})[^0-9]*([0-9]{3})[^0-9]*([0-9]{4})")
  7. The pattern captures three groups of digits, allowing any non‑digit separator between them.
  8. regexm() tests the pattern; regexs(1) returns the matched string.
  9. Result: a clean column of pure numeric phone numbers ready for analysis.

7. Designing Conversational, Open‑Ended Surveys (KQ Platform)

  • Question types:
  • Single‑value text (one word/number)
  • Paragraph (free‑form text)
  • Multimedia (upload files, audio, video)
  • Date & time (open‑ended but structured)
  • Conversational flow: use piping to insert a respondent’s previous answer into the next question, making the survey feel like a dialogue.
  • AB‑testing: create multiple variants of a question and route respondents based on earlier answers (e.g., Python users vs. non‑Python users, machine‑learning experience).
  • Logic & skip patterns:
  • Inclusion criteria – only participants who have used statistical software proceed.
  • Exclusion criteria – respondents who do not do research are routed to the end.
  • Conditional routing based on software choice and ML experience.
  • Implementation steps:
  • Format questions with Q (question) and A (answer) tags, end each block with ###.
  • Import the formatted file into the KQ platform.
  • Set up piping, AB‑test, and skip‑logic via the platform’s UI.
  • Preview the survey to verify personalized paths.
  • Publish and collect data; download a PDF of the final questionnaire for record‑keeping.

8. Ethical & Practical Considerations

  • Audio responses: contain potentially identifiable information; the platform encrypts data and currently does not auto‑transcribe to avoid privacy risks.
  • Bias mitigation: piping repeats a respondent’s own answer, reducing recall bias rather than introducing leading bias.
  • Tool choice: code‑free platforms (like KQ) speed up routine surveys, but researchers must still master study design, bias assessment, and validity checks.

9. Take‑away Tips

  • Memorise regex symbols; practice on small datasets to build intuition.
  • Always convert cleaned strings to numeric types (real) before statistical analysis.
  • Use thematic analysis for deep insight, word clouds for quick overviews.
  • Leverage conversational survey features (piping, AB‑test) to boost response rates and data quality.
  • Keep ethical safeguards front‑and‑center when collecting audio or other personal data.

The session equipped participants with the conceptual foundation and hands‑on commands needed to transform messy string data into clean, analyzable variables, and to design smart, engaging surveys that capture rich qualitative insights while maintaining rigorous quantitative standards.

By mastering string manipulation, regular expressions, and conversational survey design, researchers can efficiently clean unstructured data, extract meaningful patterns, and collect high‑quality open‑ended responses that enrich both quantitative and qualitative analyses.

Frequently Asked Questions

Who is Chisquares on YouTube?

Chisquares is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

PDF