Bayesian Auditory Perception: From Sound Waves to Speech

 78 min video

 2 min read

YouTube video ID: A9j5TFfysoU

Source: YouTube video by MIT OpenCourseWareWatch original video

PDF

Perceptual inference treats the brain’s interpretation of a single received waveform as a Bayesian search for the hypothesis with the highest posterior probability. The brain combines a prior belief about likely world states with a likelihood function that evaluates how well each hypothesis predicts the observed sound. Defining the hypothesis space, calculating the likelihood, and navigating high‑dimensional, multimodal probability landscapes present major challenges. Auditory illusions expose constraints on the brain’s hypothesis space, such as assumptions of harmonicity, abrupt onsets, and repeating elements.

Auditory Scene Analysis

Auditory scene analysis (ASA) seeks to determine how many sound sources are present and what parameters define each source from a mixed acoustic signal. Because the number of sources is unknown, the problem is often ill‑posed and the search space becomes extremely complex. Feed‑forward algorithms supply initial guesses that bootstrap more intensive optimization procedures. Bistable auditory illusions suggest that the brain may switch between multiple “good” solutions in the posterior distribution, reflecting the flexibility of ASA.

Neural Mechanisms of Auditory Processing

Spike‑triggered averaging (STA) characterizes the sensitivity of individual neurons by averaging stimulus segments that precede each action potential. This method reveals spectro‑temporal receptive fields (STRFs), which encode the specific spectro‑temporal features that drive neuronal firing. The auditory system processes sound through a cascade of filters: cochlear filters resolve audio frequency content, while subsequent modulation filters capture slower envelope fluctuations. Auditory cortex neurons exhibit more complex, higher‑order tuning than subcortical regions such as the inferior colliculus, indicating progressive abstraction along the pathway.

Intracranial Recordings and Attention

Intracranial recordings from epilepsy patients provide high‑resolution insight into human auditory processing. In cocktail‑party scenarios, the superior temporal gyrus preferentially represents the attended talker, while unattended speech receives weaker neural encoding. Errors in attentional selection correlate with a failure to isolate the target talker’s representation, highlighting the critical role of attention in shaping auditory cortical activity.

Speech Perception

The source‑filter model explains speech production: the larynx generates a periodic harmonic source, and the vocal tract shapes this source through resonances (formants) created by tongue, lip, and palate configurations. Phonemes—defined as the smallest sound units that alter meaning—do not correspond one‑to‑one with letters and vary with surrounding sounds due to coarticulation. Categorical perception maps continuous acoustic variations, such as voice onset time, onto discrete phonemic categories, producing steep identification functions at category boundaries and enhanced discrimination for stimuli that cross those boundaries. This process allows the brain to impose discrete linguistic categories on a continuously varying acoustic signal.

  Takeaways

  • Perceptual inference treats the brain’s interpretation of a single waveform as a Bayesian search for the hypothesis with the highest posterior probability, constrained by priors such as harmonicity and abrupt onsets.
  • Auditory scene analysis faces an ill‑posed problem because the number of sound sources is unknown, so feed‑forward algorithms provide initial guesses that the brain may switch between multiple plausible solutions, as illustrated by bistable auditory illusions.
  • Neurons in the auditory pathway are characterized by spike‑triggered averaging, revealing spectro‑temporal receptive fields that reflect a cascade of cochlear and modulation filters, with cortical neurons showing higher‑order tuning than subcortical ones.
  • Intracranial recordings from the superior temporal gyrus show that attention enhances the neural representation of the attended talker, and failures to isolate that representation predict selection errors in cocktail‑party situations.
  • Speech perception relies on the source‑filter model, where the larynx generates a harmonic source and the vocal tract shapes formants; coarticulation and categorical perception create discrete phoneme categories despite continuous acoustic variation.

Frequently Asked Questions

Why does auditory scene analysis use feed‑forward algorithms for initial guesses?

Feed‑forward algorithms quickly generate plausible estimates of source number and parameters, reducing the computational load of the subsequent optimization that must search a high‑dimensional, multimodal posterior. By providing a starting point, they help the brain converge on a “good” solution despite the ill‑posed nature of the mixture problem.

How does categorical perception influence discrimination of voice onset time?

Categorical perception maps continuous variations in voice onset time onto discrete phonemic categories, producing a steep identification curve at the category boundary. Listeners show heightened sensitivity to differences that cross the boundary and reduced discrimination within a category, sharpening speech contrast perception.

Who is MIT OpenCourseWare on YouTube?

MIT OpenCourseWare is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

PDF