Chapter 18: Audio & Speech Systems

Keywords

speech, audio, Whisper, transcription, TTS, diarization, voice, ASR, speaker identification

Introduction

In late 2023, a legal tech startup faced a critical scalability problem. Their team was manually reviewing hundreds of hours of deposition recordings each week, searching for relevant testimony to support case arguments. Paralegals would listen to entire recordings, take notes, and create searchable transcripts—a process that took 3-4 hours per hour of audio.

They deployed an audio processing pipeline that automatically transcribed depositions, identified speakers, and created searchable embeddings of the content. Suddenly, attorneys could search: “Find all testimony about the contract negotiation in April 2022” and receive timestamped clips from across dozens of recordings. Review time dropped by 70%.

This transformation illustrates why audio processing matters. The world’s information isn’t just text and images—it’s podcasts, meetings, customer calls, and instructional recordings. AI systems that can only search text miss the rich information carried in speech.

Why Audio Processing Matters

Audio carries unique information: Tone of voice, emphasis, pauses—all convey meaning beyond words. Customer sentiment in a support call, speaker confidence in a meeting, or emotional state in a therapeutic session exist only in audio.

Speech enables text search over audio: Transcription with speaker diarization makes audio searchable, indexable, and analyzable with the same tools we use for text.

Real-time processing enables new applications: Live captions, meeting transcription, and voice assistants require streaming audio processing with low latency.

What You’ll Learn

This chapter focuses on audio and speech processing:

  • Audio processing fundamentals: from waveforms to spectrograms
  • Production-ready speech-to-text systems with Whisper
  • Speaker diarization: who said what and when
  • Real-time vs batch transcription tradeoffs
  • Domain-specific vocabulary and accuracy optimization
  • Production architectures for audio processing

For video understanding and multimodal retrieval (combining audio, video, and images), see Chapter 19.

Prerequisites

  • Understanding of transformer architecture and attention (Chapter 5)
  • Familiarity with embeddings and vector search (Chapter 7)
  • Experience with RAG systems (Chapter 8)

How Machines “Hear”: Audio Processing Fundamentals

Speech-to-text might seem simpler than vision—just transcribe what was said. But the transformation from sound waves to text involves sophisticated processing that bridges the continuous analog world and discrete text.

From Waveforms to Features

Raw audio is a sequence of amplitude values sampled many times per second (16kHz or 44.1kHz typically). This representation is both too raw and too high-dimensional for direct processing.

Audio Processing Pipeline

Audio Processing Pipeline

The Short-Time Fourier Transform (STFT) analyzes the frequency content of small windows of audio. Instead of treating audio as a single signal, it breaks it into overlapping windows (typically 25ms each) and computes the frequency spectrum for each window. This captures how frequencies change over time—essential for understanding speech where sounds evolve rapidly.

Why Mel Spectrograms?

The mel scale reflects human hearing: we’re more sensitive to differences between low frequencies than high frequencies. The difference between 100Hz and 200Hz sounds much more significant than the difference between 10,000Hz and 10,100Hz, even though both are 100Hz apart. This is because human hearing is approximately logarithmic in frequency perception.

Mel filterbanks compress frequency information in this perceptually-motivated way. Instead of representing all frequencies equally, they allocate more resolution to lower frequencies where human speech carries more information. Vowels, which contain most speech energy, appear in lower frequencies (200-2000Hz), while consonants extend higher.

The resulting representation—time on one axis, frequency on the other—resembles an image. Some early speech systems even used image classification architectures (CNNs) on spectrograms. Modern systems use transformers that process the spectrogram as a sequence, with each time step being an 80-dimensional vector representing the frequency content at that moment.

Intuition: Reading the Spectrogram

When you look at a mel spectrogram:

  • Horizontal axis: Time progression (left to right)
  • Vertical axis: Frequency (low to high)
  • Brightness: Energy at that frequency and time

Vowels appear as horizontal bands (sustained frequencies). Consonants appear as brief bursts or gaps. Silence is dark. A sentence like “Hello world” shows distinct patterns: the “H” as a noise burst, “e” as horizontal bands, “l” as a transition, and so on. The model learns to map these patterns to phonemes, then words, then text.

The Whisper Architecture

Whisper (Radford et al., 2022) demonstrated that simple architectures trained on massive data could achieve remarkable transcription quality. It’s an encoder-decoder transformer that processes mel spectrograms and generates text autoregressively.

Whisper Architecture

Whisper Architecture

Why Whisper Works So Well

Whisper’s success comes from scale and diversity, not architectural innovation:

  • 680,000 hours of training data: Orders of magnitude more than previous systems (which typically had thousands or tens of thousands of hours)
  • Multilingual: 99 languages, enabling cross-lingual transfer. The model learns general speech patterns that apply across languages
  • Weakly supervised: Much training data was automatically transcribed, not human-labeled, enabling massive scale
  • Multitask training: Trained jointly on transcription, translation, language detection, and timestamp prediction

The model learned robust representations that generalize across accents, noise conditions, and domains—because it saw all of these in training. A model trained on clean English recordings might fail on accented speech or background noise. Whisper, trained on internet-scale diverse data, handles these naturally.

Streaming vs. Batch Processing

Whisper processes 30-second chunks by design. This creates fundamental tradeoffs between latency and accuracy:

Batch Processing Characteristics:

  • Latency: Seconds to minutes (upload time + processing time)
  • Accuracy: Highest—full context available, can look ahead and behind
  • Context: Full 30-second window for disambiguation
  • Use case: Recordings, podcasts, archived content, where speed isn’t critical

Streaming Processing Characteristics:

  • Latency: Milliseconds to seconds (near real-time)
  • Accuracy: Lower—limited context, no look-ahead, errors at chunk boundaries
  • Context: Only past audio available, current window typically 2-5 seconds
  • Use case: Live captions, voice assistants, real-time translation

The tradeoff exists because language has long-range dependencies. The phrase “I scream” vs. “ice cream” requires context to disambiguate. With full context, the model can look at surrounding words. In streaming mode, it must make immediate decisions.

Modern variants address streaming needs:

  • Streaming Whisper: Process audio in smaller overlapping windows (5-10 seconds with 1-2 second overlap), merge results
  • FastWhisper: Optimized inference with CTranslate2 and quantization for lower latency
  • Whisper.cpp: On-device implementation for edge deployment with optimized C++ code

Each trades accuracy for latency. Production systems often use hybrid approaches: streaming for immediate display, batch correction for final accuracy.


Audio Processing Systems

Speech-to-Text: From Whisper to Production

Whisper changed speech transcription from a specialized capability requiring expensive infrastructure to a commodity API. But production systems need more than basic transcription.

Production-Ready Transcription Service

import openai
from typing import Optional, Dict, List
import logging

logger = logging.getLogger(__name__)

class TranscriptionService:
    """Production-ready transcription service with error handling and optimization."""

    def __init__(self):
        self.client = openai.OpenAI()
        self.max_file_size = 25 * 1024 * 1024  # 25MB API limit

    def transcribe(
        self,
        audio_path: str,
        language: Optional[str] = None,
        prompt: Optional[str] = None,
        temperature: float = 0.0
    ) -> Dict:
        """
        Transcribe audio with optional language hint and vocabulary prompting.

        Args:
            audio_path: Path to audio file
            language: ISO-639-1 language code (e.g., 'en', 'es'). If None, auto-detect.
            prompt: Vocabulary hints for domain-specific content
            temperature: Sampling temperature (0 = deterministic, higher = more creative)

        Returns:
            Dictionary containing transcription text, language, duration, and timestamps
        """
        try:
            with open(audio_path, 'rb') as f:
                response = self.client.audio.transcriptions.create(
                    model="whisper-1",
                    file=f,
                    language=language,
                    prompt=prompt,
                    temperature=temperature,
                    response_format="verbose_json",
                    timestamp_granularities=["word", "segment"]
                )

            return {
                'text': response.text,
                'language': response.language,
                'duration': response.duration,
                'segments': response.segments,
                'words': response.words
            }

        except openai.BadRequestError as e:
            logger.error(f"Invalid request to Whisper API: {e}")
            raise
        except openai.RateLimitError as e:
            logger.warning(f"Rate limited, should implement retry: {e}")
            raise
        except Exception as e:
            logger.error(f"Transcription failed: {e}")
            raise

    def estimate_cost(self, audio_duration_seconds: float) -> float:
        """Estimate transcription cost. OpenAI charges per minute."""
        minutes = audio_duration_seconds / 60
        cost_per_minute = 0.006  # $0.006 per minute as of 2026
        return minutes * cost_per_minute
Common Mistake: Skipping Audio Preprocessing

What people do: Feed raw audio files directly to transcription APIs without any quality assessment or preprocessing, then wonder why accuracy is poor.

Why it fails: Background noise, low bitrates, clipping, and inconsistent volume levels significantly degrade transcription accuracy. Whisper is robust but not magic—garbage in, garbage out.

Fix: Before transcription, assess audio quality (SNR, clipping detection). For problematic files: normalize volume levels, apply noise reduction (spectral gating), and consider resampling to 16kHz mono if the source is lower quality. Even simple preprocessing can improve WER by 5-15% on noisy recordings.

Vocabulary Prompting

Whisper accepts prompts that hint at expected vocabulary. This dramatically improves accuracy for domain-specific content, proper nouns, and technical terminology.

# Domain-specific vocabulary prompts
DOMAIN_PROMPTS = {
    'medical': """Medical transcription. Expected terms: myocardial infarction,
        echocardiogram, hypertension, tachycardia, bradycardia, arrhythmia,
        CBC, BMP, comprehensive metabolic panel, CT scan, MRI,
        metformin, lisinopril, atenolol, warfarin, electrocardiogram.""",

    'legal': """Legal dictation. Expected terms: plaintiff, defendant,
        deposition, affidavit, subpoena, motion to dismiss, summary judgment,
        habeas corpus, voir dire, pro bono, litigation, arbitration,
        interrogatories, discovery, stipulation.""",

    'technical': """Technical discussion. Expected terms: Kubernetes,
        microservices, API gateway, load balancer, PostgreSQL, Redis,
        containerization, CI/CD pipeline, infrastructure as code,
        GraphQL, WebSocket, OAuth, JWT, horizontal scaling.""",

    'financial': """Financial analysis. Expected terms: EBITDA,
        price-to-earnings ratio, compound annual growth rate, CAGR,
        balance sheet, income statement, cash flow, liquidity,
        market capitalization, dividend yield, earnings per share."""
}

class DomainAwareTranscriber(TranscriptionService):
    """Transcription service with domain-specific vocabulary support."""

    def transcribe_domain(
        self,
        audio_path: str,
        domain: str,
        additional_terms: Optional[List[str]] = None
    ) -> Dict:
        """
        Transcribe with domain-specific vocabulary hints.

        Args:
            audio_path: Path to audio file
            domain: One of 'medical', 'legal', 'technical', 'financial'
            additional_terms: Extra vocabulary terms specific to this transcription
        """
        prompt = DOMAIN_PROMPTS.get(domain, "")

        # Add custom terms if provided
        if additional_terms:
            custom_terms = ", ".join(additional_terms)
            prompt += f" Additional terms: {custom_terms}."

        return self.transcribe(audio_path, prompt=prompt)

# Usage example
transcriber = DomainAwareTranscriber()

# Medical transcription with custom patient/procedure terms
result = transcriber.transcribe_domain(
    audio_path="consultation.mp3",
    domain="medical",
    additional_terms=["Dr. Ramirez", "Xenical", "gastric bypass"]
)
Common Mistake: Not Using Domain Vocabulary Hints

What people do: Use default Whisper settings for medical, legal, or technical transcription, then manually correct hundreds of misheard domain terms.

Why it fails: Without vocabulary hints, Whisper defaults to common words. “EBITDA” becomes “a bit of”, “myocardial infarction” becomes “my cardio infection”, and “kubectl” becomes “cube cuttle.”

Fix: Always provide a vocabulary prompt for specialized content. Include expected proper nouns, acronyms, and technical terms. The prompt doesn’t force these words—it just makes the model more confident when it hears ambiguous sounds that could match domain vocabulary.

Why Prompting Works

Whisper’s decoder is conditioned on the prompt, biasing it toward expected vocabulary. When it hears an ambiguous phrase, it’s more likely to select terms from the prompt. For example:

  • Without prompt: “We need to increase the cagr” → “We need to increase the cager” (incorrect)
  • With financial prompt: “We need to increase the cagr” → “We need to increase the CAGR” (correct)

The prompt doesn’t force these terms—if they don’t fit the audio, they won’t appear. It just makes the model more confident about domain-specific vocabulary.

Common Mistake: Chunking Audio Without Overlap

What people do: Split long audio into fixed-length chunks (e.g., 10-minute segments) at exact boundaries, then concatenate the transcripts.

Why it fails: Words get cut in half at chunk boundaries—“develop” might become “devel” at the end of one chunk and “op” at the start of the next. Sentences lose context. Speaker attribution breaks.

Fix: Use overlapping chunks (e.g., 10-minute chunks with 5-10 second overlap). During merging, detect duplicates in the overlap region by comparing timestamps, and discard the duplicate words. Keep words from the earlier chunk (they have more left context) unless they’re clearly cut off.

Handling Long Audio

Whisper’s API limits files to 25MB. For longer content (podcasts, lectures, meetings), you need chunking with careful merging:

from pydub import AudioSegment
import os
from typing import List, Dict

class LongAudioProcessor:
    """Process audio longer than API limits with intelligent chunking."""

    def __init__(
        self,
        chunk_duration_ms: int = 10 * 60 * 1000,  # 10 minutes
        overlap_ms: int = 5000  # 5 seconds overlap
    ):
        self.chunk_duration = chunk_duration_ms
        self.overlap = overlap_ms
        self.transcriber = TranscriptionService()

    def transcribe_long(
        self,
        audio_path: str,
        prompt: Optional[str] = None
    ) -> Dict:
        """
        Transcribe long audio by chunking and merging.

        The overlap ensures we don't cut words in half at chunk boundaries.
        We detect duplicates in the overlap region and remove them.
        """
        audio = AudioSegment.from_file(audio_path)
        chunks = self._create_chunks(audio)

        logger.info(f"Processing {len(chunks)} chunks from {audio_path}")

        transcriptions = []
        for i, chunk_audio in enumerate(chunks):
            # Export chunk to temporary file
            chunk_path = f"/tmp/chunk_{i}.mp3"
            chunk_audio.export(chunk_path, format="mp3")

            try:
                result = self.transcriber.transcribe(chunk_path, prompt=prompt)
                transcriptions.append({
                    'chunk_index': i,
                    'start_ms': i * (self.chunk_duration - self.overlap),
                    'result': result
                })
            finally:
                # Clean up temporary file
                if os.path.exists(chunk_path):
                    os.remove(chunk_path)

        return self._merge_transcriptions(transcriptions)

    def _create_chunks(self, audio: AudioSegment) -> List[AudioSegment]:
        """Split audio into overlapping chunks."""
        chunks = []
        start = 0

        while start < len(audio):
            end = min(start + self.chunk_duration, len(audio))
            chunks.append(audio[start:end])

            # Move start forward by chunk_duration minus overlap
            start += self.chunk_duration - self.overlap

        return chunks

    def _merge_transcriptions(self, transcriptions: List[Dict]) -> Dict:
        """
        Merge overlapping transcriptions, avoiding duplicates.

        In overlap regions, we keep words from the earlier chunk (they have
        more left context) unless they're clearly cut off.
        """
        merged_words = []
        merged_segments = []

        for i, trans in enumerate(transcriptions):
            chunk_start_ms = trans['start_ms']

            if i == 0:
                # First chunk: include everything
                for word in trans['result'].get('words', []):
                    merged_words.append(self._adjust_timestamp(word, chunk_start_ms))
            else:
                # Subsequent chunks: skip overlap region
                overlap_threshold = trans['start_ms'] + self.overlap

                for word in trans['result'].get('words', []):
                    absolute_start = chunk_start_ms + word['start'] * 1000

                    # Only include words after the overlap region
                    if absolute_start >= overlap_threshold:
                        merged_words.append(self._adjust_timestamp(word, chunk_start_ms))

        full_text = ' '.join(w['word'].strip() for w in merged_words)

        return {
            'text': full_text,
            'words': merged_words,
            'chunk_count': len(transcriptions),
            'duration': transcriptions[-1]['start_ms'] / 1000 + \
                       transcriptions[-1]['result']['duration']
        }

    def _adjust_timestamp(self, word: Dict, offset_ms: int) -> Dict:
        """Adjust word timestamps to absolute time."""
        return {
            'word': word['word'],
            'start': (offset_ms + word['start'] * 1000) / 1000,
            'end': (offset_ms + word['end'] * 1000) / 1000
        }

Speaker Diarization: Who Said What

For multi-speaker content (meetings, interviews, podcasts), transcription alone isn’t enough. You need to know who spoke when.

The Diarization Challenge

Speaker diarization segments audio by speaker without knowing identities in advance. It answers “when did different people speak?” not “who is speaking?” The output is speaker labels (SPEAKER_00, SPEAKER_01, etc.) assigned to time segments.

Audio Timeline:
[=========================================================]
0:00                                                  10:00

Raw Transcription:
"Hello everyone welcome to the meeting today we're going to discuss..."

Diarized Output:
SPEAKER_00 (0:00-0:03): "Hello everyone, welcome to the meeting."
SPEAKER_01 (0:04-0:15): "Today we're going to discuss the quarterly results."
SPEAKER_00 (0:16-0:22): "Great, let's start with revenue."
SPEAKER_02 (0:23-0:45): "Revenue was up 15% compared to last quarter..."

Integration with Transcription

Modern diarization systems use voice embeddings to cluster speakers. The pyannote.audio library provides state-of-the-art models:

from pyannote.audio import Pipeline
import torch
from typing import List, Dict, Optional

class DiarizedTranscription:
    """Combine transcription with speaker diarization."""

    def __init__(self, hf_auth_token: str):
        """
        Initialize transcription and diarization pipelines.

        Args:
            hf_auth_token: HuggingFace token for accessing pyannote models
        """
        self.transcriber = TranscriptionService()

        # Load diarization pipeline
        self.diarizer = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.1",
            use_auth_token=hf_auth_token
        )

        # Use GPU if available
        if torch.cuda.is_available():
            self.diarizer.to(torch.device("cuda"))

    def process(self, audio_path: str, num_speakers: Optional[int] = None) -> Dict:
        """
        Get transcription with speaker labels.

        Args:
            audio_path: Path to audio file
            num_speakers: Expected number of speakers (if known, improves accuracy)

        Returns:
            Dictionary with speaker-labeled turns and full transcript
        """
        # Get word-level transcription from Whisper
        transcription = self.transcriber.transcribe(audio_path)
        words = transcription.get('words', [])

        # Get speaker segments from diarization
        if num_speakers:
            diarization = self.diarizer(
                audio_path,
                num_speakers=num_speakers
            )
        else:
            diarization = self.diarizer(audio_path)

        # Assign speakers to words based on timestamp overlap
        labeled_words = []
        for word in words:
            word_time = word['start']
            speaker = self._get_speaker_at_time(diarization, word_time)
            labeled_words.append({
                **word,
                'speaker': speaker
            })

        # Group words into speaker turns
        turns = self._group_into_turns(labeled_words)

        return {
            'turns': turns,
            'speakers': list(set(turn['speaker'] for turn in turns)),
            'full_transcript': transcription['text'],
            'duration': transcription['duration']
        }

    def _get_speaker_at_time(self, diarization, time: float) -> str:
        """Find which speaker was talking at a given time."""
        for turn, _, speaker in diarization.itertracks(yield_label=True):
            if turn.start <= time <= turn.end:
                return speaker
        return "UNKNOWN"

    def _group_into_turns(self, words: List[Dict]) -> List[Dict]:
        """Group consecutive words by the same speaker into turns."""
        turns = []
        current_speaker = None
        current_words = []

        for word in words:
            if word['speaker'] != current_speaker:
                # Speaker changed - save previous turn
                if current_words:
                    turns.append({
                        'speaker': current_speaker,
                        'text': ' '.join(w['word'].strip() for w in current_words),
                        'start': current_words[0]['start'],
                        'end': current_words[-1]['end'],
                        'word_count': len(current_words)
                    })
                current_speaker = word['speaker']
                current_words = []

            current_words.append(word)

        # Don't forget the last turn
        if current_words:
            turns.append({
                'speaker': current_speaker,
                'text': ' '.join(w['word'].strip() for w in current_words),
                'start': current_words[0]['start'],
                'end': current_words[-1]['end'],
                'word_count': len(current_words)
            })

        return turns

Diarization Quality Factors

Speaker diarization accuracy depends on several factors:

  1. Audio quality: Clear audio with distinct voices works best. Background noise, crosstalk, and poor recording quality hurt performance.

  2. Speaker overlap: When multiple people speak simultaneously, diarization struggles. The model must choose one speaker or mark the segment as overlapping.

  3. Number of speakers: Performance degrades with many speakers. 2-4 speakers is ideal; 10+ speakers is challenging.

  4. Speaker similarity: Similar voices (same gender, accent, pitch) are harder to distinguish than different voices.

  5. Short segments: Very brief utterances (< 1 second) may be misattributed.

Production systems should validate diarization quality and provide confidence scores. For critical applications, consider human review of speaker assignments.

Complete code: See Speaker Diarization and Audio Transcription for production-ready implementations with error handling and confidence scoring.

Streaming vs. Batch Latency Tradeoffs

Real-time applications (live captions, voice assistants, real-time translation) can’t wait for complete audio files. The fundamental tradeoff is latency vs. accuracy.

Batch Processing:

  • Latency: 1-10 seconds (upload + processing + return)
  • Accuracy: 1-5% WER on clear speech (95-99% word accuracy)
  • Context: Full utterance available for disambiguation
  • Cost: $0.006 per minute
  • Use case: Recorded content, post-meeting transcripts, podcast transcription

Streaming Processing:

  • Latency: 200-500ms (real-time or near-real-time)
  • Accuracy: 4-8% WER (higher error rate due to limited context)
  • Context: Only past audio available
  • Cost: Higher (maintained connection, more infrastructure)
  • Use case: Live captions, voice assistants, real-time translation

Hybrid Approach: Streaming with Correction

Many production systems use a two-phase approach: fast streaming for immediate display, batch correction for accuracy.

import asyncio
from typing import AsyncIterator, Dict

class HybridStreamingTranscription:
    """Streaming transcription with delayed batch correction."""

    def __init__(self):
        self.streaming_model = StreamingWhisperClient()  # Fast, lower accuracy
        self.batch_model = TranscriptionService()        # Slower, higher accuracy
        self.buffer = []
        self.chunk_duration_ms = 5000  # 5 seconds

    async def process_stream(
        self,
        audio_stream: AsyncIterator[bytes]
    ) -> AsyncIterator[Dict]:
        """
        Process audio stream with both immediate and corrected outputs.

        Yields:
            - Immediate streaming results (type='streaming')
            - Corrected batch results every 30 seconds (type='corrected')
        """
        async for chunk in audio_stream:
            # Immediate streaming result for live display
            streaming_result = await self.streaming_model.transcribe_chunk(chunk)
            yield {
                'type': 'streaming',
                'text': streaming_result['text'],
                'is_final': False,
                'latency_ms': streaming_result['latency']
            }

            self.buffer.append(chunk)

            # Every 30 seconds, send corrected batch result
            buffer_duration = len(self.buffer) * self.chunk_duration_ms
            if buffer_duration >= 30000:
                combined_audio = self._combine_chunks(self.buffer)

                # Process with batch model for higher accuracy
                corrected = await self.batch_model.transcribe(combined_audio)

                yield {
                    'type': 'corrected',
                    'text': corrected['text'],
                    'is_final': True,
                    'replaces_streaming': True,
                    'duration': corrected['duration']
                }

                self.buffer = []

    def _combine_chunks(self, chunks: List[bytes]) -> bytes:
        """Combine audio chunks into single file."""
        # Implementation depends on audio format
        # For PCM: concatenate directly
        # For compressed: decode, combine, re-encode
        pass

This hybrid approach provides the best of both worlds:

  • Users see immediate feedback (streaming)
  • Final transcript has high accuracy (batch correction)
  • System can update the display when corrections arrive
  • Total latency is the same as pure streaming for initial display

Case Study: Meeting Transcription Platform

A productivity company built a meeting transcription service for enterprise customers. Requirements:

  • Support 2-10 speakers in meetings
  • Real-time display during meeting
  • Post-meeting corrections and summary
  • Action item extraction
  • Integration with calendar and task systems

Architecture

Meeting Transcription Pipeline

Meeting Transcription Pipeline

Key Design Decisions

1. Streaming + Correction for UX

Users see real-time text during the meeting (engagement, verification) with periodic corrections for accuracy. Better than waiting until end or showing only final transcript.

2. Speaker Identification via Voice Embeddings

Rather than just labeling speakers as SPEAKER_00, SPEAKER_01, they built a system to recognize recurring speakers across meetings:

class SpeakerIdentification:
    """Identify recurring speakers across meetings using voice embeddings."""

    def __init__(self):
        self.embedding_model = SpeakerEmbeddingModel()
        self.speaker_database = {}  # {user_id: embedding}

    def identify_speakers(
        self,
        audio_path: str,
        participant_ids: List[str]
    ) -> Dict:
        """
        Match diarized speakers to known participants.

        Args:
            audio_path: Path to meeting audio
            participant_ids: List of participant user IDs

        Returns:
            Mapping of speaker labels to user IDs
        """
        # Get diarization with speaker embeddings
        diarization = self.diarizer(audio_path, return_embeddings=True)

        # Extract representative embedding for each speaker
        speaker_embeddings = {}
        for speaker_label in set(diarization.labels()):
            segments = [
                seg for seg, _, lbl in diarization.itertracks(yield_label=True)
                if lbl == speaker_label
            ]
            # Use embedding from longest segment
            longest = max(segments, key=lambda s: s.duration)
            speaker_embeddings[speaker_label] = longest.embedding

        # Match speakers to participants
        matches = {}
        for speaker_label, embedding in speaker_embeddings.items():
            best_match = None
            best_similarity = 0

            for user_id in participant_ids:
                if user_id in self.speaker_database:
                    similarity = cosine_similarity(
                        embedding,
                        self.speaker_database[user_id]
                    )
                    if similarity > best_similarity and similarity > 0.7:
                        best_similarity = similarity
                        best_match = user_id

            matches[speaker_label] = best_match or "Unknown"

        return matches

    def register_speaker(self, user_id: str, audio_sample: str):
        """Register a user's voice embedding from a sample."""
        embedding = self.embedding_model.extract(audio_sample)
        self.speaker_database[user_id] = embedding

3. LLM Post-Processing

Raw transcription is processed by an LLM to add punctuation, fix obvious errors, and extract structured information:

MEETING_SUMMARY_PROMPT = """
Analyze this meeting transcript and provide:

## SUMMARY
2-3 paragraphs covering:

- Main topics discussed
- Key outcomes and decisions
- Overall tone and dynamics

## DECISIONS MADE
List any decisions that were explicitly agreed upon.
Format: "- [Decision] - Agreed by [participants if mentioned]"

## ACTION ITEMS
List action items with owner and deadline if mentioned.
Format: "- [ ] [Owner] Task description (Deadline: [date] or 'Not specified')"

## OPEN QUESTIONS
Topics raised but not resolved.

## KEY QUOTES
Notable statements worth highlighting (max 3).

Transcript:
{transcript}
"""

async def process_completed_meeting(transcript: str, metadata: Dict) -> Dict:
    """Extract structured information from meeting transcript."""

    prompt = MEETING_SUMMARY_PROMPT.format(transcript=transcript)

    response = await llm.generate(
        prompt,
        temperature=0.3,  # Lower temperature for factual extraction
        max_tokens=1500
    )

    return {
        'summary': response,
        'metadata': metadata,
        'transcript': transcript
    }

Results

After 6 months of deployment:

  • 94% user satisfaction (vs. 67% with manual note-taking)
  • Average 15 minutes saved per meeting in follow-up clarifications
  • 3x increase in action item completion (clear assignment + automatic tracking)
  • 40% reduction in “what was decided?” questions post-meeting

The combination of accurate transcription, speaker identification, and LLM-powered structure extraction created value that exceeded simple transcription.


Summary

This chapter covered the fundamentals of audio and speech processing for AI systems:

Key Takeaways:

  1. Audio representation matters: Mel spectrograms balance computational efficiency with acoustic information, making them ideal input for neural networks processing speech

  2. Whisper dominates production STT: OpenAI’s Whisper models provide strong multilingual transcription out of the box, with turbo variants offering production-ready latency

  3. Diarization is hard but essential: Knowing “who said what” requires speaker embeddings, clustering, and often domain-specific tuning—pyannote.audio provides strong baselines

  4. Real-time requires different architecture: Streaming transcription uses VAD chunking, incremental decoding, and careful buffer management—fundamentally different from batch processing

  5. Domain adaptation improves accuracy: Custom vocabularies, fine-tuning on domain audio, and post-processing rules can dramatically improve accuracy for specialized domains

  6. Production systems need resilience: Audio processing pipelines must handle variable quality, background noise, overlapping speech, and hardware failures gracefully

Connections to Other Chapters

  • Chapter 5 (Transformer Architecture): Whisper is an encoder-decoder transformer; understanding attention helps debug transcription errors
  • Chapter 7 (RAG Systems): Audio embeddings enable semantic search over speech content, extending RAG to audio modalities
  • Chapter 17 (Vision & Document AI): Similar preprocessing patterns (mel spectrograms vs image preprocessing) and model architectures
  • Chapter 19 (Video & Multimodal RAG): Combines audio processing with video understanding for complete multimodal systems

Practical Exercises

  1. Whisper Evaluation: Transcribe the same 5-minute audio clip using Whisper tiny, base, small, medium, and large-v3. Compare Word Error Rate (WER), processing time, and subjective quality. At what quality level do you stop noticing improvements?

  2. Streaming Prototype: Build a streaming transcription system using Whisper with VAD chunking. Measure the latency from end-of-utterance to transcript display. Experiment with chunk sizes and overlap to find the best tradeoff.

  3. Domain Vocabulary: Take a technical podcast episode (with jargon). Transcribe it and measure accuracy on domain-specific terms. Then add a custom vocabulary/prompt with key terms. Measure improvement.

  4. Diarization Pipeline: Use pyannote.audio to diarize a multi-speaker meeting recording. Evaluate speaker assignment accuracy. Identify failure modes (overlapping speech, similar voices).

  5. Audio Quality Impact: Artificially degrade a clean audio file (add noise, reduce bitrate, add reverb). Measure how each degradation affects transcription accuracy. Create guidelines for minimum audio quality.


Self-Assessment Checkpoint

Conceptual Questions

Q1. [IC2] Why do modern speech recognition systems use mel spectrograms rather than raw waveforms as input? What information is preserved and what is lost?

Answer Mel spectrograms transform time-domain audio into time-frequency representations that match human auditory perception. Benefits: (1) Reduces dimensionality—16kHz audio is 16,000 values/second, while spectrograms are ~100 frames/second with 80-128 mel bins. (2) Highlights speech-relevant frequencies using mel scale, which is logarithmic and matches human hearing. (3) Removes phase information that’s mostly irrelevant for speech recognition. (4) Creates image-like 2D representations that neural networks process efficiently. What’s lost: (1) Fine temporal detail within windows (typically 25ms). (2) Phase information needed for audio reconstruction. (3) Very high or low frequencies outside the mel filter bank range. This tradeoff works well because speech content lies primarily in the preserved frequency ranges.

Q2. [IC2] Explain the tradeoffs between batch transcription and streaming transcription. When would you choose each?

Answer

Batch transcription: Process complete audio files in one pass. Benefits: (1) Higher accuracy—model sees full context for word disambiguation. (2) Can use larger, slower models. (3) Better speaker diarization with full conversation. (4) No latency requirements. Drawbacks: (1) Can’t provide real-time feedback. (2) Must wait for recording to complete. Use cases: Podcast transcription, legal depositions, archival audio.

Streaming transcription: Process audio in chunks as it arrives. Benefits: (1) Real-time output—users see transcription as they speak. (2) Enables live applications (captions, assistants). (3) Lower memory requirements. Drawbacks: (1) Lower accuracy due to limited context. (2) Complex buffer management. (3) Must handle partial utterances and corrections. Use cases: Live captions, voice assistants, meeting transcription, real-time translation.

Hybrid approaches use streaming with periodic correction passes, combining low latency with higher eventual accuracy.

Q3. [Senior] What is speaker diarization and why is it challenging? What approaches do modern systems use?

Answer

Speaker diarization answers “who spoke when?” by segmenting audio and assigning segments to speakers. Challenges: (1) Unknown number of speakers a priori. (2) Overlapping speech (multiple speakers at once). (3) Similar voice characteristics between speakers. (4) Speaker changes mid-sentence. (5) Non-speech sounds (laughter, coughs) assigned to speakers.

Modern approaches: (1) Embedding-based: Extract speaker embeddings (d-vectors, x-vectors) for segments, cluster similar embeddings. (2) End-to-end neural: Models like pyannote.audio jointly predict speech activity and speaker assignment. (3) Pipeline: VAD → segmentation → embedding → clustering → resegmentation. (4) Online diarization: Maintain speaker registry, assign new segments in real-time.

Evaluation metrics: Diarization Error Rate (DER) = missed speech + false alarm + speaker confusion. State-of-the-art: ~5-10% DER on standard benchmarks, higher on difficult conditions.

Q4. [Senior] How does Whisper handle multilingual transcription? What are its limitations for low-resource languages?

Answer

Whisper approach: Trained on 680,000 hours of multilingual audio (96+ languages). Uses language tokens to condition generation. Can auto-detect language or be prompted with specific language. Shares encoder across languages, language-specific decoder tokens.

Multilingual capabilities: (1) Strong on high-resource languages (English, Spanish, German, French, Mandarin). (2) Reasonable on medium-resource languages. (3) Transcribe AND translate (any language → English).

Low-resource limitations: (1) Accuracy degrades significantly for languages with <1000 hours training data. (2) Rare scripts may have tokenization issues. (3) Code-switching (mixing languages) often fails. (4) Accents and dialects underrepresented. (5) Domain-specific vocabulary for rare languages very limited.

Mitigations: Fine-tune on target language data, use language prompts explicitly, post-process with language-specific spell checkers.

Q5. [Staff] Design a quality assurance system for an audio transcription service. How would you detect and handle low-quality transcriptions before they reach users?

Answer

QA pipeline components:

Input quality assessment: (1) Audio SNR estimation. (2) Clipping detection. (3) Bitrate/format validation. (4) Duration checks. Flag poor inputs for manual review or enhanced processing.

Transcription confidence: (1) Token-level confidence scores from Whisper. (2) Low average confidence → flag for review. (3) High uncertainty segments marked in output. (4) Compare multiple model outputs for consistency.

Output validation: (1) Language detection on output matches expected. (2) Grammar/coherence checks (LLM-based). (3) Domain vocabulary validation. (4) Diarization sanity checks (reasonable speaker count, segment lengths).

Human-in-the-loop: (1) Sample-based QA on percentage of transcriptions. (2) User feedback integration. (3) Escalation for low-confidence outputs. (4) Continuous labeling for model improvement.

Metrics: (1) Automated quality score per transcription. (2) Human QA error rates. (3) User-reported issues. (4) Confidence calibration accuracy.

Spot the Problem

Problem 1. [IC2] A developer’s audio processing pipeline:

def transcribe_audio(file_path: str) -> str:
    audio = load_audio(file_path)  # Load at original sample rate
    result = whisper.transcribe(audio)
    return result["text"]

What’s missing?

Answer

Issues: (1) No sample rate conversion—Whisper expects 16kHz audio. If input is 44.1kHz or 48kHz, accuracy may suffer or fail. (2) No audio validation—corrupt files, wrong formats, or silence will cause issues. (3) No error handling—network issues, OOM, or corrupt audio will crash. (4) No preprocessing—background noise, low volume, or clipping not addressed. (5) No confidence scores—returning just text loses quality signals.

Better implementation:

def transcribe_audio(file_path: str) -> dict:
    audio = load_audio(file_path)
    audio = resample_to_16khz(audio)
    audio = normalize_audio(audio)

    if calculate_snr(audio) < MIN_SNR:
        raise LowQualityAudioError("Audio quality too low")

    result = whisper.transcribe(audio)
    return {
        "text": result["text"],
        "segments": result["segments"],
        "language": result["language"],
        "avg_confidence": calculate_avg_confidence(result)
    }

Problem 2. [Senior] A streaming transcription system has high latency:

def stream_transcribe(audio_stream):
    buffer = []
    for chunk in audio_stream:
        buffer.append(chunk)
        if len(buffer) >= 30:  # 30 seconds of audio
            text = transcribe(concat(buffer))
            yield text
            buffer = []

What’s wrong and how would you fix it?

Answer

Issues: (1) 30-second buffer means 30+ second latency—users wait half a minute for any output. (2) No overlap—context lost at buffer boundaries, causing errors. (3) No VAD—buffering silence wastes resources and time. (4) Full retranscription—doesn’t leverage incremental/streaming capabilities.

Fixes: (1) Use VAD to detect speech segments, transcribe on speech end. (2) Reduce chunk size to 2-5 seconds for faster feedback. (3) Use sliding window with overlap for context. (4) Implement streaming Whisper or use streaming-native models (like AssemblyAI streaming).

Better architecture:

def stream_transcribe(audio_stream):
    vad = VoiceActivityDetector()
    current_segment = []

    for chunk in audio_stream:
        current_segment.append(chunk)

        if vad.is_end_of_speech(chunk):
            audio = concat(current_segment)
            text = transcribe(audio)
            yield text
            current_segment = current_segment[-overlap_size:]  # Keep overlap
Result: Latency drops from 30s to 1-3s (utterance length + processing).

Problem 3. [Staff] A meeting transcription product has accuracy complaints:

Support tickets:
- "Names are wrong half the time"
- "Technical terms come out garbled"
- "Can't tell who said what in group discussions"
- "Transcripts useless for our engineering meetings"

Diagnose the issues and propose solutions.

Answer

Root causes:

  1. “Names are wrong”: Whisper has no context about meeting participants. Proper nouns are guessed from phonetics. Solution: (a) Pre-meeting participant list for name hints. (b) Post-processing with known name patterns. (c) User corrections fed back to improve future transcriptions.

  2. “Technical terms garbled”: Generic model lacks domain vocabulary. Solution: (a) Custom vocabulary/prompt with technical terms. (b) Domain-specific fine-tuning if data available. (c) Post-processing with technical glossary. (d) Integrate with company knowledge base for term disambiguation.

  3. “Can’t tell who said what”: Diarization issues with group discussions. Solutions: (a) Add speaker identification (enroll users with voice samples). (b) Use video to help diarization (detect who’s speaking via lip movement). (c) Improve diarization model for overlapping speech. (d) Let users correct speaker assignments, learn preferences.

  4. “Useless for engineering meetings”: Combination of all above, plus engineering jargon, code references, and acronyms. Solution: Build engineering-specific transcription profile with technical vocabulary, common acronyms, and integration with codebase for identifier recognition.

Product changes: (a) Pre-meeting setup with attendees and agenda. (b) Domain selection (engineering, sales, legal). (c) Easy correction UI that improves future transcripts.

Design Exercises

Exercise 1. [Senior] Design an audio search system for a podcast hosting platform. Users should be able to search across millions of episodes to find discussions of specific topics, including semantic search (“discussions about AI ethics” should match “debate about responsible machine learning”).

Guidance Indexing pipeline: (1) Download/receive audio files. (2) Transcribe with Whisper large-v3 (accuracy over speed for indexing). (3) Chunk transcripts into 30-60 second segments with overlap. (4) Generate embeddings for each segment using text embedding model. (5) Store in vector database with metadata (show, episode, timestamp, speaker if diarization done). Query processing: (1) Embed user query. (2) Vector similarity search to retrieve top candidates. (3) Re-rank with cross-encoder for precision. (4) Group results by episode, deduplicate. (5) Return clips with timestamps. Enhancements: (1) Named entity recognition to extract people, organizations, topics mentioned. (2) Episode-level embeddings for “similar episodes” feature. (3) Transcript display with search highlights. (4) Audio player that jumps to matched timestamps. Scaling: Index millions of episodes offline, serve queries from vector DB. Consider sharding by show category or date range. Cost: Whisper large-v3 ~$0.006/minute, embedding ~$0.0001/1K tokens, vector storage ~$0.01/1M vectors/month.

Exercise 2. [Staff] Design a real-time meeting assistant that transcribes, identifies speakers, and generates summaries. Requirements: <3 second transcription latency, support 2-10 speakers, integrate with Zoom/Teams, generate action items in real-time. Consider: architecture, latency optimization, speaker identification, and graceful degradation.

Guidance

Architecture: (1) Audio capture via meeting platform API (Zoom SDK, Teams Graph API). (2) Edge processing for VAD and initial chunking (minimize round-trip latency). (3) Streaming transcription service (cloud-based, parallelized). (4) Speaker identification using enrolled voice profiles + real-time clustering. (5) LLM for summarization and action item extraction (runs on accumulated context).

Latency optimization: (1) Edge VAD reduces data sent to cloud. (2) Pre-warmed inference servers. (3) Streaming output (display words as they’re recognized). (4) Pipeline parallelization (transcribe chunk N while summarizing chunk N-1).

Speaker identification: (1) Pre-meeting: Enroll speakers with 30-second voice samples. (2) During meeting: Match segments to enrolled profiles. (3) Fallback: Cluster unknown voices, let users label post-meeting. (4) Video integration: Use active speaker indicators from meeting platform.

Graceful degradation: (1) High latency: Show “processing…” indicator, catch up. (2) Poor audio: Show confidence warnings, offer manual correction. (3) Unknown speakers: Generic labels (Speaker A, B), editable post-meeting. (4) LLM overload: Queue summarization, prioritize transcription.

Further Reading

Foundational Papers

  • Radford et al. (2022). “Robust Speech Recognition via Large-Scale Weak Supervision” — The Whisper paper that transformed speech-to-text with weak supervision at scale
  • Bredin et al. (2020). “pyannote.audio: Neural Building Blocks for Speaker Diarization” — Modern neural speaker diarization approaches

Speech Recognition

  • Baevski et al. (2020). “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” — Self-supervised speech representation learning foundation
  • Zhang et al. (2023). “Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages” — Large-scale multilingual ASR techniques
  • Pratap et al. (2020). “MLS: A Large-Scale Multilingual Dataset for Speech Research” — Multilingual speech datasets and challenges

Production Systems

  • Park et al. (2019). “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition” — Data augmentation techniques for ASR training
  • Chan et al. (2016). “Listen, Attend and Spell” — End-to-end speech recognition with attention mechanisms