Chapter 18: Audio & Speech Systems
speech, audio, Whisper, transcription, TTS, diarization, voice, ASR, speaker identification
Introduction
In late 2023, a legal tech startup faced a critical scalability problem. Their team was manually reviewing hundreds of hours of deposition recordings each week, searching for relevant testimony to support case arguments. Paralegals would listen to entire recordings, take notes, and create searchable transcripts—a process that took 3-4 hours per hour of audio.
They deployed an audio processing pipeline that automatically transcribed depositions, identified speakers, and created searchable embeddings of the content. Suddenly, attorneys could search: “Find all testimony about the contract negotiation in April 2022” and receive timestamped clips from across dozens of recordings. Review time dropped by 70%.
This transformation illustrates why audio processing matters. The world’s information isn’t just text and images—it’s podcasts, meetings, customer calls, and instructional recordings. AI systems that can only search text miss the rich information carried in speech.
Why Audio Processing Matters
Audio carries unique information: Tone of voice, emphasis, pauses—all convey meaning beyond words. Customer sentiment in a support call, speaker confidence in a meeting, or emotional state in a therapeutic session exist only in audio.
Speech enables text search over audio: Transcription with speaker diarization makes audio searchable, indexable, and analyzable with the same tools we use for text.
Real-time processing enables new applications: Live captions, meeting transcription, and voice assistants require streaming audio processing with low latency.
What You’ll Learn
This chapter focuses on audio and speech processing:
- Audio processing fundamentals: from waveforms to spectrograms
- Production-ready speech-to-text systems with Whisper
- Speaker diarization: who said what and when
- Real-time vs batch transcription tradeoffs
- Domain-specific vocabulary and accuracy optimization
- Production architectures for audio processing
For video understanding and multimodal retrieval (combining audio, video, and images), see Chapter 19.
Prerequisites
- Understanding of transformer architecture and attention (Chapter 5)
- Familiarity with embeddings and vector search (Chapter 7)
- Experience with RAG systems (Chapter 8)
How Machines “Hear”: Audio Processing Fundamentals
Speech-to-text might seem simpler than vision—just transcribe what was said. But the transformation from sound waves to text involves sophisticated processing that bridges the continuous analog world and discrete text.
From Waveforms to Features
Raw audio is a sequence of amplitude values sampled many times per second (16kHz or 44.1kHz typically). This representation is both too raw and too high-dimensional for direct processing.
The Short-Time Fourier Transform (STFT) analyzes the frequency content of small windows of audio. Instead of treating audio as a single signal, it breaks it into overlapping windows (typically 25ms each) and computes the frequency spectrum for each window. This captures how frequencies change over time—essential for understanding speech where sounds evolve rapidly.
Why Mel Spectrograms?
The mel scale reflects human hearing: we’re more sensitive to differences between low frequencies than high frequencies. The difference between 100Hz and 200Hz sounds much more significant than the difference between 10,000Hz and 10,100Hz, even though both are 100Hz apart. This is because human hearing is approximately logarithmic in frequency perception.
Mel filterbanks compress frequency information in this perceptually-motivated way. Instead of representing all frequencies equally, they allocate more resolution to lower frequencies where human speech carries more information. Vowels, which contain most speech energy, appear in lower frequencies (200-2000Hz), while consonants extend higher.
The resulting representation—time on one axis, frequency on the other—resembles an image. Some early speech systems even used image classification architectures (CNNs) on spectrograms. Modern systems use transformers that process the spectrogram as a sequence, with each time step being an 80-dimensional vector representing the frequency content at that moment.
Intuition: Reading the Spectrogram
When you look at a mel spectrogram:
- Horizontal axis: Time progression (left to right)
- Vertical axis: Frequency (low to high)
- Brightness: Energy at that frequency and time
Vowels appear as horizontal bands (sustained frequencies). Consonants appear as brief bursts or gaps. Silence is dark. A sentence like “Hello world” shows distinct patterns: the “H” as a noise burst, “e” as horizontal bands, “l” as a transition, and so on. The model learns to map these patterns to phonemes, then words, then text.
The Whisper Architecture
Whisper (Radford et al., 2022) demonstrated that simple architectures trained on massive data could achieve remarkable transcription quality. It’s an encoder-decoder transformer that processes mel spectrograms and generates text autoregressively.
Why Whisper Works So Well
Whisper’s success comes from scale and diversity, not architectural innovation:
- 680,000 hours of training data: Orders of magnitude more than previous systems (which typically had thousands or tens of thousands of hours)
- Multilingual: 99 languages, enabling cross-lingual transfer. The model learns general speech patterns that apply across languages
- Weakly supervised: Much training data was automatically transcribed, not human-labeled, enabling massive scale
- Multitask training: Trained jointly on transcription, translation, language detection, and timestamp prediction
The model learned robust representations that generalize across accents, noise conditions, and domains—because it saw all of these in training. A model trained on clean English recordings might fail on accented speech or background noise. Whisper, trained on internet-scale diverse data, handles these naturally.
Streaming vs. Batch Processing
Whisper processes 30-second chunks by design. This creates fundamental tradeoffs between latency and accuracy:
Batch Processing Characteristics:
- Latency: Seconds to minutes (upload time + processing time)
- Accuracy: Highest—full context available, can look ahead and behind
- Context: Full 30-second window for disambiguation
- Use case: Recordings, podcasts, archived content, where speed isn’t critical
Streaming Processing Characteristics:
- Latency: Milliseconds to seconds (near real-time)
- Accuracy: Lower—limited context, no look-ahead, errors at chunk boundaries
- Context: Only past audio available, current window typically 2-5 seconds
- Use case: Live captions, voice assistants, real-time translation
The tradeoff exists because language has long-range dependencies. The phrase “I scream” vs. “ice cream” requires context to disambiguate. With full context, the model can look at surrounding words. In streaming mode, it must make immediate decisions.
Modern variants address streaming needs:
- Streaming Whisper: Process audio in smaller overlapping windows (5-10 seconds with 1-2 second overlap), merge results
- FastWhisper: Optimized inference with CTranslate2 and quantization for lower latency
- Whisper.cpp: On-device implementation for edge deployment with optimized C++ code
Each trades accuracy for latency. Production systems often use hybrid approaches: streaming for immediate display, batch correction for final accuracy.
Audio Processing Systems
Speech-to-Text: From Whisper to Production
Whisper changed speech transcription from a specialized capability requiring expensive infrastructure to a commodity API. But production systems need more than basic transcription.
Production-Ready Transcription Service
import openai
from typing import Optional, Dict, List
import logging
logger = logging.getLogger(__name__)
class TranscriptionService:
"""Production-ready transcription service with error handling and optimization."""
def __init__(self):
self.client = openai.OpenAI()
self.max_file_size = 25 * 1024 * 1024 # 25MB API limit
def transcribe(
self,
audio_path: str,
language: Optional[str] = None,
prompt: Optional[str] = None,
temperature: float = 0.0
) -> Dict:
"""
Transcribe audio with optional language hint and vocabulary prompting.
Args:
audio_path: Path to audio file
language: ISO-639-1 language code (e.g., 'en', 'es'). If None, auto-detect.
prompt: Vocabulary hints for domain-specific content
temperature: Sampling temperature (0 = deterministic, higher = more creative)
Returns:
Dictionary containing transcription text, language, duration, and timestamps
"""
try:
with open(audio_path, 'rb') as f:
response = self.client.audio.transcriptions.create(
model="whisper-1",
file=f,
language=language,
prompt=prompt,
temperature=temperature,
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
return {
'text': response.text,
'language': response.language,
'duration': response.duration,
'segments': response.segments,
'words': response.words
}
except openai.BadRequestError as e:
logger.error(f"Invalid request to Whisper API: {e}")
raise
except openai.RateLimitError as e:
logger.warning(f"Rate limited, should implement retry: {e}")
raise
except Exception as e:
logger.error(f"Transcription failed: {e}")
raise
def estimate_cost(self, audio_duration_seconds: float) -> float:
"""Estimate transcription cost. OpenAI charges per minute."""
minutes = audio_duration_seconds / 60
cost_per_minute = 0.006 # $0.006 per minute as of 2026
return minutes * cost_per_minuteWhat people do: Feed raw audio files directly to transcription APIs without any quality assessment or preprocessing, then wonder why accuracy is poor.
Why it fails: Background noise, low bitrates, clipping, and inconsistent volume levels significantly degrade transcription accuracy. Whisper is robust but not magic—garbage in, garbage out.
Fix: Before transcription, assess audio quality (SNR, clipping detection). For problematic files: normalize volume levels, apply noise reduction (spectral gating), and consider resampling to 16kHz mono if the source is lower quality. Even simple preprocessing can improve WER by 5-15% on noisy recordings.
Vocabulary Prompting
Whisper accepts prompts that hint at expected vocabulary. This dramatically improves accuracy for domain-specific content, proper nouns, and technical terminology.
# Domain-specific vocabulary prompts
DOMAIN_PROMPTS = {
'medical': """Medical transcription. Expected terms: myocardial infarction,
echocardiogram, hypertension, tachycardia, bradycardia, arrhythmia,
CBC, BMP, comprehensive metabolic panel, CT scan, MRI,
metformin, lisinopril, atenolol, warfarin, electrocardiogram.""",
'legal': """Legal dictation. Expected terms: plaintiff, defendant,
deposition, affidavit, subpoena, motion to dismiss, summary judgment,
habeas corpus, voir dire, pro bono, litigation, arbitration,
interrogatories, discovery, stipulation.""",
'technical': """Technical discussion. Expected terms: Kubernetes,
microservices, API gateway, load balancer, PostgreSQL, Redis,
containerization, CI/CD pipeline, infrastructure as code,
GraphQL, WebSocket, OAuth, JWT, horizontal scaling.""",
'financial': """Financial analysis. Expected terms: EBITDA,
price-to-earnings ratio, compound annual growth rate, CAGR,
balance sheet, income statement, cash flow, liquidity,
market capitalization, dividend yield, earnings per share."""
}
class DomainAwareTranscriber(TranscriptionService):
"""Transcription service with domain-specific vocabulary support."""
def transcribe_domain(
self,
audio_path: str,
domain: str,
additional_terms: Optional[List[str]] = None
) -> Dict:
"""
Transcribe with domain-specific vocabulary hints.
Args:
audio_path: Path to audio file
domain: One of 'medical', 'legal', 'technical', 'financial'
additional_terms: Extra vocabulary terms specific to this transcription
"""
prompt = DOMAIN_PROMPTS.get(domain, "")
# Add custom terms if provided
if additional_terms:
custom_terms = ", ".join(additional_terms)
prompt += f" Additional terms: {custom_terms}."
return self.transcribe(audio_path, prompt=prompt)
# Usage example
transcriber = DomainAwareTranscriber()
# Medical transcription with custom patient/procedure terms
result = transcriber.transcribe_domain(
audio_path="consultation.mp3",
domain="medical",
additional_terms=["Dr. Ramirez", "Xenical", "gastric bypass"]
)What people do: Use default Whisper settings for medical, legal, or technical transcription, then manually correct hundreds of misheard domain terms.
Why it fails: Without vocabulary hints, Whisper defaults to common words. “EBITDA” becomes “a bit of”, “myocardial infarction” becomes “my cardio infection”, and “kubectl” becomes “cube cuttle.”
Fix: Always provide a vocabulary prompt for specialized content. Include expected proper nouns, acronyms, and technical terms. The prompt doesn’t force these words—it just makes the model more confident when it hears ambiguous sounds that could match domain vocabulary.
Why Prompting Works
Whisper’s decoder is conditioned on the prompt, biasing it toward expected vocabulary. When it hears an ambiguous phrase, it’s more likely to select terms from the prompt. For example:
- Without prompt: “We need to increase the cagr” → “We need to increase the cager” (incorrect)
- With financial prompt: “We need to increase the cagr” → “We need to increase the CAGR” (correct)
The prompt doesn’t force these terms—if they don’t fit the audio, they won’t appear. It just makes the model more confident about domain-specific vocabulary.
What people do: Split long audio into fixed-length chunks (e.g., 10-minute segments) at exact boundaries, then concatenate the transcripts.
Why it fails: Words get cut in half at chunk boundaries—“develop” might become “devel” at the end of one chunk and “op” at the start of the next. Sentences lose context. Speaker attribution breaks.
Fix: Use overlapping chunks (e.g., 10-minute chunks with 5-10 second overlap). During merging, detect duplicates in the overlap region by comparing timestamps, and discard the duplicate words. Keep words from the earlier chunk (they have more left context) unless they’re clearly cut off.
Handling Long Audio
Whisper’s API limits files to 25MB. For longer content (podcasts, lectures, meetings), you need chunking with careful merging:
from pydub import AudioSegment
import os
from typing import List, Dict
class LongAudioProcessor:
"""Process audio longer than API limits with intelligent chunking."""
def __init__(
self,
chunk_duration_ms: int = 10 * 60 * 1000, # 10 minutes
overlap_ms: int = 5000 # 5 seconds overlap
):
self.chunk_duration = chunk_duration_ms
self.overlap = overlap_ms
self.transcriber = TranscriptionService()
def transcribe_long(
self,
audio_path: str,
prompt: Optional[str] = None
) -> Dict:
"""
Transcribe long audio by chunking and merging.
The overlap ensures we don't cut words in half at chunk boundaries.
We detect duplicates in the overlap region and remove them.
"""
audio = AudioSegment.from_file(audio_path)
chunks = self._create_chunks(audio)
logger.info(f"Processing {len(chunks)} chunks from {audio_path}")
transcriptions = []
for i, chunk_audio in enumerate(chunks):
# Export chunk to temporary file
chunk_path = f"/tmp/chunk_{i}.mp3"
chunk_audio.export(chunk_path, format="mp3")
try:
result = self.transcriber.transcribe(chunk_path, prompt=prompt)
transcriptions.append({
'chunk_index': i,
'start_ms': i * (self.chunk_duration - self.overlap),
'result': result
})
finally:
# Clean up temporary file
if os.path.exists(chunk_path):
os.remove(chunk_path)
return self._merge_transcriptions(transcriptions)
def _create_chunks(self, audio: AudioSegment) -> List[AudioSegment]:
"""Split audio into overlapping chunks."""
chunks = []
start = 0
while start < len(audio):
end = min(start + self.chunk_duration, len(audio))
chunks.append(audio[start:end])
# Move start forward by chunk_duration minus overlap
start += self.chunk_duration - self.overlap
return chunks
def _merge_transcriptions(self, transcriptions: List[Dict]) -> Dict:
"""
Merge overlapping transcriptions, avoiding duplicates.
In overlap regions, we keep words from the earlier chunk (they have
more left context) unless they're clearly cut off.
"""
merged_words = []
merged_segments = []
for i, trans in enumerate(transcriptions):
chunk_start_ms = trans['start_ms']
if i == 0:
# First chunk: include everything
for word in trans['result'].get('words', []):
merged_words.append(self._adjust_timestamp(word, chunk_start_ms))
else:
# Subsequent chunks: skip overlap region
overlap_threshold = trans['start_ms'] + self.overlap
for word in trans['result'].get('words', []):
absolute_start = chunk_start_ms + word['start'] * 1000
# Only include words after the overlap region
if absolute_start >= overlap_threshold:
merged_words.append(self._adjust_timestamp(word, chunk_start_ms))
full_text = ' '.join(w['word'].strip() for w in merged_words)
return {
'text': full_text,
'words': merged_words,
'chunk_count': len(transcriptions),
'duration': transcriptions[-1]['start_ms'] / 1000 + \
transcriptions[-1]['result']['duration']
}
def _adjust_timestamp(self, word: Dict, offset_ms: int) -> Dict:
"""Adjust word timestamps to absolute time."""
return {
'word': word['word'],
'start': (offset_ms + word['start'] * 1000) / 1000,
'end': (offset_ms + word['end'] * 1000) / 1000
}Speaker Diarization: Who Said What
For multi-speaker content (meetings, interviews, podcasts), transcription alone isn’t enough. You need to know who spoke when.
The Diarization Challenge
Speaker diarization segments audio by speaker without knowing identities in advance. It answers “when did different people speak?” not “who is speaking?” The output is speaker labels (SPEAKER_00, SPEAKER_01, etc.) assigned to time segments.
Audio Timeline:
[=========================================================]
0:00 10:00
Raw Transcription:
"Hello everyone welcome to the meeting today we're going to discuss..."
Diarized Output:
SPEAKER_00 (0:00-0:03): "Hello everyone, welcome to the meeting."
SPEAKER_01 (0:04-0:15): "Today we're going to discuss the quarterly results."
SPEAKER_00 (0:16-0:22): "Great, let's start with revenue."
SPEAKER_02 (0:23-0:45): "Revenue was up 15% compared to last quarter..."
Integration with Transcription
Modern diarization systems use voice embeddings to cluster speakers. The pyannote.audio library provides state-of-the-art models:
from pyannote.audio import Pipeline
import torch
from typing import List, Dict, Optional
class DiarizedTranscription:
"""Combine transcription with speaker diarization."""
def __init__(self, hf_auth_token: str):
"""
Initialize transcription and diarization pipelines.
Args:
hf_auth_token: HuggingFace token for accessing pyannote models
"""
self.transcriber = TranscriptionService()
# Load diarization pipeline
self.diarizer = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=hf_auth_token
)
# Use GPU if available
if torch.cuda.is_available():
self.diarizer.to(torch.device("cuda"))
def process(self, audio_path: str, num_speakers: Optional[int] = None) -> Dict:
"""
Get transcription with speaker labels.
Args:
audio_path: Path to audio file
num_speakers: Expected number of speakers (if known, improves accuracy)
Returns:
Dictionary with speaker-labeled turns and full transcript
"""
# Get word-level transcription from Whisper
transcription = self.transcriber.transcribe(audio_path)
words = transcription.get('words', [])
# Get speaker segments from diarization
if num_speakers:
diarization = self.diarizer(
audio_path,
num_speakers=num_speakers
)
else:
diarization = self.diarizer(audio_path)
# Assign speakers to words based on timestamp overlap
labeled_words = []
for word in words:
word_time = word['start']
speaker = self._get_speaker_at_time(diarization, word_time)
labeled_words.append({
**word,
'speaker': speaker
})
# Group words into speaker turns
turns = self._group_into_turns(labeled_words)
return {
'turns': turns,
'speakers': list(set(turn['speaker'] for turn in turns)),
'full_transcript': transcription['text'],
'duration': transcription['duration']
}
def _get_speaker_at_time(self, diarization, time: float) -> str:
"""Find which speaker was talking at a given time."""
for turn, _, speaker in diarization.itertracks(yield_label=True):
if turn.start <= time <= turn.end:
return speaker
return "UNKNOWN"
def _group_into_turns(self, words: List[Dict]) -> List[Dict]:
"""Group consecutive words by the same speaker into turns."""
turns = []
current_speaker = None
current_words = []
for word in words:
if word['speaker'] != current_speaker:
# Speaker changed - save previous turn
if current_words:
turns.append({
'speaker': current_speaker,
'text': ' '.join(w['word'].strip() for w in current_words),
'start': current_words[0]['start'],
'end': current_words[-1]['end'],
'word_count': len(current_words)
})
current_speaker = word['speaker']
current_words = []
current_words.append(word)
# Don't forget the last turn
if current_words:
turns.append({
'speaker': current_speaker,
'text': ' '.join(w['word'].strip() for w in current_words),
'start': current_words[0]['start'],
'end': current_words[-1]['end'],
'word_count': len(current_words)
})
return turnsDiarization Quality Factors
Speaker diarization accuracy depends on several factors:
Audio quality: Clear audio with distinct voices works best. Background noise, crosstalk, and poor recording quality hurt performance.
Speaker overlap: When multiple people speak simultaneously, diarization struggles. The model must choose one speaker or mark the segment as overlapping.
Number of speakers: Performance degrades with many speakers. 2-4 speakers is ideal; 10+ speakers is challenging.
Speaker similarity: Similar voices (same gender, accent, pitch) are harder to distinguish than different voices.
Short segments: Very brief utterances (< 1 second) may be misattributed.
Production systems should validate diarization quality and provide confidence scores. For critical applications, consider human review of speaker assignments.
Complete code: See Speaker Diarization and Audio Transcription for production-ready implementations with error handling and confidence scoring.
Streaming vs. Batch Latency Tradeoffs
Real-time applications (live captions, voice assistants, real-time translation) can’t wait for complete audio files. The fundamental tradeoff is latency vs. accuracy.
Batch Processing:
- Latency: 1-10 seconds (upload + processing + return)
- Accuracy: 1-5% WER on clear speech (95-99% word accuracy)
- Context: Full utterance available for disambiguation
- Cost: $0.006 per minute
- Use case: Recorded content, post-meeting transcripts, podcast transcription
Streaming Processing:
- Latency: 200-500ms (real-time or near-real-time)
- Accuracy: 4-8% WER (higher error rate due to limited context)
- Context: Only past audio available
- Cost: Higher (maintained connection, more infrastructure)
- Use case: Live captions, voice assistants, real-time translation
Hybrid Approach: Streaming with Correction
Many production systems use a two-phase approach: fast streaming for immediate display, batch correction for accuracy.
import asyncio
from typing import AsyncIterator, Dict
class HybridStreamingTranscription:
"""Streaming transcription with delayed batch correction."""
def __init__(self):
self.streaming_model = StreamingWhisperClient() # Fast, lower accuracy
self.batch_model = TranscriptionService() # Slower, higher accuracy
self.buffer = []
self.chunk_duration_ms = 5000 # 5 seconds
async def process_stream(
self,
audio_stream: AsyncIterator[bytes]
) -> AsyncIterator[Dict]:
"""
Process audio stream with both immediate and corrected outputs.
Yields:
- Immediate streaming results (type='streaming')
- Corrected batch results every 30 seconds (type='corrected')
"""
async for chunk in audio_stream:
# Immediate streaming result for live display
streaming_result = await self.streaming_model.transcribe_chunk(chunk)
yield {
'type': 'streaming',
'text': streaming_result['text'],
'is_final': False,
'latency_ms': streaming_result['latency']
}
self.buffer.append(chunk)
# Every 30 seconds, send corrected batch result
buffer_duration = len(self.buffer) * self.chunk_duration_ms
if buffer_duration >= 30000:
combined_audio = self._combine_chunks(self.buffer)
# Process with batch model for higher accuracy
corrected = await self.batch_model.transcribe(combined_audio)
yield {
'type': 'corrected',
'text': corrected['text'],
'is_final': True,
'replaces_streaming': True,
'duration': corrected['duration']
}
self.buffer = []
def _combine_chunks(self, chunks: List[bytes]) -> bytes:
"""Combine audio chunks into single file."""
# Implementation depends on audio format
# For PCM: concatenate directly
# For compressed: decode, combine, re-encode
passThis hybrid approach provides the best of both worlds:
- Users see immediate feedback (streaming)
- Final transcript has high accuracy (batch correction)
- System can update the display when corrections arrive
- Total latency is the same as pure streaming for initial display
Case Study: Meeting Transcription Platform
A productivity company built a meeting transcription service for enterprise customers. Requirements:
- Support 2-10 speakers in meetings
- Real-time display during meeting
- Post-meeting corrections and summary
- Action item extraction
- Integration with calendar and task systems
Architecture
Key Design Decisions
1. Streaming + Correction for UX
Users see real-time text during the meeting (engagement, verification) with periodic corrections for accuracy. Better than waiting until end or showing only final transcript.
2. Speaker Identification via Voice Embeddings
Rather than just labeling speakers as SPEAKER_00, SPEAKER_01, they built a system to recognize recurring speakers across meetings:
class SpeakerIdentification:
"""Identify recurring speakers across meetings using voice embeddings."""
def __init__(self):
self.embedding_model = SpeakerEmbeddingModel()
self.speaker_database = {} # {user_id: embedding}
def identify_speakers(
self,
audio_path: str,
participant_ids: List[str]
) -> Dict:
"""
Match diarized speakers to known participants.
Args:
audio_path: Path to meeting audio
participant_ids: List of participant user IDs
Returns:
Mapping of speaker labels to user IDs
"""
# Get diarization with speaker embeddings
diarization = self.diarizer(audio_path, return_embeddings=True)
# Extract representative embedding for each speaker
speaker_embeddings = {}
for speaker_label in set(diarization.labels()):
segments = [
seg for seg, _, lbl in diarization.itertracks(yield_label=True)
if lbl == speaker_label
]
# Use embedding from longest segment
longest = max(segments, key=lambda s: s.duration)
speaker_embeddings[speaker_label] = longest.embedding
# Match speakers to participants
matches = {}
for speaker_label, embedding in speaker_embeddings.items():
best_match = None
best_similarity = 0
for user_id in participant_ids:
if user_id in self.speaker_database:
similarity = cosine_similarity(
embedding,
self.speaker_database[user_id]
)
if similarity > best_similarity and similarity > 0.7:
best_similarity = similarity
best_match = user_id
matches[speaker_label] = best_match or "Unknown"
return matches
def register_speaker(self, user_id: str, audio_sample: str):
"""Register a user's voice embedding from a sample."""
embedding = self.embedding_model.extract(audio_sample)
self.speaker_database[user_id] = embedding3. LLM Post-Processing
Raw transcription is processed by an LLM to add punctuation, fix obvious errors, and extract structured information:
MEETING_SUMMARY_PROMPT = """
Analyze this meeting transcript and provide:
## SUMMARY
2-3 paragraphs covering:
- Main topics discussed
- Key outcomes and decisions
- Overall tone and dynamics
## DECISIONS MADE
List any decisions that were explicitly agreed upon.
Format: "- [Decision] - Agreed by [participants if mentioned]"
## ACTION ITEMS
List action items with owner and deadline if mentioned.
Format: "- [ ] [Owner] Task description (Deadline: [date] or 'Not specified')"
## OPEN QUESTIONS
Topics raised but not resolved.
## KEY QUOTES
Notable statements worth highlighting (max 3).
Transcript:
{transcript}
"""
async def process_completed_meeting(transcript: str, metadata: Dict) -> Dict:
"""Extract structured information from meeting transcript."""
prompt = MEETING_SUMMARY_PROMPT.format(transcript=transcript)
response = await llm.generate(
prompt,
temperature=0.3, # Lower temperature for factual extraction
max_tokens=1500
)
return {
'summary': response,
'metadata': metadata,
'transcript': transcript
}Results
After 6 months of deployment:
- 94% user satisfaction (vs. 67% with manual note-taking)
- Average 15 minutes saved per meeting in follow-up clarifications
- 3x increase in action item completion (clear assignment + automatic tracking)
- 40% reduction in “what was decided?” questions post-meeting
The combination of accurate transcription, speaker identification, and LLM-powered structure extraction created value that exceeded simple transcription.
Summary
This chapter covered the fundamentals of audio and speech processing for AI systems:
Key Takeaways:
Audio representation matters: Mel spectrograms balance computational efficiency with acoustic information, making them ideal input for neural networks processing speech
Whisper dominates production STT: OpenAI’s Whisper models provide strong multilingual transcription out of the box, with turbo variants offering production-ready latency
Diarization is hard but essential: Knowing “who said what” requires speaker embeddings, clustering, and often domain-specific tuning—pyannote.audio provides strong baselines
Real-time requires different architecture: Streaming transcription uses VAD chunking, incremental decoding, and careful buffer management—fundamentally different from batch processing
Domain adaptation improves accuracy: Custom vocabularies, fine-tuning on domain audio, and post-processing rules can dramatically improve accuracy for specialized domains
Production systems need resilience: Audio processing pipelines must handle variable quality, background noise, overlapping speech, and hardware failures gracefully
Connections to Other Chapters
- Chapter 5 (Transformer Architecture): Whisper is an encoder-decoder transformer; understanding attention helps debug transcription errors
- Chapter 7 (RAG Systems): Audio embeddings enable semantic search over speech content, extending RAG to audio modalities
- Chapter 17 (Vision & Document AI): Similar preprocessing patterns (mel spectrograms vs image preprocessing) and model architectures
- Chapter 19 (Video & Multimodal RAG): Combines audio processing with video understanding for complete multimodal systems
Practical Exercises
Whisper Evaluation: Transcribe the same 5-minute audio clip using Whisper tiny, base, small, medium, and large-v3. Compare Word Error Rate (WER), processing time, and subjective quality. At what quality level do you stop noticing improvements?
Streaming Prototype: Build a streaming transcription system using Whisper with VAD chunking. Measure the latency from end-of-utterance to transcript display. Experiment with chunk sizes and overlap to find the best tradeoff.
Domain Vocabulary: Take a technical podcast episode (with jargon). Transcribe it and measure accuracy on domain-specific terms. Then add a custom vocabulary/prompt with key terms. Measure improvement.
Diarization Pipeline: Use pyannote.audio to diarize a multi-speaker meeting recording. Evaluate speaker assignment accuracy. Identify failure modes (overlapping speech, similar voices).
Audio Quality Impact: Artificially degrade a clean audio file (add noise, reduce bitrate, add reverb). Measure how each degradation affects transcription accuracy. Create guidelines for minimum audio quality.
Self-Assessment Checkpoint
Conceptual Questions
Q1. [IC2] Why do modern speech recognition systems use mel spectrograms rather than raw waveforms as input? What information is preserved and what is lost?
Answer
Mel spectrograms transform time-domain audio into time-frequency representations that match human auditory perception. Benefits: (1) Reduces dimensionality—16kHz audio is 16,000 values/second, while spectrograms are ~100 frames/second with 80-128 mel bins. (2) Highlights speech-relevant frequencies using mel scale, which is logarithmic and matches human hearing. (3) Removes phase information that’s mostly irrelevant for speech recognition. (4) Creates image-like 2D representations that neural networks process efficiently. What’s lost: (1) Fine temporal detail within windows (typically 25ms). (2) Phase information needed for audio reconstruction. (3) Very high or low frequencies outside the mel filter bank range. This tradeoff works well because speech content lies primarily in the preserved frequency ranges.Q2. [IC2] Explain the tradeoffs between batch transcription and streaming transcription. When would you choose each?
Answer
Batch transcription: Process complete audio files in one pass. Benefits: (1) Higher accuracy—model sees full context for word disambiguation. (2) Can use larger, slower models. (3) Better speaker diarization with full conversation. (4) No latency requirements. Drawbacks: (1) Can’t provide real-time feedback. (2) Must wait for recording to complete. Use cases: Podcast transcription, legal depositions, archival audio.
Streaming transcription: Process audio in chunks as it arrives. Benefits: (1) Real-time output—users see transcription as they speak. (2) Enables live applications (captions, assistants). (3) Lower memory requirements. Drawbacks: (1) Lower accuracy due to limited context. (2) Complex buffer management. (3) Must handle partial utterances and corrections. Use cases: Live captions, voice assistants, meeting transcription, real-time translation.
Hybrid approaches use streaming with periodic correction passes, combining low latency with higher eventual accuracy.Q3. [Senior] What is speaker diarization and why is it challenging? What approaches do modern systems use?
Answer
Speaker diarization answers “who spoke when?” by segmenting audio and assigning segments to speakers. Challenges: (1) Unknown number of speakers a priori. (2) Overlapping speech (multiple speakers at once). (3) Similar voice characteristics between speakers. (4) Speaker changes mid-sentence. (5) Non-speech sounds (laughter, coughs) assigned to speakers.
Modern approaches: (1) Embedding-based: Extract speaker embeddings (d-vectors, x-vectors) for segments, cluster similar embeddings. (2) End-to-end neural: Models like pyannote.audio jointly predict speech activity and speaker assignment. (3) Pipeline: VAD → segmentation → embedding → clustering → resegmentation. (4) Online diarization: Maintain speaker registry, assign new segments in real-time.
Evaluation metrics: Diarization Error Rate (DER) = missed speech + false alarm + speaker confusion. State-of-the-art: ~5-10% DER on standard benchmarks, higher on difficult conditions.Q4. [Senior] How does Whisper handle multilingual transcription? What are its limitations for low-resource languages?
Answer
Whisper approach: Trained on 680,000 hours of multilingual audio (96+ languages). Uses language tokens to condition generation. Can auto-detect language or be prompted with specific language. Shares encoder across languages, language-specific decoder tokens.
Multilingual capabilities: (1) Strong on high-resource languages (English, Spanish, German, French, Mandarin). (2) Reasonable on medium-resource languages. (3) Transcribe AND translate (any language → English).
Low-resource limitations: (1) Accuracy degrades significantly for languages with <1000 hours training data. (2) Rare scripts may have tokenization issues. (3) Code-switching (mixing languages) often fails. (4) Accents and dialects underrepresented. (5) Domain-specific vocabulary for rare languages very limited.
Mitigations: Fine-tune on target language data, use language prompts explicitly, post-process with language-specific spell checkers.Q5. [Staff] Design a quality assurance system for an audio transcription service. How would you detect and handle low-quality transcriptions before they reach users?
Answer
QA pipeline components:
Input quality assessment: (1) Audio SNR estimation. (2) Clipping detection. (3) Bitrate/format validation. (4) Duration checks. Flag poor inputs for manual review or enhanced processing.
Transcription confidence: (1) Token-level confidence scores from Whisper. (2) Low average confidence → flag for review. (3) High uncertainty segments marked in output. (4) Compare multiple model outputs for consistency.
Output validation: (1) Language detection on output matches expected. (2) Grammar/coherence checks (LLM-based). (3) Domain vocabulary validation. (4) Diarization sanity checks (reasonable speaker count, segment lengths).
Human-in-the-loop: (1) Sample-based QA on percentage of transcriptions. (2) User feedback integration. (3) Escalation for low-confidence outputs. (4) Continuous labeling for model improvement.
Metrics: (1) Automated quality score per transcription. (2) Human QA error rates. (3) User-reported issues. (4) Confidence calibration accuracy.Spot the Problem
Problem 1. [IC2] A developer’s audio processing pipeline:
def transcribe_audio(file_path: str) -> str:
audio = load_audio(file_path) # Load at original sample rate
result = whisper.transcribe(audio)
return result["text"]What’s missing?
Answer
Issues: (1) No sample rate conversion—Whisper expects 16kHz audio. If input is 44.1kHz or 48kHz, accuracy may suffer or fail. (2) No audio validation—corrupt files, wrong formats, or silence will cause issues. (3) No error handling—network issues, OOM, or corrupt audio will crash. (4) No preprocessing—background noise, low volume, or clipping not addressed. (5) No confidence scores—returning just text loses quality signals.
Better implementation:
def transcribe_audio(file_path: str) -> dict:
audio = load_audio(file_path)
audio = resample_to_16khz(audio)
audio = normalize_audio(audio)
if calculate_snr(audio) < MIN_SNR:
raise LowQualityAudioError("Audio quality too low")
result = whisper.transcribe(audio)
return {
"text": result["text"],
"segments": result["segments"],
"language": result["language"],
"avg_confidence": calculate_avg_confidence(result)
}Problem 2. [Senior] A streaming transcription system has high latency:
def stream_transcribe(audio_stream):
buffer = []
for chunk in audio_stream:
buffer.append(chunk)
if len(buffer) >= 30: # 30 seconds of audio
text = transcribe(concat(buffer))
yield text
buffer = []What’s wrong and how would you fix it?
Answer
Issues: (1) 30-second buffer means 30+ second latency—users wait half a minute for any output. (2) No overlap—context lost at buffer boundaries, causing errors. (3) No VAD—buffering silence wastes resources and time. (4) Full retranscription—doesn’t leverage incremental/streaming capabilities.
Fixes: (1) Use VAD to detect speech segments, transcribe on speech end. (2) Reduce chunk size to 2-5 seconds for faster feedback. (3) Use sliding window with overlap for context. (4) Implement streaming Whisper or use streaming-native models (like AssemblyAI streaming).
Better architecture:
def stream_transcribe(audio_stream):
vad = VoiceActivityDetector()
current_segment = []
for chunk in audio_stream:
current_segment.append(chunk)
if vad.is_end_of_speech(chunk):
audio = concat(current_segment)
text = transcribe(audio)
yield text
current_segment = current_segment[-overlap_size:] # Keep overlapProblem 3. [Staff] A meeting transcription product has accuracy complaints:
Support tickets:
- "Names are wrong half the time"
- "Technical terms come out garbled"
- "Can't tell who said what in group discussions"
- "Transcripts useless for our engineering meetings"
Diagnose the issues and propose solutions.
Answer
Root causes:
“Names are wrong”: Whisper has no context about meeting participants. Proper nouns are guessed from phonetics. Solution: (a) Pre-meeting participant list for name hints. (b) Post-processing with known name patterns. (c) User corrections fed back to improve future transcriptions.
“Technical terms garbled”: Generic model lacks domain vocabulary. Solution: (a) Custom vocabulary/prompt with technical terms. (b) Domain-specific fine-tuning if data available. (c) Post-processing with technical glossary. (d) Integrate with company knowledge base for term disambiguation.
“Can’t tell who said what”: Diarization issues with group discussions. Solutions: (a) Add speaker identification (enroll users with voice samples). (b) Use video to help diarization (detect who’s speaking via lip movement). (c) Improve diarization model for overlapping speech. (d) Let users correct speaker assignments, learn preferences.
“Useless for engineering meetings”: Combination of all above, plus engineering jargon, code references, and acronyms. Solution: Build engineering-specific transcription profile with technical vocabulary, common acronyms, and integration with codebase for identifier recognition.
Design Exercises
Exercise 1. [Senior] Design an audio search system for a podcast hosting platform. Users should be able to search across millions of episodes to find discussions of specific topics, including semantic search (“discussions about AI ethics” should match “debate about responsible machine learning”).
Guidance
Indexing pipeline: (1) Download/receive audio files. (2) Transcribe with Whisper large-v3 (accuracy over speed for indexing). (3) Chunk transcripts into 30-60 second segments with overlap. (4) Generate embeddings for each segment using text embedding model. (5) Store in vector database with metadata (show, episode, timestamp, speaker if diarization done). Query processing: (1) Embed user query. (2) Vector similarity search to retrieve top candidates. (3) Re-rank with cross-encoder for precision. (4) Group results by episode, deduplicate. (5) Return clips with timestamps. Enhancements: (1) Named entity recognition to extract people, organizations, topics mentioned. (2) Episode-level embeddings for “similar episodes” feature. (3) Transcript display with search highlights. (4) Audio player that jumps to matched timestamps. Scaling: Index millions of episodes offline, serve queries from vector DB. Consider sharding by show category or date range. Cost: Whisper large-v3 ~$0.006/minute, embedding ~$0.0001/1K tokens, vector storage ~$0.01/1M vectors/month.Exercise 2. [Staff] Design a real-time meeting assistant that transcribes, identifies speakers, and generates summaries. Requirements: <3 second transcription latency, support 2-10 speakers, integrate with Zoom/Teams, generate action items in real-time. Consider: architecture, latency optimization, speaker identification, and graceful degradation.
Guidance
Architecture: (1) Audio capture via meeting platform API (Zoom SDK, Teams Graph API). (2) Edge processing for VAD and initial chunking (minimize round-trip latency). (3) Streaming transcription service (cloud-based, parallelized). (4) Speaker identification using enrolled voice profiles + real-time clustering. (5) LLM for summarization and action item extraction (runs on accumulated context).
Latency optimization: (1) Edge VAD reduces data sent to cloud. (2) Pre-warmed inference servers. (3) Streaming output (display words as they’re recognized). (4) Pipeline parallelization (transcribe chunk N while summarizing chunk N-1).
Speaker identification: (1) Pre-meeting: Enroll speakers with 30-second voice samples. (2) During meeting: Match segments to enrolled profiles. (3) Fallback: Cluster unknown voices, let users label post-meeting. (4) Video integration: Use active speaker indicators from meeting platform.
Graceful degradation: (1) High latency: Show “processing…” indicator, catch up. (2) Poor audio: Show confidence warnings, offer manual correction. (3) Unknown speakers: Generic labels (Speaker A, B), editable post-meeting. (4) LLM overload: Queue summarization, prioritize transcription.Further Reading
Foundational Papers
- Radford et al. (2022). “Robust Speech Recognition via Large-Scale Weak Supervision” — The Whisper paper that transformed speech-to-text with weak supervision at scale
- Bredin et al. (2020). “pyannote.audio: Neural Building Blocks for Speaker Diarization” — Modern neural speaker diarization approaches
Speech Recognition
- Baevski et al. (2020). “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” — Self-supervised speech representation learning foundation
- Zhang et al. (2023). “Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages” — Large-scale multilingual ASR techniques
- Pratap et al. (2020). “MLS: A Large-Scale Multilingual Dataset for Speech Research” — Multilingual speech datasets and challenges
Production Systems
- Park et al. (2019). “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition” — Data augmentation techniques for ASR training
- Chan et al. (2016). “Listen, Attend and Spell” — End-to-end speech recognition with attention mechanisms