Chapter 6: Prompt Engineering
prompting, few-shot, chain-of-thought, in-context learning, temperature, sampling, system prompts, structured outputs, JSON mode, prompt caching
Introduction
In early 2020, a team at OpenAI made a discovery that would reshape how we think about machine learning. They had trained GPT-3, a language model with 175 billion parameters, and were probing its capabilities. What they found was unexpected: without any fine-tuning or additional training, the model could perform new tasks simply by being shown a few examples in the prompt. Show it three examples of translating English to French, and it would translate a fourth. Show it examples of sentiment classification, and it would classify. The model had learned to learn from its context.
This phenomenon, called in-context learning, defies our traditional understanding of machine learning. We typically train models on thousands of examples, update their weights through backpropagation, and then deploy the frozen model. But here was a model whose behavior could be fundamentally altered at inference time, through nothing more than carefully chosen text in the prompt.
Prompt engineering is the discipline of crafting inputs that reliably elicit the behaviors you need. This might sound like a soft skill, a matter of trial and error. It isn’t. Understanding why prompts work requires grasping the computational mechanisms of attention, the statistical patterns encoded during pretraining, and the geometric structure of the model’s representation space. The difference between a prompt that works 60% of the time and one that works 95% of the time can be the difference between a prototype and a production system.
This chapter takes you from the empirical techniques that practitioners use daily to the theoretical foundations that explain why those techniques work. By the end, you’ll understand prompting deeply enough to design novel approaches, debug failures, and push the boundaries of what’s possible without model modification.
The Core Insight: Why Prompts Work
To understand prompt engineering, you need to understand what happened during the model’s training.
Large language models are trained on a simple objective: predict the next token given all previous tokens. But this simple objective, applied to trillions of tokens of human-generated text, produces remarkable emergent capabilities. The model doesn’t just learn grammar and facts; it learns patterns of reasoning, conventions of formatting, and implicit rules of various domains.
When you write a prompt, you’re not just providing instructions. You’re activating patterns the model learned during training. Consider this prompt:
Q: What is the capital of France?
A:
The model has seen millions of Q&A formatted texts during training. The “Q:” and “A:” tokens activate attention patterns associated with question-answering. The model “knows” that after “A:”, it should produce an answer related to the question. You’re not teaching the model to answer questions; you’re triggering a latent capability.
This explains why prompt format matters so much. Different formats activate different learned patterns:
- “Translate to French:” activates translation patterns
- “def function_name():” activates code generation patterns
- “Step 1:” activates sequential reasoning patterns
- “Let me think about this carefully…” activates deliberative reasoning patterns
The model is pattern-matching against its training distribution. Your job as a prompt engineer is to craft inputs that activate the right patterns for your task.
A Mental Model for Prompting
Treat your prompt the way a compiler treats source code: you don’t sit down and type the final artifact in one pass. You make optimization passes—clarity, examples, constraints, output format, edge-case handling—each pass lowering ambiguity the way a compiler lowers high-level code into tighter intermediate forms. A prompt that “works” is the output of this pipeline, not the first draft. Heuristic: when a prompt misbehaves, ask which pass you skipped, not which sentence to rewrite.
The corollary of “prompts are compiled, not written” is that prompt engineering looks more like iterative compilation than creative writing. Think of a language model as an extraordinarily well-read research assistant who has consumed most of the internet but has no memory of your previous conversations. Each prompt is a fresh start. The assistant is eager to help but takes your words literally and has no way to ask for clarification.
Your prompt must accomplish several things simultaneously:
- Context: What situation is the assistant in? What knowledge is relevant?
- Task specification: What exactly should the assistant do?
- Format constraints: What should the output look like?
- Quality signals: What distinguishes good outputs from bad ones?
When prompts fail, it’s usually because one of these elements is missing or ambiguous. The model defaults to the most probable completion given its training, which may not be what you intended.
What You’ll Learn
- How attention mechanisms explain few-shot learning
- Why chain-of-thought prompting activates reasoning capabilities
- The theoretical basis for different prompting techniques
- When to use few-shot vs. zero-shot approaches
- Temperature and sampling strategies for different tasks
- Structured output extraction for production systems
- Prompt caching strategies at scale
- Real-world case studies from production systems
Prerequisites
- Understanding of transformer attention mechanisms (Chapter 5)
- Familiarity with tokenization and embedding spaces (Chapter 5)
- Experience calling an LLM API (OpenAI, Anthropic, or similar)
- Basic familiarity with JSON and Python
The Science of In-Context Learning
How Attention Enables Learning from Examples
In-context learning seems almost magical: include a few examples in the prompt, and the model performs the task. But there’s nothing magical about it. The mechanism is attention.
Recall from Chapter 5 that transformer layers use self-attention to allow each token to gather information from all previous tokens. When processing a few-shot prompt:
Input: The food was delicious → Sentiment: positive
Input: Terrible service → Sentiment: negative
Input: The movie was okay → Sentiment:
Here’s what happens during the forward pass:
Pattern recognition in lower layers: Early layers identify structural patterns. The model recognizes the “Input: X → Sentiment: Y” format repeating.
Feature extraction in middle layers: The model extracts relevant features from each example. For the first example, it notes “delicious” correlates with “positive.” For the second, “terrible” correlates with “negative.”
Analogy formation in upper layers: When processing the query “The movie was okay,” upper layers attend back to the examples. The attention mechanism computes: “okay” is similar to “neutral” sentiment words in training; it’s neither strongly positive like “delicious” nor strongly negative like “terrible.”
Output prediction: The final layers synthesize this information to predict “neutral” (or “mixed”) as the next token.
The key insight: attention heads are performing something like gradient-free learning during the forward pass. Research by Garg et al. (2022) showed that transformers can implement learning algorithms internally—they found that attention layers in trained models implement algorithms functionally equivalent to gradient descent on in-context examples.
This explains several empirical observations:
- More examples improve performance (more data for the implicit learner)
- Example order matters (recent examples have stronger attention weights)
- Diverse examples help (better coverage of the implicit training distribution)
The Induction Head Mechanism
Olsson et al. (2022) discovered a specific attention pattern that underlies in-context learning: the induction head. This is a two-layer circuit:
- First layer (previous token head): Attends to the token immediately before any token matching the current query
- Second layer (induction head): Uses the first layer’s output to copy or predict what came after similar patterns
For example, when processing “[A][B]…[A]”, the induction head predicts “[B]” will follow the second “[A]” because it learned the pattern from the first occurrence.
In few-shot prompting, this mechanism generalizes:
Input: cat → Output: chat
Input: dog → Output: chien
Input: house → Output:
The induction heads recognize “Input: X → Output:” as a pattern and predict that whatever transformation was applied to “cat” and “dog” should apply to “house.” Through training on parallel texts, the model learned that this pattern often involves translation, and it applies that transformation.
This mechanistic understanding has practical implications:
- Consistent formatting matters: The pattern must be clear for induction heads to latch onto
- Similar examples are powerful: Examples structurally similar to the query activate stronger attention
- The first example sets expectations: It establishes the pattern that subsequent examples reinforce
Why Larger Models Learn Better In-Context
A curious finding from the GPT-3 paper: in-context learning performance scales with model size. Small models barely improve with examples; large models show dramatic gains. Why?
The answer lies in how capabilities emerge during training:
- Small models memorize: They learn surface-level patterns but can’t generalize
- Medium models recognize: They identify patterns but struggle to apply them flexibly
- Large models abstract: They form abstract representations that transfer across contexts
Larger models develop more sophisticated induction heads and other in-context learning circuits because they have the capacity to represent these complex algorithms while still performing well on the next-token prediction objective. In-context learning is an emergent capability that arises when models are large enough to spare capacity for it.
This has practical implications for prompt engineering:
- Smaller models need more explicit instructions: They can’t infer as much from context
- Larger models benefit from subtlety: They can pick up on implicit patterns
- Capability cliffs exist: A technique that fails on smaller models might work on GPT-5 or Claude Opus 4.8
Research Foundation: Key Papers
Understanding these papers will deepen your intuition for why prompting techniques work:
“Language Models are Few-Shot Learners” (Brown et al., 2020) The GPT-3 paper that demonstrated in-context learning at scale. Key finding: task performance improves log-linearly with model size for few-shot prompting. This paper introduced the terminology of zero-shot, one-shot, and few-shot learning.
“An Explanation of In-context Learning as Implicit Bayesian Inference” (Xie et al., 2021) Proposes that in-context learning works because the model infers the latent concept (task) from examples, then conditions on that concept. This explains why example diversity matters: more examples provide better evidence for the latent task.
“What Can Transformers Learn In-Context?” (Garg et al., 2022) Shows that transformers can implement learning algorithms (including linear regression and gradient descent) entirely within their forward pass. The attention mechanism has enough computational power to learn from examples without weight updates.
“In-context Learning and Induction Heads” (Olsson et al., 2022) Identifies the specific attention patterns (induction heads) responsible for in-context learning. Demonstrates that in-context learning emerges suddenly during training when induction heads form.
Prompt Structure and Formatting
We introduced “prompts are compiled, not written” earlier. Here’s the concrete pipeline. Each production prompt should be the output of these passes—not your first draft:
Pass 1 - Requirements Capture: What must the model accomplish? What are the edge cases? What’s the acceptable failure mode?
Pass 2 - Structure Design: Role, context, task, format, constraints. Each component serves a specific function in guiding attention.
Pass 3 - Example Selection: Which few-shot examples cover the input distribution? Which demonstrate edge case handling?
Pass 4 - Optimization: Remove redundancy. Tighten language. Order information for maximum cache hits.
Pass 5 - Hardening: Add guardrails. Test adversarial inputs. Verify structured output compliance.
Just as you wouldn’t ship unoptimized code, you shouldn’t ship unrefined prompts. Each pass reduces ambiguity and increases reliability. The prompt that “just works” in development often fails in production because it skipped the hardening pass. Production prompts should feel over-engineered—because they are.
# "It works in my notebook" - Famous last words
def extract_info(text: str) -> dict:
response = client.messages.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": f"Extract the key info from: {text}"}]
)
return response.content[0].text # Hope it's valid JSON?from pydantic import BaseModel, Field
from typing import Optional
import instructor
class ExtractedInfo(BaseModel):
"""Structured output with validation."""
entities: list[str] = Field(description="Named entities mentioned")
sentiment: str = Field(pattern="^(positive|negative|neutral)$")
confidence: float = Field(ge=0.0, le=1.0)
summary: Optional[str] = Field(max_length=200)
client = instructor.from_anthropic(Anthropic())
def extract_info(text: str) -> ExtractedInfo:
"""Extract with guaranteed structure and validation."""
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
response_model=ExtractedInfo,
messages=[{
"role": "user",
"content": f"""Extract information from the following text.
<text>
{text}
</text>
Identify all named entities, overall sentiment, your confidence level,
and a brief summary if the text warrants one."""
}]
) # Returns validated ExtractedInfo, not raw stringKey differences: Structured output schema, validation via Pydantic, explicit token limits, clear delimiters, and type-safe return value.
The Anatomy of Effective Prompts
Production prompts aren’t single sentences. They’re structured documents with distinct sections that guide the model’s attention and behavior. Understanding each component helps you diagnose when prompts fail.
Role/Persona: Establishes the model’s perspective and activates relevant knowledge patterns
You are a senior software engineer with 15 years of experience in distributed systems.
Context: Background information the model needs but doesn’t have in its training data
Our system uses event sourcing with Kafka as the message broker. We're seeing
occasional message ordering issues in high-throughput scenarios.
Task specification: Exactly what the model should do (be precise!)
Analyze the following error logs and identify the root cause of the ordering issues.
Focus on the timestamps and partition assignments.
Input data: The specific content to process, clearly delimited
<logs>
{log_content}
</logs>
Output format: What the response should look like
Respond with:
1. Root cause (one sentence)
2. Evidence from logs (quote specific lines)
3. Recommended fix (with code example if applicable)
Constraints: Boundaries on the response
Do not suggest changes to our Kafka configuration unless absolutely necessary.
Keep the response under 500 words.
Why Delimiters Matter
Delimiters are more than organizational tools. They help the model’s attention mechanism distinguish between different semantic regions of the prompt.
Research on attention patterns shows that special tokens and structural markers create attention barriers. Content within XML tags attends more strongly to other content in those same tags than to the surrounding prompt. This is why:
<document>
The company was founded in 1985 by John Smith.
</document>
<question>
When was the company founded?
</question>…produces more reliable extraction than:
Here is a document: The company was founded in 1985 by John Smith.
Question: When was the company founded?
In the first version, the model knows exactly what is “document” and what is “question.” In the second, the boundary is ambiguous.
Delimiter best practices:
- XML-style tags (
<tag>content</tag>) work well with Claude - Triple quotes (
"""content""") are common for user-provided text - Markdown headers (
## Section) provide clear structure - Consistency matters more than choice: pick a style and use it everywhere
The Instruction Hierarchy
Modern LLM APIs distinguish between different levels of instructions:
- System prompt: Persistent context that frames all interactions
- User messages: Individual requests within a conversation
- Assistant messages: The model’s previous responses (in multi-turn)
These aren’t just organizational conveniences. They have different priority levels:
response = client.messages.create(
model="claude-sonnet-4-6",
system="""You are a helpful customer service agent for TechCorp.
Always be polite and professional. Never discuss competitor products.
If you don't know something, say so rather than guessing.""",
messages=[
{"role": "user", "content": "What's better, your product or Apple's?"}
]
)The system prompt establishes behavioral constraints that persist across turns. The model treats system prompt instructions as higher priority than potentially adversarial user content. This is crucial for safety: system prompts can establish boundaries that users can’t easily override.
Real-World Case Study: Customer Service Classification
A major e-commerce company needed to route 50,000 daily customer messages to appropriate teams. Their initial prompt:
Classify this message into a category.
Message: {message}
This achieved 67% accuracy. Messages were frequently misclassified because:
- Categories weren’t defined
- Edge cases weren’t handled
- Output format varied
Their production prompt:
You are a customer service routing system for an e-commerce platform.
TASK: Classify the customer message into exactly ONE category.
CATEGORIES:
- ORDER_STATUS: Tracking, delivery timing, shipment questions
- RETURNS: Return requests, refund status, exchange inquiries
- PRODUCT: Product questions, sizing, compatibility, availability
- ACCOUNT: Login issues, password reset, profile updates
- BILLING: Payment failures, invoice requests, pricing disputes
- ESCALATION: Angry customers, legal threats, media mentions
- OTHER: Anything that doesn't fit above categories
RULES:
1. If message mentions both order status AND return, classify as RETURNS
2. If customer expresses significant frustration (caps, exclamation marks, threats),
classify as ESCALATION regardless of topic
3. If uncertain between two categories, choose the one listed first above
<message>
{customer_message}
</message>
Respond with only the category name in uppercase. No explanation.
This achieved 94% accuracy. The key improvements:
- Explicit category definitions with examples
- Priority rules for ambiguous cases
- Clear handling of edge cases (angry customers)
- Strict output format
Full implementation: See reference/06b_prompt_engineering_code.md for complete classifier with validation.
Few-Shot Learning in Practice
When to Use Few-Shot vs. Zero-Shot
The first decision in prompt design: should you include examples?
Zero-shot is sufficient when:
- The task is well-known (translation, summarization, sentiment)
- The output format is standard (yes/no, short answers)
- You’re using a very capable model (GPT-5, Claude Opus 4.8)
- Speed and cost are critical (examples add tokens)
Few-shot is necessary when:
- The task requires a specific output format
- The task has domain-specific conventions
- Categories or labels aren’t obvious from names alone
- The model consistently misunderstands zero-shot instructions
Empirical guidance on example count:
| Task Complexity | Recommended Examples | Rationale |
|---|---|---|
| Simple classification | 2-3 per class | Establish format, show boundaries |
| Nuanced classification | 5-10 per class | Cover edge cases |
| Format transformation | 3-5 total | Show the pattern |
| Domain-specific tasks | 10-20 total | Transfer domain knowledge |
| Complex reasoning | 2-3 with reasoning | Quality over quantity |
Research by Min et al. (2022) found that example labels matter less than you’d think. What matters most is: 1. The format of examples (input-output structure) 2. The distribution of inputs (covering the space) 3. The number of examples (more is generally better, with diminishing returns)
Example Selection Strategies
Random selection rarely optimizes performance. Strategic selection significantly improves results.
Similarity-based selection: Choose examples similar to the query
def select_similar_examples(query: str, example_pool: list[dict],
k: int = 5, embedding_model) -> list[dict]:
"""Select k examples most similar to the query."""
query_embedding = embedding_model.encode(query)
example_embeddings = embedding_model.encode([ex['input'] for ex in example_pool])
similarities = cosine_similarity([query_embedding], example_embeddings)[0]
top_indices = similarities.argsort()[-k:][::-1]
return [example_pool[i] for i in top_indices]Diversity-based selection: Cover the space of possible inputs
def select_diverse_examples(example_pool: list[dict], k: int = 5) -> list[dict]:
"""Select k diverse examples using k-means clustering."""
embeddings = embedding_model.encode([ex['input'] for ex in example_pool])
kmeans = KMeans(n_clusters=k)
kmeans.fit(embeddings)
# Select example closest to each cluster center
selected = []
for center in kmeans.cluster_centers_:
distances = np.linalg.norm(embeddings - center, axis=1)
selected.append(example_pool[distances.argmin()])
return selectedStratified selection: Ensure all categories are represented
def select_stratified_examples(example_pool: list[dict],
k_per_class: int = 2) -> list[dict]:
"""Select k examples per class for balanced representation."""
by_class = defaultdict(list)
for ex in example_pool:
by_class[ex['label']].append(ex)
selected = []
for label, examples in by_class.items():
selected.extend(random.sample(examples, min(k_per_class, len(examples))))
return selectedResearch insight: Liu et al. (2022) found that example ordering affects performance. Examples closer to the query (at the end of the prompt) have more influence. A good strategy: 1. Start with diverse examples that establish the task 2. End with examples similar to the query
Dynamic Few-Shot at Scale
For production systems processing diverse inputs, static examples aren’t enough. Dynamic few-shot retrieves relevant examples for each query.
Architecture pattern:
Query → Embed → Search example pool → Select top-k → Build prompt → LLM
↓
Vector database of
labeled examples
Production implementation:
class DynamicFewShotClassifier:
def __init__(self, examples: list[dict], embedding_model, llm_client):
self.examples = examples
self.embedder = embedding_model
self.llm = llm_client
# Pre-compute example embeddings
self.example_embeddings = self.embedder.encode(
[ex['input'] for ex in examples]
)
self.example_labels = [ex['label'] for ex in examples]
def classify(self, query: str, k: int = 5) -> str:
# Find similar examples
query_embedding = self.embedder.encode([query])
similarities = cosine_similarity(query_embedding, self.example_embeddings)[0]
# Get top-k, ensuring label diversity
selected = self._select_with_diversity(similarities, k)
# Build prompt
prompt = self._build_prompt(selected, query)
return self.llm.generate(prompt)
def _select_with_diversity(self, similarities: np.array, k: int) -> list[dict]:
"""Select top examples while maintaining label diversity."""
sorted_indices = similarities.argsort()[::-1]
selected = []
label_counts = defaultdict(int)
max_per_label = (k // len(set(self.example_labels))) + 1
for idx in sorted_indices:
label = self.example_labels[idx]
if label_counts[label] < max_per_label:
selected.append(self.examples[idx])
label_counts[label] += 1
if len(selected) >= k:
break
return selectedFull implementation: See reference/06b_prompt_engineering_code.md for production-ready classifier with caching and error handling.
Real-World Case Study: Code Review System
A DevOps platform needed to classify code review comments into actionable categories for metrics and routing. The challenge: code review comments use highly domain-specific language.
Approach: They built a dynamic few-shot system using embeddings from a code-focused model (CodeBERT) and maintained a pool of 500 labeled examples from their own repository.
Key insights from production: 1. Domain-specific embeddings matter: Code-focused embeddings outperformed general-purpose embeddings by 12% accuracy 2. Recency helps: They weighted recent examples higher, as coding conventions evolved 3. Human-in-the-loop: Low-confidence classifications were routed to humans, whose corrections fed back into the example pool
Results: 89% classification accuracy vs. 71% with static few-shot, with the example pool requiring updates only monthly.
What people do: Place few-shot examples in arbitrary order, often the order they were created.
Why it fails: Example order significantly affects model behavior. The model over-weights recent examples due to recency bias—placing all positive examples last biases toward positive classifications.
Fix: Interleave classes systematically. For classification, alternate between categories. For reasoning tasks, order examples from simple to complex. When in doubt, randomize the order and test performance across multiple orderings.
Chain-of-Thought Reasoning
Why Thinking Out Loud Helps
In 2022, researchers at Google made a simple discovery with profound implications: asking language models to show their reasoning dramatically improves accuracy on complex tasks. Kojima et al. found that adding “Let’s think step by step” to prompts improved accuracy on the GSM8K math benchmark from 10.4% to 40.7% (using InstructGPT 175B). However, note that improvements vary significantly across tasks—arithmetic tasks may see +15-20% gains while other tasks see smaller improvements. The technique is most effective on multi-step reasoning problems with models of 100B+ parameters; smaller models may produce illogical reasoning chains.
This isn’t just about producing longer outputs. Chain-of-thought (CoT) prompting changes what the model computes. Understanding why requires looking at how transformers process information.
The computational constraint: Transformers are feedforward networks. Each layer can perform only a fixed amount of computation. For simple tasks like sentiment analysis, this is sufficient. For multi-step reasoning, it’s not.
Consider: “If a store sells 15 apples in the morning and 23 in the afternoon, then gives away 7, how many did they net sell?”
Direct prediction requires computing (15 + 23 - 7 = 31) in a single forward pass. Chain-of-thought allows:
- Step 1: “Morning sales: 15”
- Step 2: “Afternoon sales: 23”
- Step 3: “Total sold: 15 + 23 = 38”
- Step 4: “Given away: 7”
- Step 5: “Net sales: 38 - 7 = 31”
Each step is a separate token generation, giving the model more computation to work with. The intermediate results are written to the context, serving as working memory.
Theoretical interpretation: Feng et al. (2023) showed that CoT effectively increases the model’s computational depth. Without CoT, the model is limited to problems solvable in its fixed number of layers. With CoT, it can solve problems requiring arbitrarily many sequential steps, limited only by context length.
Prompting Strategies for Reasoning
Zero-shot CoT: Simply append a trigger phrase
Q: A bat and ball cost $1.10 total. The bat costs $1 more than the ball.
How much does the ball cost?
Let's work through this step by step:
This works because the model has seen reasoning patterns in its training data. The trigger phrase activates those patterns.
Few-shot CoT: Provide examples with explicit reasoning
Q: If there are 3 cars and 2 more arrive, then 1 leaves, how many are there?
A: Starting with 3 cars. After 2 arrive: 3 + 2 = 5. After 1 leaves: 5 - 1 = 4.
The answer is 4.
Q: If a store has 20 items and sells 8, then receives 15 more, how many are there?
A: Starting with 20 items. After selling 8: 20 - 8 = 12. After receiving 15: 12 + 15 = 27.
The answer is 27.
Q: {new_question}
A:
Structured CoT: Impose a reasoning framework
Solve this problem using these steps:
1. Identify what we're looking for
2. List the relevant information
3. Determine the operations needed
4. Perform calculations
5. Verify the answer makes sense
6. State the final answer
Problem: {problem}
Solution:
Self-Consistency: Voting Across Reasoning Paths
A single reasoning chain can contain errors. Self-consistency (Wang et al., 2022) generates multiple chains and takes the majority answer:
def self_consistent_answer(question: str, n_chains: int = 5,
temperature: float = 0.7) -> str:
"""Generate multiple reasoning chains and vote on the answer."""
answers = []
for _ in range(n_chains):
response = generate(
prompt=f"{question}\n\nLet's solve this step by step:",
temperature=temperature # Non-zero for diversity
)
answer = extract_final_answer(response)
answers.append(answer)
# Return most common answer
answer_counts = Counter(answers)
return answer_counts.most_common(1)[0][0]Why it works: Different reasoning paths make different errors, but correct paths tend to converge on the same answer. If 4 out of 5 chains produce “31” and one produces “38”, the majority is likely correct.
When to use: Self-consistency helps most when:
- The task has a verifiable correct answer (math, logic)
- Individual chain accuracy is moderate (50-80%)
- You can afford the additional API calls
Diminishing returns: Research shows significant gains from 1 to 5 chains, moderate gains from 5 to 10, and minimal gains beyond 20. Cost-optimize accordingly.
When Chain-of-Thought Hurts
The situation: A healthcare startup built an AI assistant to help patients understand their symptoms. To improve accuracy, they added chain-of-thought prompting: “Think step by step about what conditions might cause these symptoms, considering severity, duration, and risk factors.”
What happened: Accuracy improved dramatically—the system correctly identified concerning symptom patterns 40% more often. The team celebrated. Then they ran safety evaluations.
The same step-by-step reasoning that improved accuracy also improved the system’s ability to provide detailed information about self-harm methods. When a user mentioned feeling hopeless, the CoT prompt caused the model to reason: “Step 1: Consider what the user might be asking about. Step 2: If they’re asking about harming themselves, I should think about methods…” The reasoning process that helped with symptom analysis was now helping the model think through harmful completions.
Root cause: Chain-of-thought prompting amplifies all of the model’s reasoning capabilities, not just the ones you want. Better reasoning about benign topics meant better reasoning about harmful ones too. The team had tested for accuracy without testing for safety regressions.
How they fixed it: 1. Added safety classifiers before CoT reasoning (detect and redirect risky queries immediately) 2. Implemented separate prompt paths for different query types (no CoT for mental health topics) 3. Built a safety evaluation suite that ran alongside accuracy evaluations 4. Added output classifiers as a final safety layer
The takeaway: When you make a model more capable, you make it more capable at everything—including the things you don’t want. Every capability improvement requires a safety re-evaluation.
CoT isn’t universally beneficial. It can hurt performance when:
Task is simple: For straightforward tasks like sentiment classification or translation, CoT adds latency and cost without improving accuracy. The model already “knows” the answer.
Reasoning leads astray: The model might generate plausible-sounding but incorrect reasoning, then be committed to the wrong answer. This is especially problematic for:
- Tasks outside the model’s competence
- Questions with misleading information
- Problems requiring knowledge the model lacks
Token budget is limited: CoT can 5-10x output length. If you’re paying per token or have strict latency requirements, this matters.
Decision framework:
| Task Type | CoT Benefit | Recommendation |
|---|---|---|
| Multi-step math | High | Always use |
| Logical reasoning | High | Always use |
| Code debugging | Medium-High | Use for complex bugs |
| Classification | Low | Skip unless accuracy-critical |
| Translation | None/Negative | Never use |
| Creative writing | Variable | Use for planning, not generation |
| Factual Q&A | Low | Skip for direct facts |
Research Foundation: Key Papers
“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (Wei et al., 2022) The foundational CoT paper. Demonstrated that few-shot examples with reasoning dramatically improve performance on arithmetic, commonsense, and symbolic reasoning tasks. Key insight: benefits scale with model size.
“Self-Consistency Improves Chain of Thought Reasoning in Language Models” (Wang et al., 2022) Introduced the self-consistency technique. Showed that sampling multiple reasoning paths and taking the majority answer improves accuracy beyond single-chain CoT.
“Large Language Models are Zero-Shot Reasoners” (Kojima et al., 2022) Discovered that simply adding “Let’s think step by step” improves zero-shot performance dramatically. The phrase acts as a trigger for reasoning patterns learned during pretraining.
“Towards Revealing the Mystery behind Chain of Thought” (Feng et al., 2023) Theoretical analysis showing that CoT increases effective computational depth. The model uses generated text as external memory, circumventing the fixed-depth limitation of transformer layers.
Structured Outputs for Production
Why Structure Matters
Free-form text is easy for humans to read but hard for systems to process. Production applications need structured outputs: JSON objects, function calls, filled templates. The challenge: coaxing a language model to produce syntactically valid, semantically correct structured data.
The reliability hierarchy: 1. Syntactic validity: Is it valid JSON? 2. Schema compliance: Does it have the right fields and types? 3. Semantic correctness: Are the values meaningful and accurate?
Modern APIs address level 1 with JSON mode. Levels 2-3 remain your responsibility.
JSON Mode and Schema Enforcement
# JSON mode guarantees valid JSON syntax (OpenAI's `response_format`).
# Note: this flag is OpenAI-specific. Anthropic has no `response_format`
# parameter — there you force a tool call instead (see "Function Calling
# and Tool Use" below, or use the `instructor` library shown earlier).
response = client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
# But you must still specify and validate the schema
from pydantic import BaseModel, Field
from typing import List, Optional
class ExtractedEntity(BaseModel):
name: str = Field(description="The entity name as it appears in text")
entity_type: str = Field(description="One of: PERSON, ORG, LOCATION, DATE")
confidence: float = Field(ge=0, le=1, description="Confidence score 0-1")
context: Optional[str] = Field(description="Surrounding text for verification")
class ExtractionResult(BaseModel):
entities: List[ExtractedEntity]
raw_text: str
def extract_entities(text: str) -> ExtractionResult:
prompt = f"""Extract named entities from this text.
<text>
{text}
</text>
Return a JSON object matching this schema:
{ExtractionResult.model_json_schema()}"""
response = client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
return ExtractionResult.model_validate_json(
response.choices[0].message.content
)Function Calling and Tool Use
Modern APIs support structured function calling, where the model decides when to call functions and extracts arguments:
tools = [
{
"name": "search_database",
"description": "Search the product database for items matching criteria",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search terms"},
"category": {
"type": "string",
"enum": ["electronics", "clothing", "home", "sports"]
},
"max_price": {"type": "number", "description": "Maximum price in USD"},
"in_stock": {"type": "boolean", "description": "Only show in-stock items"}
},
"required": ["query"]
}
}
]
response = client.messages.create(
model="claude-sonnet-4-6",
messages=[{
"role": "user",
"content": "Find me running shoes under $100 that are in stock"
}],
tools=tools
)
if response.stop_reason == "tool_use":
tool_call = response.content[0]
# tool_call.name == "search_database"
# tool_call.input == {"query": "running shoes", "category": "sports",
# "max_price": 100, "in_stock": True}Why function calling works better than JSON prompting:
- The model is specifically trained on function-calling patterns
- Constrained decoding ensures valid argument structures
- Type coercion is handled automatically
Retry Strategies for Structured Outputs
Even with JSON mode, extraction can fail. Production systems need retry logic:
def extract_with_retry(text: str, schema: type[BaseModel],
max_retries: int = 3) -> BaseModel:
"""Extract structured data with exponential backoff retry."""
for attempt in range(max_retries):
try:
prompt = build_extraction_prompt(text, schema)
response = call_llm_json_mode(prompt)
return schema.model_validate_json(response)
except ValidationError as e:
if attempt == max_retries - 1:
raise
# Include error in retry prompt for self-correction
prompt = f"""{prompt}
Your previous response had validation errors:
{e}
Please fix these issues and try again."""
time.sleep(2 ** attempt) # Exponential backoff
raise RuntimeError("Max retries exceeded")Advanced pattern: Extraction verification
def verified_extraction(text: str, schema: type[BaseModel]) -> tuple[BaseModel, float]:
"""Extract and verify with confidence score."""
# First pass: extraction
result = extract_with_retry(text, schema)
# Second pass: verification
verify_prompt = f"""Verify this extraction is correct.
Original text:
{text}
Extracted data:
{result.model_dump_json(indent=2)}
For each field, is the extraction correct? Rate overall confidence 0-1.
Return JSON: {{"correct": bool, "issues": [str], "confidence": float}}"""
verification = call_llm_json_mode(verify_prompt)
return result, verification["confidence"]Real-World Case Study: Invoice Processing
A fintech company processes 100,000 invoices monthly with varying formats: PDFs, scanned images, emails. They needed to extract:
- Vendor name and address
- Invoice number and date
- Line items with quantities and prices
- Total and tax amounts
Challenge: Invoices have no standard format. Field labels vary (“Total”, “Amount Due”, “Balance”, “Grand Total”). Layout varies (tables, lists, free text).
Solution architecture: 1. Document preprocessing: OCR for images, PDF text extraction 2. Initial extraction: Claude with structured output schema 3. Verification pass: Second LLM call to verify amounts add up 4. Human review queue: Low-confidence extractions flagged for review
Key prompt engineering insights:
- Including 3 diverse invoice examples in the prompt improved accuracy from 76% to 91%
- Explicitly listing field aliases (“Total” OR “Amount Due” OR “Balance”) helped significantly
- Asking the model to quote the source text for each extracted field enabled verification
Production metrics:
- 91% full automation (no human review)
- 99.7% accuracy after human review
- Processing time: 3 seconds per invoice (vs. 5 minutes manual)
What people do: Enable JSON mode and assume the output will match their expected schema.
Why it fails: JSON mode only guarantees syntactically valid JSON. It doesn’t guarantee the right fields, correct types, or sensible values. A response like {"error": "I cannot help with that"} is valid JSON but breaks your parsing.
Fix: Always define your schema explicitly in the prompt, and always validate the response against a Pydantic model or JSON Schema before using it. Implement fallback handling for schema violations.
Temperature and Sampling Strategies
“I’ve seen teams spend weeks optimizing prompts when the real issue was temperature. One customer support team had temperature at 0.9 because ‘we wanted the responses to sound natural.’ Their bot was inventing return policies. We dropped it to 0.2 and their accuracy jumped 40%. Save your creativity budget for where you need it.”
— Staff AI Engineer at enterprise SaaS company
Understanding Temperature
Temperature controls the randomness of token selection. But what does that actually mean?
The model produces a probability distribution over the vocabulary for each token position. Temperature scales these probabilities before sampling:
adjusted_probability(token) = softmax(logit(token) / temperature)
Low temperature (0.0-0.3):
- Distribution becomes more peaked
- Highest-probability token is almost always selected
- Output is deterministic and focused
- Good for: factual answers, structured outputs, classification
Medium temperature (0.5-0.8):
- Some randomness, but probable tokens favored
- Balance of reliability and variety
- Good for: general conversation, explanations
High temperature (0.9-1.5):
- Distribution flattens
- Low-probability tokens have a chance
- Output is creative and unpredictable
- Good for: brainstorming, creative writing, exploration
The Temperature-Task Matrix
| Task | Recommended Temperature | Rationale |
|---|---|---|
| Code generation | 0.0-0.2 | Precision matters; syntax errors are costly |
| JSON extraction | 0.0 | Determinism required; same input should give same output |
| Classification | 0.0 | Consistency for testing and reliability |
| Summarization | 0.3-0.5 | Some variety in phrasing, but faithful to source |
| Conversation | 0.7-0.9 | Natural variation; avoid repetitive responses |
| Creative writing | 0.8-1.2 | Surprise and novelty valued |
| Brainstorming | 1.0-1.5 | Want unusual ideas; high variance is the goal |
What people do: Set temperature=1.0 for “creative” tasks and temperature=0 for “factual” tasks, believing temperature controls the quality or creativity of outputs.
Why it fails: Temperature only controls sampling randomness, not the model’s underlying capabilities. A factually wrong answer at temperature=0 is still wrong. A creative prompt at temperature=0 can still produce creative output—it just picks the most likely tokens. High temperature doesn’t make the model more creative; it makes it more random, which can produce both unexpected good ideas AND nonsensical outputs.
Fix: Use temperature to control consistency, not creativity. For tasks where you want the same input to produce the same output (testing, caching, structured extraction), use temperature=0. For tasks where variety is valuable (brainstorming, conversation), use higher temperature. The creativity comes from your prompt design, not the temperature setting.
Top-p (Nucleus) Sampling
Top-p sampling is an alternative to temperature that’s often more intuitive:
Select from the smallest set of tokens whose cumulative probability >= p
With top-p = 0.9, the model considers only tokens that collectively account for 90% of the probability mass. Rare tokens below this threshold are excluded.
Temperature vs. Top-p:
- Temperature: Scales all probabilities, can make rare tokens common
- Top-p: Truncates the distribution, prevents truly rare tokens
Common combinations:
temperature=0.7, top_p=0.9: Balanced creativity with guardrailstemperature=0.0, top_p=1.0: Deterministic (temperature overrides)temperature=1.0, top_p=0.5: High creativity but only from likely tokens
Sampling for Self-Consistency
When using self-consistency (multiple reasoning chains), sampling strategy matters:
def self_consistent_with_temperature_sweep(question: str,
temperatures: list[float] = [0.5, 0.7, 0.9]
) -> str:
"""Generate chains at different temperatures for diversity."""
answers = []
for temp in temperatures:
for _ in range(3): # 3 chains per temperature
response = generate(
prompt=f"{question}\n\nLet's solve step by step:",
temperature=temp
)
answers.append(extract_answer(response))
return Counter(answers).most_common(1)[0][0]Research finding: Diversity in reasoning paths matters more than raw chain count. Varying temperature produces more diverse chains than many chains at the same temperature.
What people do: Set temperature to 0 expecting identical outputs every time, then panic when outputs occasionally vary.
Why it fails: Temperature 0 means “always sample the highest-probability token,” but ties exist. When two tokens have equal probability, the selection can vary. Additionally, infrastructure changes (batching, caching, floating-point rounding) can cause subtle differences.
Fix: If you need true determinism for testing, use seed parameters where available. For production, design your system to handle minor output variations—don’t rely on exact string matching.
Context Window Management
2018 (BERT): 512 tokens. The original BERT used fixed positional embeddings, hard-limiting context. This was enough for most NLU tasks but inadequate for document-level reasoning.
2020 (GPT-3): 2,048 tokens (~1,500 words). Seemed generous at the time, but practitioners quickly learned to economize every token. Prompt engineering was as much about compression as communication.
2022 (ChatGPT/GPT-3.5): 4,096 tokens. Doubled the budget, enabling longer conversations and more context. Still tight for serious RAG applications.
Early 2023 (GPT-4, Claude 2): 8K-32K tokens. A step change that enabled fitting entire documents in context. Made “stuff it all in the prompt” a viable strategy for many applications.
Late 2023 (Claude 2.1, GPT-4 Turbo): 128K-200K tokens. Effectively unlimited for most single-document use cases. But research revealed the “lost-in-the-middle” problem: more context doesn’t mean better utilization.
2024-2025 (Gemini 1.5, Claude 3+): 1M+ tokens. Entire codebases and book-length documents fit in context. Shifts the constraint from “does it fit?” to “can the model use it effectively?”
The pattern: Context windows grew ~250x in 6 years (512 → 128K+), but quality of context use didn’t scale proportionally. The art shifted from fitting information in to organizing information for effective retrieval-by-attention.
The Context Budget
Every prompt consumes a budget:
Total Context = System Prompt + History + Retrieved Documents + User Query + Reserved Output
For a model with 128K context, you might allocate:
- System prompt: 2K tokens (persistent instructions)
- Retrieved documents: 80K tokens (RAG context)
- Conversation history: 30K tokens (multi-turn context)
- User query: 1K tokens (current request)
- Reserved output: 15K tokens (response space)
class ContextBudget:
def __init__(self, max_tokens: int, tokenizer):
self.max_tokens = max_tokens
self.tokenizer = tokenizer
def allocate(self, system: str, documents: list[str],
history: list[dict], query: str,
reserved_output: int = 4000) -> dict:
"""Allocate context budget across components."""
system_tokens = self.count(system)
query_tokens = self.count(query)
available = self.max_tokens - system_tokens - query_tokens - reserved_output
# Prioritize: recent history > relevant documents
history_budget = int(available * 0.3)
doc_budget = int(available * 0.7)
trimmed_history = self.trim_history(history, history_budget)
trimmed_docs = self.trim_documents(documents, doc_budget)
return {
"system": system,
"documents": trimmed_docs,
"history": trimmed_history,
"query": query
}The Lost-in-the-Middle Problem
Liu et al. (2024) discovered that LLMs don’t use long contexts uniformly. Information in the middle of the context is often ignored, while information at the beginning and end receives more attention.
Implications for prompt design: 1. Put the most important information at the start or end 2. Don’t rely on the model finding needles in haystacks 3. Quality of context matters more than quantity
Mitigation strategies:
def position_aware_context(documents: list[str], query: str,
relevance_scores: list[float]) -> str:
"""Position documents by relevance: most relevant first and last."""
sorted_docs = [d for _, d in sorted(zip(relevance_scores, documents),
reverse=True)]
if len(sorted_docs) <= 2:
return "\n\n".join(sorted_docs)
# Most relevant first, second-most-relevant last, rest in middle
positioned = [sorted_docs[0]]
positioned.extend(sorted_docs[2:])
positioned.append(sorted_docs[1])
return "\n\n".join(positioned)Conversation Summarization
For multi-turn conversations, summarize older exchanges:
def manage_conversation(history: list[dict], max_tokens: int,
summarization_threshold: float = 0.7) -> list[dict]:
"""Summarize old messages when history exceeds threshold."""
total_tokens = sum(count_tokens(m["content"]) for m in history)
if total_tokens < max_tokens * summarization_threshold:
return history # Still fits
# Keep recent messages (40% of budget)
recent_budget = int(max_tokens * 0.4)
recent = []
recent_tokens = 0
for msg in reversed(history):
msg_tokens = count_tokens(msg["content"])
if recent_tokens + msg_tokens > recent_budget:
break
recent.insert(0, msg)
recent_tokens += msg_tokens
# Summarize everything older
old_messages = history[:-len(recent)] if recent else history
summary = generate(
prompt=f"""Summarize this conversation, preserving key decisions,
facts mentioned, and user preferences:
{format_messages(old_messages)}
Summary:"""
)
return [{"role": "system", "content": f"Previous conversation summary:\n{summary}"}] + recentPrompt Caching at Scale
When to Use Prompt Caching
Caching ROI formula: Savings = (cached_tokens × 0.9 × price_per_token) - cache_management_overhead
How Caching Works
When you send a prompt, the model computes internal states (key-value caches) for every token. Prompt caching stores these states so subsequent requests with the same prefix skip the computation.
Request 1: [System prompt | Document | Question 1]
├───────────┴──────────┤
Cached after first request
Request 2: [System prompt | Document | Question 2]
├───────────┴──────────┤
Cache hit! Reuse computation
Cost savings: Cached tokens are typically 90% cheaper than computed tokens. For a 10K token document, answering 100 questions costs:
- Without caching: 100 x 10K = 1M input tokens
- With caching: 10K + (100 x 0.1 x 10K) = 110K effective tokens
Designing for Cacheability
Cache hits require exact prefix matches. Structure prompts accordingly:
# Good: Cacheable prefix
prompt = f"""<system>
{system_instructions} # Same across requests
</system>
<documents>
{document_content} # Same for batch of questions
</documents>
<query>
{user_question} # Varies per request
</query>"""
# Bad: Variable content early
prompt = f"""Answer this question: {user_question}
Using these documents:
{document_content}
Follow these rules:
{system_instructions}"""Production Caching Patterns
Batch processing with shared context:
async def batch_qa_with_caching(document: str,
questions: list[str]) -> list[str]:
"""Answer multiple questions, maximizing cache hits."""
# Sort questions by similarity for consecutive cache hits
# (Similar questions likely share more prefix)
sorted_questions = cluster_similar(questions)
results = []
for question in sorted_questions:
response = await client.messages.create(
model="claude-sonnet-4-6",
system=[{
"type": "text",
"text": f"Reference document:\n{document}",
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": question}]
)
results.append(response.content)
return resultsMonitoring cache effectiveness:
def analyze_cache_metrics(responses: list[dict]) -> dict:
"""Analyze caching effectiveness across requests."""
total_input = 0
cache_hits = 0
for response in responses:
usage = response.usage
total_input += usage.input_tokens
cache_hits += usage.get("cache_read_input_tokens", 0)
hit_rate = cache_hits / total_input if total_input > 0 else 0
# Estimate cost savings (cached tokens ~90% cheaper)
savings_rate = hit_rate * 0.9
return {
"total_input_tokens": total_input,
"cache_hit_tokens": cache_hits,
"hit_rate": hit_rate,
"estimated_cost_savings": f"{savings_rate:.1%}"
}Real-World Case Study: Legal Document Analysis
A legal tech company analyzes contracts for compliance issues. Each contract (averaging 50K tokens) needs 20-30 different checks (data privacy clauses, indemnification terms, etc.).
Challenge: Without caching, analyzing one contract = 50K x 25 = 1.25M input tokens. At scale (1000 contracts/day), costs were prohibitive.
Caching strategy: 1. Embed contract once in system prompt with cache control 2. Run all 25 checks sequentially (each hits cache) 3. Batch contracts by type (similar contracts share some cached instructions)
Results:
- Cache hit rate: 94%
- Cost reduction: 87%
- Latency improvement: 3.2x faster (cache hits skip computation)
Key insight: They discovered that ordering checks by similarity increased hit rate from 89% to 94%. Checks that reference similar contract sections benefit from shared cache.
Advanced Prompting Patterns
ReAct: Reasoning and Acting
ReAct (Yao et al., 2022) interleaves reasoning with tool use, allowing models to think about what information they need and then retrieve it:
Question: What is the population of the capital of the country that hosted the 2024 Olympics?
Thought: I need to find what country hosted the 2024 Olympics, then find its capital,
then find that city's population.
Action: search("2024 Olympics host country")
Observation: France hosted the 2024 Summer Olympics.
Thought: France hosted the Olympics. The capital of France is Paris.
Now I need Paris's population.
Action: search("Paris France population 2024")
Observation: The population of Paris is approximately 2.1 million in the city proper,
or 12.5 million in the metropolitan area.
Thought: I have the information needed. The capital of France (2024 Olympics host)
is Paris, with a population of about 2.1 million (city) or 12.5 million (metro).
Final Answer: The population of Paris, capital of France (2024 Olympics host),
is approximately 2.1 million in the city proper.
Implementation pattern:
def react_agent(question: str, tools: dict[str, Callable],
max_steps: int = 10) -> str:
"""ReAct agent with thought-action-observation loop."""
prompt = f"""Answer this question using available tools.
Tools:
{format_tool_descriptions(tools)}
Question: {question}
Use this format:
Thought: [your reasoning about what to do next]
Action: tool_name(arguments)
Observation: [tool result will appear here]
... (repeat until you have the answer)
Thought: [final reasoning]
Final Answer: [your answer]
Begin:
Thought:"""
for _ in range(max_steps):
response = generate(prompt)
prompt += response
if "Final Answer:" in response:
return response.split("Final Answer:")[-1].strip()
if "Action:" in response:
action = parse_action(response)
result = tools[action.name](**action.args)
prompt += f"\nObservation: {result}\n\nThought:"
return "Max steps reached without answer"Prompt Chaining for Complex Tasks
Break complex tasks into focused sub-tasks:
def analyze_document(document: str) -> dict:
"""Multi-stage document analysis with prompt chaining."""
# Stage 1: Extract key facts (focused, reliable)
facts = generate(f"""Extract key facts from this document as a bullet list.
Focus on: dates, names, numbers, and decisions.
Document: {document}
Key facts:""")
# Stage 2: Identify themes (analytical)
themes = generate(f"""Based on these facts, identify 3-5 main themes.
Facts:
{facts}
Themes:""")
# Stage 3: Generate summary (synthesizing previous stages)
summary = generate(f"""Write an executive summary based on these facts and themes.
Facts:
{facts}
Themes:
{themes}
Executive Summary (2-3 paragraphs):""")
return {
"facts": facts,
"themes": themes,
"summary": summary
}Why chaining works:
- Each stage can be optimized independently
- Intermediate results are explicit and debuggable
- Can use different models for different stages (fast for extraction, powerful for synthesis)
- Cache intermediate results for reuse
Constitutional AI Patterns
Inspired by Anthropic’s Constitutional AI (Bai et al., 2022), use self-critique to improve outputs:
def generate_with_critique(prompt: str, principles: list[str]) -> str:
"""Generate with constitutional self-critique."""
# Initial generation
draft = generate(prompt)
# Self-critique against principles
critique_prompt = f"""Review this response against these principles:
{chr(10).join(f'- {p}' for p in principles)}
Response to review:
{draft}
For each principle, note any violations. Then provide a revised response
that addresses the issues while maintaining accuracy.
Critique and revision:"""
revised = generate(critique_prompt)
# Extract final revised version
return extract_revision(revised)
# Usage
principles = [
"Be accurate and don't make claims not supported by the context",
"Acknowledge uncertainty when present",
"Be concise without losing important nuance",
"Use professional but accessible language"
]
result = generate_with_critique(
prompt="Explain the implications of this financial report...",
principles=principles
)Production Checklist
Before Deployment
Prompt robustness:
Cost optimization:
Latency management:
Monitoring in Production
@dataclass
class PromptMetrics:
prompt_version: str
input_tokens: int
output_tokens: int
cache_hit_tokens: int
latency_ms: float
success: bool
error_type: Optional[str]
def log_request(response, prompt_version: str, start_time: float):
"""Log metrics for monitoring and optimization."""
metrics = PromptMetrics(
prompt_version=prompt_version,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
cache_hit_tokens=response.usage.get("cache_read_input_tokens", 0),
latency_ms=(time.time() - start_time) * 1000,
success=True,
error_type=None
)
# Send to monitoring system
metrics_client.log(metrics)Alert on:
- Latency P95 exceeding SLA
- Error rate above threshold
- Cache hit rate dropping (indicates prompt changes)
- Cost per request increasing
Key Takeaways
In-context learning works through attention - When you provide examples, attention heads identify patterns and apply them to new inputs. Example format, diversity, and positioning all matter for performance.
Chain-of-thought extends model computation - By generating intermediate reasoning steps, models use their output as working memory, enabling multi-step reasoning impossible in a single forward pass.
Structured outputs require explicit constraints - Use JSON mode, schemas, and explicit format instructions. Models generate probability distributions over tokens, not inherent understanding of structure.
Context is a limited resource to budget - Token limits, positional attention effects, and caching economics all influence how to structure prompts. Maximize shared prefixes for cost efficiency.
Temperature controls the explore-exploit tradeoff - Use temperature 0 for deterministic, structured outputs and higher values (0.7+) for creative tasks, accepting variability in exchange for diversity.
Summary
Prompt engineering is where understanding of language models meets practical system design. The techniques are grounded in how transformers process information:
In-context learning works because attention enables pattern matching. When you provide examples, the model’s attention heads identify patterns and apply them to new inputs. This explains why example format, diversity, and positioning matter.
Chain-of-thought prompting increases effective computation. By generating intermediate steps, the model uses its output as working memory, enabling multi-step reasoning that would be impossible in a single forward pass.
Structured outputs require explicit constraints. The model generates probability distributions over tokens; structured formats arise from training patterns and explicit prompting, not from an understanding of structure.
Context is a resource to be budgeted. Token limits, positional attention effects, and caching economics all influence how to structure prompts for production systems.
Temperature controls the explore-exploit tradeoff. Low temperatures give consistent, predictable outputs; high temperatures enable creativity at the cost of reliability.
Above all, remember that prompts are compiled, not written: a production prompt is the result of disciplined optimization passes over clarity, examples, constraints, and format—not a single act of authorship.
Key Decisions Framework
| Decision | Key Factors | Recommendation |
|---|---|---|
| Zero-shot vs. few-shot | Task novelty, format requirements | Few-shot for non-standard formats |
| Number of examples | Task complexity, token budget | 5-10 for most tasks |
| Chain-of-thought | Task requires reasoning | Use for math/logic, skip for classification |
| Temperature | Determinism vs. creativity | 0 for structured, 0.7+ for creative |
| Caching strategy | Request patterns, cost sensitivity | Maximize shared prefixes |
Connections to Other Chapters
- Chapter 5 (LLM Foundations): The attention mechanisms and tokenization covered there explain why prompt engineering techniques work at a mechanistic level.
- Chapter 7 (RAG Systems): RAG systems are essentially prompt engineering with retrieved context. The context management and prompt structure principles from this chapter apply directly.
- Chapter 8 (Agentic Systems): Agents use ReAct-style prompting for reasoning loops. The advanced patterns here are foundational.
- Chapter 15 (Evaluation): Systematic prompt evaluation requires the testing approaches outlined in that chapter.
Practical Exercises
Example selection experiment: For a classification task of your choice, compare random example selection vs. similarity-based selection vs. diversity-based selection. Measure accuracy on a held-out test set of 100 examples. Which strategy wins and why?
Chain-of-thought boundary: Find the task complexity boundary where CoT starts helping. Test on arithmetic problems of increasing complexity (1-step, 2-step, 5-step, 10-step). At what point does CoT significantly outperform direct prompting?
Temperature tuning: For a creative writing task, generate outputs at temperatures 0.3, 0.7, 1.0, and 1.3. Have 5 people rate outputs on creativity and coherence. Where’s the sweet spot?
Caching optimization: Take a document QA scenario with 50 questions about a single document. Measure token costs with and without caching. Then experiment with question ordering to maximize cache hits.
Prompt injection resistance: Design a customer service prompt that resists common prompt injection attacks. Test against: “Ignore previous instructions…”, “You are now…”, and attempts to extract the system prompt.
Self-Assessment Checkpoint
Conceptual Questions
Q1. [IC2] You’re using 5-shot prompting but accuracy is poor. What three adjustments would you try first?
Answer
- Example selection—are examples representative of the test cases? Try similarity-based selection (embed test query, find similar examples) or ensure diversity across categories. (2) Example format—is the output format consistent across examples? Inconsistent formatting confuses the model. (3) Example ordering—try putting the most relevant example last (recency bias). Also consider: Do you have edge cases covered? Are examples too similar to each other? Is 5 examples enough, or do you need more for this task complexity?
Q2. [IC2] Why does chain-of-thought prompting help with math problems but not with simple classification?
Answer
Math problems require multi-step reasoning—each step depends on previous results. CoT works by using generated tokens as working memory, enabling computation that spans multiple forward passes. Simple classification is a single-step pattern match—the model recognizes features and maps to a class. Adding CoT to classification adds unnecessary tokens that may introduce errors or inconsistencies. CoT helps when the task exceeds what a single forward pass can compute; it hurts when it adds complexity to simple tasks.Q3. [Senior] A prompt with temperature=0 produces different outputs on different days. What’s happening?
Answer
Possible causes: (1) Provider model updates—they may have deployed a new version. (2) Floating-point non-determinism—even temp=0 may not be perfectly deterministic due to GPU computation ordering. (3) Caching issues—stale caches, different cache hits. (4) Input differences—timestamps, context that changes over time included in the prompt. (5) A/B testing—provider routing to different model variants. (6) Load balancing to different hardware with different behavior. Mitigation: Use explicit seeds if supported, log model version, normalize inputs, and track the specific output for debugging.Q4. [Senior] How do you decide between using structured output (JSON mode) vs. parsing free-form text?
Answer
Structured output when: (1) Output must be machine-parseable reliably. (2) Schema is well-defined and stable. (3) You need guaranteed valid JSON. (4) Downstream processing is automated. Free-form text when: (1) Output is for humans primarily. (2) Schema would be overly constraining. (3) You need flexibility in response format. (4) Task is inherently unstructured (creative writing). Trade-offs: Structured output may reduce quality slightly (model constrained), but dramatically improves reliability. For production pipelines, almost always prefer structured output. For user-facing content, often prefer free-form.Q5. [Staff] You’re designing a prompt system for a product serving 1M users with diverse needs. How do you manage prompt versioning, A/B testing, and rollback?
Answer
Design: (1) Prompt registry—versioned prompts stored externally (not in code), with metadata (author, date, description, linked experiments). (2) Feature flags—control which users get which prompt version. (3) A/B framework—random assignment to variants, metric collection, statistical significance calculation. (4) Gradual rollout—5%→25%→50%→100% with monitoring at each stage. (5) Instant rollback—flag flip returns to previous version, no deployment needed. (6) Evaluation pipeline—automated testing against golden set before any rollout. (7) Audit logging—full history of which user saw which prompt for debugging. Implementation: Treat prompts as configuration, not code. Decouple prompt changes from code deploys.Spot the Problem
Problem 1. [IC2] This few-shot prompt for sentiment classification:
Classify the sentiment as positive or negative.
Example: "Great product, love it!" -> Positive
Example: "Terrible service" -> Negative
Example: "Best purchase ever!" -> Positive
Example: "Would not recommend" -> Negative
Example: "Amazing quality" -> Positive
Text: "The delivery was fast"
Sentiment:
Answer
Issues: (1) Unbalanced examples—4 positive, only 1 negative. Model may be biased toward positive. (2) All examples have clear sentiment—no neutral or ambiguous cases. (3) The test case (“delivery was fast”) might be neutral, but no neutral option or examples. (4) Examples are very short; test case is also short. If real data includes longer texts, these examples won’t help. Fix: Balance examples, include edge cases, match example complexity to real data.Problem 2. [Senior] A developer adds CoT to every prompt for “consistency”:
def process(input):
prompt = f"""Think step by step, then provide your answer.
Input: {input}
Response:"""
return llm.generate(prompt)Answer
Problems: (1) CoT overhead for simple tasks—adds latency and cost for tasks that don’t need reasoning. (2) Generic CoT instruction—no guidance on what steps to consider, may produce irrelevant reasoning. (3) No output structure—CoT will produce reasoning, but extracting the actual answer requires parsing. (4) May hurt simple tasks—CoT can introduce errors in straightforward pattern matching. Better: Use CoT selectively for reasoning tasks, provide specific reasoning guidance, structure output (e.g., “First analyze X, then Y, finally provide answer in format Z”).Problem 3. [Staff] A production system’s caching approach:
cache_key = hash(system_prompt + user_message)
if cache_key in cache:
return cache[cache_key]Answer
Issues: (1) Cache invalidation—when do entries expire? Old answers may be stale if the underlying data changes. (2) Non-deterministic outputs—even with temp=0, LLM outputs may vary slightly. Caching assumes determinism. (3) Exact match only—“What’s the weather?” and “What is the weather?” are different keys but same query. (4) No context—same question from different users might need different answers (permissions, personalization). (5) Semantic caching opportunity missed—similar questions could share cached reasoning. Better: Add TTL, consider semantic similarity for cache lookup, include relevant context in key, monitor cache hit rate and staleness.Design Exercises
Exercise 1. [Senior] Design a prompt system for a legal document summarization tool. Requirements: Summaries must be accurate (no hallucination), preserve key dates/amounts/names, flag uncertainty, and work across contract types. Outline your prompt strategy, evaluation approach, and how you’d handle edge cases.
Guidance
Prompt strategy: (1) Strong grounding instructions—“Only include information explicitly stated in the document.” (2) Structured output—JSON with fields for key_dates, key_amounts, parties, obligations, uncertainties. (3) Extraction-first approach—extract facts, then summarize from extractions. (4) Uncertainty flagging—explicit field for “sections that were unclear or incomplete.” Evaluation: (1) Human lawyer review on sample. (2) Automated fact checking—are extracted dates/amounts present in source? (3) Hallucination detection—is any summary claim not traceable to source? Edge cases: (1) Very long contracts—chunk and summarize, then synthesize. (2) Handwritten addendums—OCR quality issues, flag low confidence. (3) Non-standard formats—graceful degradation, request human review.Exercise 2. [Staff] You’re building a prompt optimization system that automatically improves prompts based on feedback. Design the system architecture: how do you represent prompts, collect feedback, generate variations, evaluate improvements, and prevent degradation?
Guidance
Architecture: (1) Prompt representation—structured format with editable components (instructions, examples, format specification). (2) Feedback collection—explicit (user ratings, corrections) and implicit (regeneration, session abandonment). Map feedback to specific prompt components. (3) Variation generation—LLM-based prompt mutation: “Given this prompt and these failure cases, suggest improvements.” Also: parametric variations (add example, change wording). (4) Evaluation—automated golden set testing before any deployment. Compare new vs. current on held-out examples. (5) Staged rollout—never replace 100% immediately. A/B test variations, promote winners. (6) Regression prevention—maintain a “don’t break” test set of critical cases. Any prompt that fails these is rejected regardless of aggregate improvement. (7) Version control—full history, ability to understand why a change was made, rollback capability.See Also
- Chapter 5: LLM/NLP Foundations - Understand transformer architecture and attention mechanisms that underpin all prompt engineering techniques
- Chapter 7: RAG Systems Deep Dive - Combines prompting with retrieval for knowledge-grounded applications
- Chapter 8: Agentic Systems - Advanced prompting patterns for tool use, planning, and autonomous behavior
- Appendix I: Quick Reference Cards - Prompt templates and cheat sheets for common use cases
Further Reading
Essential
Wei et al. (2022), “Chain-of-Thought Prompting Elicits Reasoning” - The foundational CoT paper. Changed how we approach reasoning in LLMs.
“Lost in the Middle” (Liu et al., 2024) - How models use long contexts. Critical for RAG and context design.
Ouyang et al. (2022), “Training language models to follow instructions” (InstructGPT) - Explains why modern models follow instructions so well.
Deep Dives
Olsson et al. (2022), “In-context Learning and Induction Heads” - Mechanistic understanding of how few-shot learning actually works in transformers.
Wang et al. (2022), “Self-Consistency Improves Chain of Thought Reasoning” - Multiple reasoning paths for higher accuracy.
Yao et al. (2022), “ReAct: Synergizing Reasoning and Acting” - Foundation for agentic prompting patterns.