Appendix C: Paper Reading List (Annotated)

This appendix provides an annotated reading list of essential papers for AI engineers. Papers are organized by topic and ordered by importance within each section. Each entry includes a summary of key contributions, practical takeaways, and reading time estimate.


Foundational Papers

These papers establish the conceptual foundations that every AI engineer should understand.

Attention Is All You Need (2017)

Citation: Vaswani et al., “Attention Is All You Need,” NeurIPS 2017

Why Read It: This paper introduced the Transformer architecture that underlies virtually all modern LLMs. Understanding self-attention, multi-head attention, and positional encoding is fundamental.

Key Contributions:

  • Self-attention mechanism that processes sequences without recurrence
  • Multi-head attention for attending to different representation subspaces
  • Scaled dot-product attention with the √d_k scaling factor
  • Positional encoding to inject sequence order information

Practical Takeaways:

  • Attention complexity is O(n²) with sequence length—this drives context window limitations
  • Encoder-decoder architecture is optional; decoder-only (GPT) and encoder-only (BERT) variants emerged
  • The feedforward network after attention is often where capacity resides

Reading Time: 2-3 hours for initial read, deeper understanding takes longer

Link: https://arxiv.org/abs/1706.03762


BERT: Pre-training of Deep Bidirectional Transformers (2018)

Citation: Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL 2019

Why Read It: BERT demonstrated that pre-training on large text corpora, followed by fine-tuning, dramatically improves performance on downstream tasks. This transfer learning paradigm now dominates NLP.

Key Contributions:

  • Masked language modeling (MLM) pre-training objective
  • Bidirectional context (unlike left-to-right GPT)
  • Next sentence prediction (NSP) task (later shown less important)
  • Fine-tuning paradigm for transfer learning

Practical Takeaways:

  • Pre-training + fine-tuning is more effective than training from scratch
  • Bidirectional models excel at understanding tasks (classification, extraction)
  • But they can’t generate text autoregressively

Reading Time: 2 hours

Link: https://arxiv.org/abs/1810.04805


Language Models are Unsupervised Multitask Learners (GPT-2, 2019)

Citation: Radford et al., “Language Models are Unsupervised Multitask Learners”

Why Read It: GPT-2 demonstrated that scaling language models enables zero-shot task performance. It introduced the paradigm of prompting models rather than fine-tuning them.

Key Contributions:

  • Zero-shot learning through prompting
  • Demonstration that scale improves capability
  • Decoder-only architecture for generation

Practical Takeaways:

  • Larger models can perform tasks they weren’t explicitly trained for
  • Prompting is a form of “soft” programming
  • Model capacity matters significantly

Reading Time: 1.5 hours

Link: https://openai.com/index/better-language-models/


Training language models to follow instructions (InstructGPT, 2022)

Citation: Ouyang et al., “Training language models to follow instructions with human feedback”

Why Read It: This paper describes how to align language models to follow human instructions using RLHF. It’s the foundation of ChatGPT and similar assistants.

Key Contributions:

  • Three-stage alignment process: supervised fine-tuning, reward modeling, PPO
  • Demonstration that smaller aligned models outperform larger unaligned models
  • Human preference data collection methodology

Practical Takeaways:

  • Alignment is as important as capability
  • Human feedback is expensive but valuable
  • Reward hacking is a real risk

Reading Time: 2-3 hours

Link: https://arxiv.org/abs/2203.02155


Scaling and Training

Scaling Laws for Neural Language Models (2020)

Citation: Kaplan et al., “Scaling Laws for Neural Language Models”

Why Read It: Establishes power-law relationships between model size, data, compute, and loss. Essential for planning training runs and resource allocation.

Key Contributions:

  • Power-law scaling of loss with parameters, data, and compute
  • Optimal allocation of compute budget between model size and data
  • Early stopping recommendations

Practical Takeaways:

  • Double parameters → ~constant loss reduction
  • Compute-optimal training has specific ratios
  • These laws guide billion-dollar training decisions

Reading Time: 2 hours (heavy on math)

Link: https://arxiv.org/abs/2001.08361


Training Compute-Optimal Large Language Models (Chinchilla, 2022)

Citation: Hoffmann et al., “Training Compute-Optimal Large Language Models”

Why Read It: Revised scaling laws showing that models should be trained on more data than previously thought. Influenced all subsequent LLM training.

Key Contributions:

  • Revised optimal data-to-parameters ratio (~20 tokens per parameter)
  • Showed GPT-3 was undertrained for its size
  • Chinchilla (70B) outperformed Gopher (280B) with same compute

Practical Takeaways:

  • More data often beats more parameters
  • Optimal ratios changed industry training practices
  • Data quality and quantity both matter enormously

Reading Time: 2 hours

Link: https://arxiv.org/abs/2203.15556


FlashAttention: Fast and Memory-Efficient Exact Attention (2022)

Citation: Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”

Why Read It: FlashAttention is now standard in all production systems. Understanding its memory hierarchy optimization is valuable for performance engineering.

Key Contributions:

  • Fused attention kernels that minimize HBM reads/writes
  • Tiling approach that fits computation in SRAM
  • Significant speedup and memory reduction

Practical Takeaways:

  • Memory bandwidth, not compute, is often the bottleneck
  • Algorithm-level optimization can beat hardware upgrades
  • Always use FlashAttention in production

Reading Time: 2-3 hours (systems-heavy)

Link: https://arxiv.org/abs/2205.14135


Retrieval and RAG

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)

Citation: Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”

Why Read It: Introduced RAG, which is now fundamental to production LLM applications. Combines retrieval with generation for better factuality.

Key Contributions:

  • Architecture combining retriever and generator
  • Marginalization over retrieved documents
  • Significant improvements on knowledge tasks

Practical Takeaways:

  • Retrieval grounds generation in facts
  • Quality of retrieved context directly impacts output quality
  • RAG is more efficient than putting everything in context

Reading Time: 2 hours

Link: https://arxiv.org/abs/2005.11401


Dense Passage Retrieval for Open-Domain Question Answering (2020)

Citation: Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering”

Why Read It: DPR showed that learned dense retrieval can outperform BM25. Foundation for modern semantic search.

Key Contributions:

  • Bi-encoder architecture for dense retrieval
  • In-batch negatives for efficient training
  • Strong QA results with simple architecture

Practical Takeaways:

  • Dense retrieval captures semantic similarity that BM25 misses
  • Hard negatives improve retrieval quality
  • Hybrid (dense + sparse) often works best

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2004.04906


Self-RAG: Learning to Retrieve, Generate, and Critique (2023)

Citation: Asai et al., “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”

Why Read It: Advanced RAG that teaches models to decide when to retrieve and self-critique. Shows direction of RAG evolution.

Key Contributions:

  • Model learns when retrieval is needed
  • Self-reflection for assessing generation quality
  • Special tokens for retrieval decisions

Practical Takeaways:

  • Not every query needs retrieval
  • Self-critique can improve output quality
  • Training for retrieval decisions is possible

Reading Time: 2 hours

Link: https://arxiv.org/abs/2310.11511


Efficiency and Optimization

LoRA: Low-Rank Adaptation of Large Language Models (2021)

Citation: Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models”

Why Read It: LoRA made fine-tuning large models practical by training only a small number of parameters. Now standard for LLM adaptation.

Key Contributions:

  • Low-rank decomposition for weight updates
  • No additional inference latency
  • Dramatic reduction in trainable parameters

Practical Takeaways:

  • Fine-tune 7B+ models on consumer GPUs
  • r=8-16 typically sufficient
  • Can merge adapters back into base weights

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2106.09685


QLoRA: Efficient Finetuning of Quantized LLMs (2023)

Citation: Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs”

Why Read It: Combines 4-bit quantization with LoRA for even more efficient fine-tuning. Enabled fine-tuning 65B models on single GPUs.

Key Contributions:

  • 4-bit NormalFloat quantization
  • Double quantization for memory savings
  • Paged optimizers for memory spikes

Practical Takeaways:

  • Fine-tune 65B models on single 48GB GPU (per the paper; modern 70B models also fit)
  • Quality close to full fine-tuning
  • Standard tool for resource-constrained fine-tuning

Reading Time: 2 hours

Link: https://arxiv.org/abs/2305.14314


Speculative Decoding (2022-2023)

Citation: Leviathan et al., “Fast Inference from Transformers via Speculative Decoding”

Why Read It: Speculative decoding provides 2-3x inference speedup with no quality loss. Understanding why it works helps with other optimizations.

Key Contributions:

  • Draft-then-verify paradigm
  • Acceptance probability analysis
  • Parallel verification of multiple tokens

Practical Takeaways:

  • Works best when draft model is well-aligned with target
  • Speedup depends on acceptance rate
  • No quality degradation (exact sampling)

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2211.17192


Agents and Reasoning

ReAct: Synergizing Reasoning and Acting in Language Models (2022)

Citation: Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models”

Why Read It: ReAct is the foundation for most LLM agent architectures. Interleaves reasoning traces with actions.

Key Contributions:

  • Thought-action-observation loop
  • Reasoning traces improve action quality
  • Generalizable agent architecture

Practical Takeaways:

  • Chain-of-thought + tool use > either alone
  • Explicit reasoning helps debugging
  • Foundation for production agents

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2210.03629


Chain-of-Thought Prompting Elicits Reasoning (2022)

Citation: Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”

Why Read It: Chain-of-thought prompting dramatically improves reasoning. Simple technique with large impact.

Key Contributions:

  • Demonstrated reasoning improvements from step-by-step thinking
  • Showed emergent capability at scale
  • Simple prompting technique

Practical Takeaways:

  • “Let’s think step by step” improves complex tasks
  • Works best on reasoning tasks
  • Smaller models need examples; larger can do zero-shot

Reading Time: 1 hour

Link: https://arxiv.org/abs/2201.11903


Toolformer: Language Models Can Teach Themselves to Use Tools (2023)

Citation: Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools”

Why Read It: Shows how to train models to use external tools. Foundation for understanding tool-use training.

Key Contributions:

  • Self-supervised tool use training
  • API call insertion and filtering
  • Works with various tools

Practical Takeaways:

  • Models can learn when tools help
  • Training data generation is key
  • Different from pure prompting approaches

Reading Time: 2 hours

Link: https://arxiv.org/abs/2302.04761


Safety and Alignment

Constitutional AI: Harmlessness from AI Feedback (2022)

Citation: Bai et al., “Constitutional AI: Harmlessness from AI Feedback”

Why Read It: Describes Anthropic’s approach to alignment using principles. Alternative to pure human feedback.

Key Contributions:

  • Using AI feedback for alignment
  • Constitutional principles for behavior
  • Reduced reliance on human labelers

Practical Takeaways:

  • Principles can guide behavior
  • AI feedback can complement human feedback
  • Scalable alignment approach

Reading Time: 2 hours

Link: https://arxiv.org/abs/2212.08073


Direct Preference Optimization (DPO, 2023)

Citation: Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”

Why Read It: DPO simplifies RLHF by directly optimizing preferences. Increasingly common alternative to PPO.

Key Contributions:

  • Closed-form solution for preference learning
  • No separate reward model needed
  • Simpler and more stable than PPO

Practical Takeaways:

  • Often preferred over PPO for simplicity
  • Still needs preference data
  • Good default for alignment fine-tuning

Reading Time: 2 hours (math-heavy)

Link: https://arxiv.org/abs/2305.18290


Multimodal

Learning Transferable Visual Models From Natural Language Supervision (CLIP, 2021)

Citation: Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”

Why Read It: CLIP connected vision and language understanding. Foundation for multimodal models.

Key Contributions:

  • Contrastive learning across modalities
  • Zero-shot image classification
  • Web-scale training data

Practical Takeaways:

  • Text-image alignment enables new applications
  • Contrastive learning is powerful for embedding spaces
  • Foundation for DALL-E, Stable Diffusion, etc.

Reading Time: 2 hours

Link: https://arxiv.org/abs/2103.00020


Visual Instruction Tuning (LLaVA, 2023)

Citation: Liu et al., “Visual Instruction Tuning”

Why Read It: LLaVA shows how to create multimodal instruction-following models. Practical for vision-language tasks.

Key Contributions:

  • Visual instruction tuning methodology
  • GPT-4 assisted data generation
  • Strong performance with simple architecture

Practical Takeaways:

  • Vision encoders + LLMs work well together
  • Instruction data quality matters
  • Relatively simple to replicate

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2304.08485


Evaluation

HELM: Holistic Evaluation of Language Models (2022)

Citation: Liang et al., “Holistic Evaluation of Language Models”

Why Read It: HELM provides comprehensive evaluation methodology. Useful for understanding what metrics matter.

Key Contributions:

  • Multi-scenario evaluation framework
  • Standardized evaluation methodology
  • Metrics beyond accuracy (calibration, fairness, etc.)

Practical Takeaways:

  • Single metrics don’t capture model quality
  • Consider accuracy, calibration, robustness, fairness, efficiency
  • Evaluation methodology should be explicit

Reading Time: Long paper; skim for methodology

Link: https://arxiv.org/abs/2211.09110


Judging LLM-as-a-Judge (2023)

Citation: Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”

Why Read It: Establishes when LLM judges are reliable. Essential for automated evaluation.

Key Contributions:

  • MT-Bench multi-turn benchmark
  • Analysis of LLM judge reliability
  • Position bias and other issues identified

Practical Takeaways:

  • LLM judges correlate well with humans for many tasks
  • Position bias is real—randomize order
  • Strong models make better judges

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2306.05685


Systems and Infrastructure

Efficient Memory Management for Large Language Model Serving (PagedAttention/vLLM, 2023)

Citation: Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention”

Why Read It: Introduces PagedAttention, the key innovation in vLLM. Essential for understanding modern LLM serving.

Key Contributions:

  • Virtual memory for KV cache
  • Non-contiguous memory allocation
  • Block-level sharing for parallel sequences
  • 2-4x throughput improvement

Practical Takeaways:

  • KV cache memory management is critical
  • Block-based allocation solves fragmentation
  • Foundation for all modern serving systems

Reading Time: 2 hours

Link: https://arxiv.org/abs/2309.06180


TensorRT-LLM: A High-Performance Framework for Large Language Model Inference

Citation: NVIDIA TensorRT-LLM Technical Blog/Documentation

Why Read It: Understanding TensorRT-LLM optimizations helps with production performance engineering.

Key Contributions:

  • Kernel fusion and optimization
  • Multi-GPU inference patterns
  • INT8/FP8 quantization
  • Inflight batching

Practical Takeaways:

  • Maximum performance on NVIDIA hardware
  • Compilation-based optimization
  • Trade-off between optimization time and inference speed

Reading Time: 1.5 hours (documentation + blog posts)


Orca: A Distributed Serving System for Transformer-Based Generative Models (2022)

Citation: Yu et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models”

Why Read It: Introduces continuous batching, now standard in all LLM serving systems.

Key Contributions:

  • Iteration-level scheduling
  • Continuous batching (vs. static batching)
  • Selective batching for different request lengths

Practical Takeaways:

  • Static batching wastes significant compute
  • Iteration-level scheduling maximizes utilization
  • Influenced vLLM, TGI, and all modern systems

Reading Time: 2 hours

Link: https://www.usenix.org/conference/osdi22/presentation/yu


Data and Features

Hidden Technical Debt in Machine Learning Systems (2015)

Citation: Sculley et al., “Hidden Technical Debt in Machine Learning Systems,” NeurIPS 2015

Why Read It: Classic paper on ML systems debt. Still highly relevant. Required reading for ML engineers.

Key Contributions:

  • Identified ML-specific technical debt
  • Glue code, pipeline jungles, dead features
  • Configuration debt, feedback loops

Practical Takeaways:

  • Most ML code isn’t the model
  • Data and pipeline issues dominate
  • Proactive debt management is essential

Reading Time: 1.5 hours

Link: https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf


Automating Large-Scale Data Quality Verification (2018)

Citation: Schelter et al., “Automating Large-Scale Data Quality Verification”

Why Read It: Foundation for data quality in ML pipelines. Basis for Amazon’s Deequ library.

Key Contributions:

  • Declarative data quality rules
  • Incremental validation
  • Anomaly detection for data

Practical Takeaways:

  • Data testing is as important as code testing
  • Automation is essential at scale
  • Historical profiles enable anomaly detection

Reading Time: 1.5 hours

Link: https://www.vldb.org/pvldb/vol11/p1781-schelter.pdf


Data Management Challenges in Production Machine Learning (2017)

Citation: Polyzotis et al., “Data Management Challenges in Production Machine Learning”

Why Read It: Survey of data challenges in production ML. Provides vocabulary and frameworks for thinking about data issues.

Key Contributions:

  • Taxonomy of data management challenges
  • Training-serving skew analysis
  • Data lifecycle management

Practical Takeaways:

  • Data issues are the primary source of ML bugs
  • Validation at system boundaries is critical
  • Reproducibility requires data versioning

Reading Time: 2 hours

Link: https://dl.acm.org/doi/10.1145/3035918.3054782


Long Context and Memory

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (2023)

Citation: Chen et al., “LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models”

Why Read It: Shows how to efficiently extend context length. Relevant as context windows grow.

Key Contributions:

  • Shifted sparse attention for training
  • Efficient context extension
  • Position interpolation techniques

Practical Takeaways:

  • Context extension is cheaper than training from scratch
  • Sparse attention during training, dense during inference
  • 100K+ context is achievable

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2309.12307


Retrieval-Augmented Generation vs Long-Context LLMs: A Fair Comparison (2024)

Citation: Various comparison studies

Why Read It: Understanding when to use RAG vs. long context is crucial for system design.

Key Contributions:

  • Empirical comparison of RAG vs. stuffing context
  • Cost-performance trade-offs
  • Task-dependent recommendations

Practical Takeaways:

  • Long context isn’t always better than RAG
  • Cost considerations favor RAG at scale
  • Hybrid approaches often optimal

Reading Time: 1.5 hours


Quantization

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022)

Citation: Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”

Why Read It: GPTQ enables running large models on consumer hardware. Essential for deployment.

Key Contributions:

  • One-shot quantization method
  • Per-column quantization
  • Minimal accuracy loss at 4-bit

Practical Takeaways:

  • 4-bit quantization is practical
  • Post-training quantization is convenient
  • Some accuracy trade-off at extreme compression

Reading Time: 2 hours (math-heavy)

Link: https://arxiv.org/abs/2210.17323


AWQ: Activation-aware Weight Quantization (2023)

Citation: Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration”

Why Read It: AWQ often outperforms GPTQ. Understanding both helps with quantization decisions.

Key Contributions:

  • Activation-aware importance weights
  • No retraining needed
  • Better preservation of salient weights

Practical Takeaways:

  • Salient weights matter more—protect them
  • AWQ often better than GPTQ
  • 4-bit with minimal quality loss

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2306.00978


Embeddings and Retrieval

E5: Text Embeddings by Weakly-Supervised Contrastive Pre-training (2022)

Citation: Wang et al., “Text Embeddings by Weakly-Supervised Contrastive Pre-training”

Why Read It: E5 embeddings are widely used. Understanding training methodology helps with embedding selection.

Key Contributions:

  • Weakly-supervised training on web data
  • Strong zero-shot performance
  • Instruction-tuned variants

Practical Takeaways:

  • Instruction prefix can improve retrieval
  • Quality of training data matters enormously
  • E5 variants available for different use cases

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2212.03533


ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction (2020)

Citation: Khattab and Zaharia, “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT”

Why Read It: ColBERT offers different retrieval trade-offs than single-vector approaches. Important for advanced retrieval.

Key Contributions:

  • Late interaction for better relevance
  • Efficient MaxSim operation
  • Balance between single-vector and cross-encoder

Practical Takeaways:

  • Token-level matching improves relevance
  • More storage but better accuracy than single-vector
  • Good for precision-critical applications

Reading Time: 2 hours

Link: https://arxiv.org/abs/2004.12832


Mixture of Experts

Switch Transformers: Scaling to Trillion Parameter Models (2021)

Citation: Fedus et al., “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”

Why Read It: MoE architectures are used in GPT-4, Mixtral, and future models. Understanding them helps interpret model behavior.

Key Contributions:

  • Simplified MoE routing
  • Trillion-parameter models with manageable compute
  • Expert capacity balancing

Practical Takeaways:

  • MoE enables larger models with less compute
  • Load balancing is critical
  • Not all parameters active for each token

Reading Time: 2.5 hours

Link: https://arxiv.org/abs/2101.03961


Mixtral of Experts (2023)

Citation: Mistral AI, “Mixtral of Experts”

Why Read It: Practical MoE model that outperforms larger dense models. Represents production-ready MoE.

Key Contributions:

  • 8x7B MoE with 2 experts active
  • Strong performance per compute
  • Open weights

Practical Takeaways:

  • MoE is practical for deployment
  • Router overhead is manageable
  • Memory usage is full model size

Reading Time: 1 hour (short paper/blog)

Link: https://arxiv.org/abs/2401.04088


Production and Operations

Machine Learning Operations (MLOps): Overview, Definition, and Architecture (2022)

Citation: Kreuzberger et al., “Machine Learning Operations (MLOps): Overview, Definition, and Architecture”

Why Read It: Comprehensive survey of MLOps practices. Good for understanding the field holistically.

Key Contributions:

  • MLOps definition and principles
  • Architecture patterns
  • Automation levels

Practical Takeaways:

  • MLOps is more than just ML + DevOps
  • Automation is progressive
  • Monitoring and feedback are essential

Reading Time: 2 hours

Link: https://arxiv.org/abs/2205.02302


Monitoring Machine Learning Models in Production (2019)

Citation: Google Cloud ML Documentation and Blog Series

Why Read It: Practical guidance on production ML monitoring from Google’s experience.

Key Contributions:

  • Model monitoring best practices
  • Drift detection approaches
  • Alert design

Practical Takeaways:

  • Monitor inputs, outputs, and outcomes
  • Statistical tests for drift detection
  • Human-in-the-loop for edge cases

Reading Time: 2 hours (multiple articles)


Prompt Engineering and In-Context Learning

Large Language Models are Zero-Shot Reasoners (2022)

Citation: Kojima et al., “Large Language Models are Zero-Shot Reasoners”

Why Read It: Shows that “Let’s think step by step” dramatically improves reasoning without examples.

Key Contributions:

  • Zero-shot chain-of-thought prompting
  • Simple trigger phrases enable reasoning
  • Analysis of why it works

Practical Takeaways:

  • Add “Let’s think step by step” for reasoning tasks
  • No examples needed for basic CoT
  • Works better with larger models

Reading Time: 1 hour

Link: https://arxiv.org/abs/2205.11916


Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (2022)

Citation: Min et al., “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?”

Why Read It: Analyzes what actually matters in few-shot examples. Surprising findings about label importance.

Key Contributions:

  • Format matters more than correct labels
  • Input distribution of examples is key
  • Label space specification is important

Practical Takeaways:

  • Examples teach format and task structure
  • Random labels work surprisingly well
  • Distribution coverage matters

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2202.12837


Calibrate Before Use: Improving Few-Shot Performance of Language Models (2021)

Citation: Zhao et al., “Calibrate Before Use: Improving Few-Shot Performance of Language Models”

Why Read It: Addresses biases in few-shot learning. Practical for improving prompt reliability.

Key Contributions:

  • Identified majority label, recency, and common token biases
  • Calibration technique to correct biases
  • Significant accuracy improvements

Practical Takeaways:

  • Example order affects predictions
  • Calibration via content-free inputs
  • Randomize and average when possible

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2102.09690


Code Generation

Evaluating Large Language Models Trained on Code (Codex, 2021)

Citation: Chen et al., “Evaluating Large Language Models Trained on Code”

Why Read It: Introduces HumanEval benchmark and Codex. Foundation for understanding code LLMs.

Key Contributions:

  • HumanEval benchmark
  • Analysis of code generation capabilities
  • pass@k evaluation methodology

Practical Takeaways:

  • Functional correctness matters, not just syntax
  • Multiple samples improve pass rate
  • Temperature affects code quality

Reading Time: 2 hours

Link: https://arxiv.org/abs/2107.03374


Self-Instruct: Aligning Language Models with Self-Generated Instructions (2022)

Citation: Wang et al., “Self-Instruct: Aligning Language Models with Self-Generated Instructions”

Why Read It: Shows how to generate instruction data using models themselves. Practical for data generation.

Key Contributions:

  • Self-instruction generation pipeline
  • Filtering for quality
  • Reduced human annotation needs

Practical Takeaways:

  • Models can generate their own training data
  • Quality filtering is essential
  • Useful for domain-specific fine-tuning

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2212.10560


Robustness and Safety

Universal and Transferable Adversarial Attacks on Aligned Language Models (2023)

Citation: Zou et al., “Universal and Transferable Adversarial Attacks on Aligned Language Models”

Why Read It: Important for understanding LLM vulnerabilities. Motivates safety measures.

Key Contributions:

  • Automated adversarial suffix generation
  • Transfer across models
  • Bypasses safety training

Practical Takeaways:

  • Alignment isn’t robust to optimization
  • Perplexity filtering helps
  • Defense requires multiple layers

Reading Time: 2 hours

Link: https://arxiv.org/abs/2307.15043


Red Teaming Language Models with Language Models (2022)

Citation: Perez et al., “Red Teaming Language Models with Language Models”

Why Read It: Describes automated red teaming approaches. Practical for safety testing.

Key Contributions:

  • LLM-based red teaming methodology
  • Diverse attack generation
  • Scaling red team efforts

Practical Takeaways:

  • Automation enables broader coverage
  • Diversity of attacks matters
  • Continuous red teaming needed

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2202.03286


Recent Advances (2024-2025)

The pace of development in 2024-2025 has been extraordinary. These papers represent the most impactful advances during this period.

Llama 3 and Open Foundation Models

Citation: Meta AI, “Llama 3” (April 2024) and “Llama 3.1” (July 2024); see also “The Llama 3 Herd of Models” technical report (arXiv:2407.21783)

Why Read It: Llama 3 pushed open-source models to near-parity with proprietary alternatives. The Llama 3.1 70B and 405B models are competitive with GPT-4 on many benchmarks.

Key Contributions:

  • Scaling to 405B parameters with open weights (Llama 3.1, July 2024)
  • 128K context window in Llama 3.1 (up from 8K in the original Llama 3)
  • Improved training data curation and quality
  • Better instruction-following and reasoning

Practical Takeaways:

  • Open models are viable for production use
  • 70B models hit the sweet spot for many applications
  • Data quality matters as much as scale

Link: https://ai.meta.com/blog/meta-llama-3/


Claude 3 and Constitutional AI Advances

Citation: Anthropic, “Claude 3 Model Card” (2024)

Why Read It: Claude 3 (Opus, Sonnet, Haiku) demonstrated significant capability improvements while maintaining safety. Introduces improved vision capabilities.

Key Contributions:

  • Native multimodal understanding
  • Improved long-context performance (200K tokens)
  • Better instruction following and reduced refusals

Practical Takeaways:

  • Tiered model offerings enable cost optimization
  • Long context enables new use cases
  • Vision capabilities are production-ready

Gemini and Long Context

Citation: Google DeepMind, “Gemini: A Family of Highly Capable Multimodal Models” (2023-2024)

Why Read It: Gemini introduced native multimodality and pushed context windows to 1M+ tokens.

Key Contributions:

  • Native multimodal training from the start
  • 1M+ token context window (Gemini 1.5)
  • Strong performance across modalities

Practical Takeaways:

  • Truly long context changes application architecture
  • Native multimodal may outperform adapter approaches
  • Context window isn’t free—understand cost implications

Link: https://arxiv.org/abs/2312.11805


Test-Time Compute and Reasoning (o1-style)

Citation: Various papers on inference-time scaling, including OpenAI’s o1 approach

Why Read It: The paradigm shift toward using more compute at inference time (thinking longer) rather than just training larger models.

Key Contributions:

  • Chain-of-thought at inference time improves reasoning
  • Self-consistency and verification during generation
  • Trading inference compute for capability

Practical Takeaways:

  • Longer thinking time can substitute for larger models
  • Useful for high-stakes decisions worth the latency
  • Different cost model than traditional inference

Mixture of Experts at Scale (Mixtral, DeepSeek)

Citation: Mistral AI, “Mixtral of Experts” (2024); DeepSeek-V2 (2024)

Why Read It: MoE architectures achieved excellent performance-per-compute, making very capable models more accessible.

Key Contributions:

  • Efficient expert routing at scale
  • Mixtral 8x7B outperforms larger dense models
  • DeepSeek-V2 pushes efficiency further

Practical Takeaways:

  • MoE enables larger effective models on same hardware
  • Not all parameters active—memory is still full model size
  • Great for throughput-focused deployments

Link (Mixtral): https://arxiv.org/abs/2401.04088


Structured Output and JSON Mode

Citation: OpenAI “Structured Outputs” announcement; various implementations

Why Read It: Native structured output support changed how we build LLM applications—more reliable than prompt-based JSON extraction.

Key Contributions:

  • Guaranteed valid JSON output
  • Schema enforcement during generation
  • Reduced post-processing and error handling

Practical Takeaways:

  • Use structured outputs for any programmatic consumption
  • Dramatically reduces parsing errors
  • Simplifies application code

Model Merging and Composition

Citation: TIES-Merging (2023), DARE (2024), Model Soups

Why Read It: Merging fine-tuned models without additional training enables combining capabilities efficiently.

Key Contributions:

  • Task arithmetic with model weights
  • Interference-free merging techniques
  • Combining specialist models into generalists

Practical Takeaways:

  • Cheaper than multi-task training
  • Useful for domain adaptation
  • Quality varies—evaluate merged models carefully

Link (TIES): https://arxiv.org/abs/2306.01708


Long Context Advances

Citation: Ring Attention (2024), LongRoPE, various context extension techniques

Why Read It: Understanding how context windows extended from 4K to 1M+ tokens informs architectural decisions.

Key Contributions:

  • Distributed attention across devices
  • Position interpolation for length extension
  • Memory-efficient attention variants

Practical Takeaways:

  • Long context has real costs (quadratic attention)
  • Retrieval may still beat very long context for some tasks
  • Test actual performance at length, not just capability

Link (Ring Attention): https://arxiv.org/abs/2310.01889


Inference Optimization

Citation: Medusa (2024), EAGLE, Speculative Decoding advances

Why Read It: 2-3x inference speedup without quality loss is practically important.

Key Contributions:

  • Multiple draft heads for speculation
  • Better acceptance rates through architecture
  • Production-ready implementations

Practical Takeaways:

  • Speculative decoding is production-ready
  • Works best when draft model matches target well
  • Evaluate for your specific models and use cases

Link (Medusa): https://arxiv.org/abs/2401.10774


Multimodal and Video Understanding

Citation: GPT-4V technical report, LLaVA-NeXT, video understanding papers

Why Read It: Vision-language models matured significantly in 2024, enabling new applications.

Key Contributions:

  • Improved visual reasoning and OCR
  • Video understanding through frame sampling
  • Better integration of visual and textual reasoning

Practical Takeaways:

  • Vision is production-ready for many use cases
  • Frame sampling strategy matters for video
  • Combine with specialized vision models when needed

Note: The field continues to move quickly. For the latest developments, follow:

  • arxiv.org/list/cs.CL and cs.LG
  • Papers With Code
  • Major lab blogs (OpenAI, Anthropic, Google DeepMind, Meta AI)

Organizing Your Reading

Paper Notes Template

For each paper you read in depth, capture:

Paper: [Title]
Authors: [Names]
Link: [URL]
Date Read: [Date]

One-sentence summary:
[What is the key contribution?]

Key techniques:

- [Technique 1]
- [Technique 2]

Practical applications:

- [How I might use this]

Questions/limitations:

- [What's unclear or limited?]

Related papers:

- [Connected work]

Building a Personal Library

  1. Core references: Papers you’ll cite repeatedly (~20 papers)
  2. Deep dives: Papers you’ve studied carefully (~50 papers)
  3. Awareness: Papers you’ve skimmed for context (~200 papers)

Reading Schedule

Sustainable approach for practicing engineers:

  • Weekly: 1 paper read carefully (2-3 hours)
  • Daily: Skim 2-3 abstracts (15 minutes)
  • Monthly: One deep dive into a foundational paper (4+ hours)

Reading Strategy Recommendations

For Engineers New to AI

  1. Start with “Attention Is All You Need” for foundations
  2. Read BERT and GPT-2 papers for encoder/decoder understanding
  3. Study InstructGPT for alignment concepts
  4. Then explore specific areas as needed

For Production-Focused Engineers

Priority order: 1. FlashAttention (performance) 2. LoRA/QLoRA (fine-tuning) 3. RAG paper (retrieval) 4. ReAct (agents) 5. DPO (alignment)

For Research-Adjacent Roles

Deep dive into: 1. Scaling laws papers 2. Latest arxiv releases in areas of focus 3. Benchmark papers for evaluation methodology

Time-Efficient Reading

  • Read abstract and introduction first
  • Skip to experiments/results if time-limited
  • Return to methods when implementing
  • Keep a paper notes system

Staying Current

Daily Sources

  • arxiv.org/list/cs.CL (NLP)
  • arxiv.org/list/cs.LG (machine learning)
  • Papers With Code newsletter

Weekly Summary Sources

  • ML Reddit communities
  • The Batch (Andrew Ng’s newsletter)
  • Import AI newsletter

Monthly Deep Dives

  • Pick 1-2 important papers for careful study
  • Implement key concepts when possible
  • Discuss with colleagues

Paper Finding Resources

Primary Sources

arxiv.org: Primary preprint server. Categories:

  • cs.CL (Computation and Language)
  • cs.LG (Machine Learning)
  • cs.AI (Artificial Intelligence)
  • stat.ML (Machine Learning - Statistics)

Papers With Code: Papers with implementation links. Good for finding reproducible work.

Semantic Scholar: Academic search with citation analysis. Good for finding influential papers.

Google Scholar: Broad academic search. Useful for citation counts.

Curated Lists

Awesome-LLM: GitHub repository tracking LLM papers

NLP Progress: Task-specific paper tracking

ML Papers Explained: Accessible paper summaries

Conference Proceedings

Major venues for AI/ML papers:

  • NeurIPS: Neural Information Processing Systems (December)
  • ICML: International Conference on Machine Learning (July)
  • ICLR: International Conference on Learning Representations (May)
  • ACL/EMNLP/NAACL: NLP conferences
  • CVPR/ICCV: Computer vision

Industry Sources

  • OpenAI Blog: Research announcements
  • Anthropic Research: Constitutional AI and safety
  • Google AI Blog: DeepMind and Google Research
  • Meta AI Blog: LLaMA and other releases
  • Hugging Face Blog: Open source developments

Building Research Intuition

Evaluating Paper Quality

Questions to ask when reading:

  1. Is the problem well-motivated? Why should we care?
  2. Is the approach novel? What’s the key insight?
  3. Are the experiments convincing? Proper baselines? Ablations?
  4. Are the claims appropriate? Does evidence support conclusions?
  5. Is it reproducible? Code available? Clear methodology?

Identifying Important Papers

Signals of importance:

  • High citation count relative to age
  • Adopted by practitioners
  • Spawned follow-up work
  • Changed how people think about the problem
  • Has open-source implementation

Common Paper Patterns

Scaling paper: Shows performance improves with scale (data, parameters, compute). Example: Scaling Laws.

Architecture paper: Introduces new model design. Example: Attention Is All You Need.

Technique paper: Introduces training or inference technique. Example: LoRA.

Benchmark paper: Introduces evaluation methodology. Example: HELM.

Analysis paper: Studies existing models/techniques. Example: What Makes In-Context Learning Work.

Application paper: Applies ML to specific domain. Example: AlphaFold.


Reading Difficult Papers

When the Math is Hard

  1. First pass: Skip math, understand intuition
  2. Second pass: Follow math at high level
  3. Third pass: Work through key derivations
  4. Implementation: Implement to solidify understanding

When the System is Complex

  1. Architecture diagram: Draw it yourself
  2. Data flow: Trace a single example through
  3. Compare to simpler systems: What’s added/different?

When Results are Surprising

  1. Check experimental setup: Are comparisons fair?
  2. Look for ablations: What’s actually contributing?
  3. Find replications: Have others reproduced?

Creating Value from Paper Reading

For Your Team

  • Share weekly paper summaries
  • Present papers at reading groups
  • Write internal “paper digests”
  • Connect papers to current projects

For Your Career

  • Build expertise in specific areas
  • Contribute to survey papers
  • Write blog posts explaining papers
  • Implement papers and open source

For the Community

  • Write accessible explanations
  • Create educational content
  • Point out limitations/issues
  • Propose extensions and improvements