Appendix C: Paper Reading List (Annotated)

This appendix provides an annotated reading list of essential papers for AI engineers. Papers are organized by topic and ordered by importance within each section. Each entry includes a summary of key contributions, practical takeaways, and reading time estimate.

Foundational Papers

These papers establish the conceptual foundations that every AI engineer should understand.

Attention Is All You Need (2017)

Citation: Vaswani et al., “Attention Is All You Need,” NeurIPS 2017

Why Read It: This paper introduced the Transformer architecture that underlies virtually all modern LLMs. Understanding self-attention, multi-head attention, and positional encoding is fundamental.

Key Contributions:

Self-attention mechanism that processes sequences without recurrence
Multi-head attention for attending to different representation subspaces
Scaled dot-product attention with the √d_k scaling factor
Positional encoding to inject sequence order information

Practical Takeaways:

Attention complexity is O(n²) with sequence length—this drives context window limitations
Encoder-decoder architecture is optional; decoder-only (GPT) and encoder-only (BERT) variants emerged
The feedforward network after attention is often where capacity resides

Reading Time: 2-3 hours for initial read, deeper understanding takes longer

Link: https://arxiv.org/abs/1706.03762

BERT: Pre-training of Deep Bidirectional Transformers (2018)

Citation: Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” NAACL 2019

Why Read It: BERT demonstrated that pre-training on large text corpora, followed by fine-tuning, dramatically improves performance on downstream tasks. This transfer learning paradigm now dominates NLP.

Key Contributions:

Masked language modeling (MLM) pre-training objective
Bidirectional context (unlike left-to-right GPT)
Next sentence prediction (NSP) task (later shown less important)
Fine-tuning paradigm for transfer learning

Practical Takeaways:

Pre-training + fine-tuning is more effective than training from scratch
Bidirectional models excel at understanding tasks (classification, extraction)
But they can’t generate text autoregressively

Reading Time: 2 hours

Link: https://arxiv.org/abs/1810.04805

Language Models are Unsupervised Multitask Learners (GPT-2, 2019)

Citation: Radford et al., “Language Models are Unsupervised Multitask Learners”

Why Read It: GPT-2 demonstrated that scaling language models enables zero-shot task performance. It introduced the paradigm of prompting models rather than fine-tuning them.

Key Contributions:

Zero-shot learning through prompting
Demonstration that scale improves capability
Decoder-only architecture for generation

Practical Takeaways:

Larger models can perform tasks they weren’t explicitly trained for
Prompting is a form of “soft” programming
Model capacity matters significantly

Reading Time: 1.5 hours

Link: https://openai.com/index/better-language-models/

Training language models to follow instructions (InstructGPT, 2022)

Citation: Ouyang et al., “Training language models to follow instructions with human feedback”

Why Read It: This paper describes how to align language models to follow human instructions using RLHF. It’s the foundation of ChatGPT and similar assistants.

Key Contributions:

Three-stage alignment process: supervised fine-tuning, reward modeling, PPO
Demonstration that smaller aligned models outperform larger unaligned models
Human preference data collection methodology

Practical Takeaways:

Alignment is as important as capability
Human feedback is expensive but valuable
Reward hacking is a real risk

Reading Time: 2-3 hours

Link: https://arxiv.org/abs/2203.02155

Scaling and Training

Scaling Laws for Neural Language Models (2020)

Citation: Kaplan et al., “Scaling Laws for Neural Language Models”

Why Read It: Establishes power-law relationships between model size, data, compute, and loss. Essential for planning training runs and resource allocation.

Key Contributions:

Power-law scaling of loss with parameters, data, and compute
Optimal allocation of compute budget between model size and data
Early stopping recommendations

Practical Takeaways:

Double parameters → ~constant loss reduction
Compute-optimal training has specific ratios
These laws guide billion-dollar training decisions

Reading Time: 2 hours (heavy on math)

Link: https://arxiv.org/abs/2001.08361

Training Compute-Optimal Large Language Models (Chinchilla, 2022)

Citation: Hoffmann et al., “Training Compute-Optimal Large Language Models”

Why Read It: Revised scaling laws showing that models should be trained on more data than previously thought. Influenced all subsequent LLM training.

Key Contributions:

Revised optimal data-to-parameters ratio (~20 tokens per parameter)
Showed GPT-3 was undertrained for its size
Chinchilla (70B) outperformed Gopher (280B) with same compute

Practical Takeaways:

More data often beats more parameters
Optimal ratios changed industry training practices
Data quality and quantity both matter enormously

Reading Time: 2 hours

Link: https://arxiv.org/abs/2203.15556

FlashAttention: Fast and Memory-Efficient Exact Attention (2022)

Citation: Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”

Why Read It: FlashAttention is now standard in all production systems. Understanding its memory hierarchy optimization is valuable for performance engineering.

Key Contributions:

Fused attention kernels that minimize HBM reads/writes
Tiling approach that fits computation in SRAM
Significant speedup and memory reduction

Practical Takeaways:

Memory bandwidth, not compute, is often the bottleneck
Algorithm-level optimization can beat hardware upgrades
Always use FlashAttention in production

Reading Time: 2-3 hours (systems-heavy)

Link: https://arxiv.org/abs/2205.14135

Retrieval and RAG

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)

Citation: Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”

Why Read It: Introduced RAG, which is now fundamental to production LLM applications. Combines retrieval with generation for better factuality.

Key Contributions:

Architecture combining retriever and generator
Marginalization over retrieved documents
Significant improvements on knowledge tasks

Practical Takeaways:

Retrieval grounds generation in facts
Quality of retrieved context directly impacts output quality
RAG is more efficient than putting everything in context

Reading Time: 2 hours

Link: https://arxiv.org/abs/2005.11401

Dense Passage Retrieval for Open-Domain Question Answering (2020)

Citation: Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering”

Why Read It: DPR showed that learned dense retrieval can outperform BM25. Foundation for modern semantic search.

Key Contributions:

Bi-encoder architecture for dense retrieval
In-batch negatives for efficient training
Strong QA results with simple architecture

Practical Takeaways:

Dense retrieval captures semantic similarity that BM25 misses
Hard negatives improve retrieval quality
Hybrid (dense + sparse) often works best

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2004.04906

Self-RAG: Learning to Retrieve, Generate, and Critique (2023)

Citation: Asai et al., “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”

Why Read It: Advanced RAG that teaches models to decide when to retrieve and self-critique. Shows direction of RAG evolution.

Key Contributions:

Model learns when retrieval is needed
Self-reflection for assessing generation quality
Special tokens for retrieval decisions

Practical Takeaways:

Not every query needs retrieval
Self-critique can improve output quality
Training for retrieval decisions is possible

Reading Time: 2 hours

Link: https://arxiv.org/abs/2310.11511

Efficiency and Optimization

LoRA: Low-Rank Adaptation of Large Language Models (2021)

Citation: Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models”

Why Read It: LoRA made fine-tuning large models practical by training only a small number of parameters. Now standard for LLM adaptation.

Key Contributions:

Low-rank decomposition for weight updates
No additional inference latency
Dramatic reduction in trainable parameters

Practical Takeaways:

Fine-tune 7B+ models on consumer GPUs
r=8-16 typically sufficient
Can merge adapters back into base weights

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2106.09685

QLoRA: Efficient Finetuning of Quantized LLMs (2023)

Citation: Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs”

Why Read It: Combines 4-bit quantization with LoRA for even more efficient fine-tuning. Enabled fine-tuning 65B models on single GPUs.

Key Contributions:

4-bit NormalFloat quantization
Double quantization for memory savings
Paged optimizers for memory spikes

Practical Takeaways:

Fine-tune 65B models on single 48GB GPU (per the paper; modern 70B models also fit)
Quality close to full fine-tuning
Standard tool for resource-constrained fine-tuning

Reading Time: 2 hours

Link: https://arxiv.org/abs/2305.14314

Speculative Decoding (2022-2023)

Citation: Leviathan et al., “Fast Inference from Transformers via Speculative Decoding”

Why Read It: Speculative decoding provides 2-3x inference speedup with no quality loss. Understanding why it works helps with other optimizations.

Key Contributions:

Draft-then-verify paradigm
Acceptance probability analysis
Parallel verification of multiple tokens

Practical Takeaways:

Works best when draft model is well-aligned with target
Speedup depends on acceptance rate
No quality degradation (exact sampling)

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2211.17192

Agents and Reasoning

ReAct: Synergizing Reasoning and Acting in Language Models (2022)

Citation: Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models”

Why Read It: ReAct is the foundation for most LLM agent architectures. Interleaves reasoning traces with actions.

Key Contributions:

Thought-action-observation loop
Reasoning traces improve action quality
Generalizable agent architecture

Practical Takeaways:

Chain-of-thought + tool use > either alone
Explicit reasoning helps debugging
Foundation for production agents

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2210.03629

Chain-of-Thought Prompting Elicits Reasoning (2022)

Citation: Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”

Why Read It: Chain-of-thought prompting dramatically improves reasoning. Simple technique with large impact.

Key Contributions:

Demonstrated reasoning improvements from step-by-step thinking
Showed emergent capability at scale
Simple prompting technique

Practical Takeaways:

“Let’s think step by step” improves complex tasks
Works best on reasoning tasks
Smaller models need examples; larger can do zero-shot

Reading Time: 1 hour

Link: https://arxiv.org/abs/2201.11903

Toolformer: Language Models Can Teach Themselves to Use Tools (2023)

Citation: Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools”

Why Read It: Shows how to train models to use external tools. Foundation for understanding tool-use training.

Key Contributions:

Self-supervised tool use training
API call insertion and filtering
Works with various tools

Practical Takeaways:

Models can learn when tools help
Training data generation is key
Different from pure prompting approaches

Reading Time: 2 hours

Link: https://arxiv.org/abs/2302.04761

Safety and Alignment

Constitutional AI: Harmlessness from AI Feedback (2022)

Citation: Bai et al., “Constitutional AI: Harmlessness from AI Feedback”

Why Read It: Describes Anthropic’s approach to alignment using principles. Alternative to pure human feedback.

Key Contributions:

Using AI feedback for alignment
Constitutional principles for behavior
Reduced reliance on human labelers

Practical Takeaways:

Principles can guide behavior
AI feedback can complement human feedback
Scalable alignment approach

Reading Time: 2 hours

Link: https://arxiv.org/abs/2212.08073

Direct Preference Optimization (DPO, 2023)

Citation: Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”

Why Read It: DPO simplifies RLHF by directly optimizing preferences. Increasingly common alternative to PPO.

Key Contributions:

Closed-form solution for preference learning
No separate reward model needed
Simpler and more stable than PPO

Practical Takeaways:

Often preferred over PPO for simplicity
Still needs preference data
Good default for alignment fine-tuning

Reading Time: 2 hours (math-heavy)

Link: https://arxiv.org/abs/2305.18290

Multimodal

Learning Transferable Visual Models From Natural Language Supervision (CLIP, 2021)

Citation: Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”

Why Read It: CLIP connected vision and language understanding. Foundation for multimodal models.

Key Contributions:

Contrastive learning across modalities
Zero-shot image classification
Web-scale training data

Practical Takeaways:

Text-image alignment enables new applications
Contrastive learning is powerful for embedding spaces
Foundation for DALL-E, Stable Diffusion, etc.

Reading Time: 2 hours

Link: https://arxiv.org/abs/2103.00020

Visual Instruction Tuning (LLaVA, 2023)

Citation: Liu et al., “Visual Instruction Tuning”

Why Read It: LLaVA shows how to create multimodal instruction-following models. Practical for vision-language tasks.

Key Contributions:

Visual instruction tuning methodology
GPT-4 assisted data generation
Strong performance with simple architecture

Practical Takeaways:

Vision encoders + LLMs work well together
Instruction data quality matters
Relatively simple to replicate

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2304.08485

Evaluation

HELM: Holistic Evaluation of Language Models (2022)

Citation: Liang et al., “Holistic Evaluation of Language Models”

Why Read It: HELM provides comprehensive evaluation methodology. Useful for understanding what metrics matter.

Key Contributions:

Multi-scenario evaluation framework
Standardized evaluation methodology
Metrics beyond accuracy (calibration, fairness, etc.)

Practical Takeaways:

Single metrics don’t capture model quality
Consider accuracy, calibration, robustness, fairness, efficiency
Evaluation methodology should be explicit

Reading Time: Long paper; skim for methodology

Link: https://arxiv.org/abs/2211.09110

Judging LLM-as-a-Judge (2023)

Citation: Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”

Why Read It: Establishes when LLM judges are reliable. Essential for automated evaluation.

Key Contributions:

MT-Bench multi-turn benchmark
Analysis of LLM judge reliability
Position bias and other issues identified

Practical Takeaways:

LLM judges correlate well with humans for many tasks
Position bias is real—randomize order
Strong models make better judges

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2306.05685

Systems and Infrastructure

Efficient Memory Management for Large Language Model Serving (PagedAttention/vLLM, 2023)

Citation: Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention”

Why Read It: Introduces PagedAttention, the key innovation in vLLM. Essential for understanding modern LLM serving.

Key Contributions:

Virtual memory for KV cache
Non-contiguous memory allocation
Block-level sharing for parallel sequences
2-4x throughput improvement

Practical Takeaways:

KV cache memory management is critical
Block-based allocation solves fragmentation
Foundation for all modern serving systems

Reading Time: 2 hours

Link: https://arxiv.org/abs/2309.06180

TensorRT-LLM: A High-Performance Framework for Large Language Model Inference

Citation: NVIDIA TensorRT-LLM Technical Blog/Documentation

Why Read It: Understanding TensorRT-LLM optimizations helps with production performance engineering.

Key Contributions:

Kernel fusion and optimization
Multi-GPU inference patterns
INT8/FP8 quantization
Inflight batching

Practical Takeaways:

Maximum performance on NVIDIA hardware
Compilation-based optimization
Trade-off between optimization time and inference speed

Reading Time: 1.5 hours (documentation + blog posts)

Orca: A Distributed Serving System for Transformer-Based Generative Models (2022)

Citation: Yu et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models”

Why Read It: Introduces continuous batching, now standard in all LLM serving systems.

Key Contributions:

Iteration-level scheduling
Continuous batching (vs. static batching)
Selective batching for different request lengths

Practical Takeaways:

Static batching wastes significant compute
Iteration-level scheduling maximizes utilization
Influenced vLLM, TGI, and all modern systems

Reading Time: 2 hours

Link: https://www.usenix.org/conference/osdi22/presentation/yu

Data and Features

Hidden Technical Debt in Machine Learning Systems (2015)

Citation: Sculley et al., “Hidden Technical Debt in Machine Learning Systems,” NeurIPS 2015

Why Read It: Classic paper on ML systems debt. Still highly relevant. Required reading for ML engineers.

Key Contributions:

Identified ML-specific technical debt
Glue code, pipeline jungles, dead features
Configuration debt, feedback loops

Practical Takeaways:

Most ML code isn’t the model
Data and pipeline issues dominate
Proactive debt management is essential

Reading Time: 1.5 hours

Link: https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

Automating Large-Scale Data Quality Verification (2018)

Citation: Schelter et al., “Automating Large-Scale Data Quality Verification”

Why Read It: Foundation for data quality in ML pipelines. Basis for Amazon’s Deequ library.

Key Contributions:

Declarative data quality rules
Incremental validation
Anomaly detection for data

Practical Takeaways:

Data testing is as important as code testing
Automation is essential at scale
Historical profiles enable anomaly detection

Reading Time: 1.5 hours

Link: https://www.vldb.org/pvldb/vol11/p1781-schelter.pdf

Data Management Challenges in Production Machine Learning (2017)

Citation: Polyzotis et al., “Data Management Challenges in Production Machine Learning”

Why Read It: Survey of data challenges in production ML. Provides vocabulary and frameworks for thinking about data issues.

Key Contributions:

Taxonomy of data management challenges
Training-serving skew analysis
Data lifecycle management

Practical Takeaways:

Data issues are the primary source of ML bugs
Validation at system boundaries is critical
Reproducibility requires data versioning

Reading Time: 2 hours

Link: https://dl.acm.org/doi/10.1145/3035918.3054782

Long Context and Memory

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (2023)

Citation: Chen et al., “LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models”

Why Read It: Shows how to efficiently extend context length. Relevant as context windows grow.

Key Contributions:

Shifted sparse attention for training
Efficient context extension
Position interpolation techniques

Practical Takeaways:

Context extension is cheaper than training from scratch
Sparse attention during training, dense during inference
100K+ context is achievable

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2309.12307

Retrieval-Augmented Generation vs Long-Context LLMs: A Fair Comparison (2024)

Citation: Various comparison studies

Why Read It: Understanding when to use RAG vs. long context is crucial for system design.

Key Contributions:

Empirical comparison of RAG vs. stuffing context
Cost-performance trade-offs
Task-dependent recommendations

Practical Takeaways:

Long context isn’t always better than RAG
Cost considerations favor RAG at scale
Hybrid approaches often optimal

Reading Time: 1.5 hours

Quantization

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022)

Citation: Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”

Why Read It: GPTQ enables running large models on consumer hardware. Essential for deployment.

Key Contributions:

One-shot quantization method
Per-column quantization
Minimal accuracy loss at 4-bit

Practical Takeaways:

4-bit quantization is practical
Post-training quantization is convenient
Some accuracy trade-off at extreme compression

Reading Time: 2 hours (math-heavy)

Link: https://arxiv.org/abs/2210.17323

AWQ: Activation-aware Weight Quantization (2023)

Citation: Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration”

Why Read It: AWQ often outperforms GPTQ. Understanding both helps with quantization decisions.

Key Contributions:

Activation-aware importance weights
No retraining needed
Better preservation of salient weights

Practical Takeaways:

Salient weights matter more—protect them
AWQ often better than GPTQ
4-bit with minimal quality loss

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2306.00978

Embeddings and Retrieval

E5: Text Embeddings by Weakly-Supervised Contrastive Pre-training (2022)

Citation: Wang et al., “Text Embeddings by Weakly-Supervised Contrastive Pre-training”

Why Read It: E5 embeddings are widely used. Understanding training methodology helps with embedding selection.

Key Contributions:

Weakly-supervised training on web data
Strong zero-shot performance
Instruction-tuned variants

Practical Takeaways:

Instruction prefix can improve retrieval
Quality of training data matters enormously
E5 variants available for different use cases

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2212.03533

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction (2020)

Citation: Khattab and Zaharia, “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT”

Why Read It: ColBERT offers different retrieval trade-offs than single-vector approaches. Important for advanced retrieval.

Key Contributions:

Late interaction for better relevance
Efficient MaxSim operation
Balance between single-vector and cross-encoder

Practical Takeaways:

Token-level matching improves relevance
More storage but better accuracy than single-vector
Good for precision-critical applications

Reading Time: 2 hours

Link: https://arxiv.org/abs/2004.12832

Mixture of Experts

Switch Transformers: Scaling to Trillion Parameter Models (2021)

Citation: Fedus et al., “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”

Why Read It: MoE architectures are used in GPT-4, Mixtral, and future models. Understanding them helps interpret model behavior.

Key Contributions:

Simplified MoE routing
Trillion-parameter models with manageable compute
Expert capacity balancing

Practical Takeaways:

MoE enables larger models with less compute
Load balancing is critical
Not all parameters active for each token

Reading Time: 2.5 hours

Link: https://arxiv.org/abs/2101.03961

Mixtral of Experts (2023)

Citation: Mistral AI, “Mixtral of Experts”

Why Read It: Practical MoE model that outperforms larger dense models. Represents production-ready MoE.

Key Contributions:

8x7B MoE with 2 experts active
Strong performance per compute
Open weights

Practical Takeaways:

MoE is practical for deployment
Router overhead is manageable
Memory usage is full model size

Reading Time: 1 hour (short paper/blog)

Link: https://arxiv.org/abs/2401.04088

Production and Operations

Machine Learning Operations (MLOps): Overview, Definition, and Architecture (2022)

Citation: Kreuzberger et al., “Machine Learning Operations (MLOps): Overview, Definition, and Architecture”

Why Read It: Comprehensive survey of MLOps practices. Good for understanding the field holistically.

Key Contributions:

MLOps definition and principles
Architecture patterns
Automation levels

Practical Takeaways:

MLOps is more than just ML + DevOps
Automation is progressive
Monitoring and feedback are essential

Reading Time: 2 hours

Link: https://arxiv.org/abs/2205.02302

Monitoring Machine Learning Models in Production (2019)

Citation: Google Cloud ML Documentation and Blog Series

Why Read It: Practical guidance on production ML monitoring from Google’s experience.

Key Contributions:

Model monitoring best practices
Drift detection approaches
Alert design

Practical Takeaways:

Monitor inputs, outputs, and outcomes
Statistical tests for drift detection
Human-in-the-loop for edge cases

Reading Time: 2 hours (multiple articles)

Prompt Engineering and In-Context Learning

Large Language Models are Zero-Shot Reasoners (2022)

Citation: Kojima et al., “Large Language Models are Zero-Shot Reasoners”

Why Read It: Shows that “Let’s think step by step” dramatically improves reasoning without examples.

Key Contributions:

Zero-shot chain-of-thought prompting
Simple trigger phrases enable reasoning
Analysis of why it works

Practical Takeaways:

Add “Let’s think step by step” for reasoning tasks
No examples needed for basic CoT
Works better with larger models

Reading Time: 1 hour

Link: https://arxiv.org/abs/2205.11916

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (2022)

Citation: Min et al., “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?”

Why Read It: Analyzes what actually matters in few-shot examples. Surprising findings about label importance.

Key Contributions:

Format matters more than correct labels
Input distribution of examples is key
Label space specification is important

Practical Takeaways:

Examples teach format and task structure
Random labels work surprisingly well
Distribution coverage matters

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2202.12837

Calibrate Before Use: Improving Few-Shot Performance of Language Models (2021)

Citation: Zhao et al., “Calibrate Before Use: Improving Few-Shot Performance of Language Models”

Why Read It: Addresses biases in few-shot learning. Practical for improving prompt reliability.

Key Contributions:

Identified majority label, recency, and common token biases
Calibration technique to correct biases
Significant accuracy improvements

Practical Takeaways:

Example order affects predictions
Calibration via content-free inputs
Randomize and average when possible

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2102.09690

Code Generation

Evaluating Large Language Models Trained on Code (Codex, 2021)

Citation: Chen et al., “Evaluating Large Language Models Trained on Code”

Why Read It: Introduces HumanEval benchmark and Codex. Foundation for understanding code LLMs.

Key Contributions:

HumanEval benchmark
Analysis of code generation capabilities
pass@k evaluation methodology

Practical Takeaways:

Functional correctness matters, not just syntax
Multiple samples improve pass rate
Temperature affects code quality

Reading Time: 2 hours

Link: https://arxiv.org/abs/2107.03374

Self-Instruct: Aligning Language Models with Self-Generated Instructions (2022)

Citation: Wang et al., “Self-Instruct: Aligning Language Models with Self-Generated Instructions”

Why Read It: Shows how to generate instruction data using models themselves. Practical for data generation.

Key Contributions:

Self-instruction generation pipeline
Filtering for quality
Reduced human annotation needs

Practical Takeaways:

Models can generate their own training data
Quality filtering is essential
Useful for domain-specific fine-tuning

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2212.10560

Robustness and Safety

Universal and Transferable Adversarial Attacks on Aligned Language Models (2023)

Citation: Zou et al., “Universal and Transferable Adversarial Attacks on Aligned Language Models”

Why Read It: Important for understanding LLM vulnerabilities. Motivates safety measures.

Key Contributions:

Automated adversarial suffix generation
Transfer across models
Bypasses safety training

Practical Takeaways:

Alignment isn’t robust to optimization
Perplexity filtering helps
Defense requires multiple layers

Reading Time: 2 hours

Link: https://arxiv.org/abs/2307.15043

Red Teaming Language Models with Language Models (2022)

Citation: Perez et al., “Red Teaming Language Models with Language Models”

Why Read It: Describes automated red teaming approaches. Practical for safety testing.

Key Contributions:

LLM-based red teaming methodology
Diverse attack generation
Scaling red team efforts

Practical Takeaways:

Automation enables broader coverage
Diversity of attacks matters
Continuous red teaming needed

Reading Time: 1.5 hours

Link: https://arxiv.org/abs/2202.03286

Recent Advances (2024-2025)

The pace of development in 2024-2025 has been extraordinary. These papers represent the most impactful advances during this period.

Llama 3 and Open Foundation Models

Citation: Meta AI, “Llama 3” (April 2024) and “Llama 3.1” (July 2024); see also “The Llama 3 Herd of Models” technical report (arXiv:2407.21783)

Why Read It: Llama 3 pushed open-source models to near-parity with proprietary alternatives. The Llama 3.1 70B and 405B models are competitive with GPT-4 on many benchmarks.

Key Contributions:

Scaling to 405B parameters with open weights (Llama 3.1, July 2024)
128K context window in Llama 3.1 (up from 8K in the original Llama 3)
Improved training data curation and quality
Better instruction-following and reasoning

Practical Takeaways:

Open models are viable for production use
70B models hit the sweet spot for many applications
Data quality matters as much as scale

Link: https://ai.meta.com/blog/meta-llama-3/

Claude 3 and Constitutional AI Advances

Citation: Anthropic, “Claude 3 Model Card” (2024)

Why Read It: Claude 3 (Opus, Sonnet, Haiku) demonstrated significant capability improvements while maintaining safety. Introduces improved vision capabilities.

Key Contributions:

Native multimodal understanding
Improved long-context performance (200K tokens)
Better instruction following and reduced refusals

Practical Takeaways:

Tiered model offerings enable cost optimization
Long context enables new use cases
Vision capabilities are production-ready

Gemini and Long Context

Citation: Google DeepMind, “Gemini: A Family of Highly Capable Multimodal Models” (2023-2024)

Why Read It: Gemini introduced native multimodality and pushed context windows to 1M+ tokens.

Key Contributions:

Native multimodal training from the start
1M+ token context window (Gemini 1.5)
Strong performance across modalities

Practical Takeaways:

Truly long context changes application architecture
Native multimodal may outperform adapter approaches
Context window isn’t free—understand cost implications

Link: https://arxiv.org/abs/2312.11805

Test-Time Compute and Reasoning (o1-style)

Citation: Various papers on inference-time scaling, including OpenAI’s o1 approach

Why Read It: The paradigm shift toward using more compute at inference time (thinking longer) rather than just training larger models.

Key Contributions:

Chain-of-thought at inference time improves reasoning
Self-consistency and verification during generation
Trading inference compute for capability

Practical Takeaways:

Longer thinking time can substitute for larger models
Useful for high-stakes decisions worth the latency
Different cost model than traditional inference

Mixture of Experts at Scale (Mixtral, DeepSeek)

Citation: Mistral AI, “Mixtral of Experts” (2024); DeepSeek-V2 (2024)

Why Read It: MoE architectures achieved excellent performance-per-compute, making very capable models more accessible.

Key Contributions:

Efficient expert routing at scale
Mixtral 8x7B outperforms larger dense models
DeepSeek-V2 pushes efficiency further

Practical Takeaways:

MoE enables larger effective models on same hardware
Not all parameters active—memory is still full model size
Great for throughput-focused deployments

Link (Mixtral): https://arxiv.org/abs/2401.04088

Structured Output and JSON Mode

Citation: OpenAI “Structured Outputs” announcement; various implementations

Why Read It: Native structured output support changed how we build LLM applications—more reliable than prompt-based JSON extraction.

Key Contributions:

Guaranteed valid JSON output
Schema enforcement during generation
Reduced post-processing and error handling

Practical Takeaways:

Use structured outputs for any programmatic consumption
Dramatically reduces parsing errors
Simplifies application code

Model Merging and Composition

Citation: TIES-Merging (2023), DARE (2024), Model Soups

Why Read It: Merging fine-tuned models without additional training enables combining capabilities efficiently.

Key Contributions:

Task arithmetic with model weights
Interference-free merging techniques
Combining specialist models into generalists

Practical Takeaways:

Cheaper than multi-task training
Useful for domain adaptation
Quality varies—evaluate merged models carefully

Link (TIES): https://arxiv.org/abs/2306.01708

Long Context Advances

Citation: Ring Attention (2024), LongRoPE, various context extension techniques

Why Read It: Understanding how context windows extended from 4K to 1M+ tokens informs architectural decisions.

Key Contributions:

Distributed attention across devices
Position interpolation for length extension
Memory-efficient attention variants

Practical Takeaways:

Long context has real costs (quadratic attention)
Retrieval may still beat very long context for some tasks
Test actual performance at length, not just capability

Link (Ring Attention): https://arxiv.org/abs/2310.01889

Inference Optimization

Citation: Medusa (2024), EAGLE, Speculative Decoding advances

Why Read It: 2-3x inference speedup without quality loss is practically important.

Key Contributions:

Multiple draft heads for speculation
Better acceptance rates through architecture
Production-ready implementations

Practical Takeaways:

Speculative decoding is production-ready
Works best when draft model matches target well
Evaluate for your specific models and use cases

Link (Medusa): https://arxiv.org/abs/2401.10774

Multimodal and Video Understanding

Citation: GPT-4V technical report, LLaVA-NeXT, video understanding papers

Why Read It: Vision-language models matured significantly in 2024, enabling new applications.

Key Contributions:

Improved visual reasoning and OCR
Video understanding through frame sampling
Better integration of visual and textual reasoning

Practical Takeaways:

Vision is production-ready for many use cases
Frame sampling strategy matters for video
Combine with specialized vision models when needed

Note: The field continues to move quickly. For the latest developments, follow:

arxiv.org/list/cs.CL and cs.LG
Papers With Code
Major lab blogs (OpenAI, Anthropic, Google DeepMind, Meta AI)

Organizing Your Reading

Paper Notes Template

For each paper you read in depth, capture:

Paper: [Title]
Authors: [Names]
Link: [URL]
Date Read: [Date]

One-sentence summary:
[What is the key contribution?]

Key techniques:

- [Technique 1]
- [Technique 2]

Practical applications:

- [How I might use this]

Questions/limitations:

- [What's unclear or limited?]

Related papers:

- [Connected work]

Building a Personal Library

Core references: Papers you’ll cite repeatedly (~20 papers)
Deep dives: Papers you’ve studied carefully (~50 papers)
Awareness: Papers you’ve skimmed for context (~200 papers)

Reading Schedule

Sustainable approach for practicing engineers:

Weekly: 1 paper read carefully (2-3 hours)
Daily: Skim 2-3 abstracts (15 minutes)
Monthly: One deep dive into a foundational paper (4+ hours)

Reading Strategy Recommendations

For Engineers New to AI

Start with “Attention Is All You Need” for foundations
Read BERT and GPT-2 papers for encoder/decoder understanding
Study InstructGPT for alignment concepts
Then explore specific areas as needed

For Production-Focused Engineers

Priority order: 1. FlashAttention (performance) 2. LoRA/QLoRA (fine-tuning) 3. RAG paper (retrieval) 4. ReAct (agents) 5. DPO (alignment)

For Research-Adjacent Roles

Deep dive into: 1. Scaling laws papers 2. Latest arxiv releases in areas of focus 3. Benchmark papers for evaluation methodology

Time-Efficient Reading

Read abstract and introduction first
Skip to experiments/results if time-limited
Return to methods when implementing
Keep a paper notes system

Staying Current

Daily Sources

arxiv.org/list/cs.CL (NLP)
arxiv.org/list/cs.LG (machine learning)
Papers With Code newsletter

Weekly Summary Sources

ML Reddit communities
The Batch (Andrew Ng’s newsletter)
Import AI newsletter

Monthly Deep Dives

Pick 1-2 important papers for careful study
Implement key concepts when possible
Discuss with colleagues

Additional Recommended Papers by Topic

Transformers and Architecture

Paper	Year	Key Contribution
An Image is Worth 16x16 Words (ViT)	2020	Vision transformers
RoBERTa	2019	Optimized BERT pretraining
T5	2019	Text-to-text framework
PaLM	2022	Scaling and emergent abilities
LLaMA	2023	Open foundation models
Mistral 7B	2023	Efficient open models

Training and Optimization

Paper	Year	Key Contribution
Adam Optimizer	2014	Adaptive learning rates
Dropout	2014	Regularization technique
Batch Normalization	2015	Training stability
Layer Normalization	2016	Sequence model normalization
Mixed Precision Training	2017	Efficient training
ZeRO	2019	Memory-efficient distributed training

Retrieval and Search

Paper	Year	Key Contribution
BM25	1994	Term frequency retrieval
Sentence-BERT	2019	Sentence embeddings
Approximate Nearest Neighbors Oh Yeah (Annoy)	2013	Efficient vector search
Hierarchical Navigable Small World (HNSW)	2016	Graph-based ANN

Evaluation and Benchmarks

Paper	Year	Key Contribution
GLUE/SuperGLUE	2018/2019	NLU benchmarks
SQuAD	2016	Question answering benchmark
MMLU	2020	Multi-task language understanding
BIG-bench	2022	Diverse capability evaluation
TruthfulQA	2021	Factuality evaluation

Safety and Ethics

Paper	Year	Key Contribution
On the Dangers of Stochastic Parrots	2021	LLM risks and biases
Predictability and Surprise	2022	Emergent capabilities
Sleeper Agents	2024	Deceptive model behavior

Paper Finding Resources

Primary Sources

arxiv.org: Primary preprint server. Categories:

cs.CL (Computation and Language)
cs.LG (Machine Learning)
cs.AI (Artificial Intelligence)
stat.ML (Machine Learning - Statistics)

Papers With Code: Papers with implementation links. Good for finding reproducible work.

Semantic Scholar: Academic search with citation analysis. Good for finding influential papers.

Google Scholar: Broad academic search. Useful for citation counts.

Curated Lists

Awesome-LLM: GitHub repository tracking LLM papers

NLP Progress: Task-specific paper tracking

ML Papers Explained: Accessible paper summaries

Conference Proceedings

Major venues for AI/ML papers:

NeurIPS: Neural Information Processing Systems (December)
ICML: International Conference on Machine Learning (July)
ICLR: International Conference on Learning Representations (May)
ACL/EMNLP/NAACL: NLP conferences
CVPR/ICCV: Computer vision

Industry Sources

OpenAI Blog: Research announcements
Anthropic Research: Constitutional AI and safety
Google AI Blog: DeepMind and Google Research
Meta AI Blog: LLaMA and other releases
Hugging Face Blog: Open source developments

Building Research Intuition

Evaluating Paper Quality

Questions to ask when reading:

Is the problem well-motivated? Why should we care?
Is the approach novel? What’s the key insight?
Are the experiments convincing? Proper baselines? Ablations?
Are the claims appropriate? Does evidence support conclusions?
Is it reproducible? Code available? Clear methodology?

Identifying Important Papers

Signals of importance:

High citation count relative to age
Adopted by practitioners
Spawned follow-up work
Changed how people think about the problem
Has open-source implementation

Common Paper Patterns

Scaling paper: Shows performance improves with scale (data, parameters, compute). Example: Scaling Laws.

Architecture paper: Introduces new model design. Example: Attention Is All You Need.

Technique paper: Introduces training or inference technique. Example: LoRA.

Benchmark paper: Introduces evaluation methodology. Example: HELM.

Analysis paper: Studies existing models/techniques. Example: What Makes In-Context Learning Work.

Application paper: Applies ML to specific domain. Example: AlphaFold.

Reading Difficult Papers

When the Math is Hard

First pass: Skip math, understand intuition
Second pass: Follow math at high level
Third pass: Work through key derivations
Implementation: Implement to solidify understanding

When the System is Complex

Architecture diagram: Draw it yourself
Data flow: Trace a single example through
Compare to simpler systems: What’s added/different?

When Results are Surprising

Check experimental setup: Are comparisons fair?
Look for ablations: What’s actually contributing?
Find replications: Have others reproduced?

Creating Value from Paper Reading

For Your Team

Share weekly paper summaries
Present papers at reading groups
Write internal “paper digests”
Connect papers to current projects

For Your Career

Build expertise in specific areas
Contribute to survey papers
Write blog posts explaining papers
Implement papers and open source

For the Community

Write accessible explanations
Create educational content
Point out limitations/issues
Propose extensions and improvements

# Appendix C: Paper Reading List (Annotated) {.unnumbered} This appendix provides an annotated reading list of essential papers for AI engineers. Papers are organized by topic and ordered by importance within each section. Each entry includes a summary of key contributions, practical takeaways, and reading time estimate. --- ## Foundational Papers These papers establish the conceptual foundations that every AI engineer should understand. ### Attention Is All You Need (2017) **Citation**: Vaswani et al., "Attention Is All You Need," NeurIPS 2017 **Why Read It**: This paper introduced the Transformer architecture that underlies virtually all modern LLMs. Understanding self-attention, multi-head attention, and positional encoding is fundamental. **Key Contributions**: - Self-attention mechanism that processes sequences without recurrence - Multi-head attention for attending to different representation subspaces - Scaled dot-product attention with the √d_k scaling factor - Positional encoding to inject sequence order information **Practical Takeaways**: - Attention complexity is O(n²) with sequence length—this drives context window limitations - Encoder-decoder architecture is optional; decoder-only (GPT) and encoder-only (BERT) variants emerged - The feedforward network after attention is often where capacity resides **Reading Time**: 2-3 hours for initial read, deeper understanding takes longer **Link**: https://arxiv.org/abs/1706.03762 --- ### BERT: Pre-training of Deep Bidirectional Transformers (2018) **Citation**: Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL 2019 **Why Read It**: BERT demonstrated that pre-training on large text corpora, followed by fine-tuning, dramatically improves performance on downstream tasks. This transfer learning paradigm now dominates NLP. **Key Contributions**: - Masked language modeling (MLM) pre-training objective - Bidirectional context (unlike left-to-right GPT) - Next sentence prediction (NSP) task (later shown less important) - Fine-tuning paradigm for transfer learning **Practical Takeaways**: - Pre-training + fine-tuning is more effective than training from scratch - Bidirectional models excel at understanding tasks (classification, extraction) - But they can't generate text autoregressively **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/1810.04805 --- ### Language Models are Unsupervised Multitask Learners (GPT-2, 2019) **Citation**: Radford et al., "Language Models are Unsupervised Multitask Learners" **Why Read It**: GPT-2 demonstrated that scaling language models enables zero-shot task performance. It introduced the paradigm of prompting models rather than fine-tuning them. **Key Contributions**: - Zero-shot learning through prompting - Demonstration that scale improves capability - Decoder-only architecture for generation **Practical Takeaways**: - Larger models can perform tasks they weren't explicitly trained for - Prompting is a form of "soft" programming - Model capacity matters significantly **Reading Time**: 1.5 hours **Link**: https://openai.com/index/better-language-models/ --- ### Training language models to follow instructions (InstructGPT, 2022) **Citation**: Ouyang et al., "Training language models to follow instructions with human feedback" **Why Read It**: This paper describes how to align language models to follow human instructions using RLHF. It's the foundation of ChatGPT and similar assistants. **Key Contributions**: - Three-stage alignment process: supervised fine-tuning, reward modeling, PPO - Demonstration that smaller aligned models outperform larger unaligned models - Human preference data collection methodology **Practical Takeaways**: - Alignment is as important as capability - Human feedback is expensive but valuable - Reward hacking is a real risk **Reading Time**: 2-3 hours **Link**: https://arxiv.org/abs/2203.02155 --- ## Scaling and Training ### Scaling Laws for Neural Language Models (2020) **Citation**: Kaplan et al., "Scaling Laws for Neural Language Models" **Why Read It**: Establishes power-law relationships between model size, data, compute, and loss. Essential for planning training runs and resource allocation. **Key Contributions**: - Power-law scaling of loss with parameters, data, and compute - Optimal allocation of compute budget between model size and data - Early stopping recommendations **Practical Takeaways**: - Double parameters → ~constant loss reduction - Compute-optimal training has specific ratios - These laws guide billion-dollar training decisions **Reading Time**: 2 hours (heavy on math) **Link**: https://arxiv.org/abs/2001.08361 --- ### Training Compute-Optimal Large Language Models (Chinchilla, 2022) **Citation**: Hoffmann et al., "Training Compute-Optimal Large Language Models" **Why Read It**: Revised scaling laws showing that models should be trained on more data than previously thought. Influenced all subsequent LLM training. **Key Contributions**: - Revised optimal data-to-parameters ratio (~20 tokens per parameter) - Showed GPT-3 was undertrained for its size - Chinchilla (70B) outperformed Gopher (280B) with same compute **Practical Takeaways**: - More data often beats more parameters - Optimal ratios changed industry training practices - Data quality and quantity both matter enormously **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2203.15556 --- ### FlashAttention: Fast and Memory-Efficient Exact Attention (2022) **Citation**: Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" **Why Read It**: FlashAttention is now standard in all production systems. Understanding its memory hierarchy optimization is valuable for performance engineering. **Key Contributions**: - Fused attention kernels that minimize HBM reads/writes - Tiling approach that fits computation in SRAM - Significant speedup and memory reduction **Practical Takeaways**: - Memory bandwidth, not compute, is often the bottleneck - Algorithm-level optimization can beat hardware upgrades - Always use FlashAttention in production **Reading Time**: 2-3 hours (systems-heavy) **Link**: https://arxiv.org/abs/2205.14135 --- ## Retrieval and RAG ### Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020) **Citation**: Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" **Why Read It**: Introduced RAG, which is now fundamental to production LLM applications. Combines retrieval with generation for better factuality. **Key Contributions**: - Architecture combining retriever and generator - Marginalization over retrieved documents - Significant improvements on knowledge tasks **Practical Takeaways**: - Retrieval grounds generation in facts - Quality of retrieved context directly impacts output quality - RAG is more efficient than putting everything in context **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2005.11401 --- ### Dense Passage Retrieval for Open-Domain Question Answering (2020) **Citation**: Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" **Why Read It**: DPR showed that learned dense retrieval can outperform BM25. Foundation for modern semantic search. **Key Contributions**: - Bi-encoder architecture for dense retrieval - In-batch negatives for efficient training - Strong QA results with simple architecture **Practical Takeaways**: - Dense retrieval captures semantic similarity that BM25 misses - Hard negatives improve retrieval quality - Hybrid (dense + sparse) often works best **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2004.04906 --- ### Self-RAG: Learning to Retrieve, Generate, and Critique (2023) **Citation**: Asai et al., "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" **Why Read It**: Advanced RAG that teaches models to decide when to retrieve and self-critique. Shows direction of RAG evolution. **Key Contributions**: - Model learns when retrieval is needed - Self-reflection for assessing generation quality - Special tokens for retrieval decisions **Practical Takeaways**: - Not every query needs retrieval - Self-critique can improve output quality - Training for retrieval decisions is possible **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2310.11511 --- ## Efficiency and Optimization ### LoRA: Low-Rank Adaptation of Large Language Models (2021) **Citation**: Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" **Why Read It**: LoRA made fine-tuning large models practical by training only a small number of parameters. Now standard for LLM adaptation. **Key Contributions**: - Low-rank decomposition for weight updates - No additional inference latency - Dramatic reduction in trainable parameters **Practical Takeaways**: - Fine-tune 7B+ models on consumer GPUs - r=8-16 typically sufficient - Can merge adapters back into base weights **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2106.09685 --- ### QLoRA: Efficient Finetuning of Quantized LLMs (2023) **Citation**: Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs" **Why Read It**: Combines 4-bit quantization with LoRA for even more efficient fine-tuning. Enabled fine-tuning 65B models on single GPUs. **Key Contributions**: - 4-bit NormalFloat quantization - Double quantization for memory savings - Paged optimizers for memory spikes **Practical Takeaways**: - Fine-tune 65B models on single 48GB GPU (per the paper; modern 70B models also fit) - Quality close to full fine-tuning - Standard tool for resource-constrained fine-tuning **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2305.14314 --- ### Speculative Decoding (2022-2023) **Citation**: Leviathan et al., "Fast Inference from Transformers via Speculative Decoding" **Why Read It**: Speculative decoding provides 2-3x inference speedup with no quality loss. Understanding why it works helps with other optimizations. **Key Contributions**: - Draft-then-verify paradigm - Acceptance probability analysis - Parallel verification of multiple tokens **Practical Takeaways**: - Works best when draft model is well-aligned with target - Speedup depends on acceptance rate - No quality degradation (exact sampling) **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2211.17192 --- ## Agents and Reasoning ### ReAct: Synergizing Reasoning and Acting in Language Models (2022) **Citation**: Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" **Why Read It**: ReAct is the foundation for most LLM agent architectures. Interleaves reasoning traces with actions. **Key Contributions**: - Thought-action-observation loop - Reasoning traces improve action quality - Generalizable agent architecture **Practical Takeaways**: - Chain-of-thought + tool use > either alone - Explicit reasoning helps debugging - Foundation for production agents **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2210.03629 --- ### Chain-of-Thought Prompting Elicits Reasoning (2022) **Citation**: Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" **Why Read It**: Chain-of-thought prompting dramatically improves reasoning. Simple technique with large impact. **Key Contributions**: - Demonstrated reasoning improvements from step-by-step thinking - Showed emergent capability at scale - Simple prompting technique **Practical Takeaways**: - "Let's think step by step" improves complex tasks - Works best on reasoning tasks - Smaller models need examples; larger can do zero-shot **Reading Time**: 1 hour **Link**: https://arxiv.org/abs/2201.11903 --- ### Toolformer: Language Models Can Teach Themselves to Use Tools (2023) **Citation**: Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" **Why Read It**: Shows how to train models to use external tools. Foundation for understanding tool-use training. **Key Contributions**: - Self-supervised tool use training - API call insertion and filtering - Works with various tools **Practical Takeaways**: - Models can learn when tools help - Training data generation is key - Different from pure prompting approaches **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2302.04761 --- ## Safety and Alignment ### Constitutional AI: Harmlessness from AI Feedback (2022) **Citation**: Bai et al., "Constitutional AI: Harmlessness from AI Feedback" **Why Read It**: Describes Anthropic's approach to alignment using principles. Alternative to pure human feedback. **Key Contributions**: - Using AI feedback for alignment - Constitutional principles for behavior - Reduced reliance on human labelers **Practical Takeaways**: - Principles can guide behavior - AI feedback can complement human feedback - Scalable alignment approach **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2212.08073 --- ### Direct Preference Optimization (DPO, 2023) **Citation**: Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" **Why Read It**: DPO simplifies RLHF by directly optimizing preferences. Increasingly common alternative to PPO. **Key Contributions**: - Closed-form solution for preference learning - No separate reward model needed - Simpler and more stable than PPO **Practical Takeaways**: - Often preferred over PPO for simplicity - Still needs preference data - Good default for alignment fine-tuning **Reading Time**: 2 hours (math-heavy) **Link**: https://arxiv.org/abs/2305.18290 --- ## Multimodal ### Learning Transferable Visual Models From Natural Language Supervision (CLIP, 2021) **Citation**: Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" **Why Read It**: CLIP connected vision and language understanding. Foundation for multimodal models. **Key Contributions**: - Contrastive learning across modalities - Zero-shot image classification - Web-scale training data **Practical Takeaways**: - Text-image alignment enables new applications - Contrastive learning is powerful for embedding spaces - Foundation for DALL-E, Stable Diffusion, etc. **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2103.00020 --- ### Visual Instruction Tuning (LLaVA, 2023) **Citation**: Liu et al., "Visual Instruction Tuning" **Why Read It**: LLaVA shows how to create multimodal instruction-following models. Practical for vision-language tasks. **Key Contributions**: - Visual instruction tuning methodology - GPT-4 assisted data generation - Strong performance with simple architecture **Practical Takeaways**: - Vision encoders + LLMs work well together - Instruction data quality matters - Relatively simple to replicate **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2304.08485 --- ## Evaluation ### HELM: Holistic Evaluation of Language Models (2022) **Citation**: Liang et al., "Holistic Evaluation of Language Models" **Why Read It**: HELM provides comprehensive evaluation methodology. Useful for understanding what metrics matter. **Key Contributions**: - Multi-scenario evaluation framework - Standardized evaluation methodology - Metrics beyond accuracy (calibration, fairness, etc.) **Practical Takeaways**: - Single metrics don't capture model quality - Consider accuracy, calibration, robustness, fairness, efficiency - Evaluation methodology should be explicit **Reading Time**: Long paper; skim for methodology **Link**: https://arxiv.org/abs/2211.09110 --- ### Judging LLM-as-a-Judge (2023) **Citation**: Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" **Why Read It**: Establishes when LLM judges are reliable. Essential for automated evaluation. **Key Contributions**: - MT-Bench multi-turn benchmark - Analysis of LLM judge reliability - Position bias and other issues identified **Practical Takeaways**: - LLM judges correlate well with humans for many tasks - Position bias is real—randomize order - Strong models make better judges **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2306.05685 --- ## Systems and Infrastructure ### Efficient Memory Management for Large Language Model Serving (PagedAttention/vLLM, 2023) **Citation**: Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" **Why Read It**: Introduces PagedAttention, the key innovation in vLLM. Essential for understanding modern LLM serving. **Key Contributions**: - Virtual memory for KV cache - Non-contiguous memory allocation - Block-level sharing for parallel sequences - 2-4x throughput improvement **Practical Takeaways**: - KV cache memory management is critical - Block-based allocation solves fragmentation - Foundation for all modern serving systems **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2309.06180 --- ### TensorRT-LLM: A High-Performance Framework for Large Language Model Inference **Citation**: NVIDIA TensorRT-LLM Technical Blog/Documentation **Why Read It**: Understanding TensorRT-LLM optimizations helps with production performance engineering. **Key Contributions**: - Kernel fusion and optimization - Multi-GPU inference patterns - INT8/FP8 quantization - Inflight batching **Practical Takeaways**: - Maximum performance on NVIDIA hardware - Compilation-based optimization - Trade-off between optimization time and inference speed **Reading Time**: 1.5 hours (documentation + blog posts) --- ### Orca: A Distributed Serving System for Transformer-Based Generative Models (2022) **Citation**: Yu et al., "Orca: A Distributed Serving System for Transformer-Based Generative Models" **Why Read It**: Introduces continuous batching, now standard in all LLM serving systems. **Key Contributions**: - Iteration-level scheduling - Continuous batching (vs. static batching) - Selective batching for different request lengths **Practical Takeaways**: - Static batching wastes significant compute - Iteration-level scheduling maximizes utilization - Influenced vLLM, TGI, and all modern systems **Reading Time**: 2 hours **Link**: https://www.usenix.org/conference/osdi22/presentation/yu --- ## Data and Features ### Hidden Technical Debt in Machine Learning Systems (2015) **Citation**: Sculley et al., "Hidden Technical Debt in Machine Learning Systems," NeurIPS 2015 **Why Read It**: Classic paper on ML systems debt. Still highly relevant. Required reading for ML engineers. **Key Contributions**: - Identified ML-specific technical debt - Glue code, pipeline jungles, dead features - Configuration debt, feedback loops **Practical Takeaways**: - Most ML code isn't the model - Data and pipeline issues dominate - Proactive debt management is essential **Reading Time**: 1.5 hours **Link**: https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf --- ### Automating Large-Scale Data Quality Verification (2018) **Citation**: Schelter et al., "Automating Large-Scale Data Quality Verification" **Why Read It**: Foundation for data quality in ML pipelines. Basis for Amazon's Deequ library. **Key Contributions**: - Declarative data quality rules - Incremental validation - Anomaly detection for data **Practical Takeaways**: - Data testing is as important as code testing - Automation is essential at scale - Historical profiles enable anomaly detection **Reading Time**: 1.5 hours **Link**: https://www.vldb.org/pvldb/vol11/p1781-schelter.pdf --- ### Data Management Challenges in Production Machine Learning (2017) **Citation**: Polyzotis et al., "Data Management Challenges in Production Machine Learning" **Why Read It**: Survey of data challenges in production ML. Provides vocabulary and frameworks for thinking about data issues. **Key Contributions**: - Taxonomy of data management challenges - Training-serving skew analysis - Data lifecycle management **Practical Takeaways**: - Data issues are the primary source of ML bugs - Validation at system boundaries is critical - Reproducibility requires data versioning **Reading Time**: 2 hours **Link**: https://dl.acm.org/doi/10.1145/3035918.3054782 --- ## Long Context and Memory ### LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (2023) **Citation**: Chen et al., "LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models" **Why Read It**: Shows how to efficiently extend context length. Relevant as context windows grow. **Key Contributions**: - Shifted sparse attention for training - Efficient context extension - Position interpolation techniques **Practical Takeaways**: - Context extension is cheaper than training from scratch - Sparse attention during training, dense during inference - 100K+ context is achievable **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2309.12307 --- ### Retrieval-Augmented Generation vs Long-Context LLMs: A Fair Comparison (2024) **Citation**: Various comparison studies **Why Read It**: Understanding when to use RAG vs. long context is crucial for system design. **Key Contributions**: - Empirical comparison of RAG vs. stuffing context - Cost-performance trade-offs - Task-dependent recommendations **Practical Takeaways**: - Long context isn't always better than RAG - Cost considerations favor RAG at scale - Hybrid approaches often optimal **Reading Time**: 1.5 hours --- ## Quantization ### GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022) **Citation**: Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers" **Why Read It**: GPTQ enables running large models on consumer hardware. Essential for deployment. **Key Contributions**: - One-shot quantization method - Per-column quantization - Minimal accuracy loss at 4-bit **Practical Takeaways**: - 4-bit quantization is practical - Post-training quantization is convenient - Some accuracy trade-off at extreme compression **Reading Time**: 2 hours (math-heavy) **Link**: https://arxiv.org/abs/2210.17323 --- ### AWQ: Activation-aware Weight Quantization (2023) **Citation**: Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration" **Why Read It**: AWQ often outperforms GPTQ. Understanding both helps with quantization decisions. **Key Contributions**: - Activation-aware importance weights - No retraining needed - Better preservation of salient weights **Practical Takeaways**: - Salient weights matter more—protect them - AWQ often better than GPTQ - 4-bit with minimal quality loss **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2306.00978 --- ## Embeddings and Retrieval ### E5: Text Embeddings by Weakly-Supervised Contrastive Pre-training (2022) **Citation**: Wang et al., "Text Embeddings by Weakly-Supervised Contrastive Pre-training" **Why Read It**: E5 embeddings are widely used. Understanding training methodology helps with embedding selection. **Key Contributions**: - Weakly-supervised training on web data - Strong zero-shot performance - Instruction-tuned variants **Practical Takeaways**: - Instruction prefix can improve retrieval - Quality of training data matters enormously - E5 variants available for different use cases **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2212.03533 --- ### ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction (2020) **Citation**: Khattab and Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT" **Why Read It**: ColBERT offers different retrieval trade-offs than single-vector approaches. Important for advanced retrieval. **Key Contributions**: - Late interaction for better relevance - Efficient MaxSim operation - Balance between single-vector and cross-encoder **Practical Takeaways**: - Token-level matching improves relevance - More storage but better accuracy than single-vector - Good for precision-critical applications **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2004.12832 --- ## Mixture of Experts ### Switch Transformers: Scaling to Trillion Parameter Models (2021) **Citation**: Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" **Why Read It**: MoE architectures are used in GPT-4, Mixtral, and future models. Understanding them helps interpret model behavior. **Key Contributions**: - Simplified MoE routing - Trillion-parameter models with manageable compute - Expert capacity balancing **Practical Takeaways**: - MoE enables larger models with less compute - Load balancing is critical - Not all parameters active for each token **Reading Time**: 2.5 hours **Link**: https://arxiv.org/abs/2101.03961 --- ### Mixtral of Experts (2023) **Citation**: Mistral AI, "Mixtral of Experts" **Why Read It**: Practical MoE model that outperforms larger dense models. Represents production-ready MoE. **Key Contributions**: - 8x7B MoE with 2 experts active - Strong performance per compute - Open weights **Practical Takeaways**: - MoE is practical for deployment - Router overhead is manageable - Memory usage is full model size **Reading Time**: 1 hour (short paper/blog) **Link**: https://arxiv.org/abs/2401.04088 --- ## Production and Operations ### Machine Learning Operations (MLOps): Overview, Definition, and Architecture (2022) **Citation**: Kreuzberger et al., "Machine Learning Operations (MLOps): Overview, Definition, and Architecture" **Why Read It**: Comprehensive survey of MLOps practices. Good for understanding the field holistically. **Key Contributions**: - MLOps definition and principles - Architecture patterns - Automation levels **Practical Takeaways**: - MLOps is more than just ML + DevOps - Automation is progressive - Monitoring and feedback are essential **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2205.02302 --- ### Monitoring Machine Learning Models in Production (2019) **Citation**: Google Cloud ML Documentation and Blog Series **Why Read It**: Practical guidance on production ML monitoring from Google's experience. **Key Contributions**: - Model monitoring best practices - Drift detection approaches - Alert design **Practical Takeaways**: - Monitor inputs, outputs, and outcomes - Statistical tests for drift detection - Human-in-the-loop for edge cases **Reading Time**: 2 hours (multiple articles) --- ## Prompt Engineering and In-Context Learning ### Large Language Models are Zero-Shot Reasoners (2022) **Citation**: Kojima et al., "Large Language Models are Zero-Shot Reasoners" **Why Read It**: Shows that "Let's think step by step" dramatically improves reasoning without examples. **Key Contributions**: - Zero-shot chain-of-thought prompting - Simple trigger phrases enable reasoning - Analysis of why it works **Practical Takeaways**: - Add "Let's think step by step" for reasoning tasks - No examples needed for basic CoT - Works better with larger models **Reading Time**: 1 hour **Link**: https://arxiv.org/abs/2205.11916 --- ### Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (2022) **Citation**: Min et al., "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?" **Why Read It**: Analyzes what actually matters in few-shot examples. Surprising findings about label importance. **Key Contributions**: - Format matters more than correct labels - Input distribution of examples is key - Label space specification is important **Practical Takeaways**: - Examples teach format and task structure - Random labels work surprisingly well - Distribution coverage matters **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2202.12837 --- ### Calibrate Before Use: Improving Few-Shot Performance of Language Models (2021) **Citation**: Zhao et al., "Calibrate Before Use: Improving Few-Shot Performance of Language Models" **Why Read It**: Addresses biases in few-shot learning. Practical for improving prompt reliability. **Key Contributions**: - Identified majority label, recency, and common token biases - Calibration technique to correct biases - Significant accuracy improvements **Practical Takeaways**: - Example order affects predictions - Calibration via content-free inputs - Randomize and average when possible **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2102.09690 --- ## Code Generation ### Evaluating Large Language Models Trained on Code (Codex, 2021) **Citation**: Chen et al., "Evaluating Large Language Models Trained on Code" **Why Read It**: Introduces HumanEval benchmark and Codex. Foundation for understanding code LLMs. **Key Contributions**: - HumanEval benchmark - Analysis of code generation capabilities - pass@k evaluation methodology **Practical Takeaways**: - Functional correctness matters, not just syntax - Multiple samples improve pass rate - Temperature affects code quality **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2107.03374 --- ### Self-Instruct: Aligning Language Models with Self-Generated Instructions (2022) **Citation**: Wang et al., "Self-Instruct: Aligning Language Models with Self-Generated Instructions" **Why Read It**: Shows how to generate instruction data using models themselves. Practical for data generation. **Key Contributions**: - Self-instruction generation pipeline - Filtering for quality - Reduced human annotation needs **Practical Takeaways**: - Models can generate their own training data - Quality filtering is essential - Useful for domain-specific fine-tuning **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2212.10560 --- ## Robustness and Safety ### Universal and Transferable Adversarial Attacks on Aligned Language Models (2023) **Citation**: Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models" **Why Read It**: Important for understanding LLM vulnerabilities. Motivates safety measures. **Key Contributions**: - Automated adversarial suffix generation - Transfer across models - Bypasses safety training **Practical Takeaways**: - Alignment isn't robust to optimization - Perplexity filtering helps - Defense requires multiple layers **Reading Time**: 2 hours **Link**: https://arxiv.org/abs/2307.15043 --- ### Red Teaming Language Models with Language Models (2022) **Citation**: Perez et al., "Red Teaming Language Models with Language Models" **Why Read It**: Describes automated red teaming approaches. Practical for safety testing. **Key Contributions**: - LLM-based red teaming methodology - Diverse attack generation - Scaling red team efforts **Practical Takeaways**: - Automation enables broader coverage - Diversity of attacks matters - Continuous red teaming needed **Reading Time**: 1.5 hours **Link**: https://arxiv.org/abs/2202.03286 --- ## Recent Advances (2024-2025) The pace of development in 2024-2025 has been extraordinary. These papers represent the most impactful advances during this period. ### Llama 3 and Open Foundation Models **Citation**: Meta AI, "Llama 3" (April 2024) and "Llama 3.1" (July 2024); see also "The Llama 3 Herd of Models" technical report (arXiv:2407.21783) **Why Read It**: Llama 3 pushed open-source models to near-parity with proprietary alternatives. The Llama 3.1 70B and 405B models are competitive with GPT-4 on many benchmarks. **Key Contributions**: - Scaling to 405B parameters with open weights (Llama 3.1, July 2024) - 128K context window in Llama 3.1 (up from 8K in the original Llama 3) - Improved training data curation and quality - Better instruction-following and reasoning **Practical Takeaways**: - Open models are viable for production use - 70B models hit the sweet spot for many applications - Data quality matters as much as scale **Link**: https://ai.meta.com/blog/meta-llama-3/ --- ### Claude 3 and Constitutional AI Advances **Citation**: Anthropic, "Claude 3 Model Card" (2024) **Why Read It**: Claude 3 (Opus, Sonnet, Haiku) demonstrated significant capability improvements while maintaining safety. Introduces improved vision capabilities. **Key Contributions**: - Native multimodal understanding - Improved long-context performance (200K tokens) - Better instruction following and reduced refusals **Practical Takeaways**: - Tiered model offerings enable cost optimization - Long context enables new use cases - Vision capabilities are production-ready --- ### Gemini and Long Context **Citation**: Google DeepMind, "Gemini: A Family of Highly Capable Multimodal Models" (2023-2024) **Why Read It**: Gemini introduced native multimodality and pushed context windows to 1M+ tokens. **Key Contributions**: - Native multimodal training from the start - 1M+ token context window (Gemini 1.5) - Strong performance across modalities **Practical Takeaways**: - Truly long context changes application architecture - Native multimodal may outperform adapter approaches - Context window isn't free—understand cost implications **Link**: https://arxiv.org/abs/2312.11805 --- ### Test-Time Compute and Reasoning (o1-style) **Citation**: Various papers on inference-time scaling, including OpenAI's o1 approach **Why Read It**: The paradigm shift toward using more compute at inference time (thinking longer) rather than just training larger models. **Key Contributions**: - Chain-of-thought at inference time improves reasoning - Self-consistency and verification during generation - Trading inference compute for capability **Practical Takeaways**: - Longer thinking time can substitute for larger models - Useful for high-stakes decisions worth the latency - Different cost model than traditional inference --- ### Mixture of Experts at Scale (Mixtral, DeepSeek) **Citation**: Mistral AI, "Mixtral of Experts" (2024); DeepSeek-V2 (2024) **Why Read It**: MoE architectures achieved excellent performance-per-compute, making very capable models more accessible. **Key Contributions**: - Efficient expert routing at scale - Mixtral 8x7B outperforms larger dense models - DeepSeek-V2 pushes efficiency further **Practical Takeaways**: - MoE enables larger effective models on same hardware - Not all parameters active—memory is still full model size - Great for throughput-focused deployments **Link (Mixtral)**: https://arxiv.org/abs/2401.04088 --- ### Structured Output and JSON Mode **Citation**: OpenAI "Structured Outputs" announcement; various implementations **Why Read It**: Native structured output support changed how we build LLM applications—more reliable than prompt-based JSON extraction. **Key Contributions**: - Guaranteed valid JSON output - Schema enforcement during generation - Reduced post-processing and error handling **Practical Takeaways**: - Use structured outputs for any programmatic consumption - Dramatically reduces parsing errors - Simplifies application code --- ### Model Merging and Composition **Citation**: TIES-Merging (2023), DARE (2024), Model Soups **Why Read It**: Merging fine-tuned models without additional training enables combining capabilities efficiently. **Key Contributions**: - Task arithmetic with model weights - Interference-free merging techniques - Combining specialist models into generalists **Practical Takeaways**: - Cheaper than multi-task training - Useful for domain adaptation - Quality varies—evaluate merged models carefully **Link (TIES)**: https://arxiv.org/abs/2306.01708 --- ### Long Context Advances **Citation**: Ring Attention (2024), LongRoPE, various context extension techniques **Why Read It**: Understanding how context windows extended from 4K to 1M+ tokens informs architectural decisions. **Key Contributions**: - Distributed attention across devices - Position interpolation for length extension - Memory-efficient attention variants **Practical Takeaways**: - Long context has real costs (quadratic attention) - Retrieval may still beat very long context for some tasks - Test actual performance at length, not just capability **Link (Ring Attention)**: https://arxiv.org/abs/2310.01889 --- ### Inference Optimization **Citation**: Medusa (2024), EAGLE, Speculative Decoding advances **Why Read It**: 2-3x inference speedup without quality loss is practically important. **Key Contributions**: - Multiple draft heads for speculation - Better acceptance rates through architecture - Production-ready implementations **Practical Takeaways**: - Speculative decoding is production-ready - Works best when draft model matches target well - Evaluate for your specific models and use cases **Link (Medusa)**: https://arxiv.org/abs/2401.10774 --- ### Multimodal and Video Understanding **Citation**: GPT-4V technical report, LLaVA-NeXT, video understanding papers **Why Read It**: Vision-language models matured significantly in 2024, enabling new applications. **Key Contributions**: - Improved visual reasoning and OCR - Video understanding through frame sampling - Better integration of visual and textual reasoning **Practical Takeaways**: - Vision is production-ready for many use cases - Frame sampling strategy matters for video - Combine with specialized vision models when needed --- **Note**: The field continues to move quickly. For the latest developments, follow: - arxiv.org/list/cs.CL and cs.LG - Papers With Code - Major lab blogs (OpenAI, Anthropic, Google DeepMind, Meta AI) --- ## Organizing Your Reading ### Paper Notes Template For each paper you read in depth, capture: ``` Paper: [Title] Authors: [Names] Link: [URL] Date Read: [Date] One-sentence summary: [What is the key contribution?] Key techniques: - [Technique 1] - [Technique 2] Practical applications: - [How I might use this] Questions/limitations: - [What's unclear or limited?] Related papers: - [Connected work] ``` ### Building a Personal Library 1. **Core references**: Papers you'll cite repeatedly (~20 papers) 2. **Deep dives**: Papers you've studied carefully (~50 papers) 3. **Awareness**: Papers you've skimmed for context (~200 papers) ### Reading Schedule Sustainable approach for practicing engineers: - **Weekly**: 1 paper read carefully (2-3 hours) - **Daily**: Skim 2-3 abstracts (15 minutes) - **Monthly**: One deep dive into a foundational paper (4+ hours) --- ## Reading Strategy Recommendations ### For Engineers New to AI 1. Start with "Attention Is All You Need" for foundations 2. Read BERT and GPT-2 papers for encoder/decoder understanding 3. Study InstructGPT for alignment concepts 4. Then explore specific areas as needed ### For Production-Focused Engineers Priority order: 1. FlashAttention (performance) 2. LoRA/QLoRA (fine-tuning) 3. RAG paper (retrieval) 4. ReAct (agents) 5. DPO (alignment) ### For Research-Adjacent Roles Deep dive into: 1. Scaling laws papers 2. Latest arxiv releases in areas of focus 3. Benchmark papers for evaluation methodology ### Time-Efficient Reading - Read abstract and introduction first - Skip to experiments/results if time-limited - Return to methods when implementing - Keep a paper notes system --- ## Staying Current ### Daily Sources - arxiv.org/list/cs.CL (NLP) - arxiv.org/list/cs.LG (machine learning) - Papers With Code newsletter ### Weekly Summary Sources - ML Reddit communities - The Batch (Andrew Ng's newsletter) - Import AI newsletter ### Monthly Deep Dives - Pick 1-2 important papers for careful study - Implement key concepts when possible - Discuss with colleagues --- ## Additional Recommended Papers by Topic ### Transformers and Architecture | Paper | Year | Key Contribution | |-------|------|------------------| | An Image is Worth 16x16 Words (ViT) | 2020 | Vision transformers | | RoBERTa | 2019 | Optimized BERT pretraining | | T5 | 2019 | Text-to-text framework | | PaLM | 2022 | Scaling and emergent abilities | | LLaMA | 2023 | Open foundation models | | Mistral 7B | 2023 | Efficient open models | ### Training and Optimization | Paper | Year | Key Contribution | |-------|------|------------------| | Adam Optimizer | 2014 | Adaptive learning rates | | Dropout | 2014 | Regularization technique | | Batch Normalization | 2015 | Training stability | | Layer Normalization | 2016 | Sequence model normalization | | Mixed Precision Training | 2017 | Efficient training | | ZeRO | 2019 | Memory-efficient distributed training | ### Retrieval and Search | Paper | Year | Key Contribution | |-------|------|------------------| | BM25 | 1994 | Term frequency retrieval | | Sentence-BERT | 2019 | Sentence embeddings | | Approximate Nearest Neighbors Oh Yeah (Annoy) | 2013 | Efficient vector search | | Hierarchical Navigable Small World (HNSW) | 2016 | Graph-based ANN | ### Evaluation and Benchmarks | Paper | Year | Key Contribution | |-------|------|------------------| | GLUE/SuperGLUE | 2018/2019 | NLU benchmarks | | SQuAD | 2016 | Question answering benchmark | | MMLU | 2020 | Multi-task language understanding | | BIG-bench | 2022 | Diverse capability evaluation | | TruthfulQA | 2021 | Factuality evaluation | ### Safety and Ethics | Paper | Year | Key Contribution | |-------|------|------------------| | On the Dangers of Stochastic Parrots | 2021 | LLM risks and biases | | Predictability and Surprise | 2022 | Emergent capabilities | | Sleeper Agents | 2024 | Deceptive model behavior | --- ## Paper Finding Resources ### Primary Sources **arxiv.org**: Primary preprint server. Categories: - cs.CL (Computation and Language) - cs.LG (Machine Learning) - cs.AI (Artificial Intelligence) - stat.ML (Machine Learning - Statistics) **Papers With Code**: Papers with implementation links. Good for finding reproducible work. **Semantic Scholar**: Academic search with citation analysis. Good for finding influential papers. **Google Scholar**: Broad academic search. Useful for citation counts. ### Curated Lists **Awesome-LLM**: GitHub repository tracking LLM papers **NLP Progress**: Task-specific paper tracking **ML Papers Explained**: Accessible paper summaries ### Conference Proceedings Major venues for AI/ML papers: - **NeurIPS**: Neural Information Processing Systems (December) - **ICML**: International Conference on Machine Learning (July) - **ICLR**: International Conference on Learning Representations (May) - **ACL/EMNLP/NAACL**: NLP conferences - **CVPR/ICCV**: Computer vision ### Industry Sources - **OpenAI Blog**: Research announcements - **Anthropic Research**: Constitutional AI and safety - **Google AI Blog**: DeepMind and Google Research - **Meta AI Blog**: LLaMA and other releases - **Hugging Face Blog**: Open source developments --- ## Building Research Intuition ### Evaluating Paper Quality Questions to ask when reading: 1. **Is the problem well-motivated?** Why should we care? 2. **Is the approach novel?** What's the key insight? 3. **Are the experiments convincing?** Proper baselines? Ablations? 4. **Are the claims appropriate?** Does evidence support conclusions? 5. **Is it reproducible?** Code available? Clear methodology? ### Identifying Important Papers Signals of importance: - High citation count relative to age - Adopted by practitioners - Spawned follow-up work - Changed how people think about the problem - Has open-source implementation ### Common Paper Patterns **Scaling paper**: Shows performance improves with scale (data, parameters, compute). Example: Scaling Laws. **Architecture paper**: Introduces new model design. Example: Attention Is All You Need. **Technique paper**: Introduces training or inference technique. Example: LoRA. **Benchmark paper**: Introduces evaluation methodology. Example: HELM. **Analysis paper**: Studies existing models/techniques. Example: What Makes In-Context Learning Work. **Application paper**: Applies ML to specific domain. Example: AlphaFold. --- ## Reading Difficult Papers ### When the Math is Hard 1. **First pass**: Skip math, understand intuition 2. **Second pass**: Follow math at high level 3. **Third pass**: Work through key derivations 4. **Implementation**: Implement to solidify understanding ### When the System is Complex 1. **Architecture diagram**: Draw it yourself 2. **Data flow**: Trace a single example through 3. **Compare to simpler systems**: What's added/different? ### When Results are Surprising 1. **Check experimental setup**: Are comparisons fair? 2. **Look for ablations**: What's actually contributing? 3. **Find replications**: Have others reproduced? --- ## Creating Value from Paper Reading ### For Your Team - Share weekly paper summaries - Present papers at reading groups - Write internal "paper digests" - Connect papers to current projects ### For Your Career - Build expertise in specific areas - Contribute to survey papers - Write blog posts explaining papers - Implement papers and open source ### For the Community - Write accessible explanations - Create educational content - Point out limitations/issues - Propose extensions and improvements