Machine Learning Papers

🏆 Best ML Papers of All Time

The most influential machine learning papers — curated by impact, novelty, and field-defining significance.

107 landmark papers · Organized by year · Updated April 2026

🏅 Hall of Fame — Most Cited

#1📚 172.4k
Attention Is All You Need99General ML
Google Research · 2017
#2📚 164.7k
Adam: A Method for Stochastic Optimization93General ML
University of Amsterdam · 2014
#3📚 112.7k
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding93NLP
Google AI Language · 2018
#4📚 110.4k
Very Deep Convolutional Networks for Large-Scale Image Recognition92Vision
University of Oxford · 2014
#5📚 93.5k
U-Net: Convolutional Networks for Biomedical Image Segmentation91Vision
University of Freiburg · 2015
#6📚 71.7k
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks95Vision
Microsoft Research · 2015
#7📚 60.3k
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale95Vision
Google Research · 2020
#8📚 56.4k
Language Models are Few-Shot Learners96NLP
OpenAI · 2020
#9📚 46.4k
Learning Transferable Visual Models From Natural Language Supervision92Vision
OpenAI · 2021
#10📚 46.4k
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift95General ML
Google · 2015

2025

NLP76
DeepSeek-AI, Daya Guo, Dejian Yang ... · cs.CL
General reasoning represents a long-standing and formidable challenge in artificial intelligence. Recent breakthroughs, exemplified by large language models (LLMs) and chain-of-thought prompting, have achieved considerable success on foundational reasoning tasks. However, this su...

DeepSeek; o1-level reasoning via RL; open weights; major milestone

📚 5k citations
General ML80
Kimi Team, Angang Du, Bofei Gao ... · cs.AI
Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promis...

Moonshot AI; RL-based reasoning with long + short CoT; competitive with o1

📚 846 citations

2024

NLP80
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri ... · cs.AI
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense...

Meta; 8B/70B/405B; strong multilingual/code; most-adopted open-weight model family of 2024

📚 14k citations
Vision83
Patrick Esser, Sumith Kulal, Andreas Blattmann ... · cs.CV
Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that ...

Esser et al., Stability AI; multimodal diffusion transformer; improved text rendering

📚 3k citations
NLP80
Gemini Team, Petko Georgiev, Ving Ian Lei ... · cs.CL
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and ...

Google DeepMind; 1M-token context window; strong multimodal reasoning; function calling

📚 3k citations
NLP83
Marah Abdin, Jyoti Aneja, Hany Awadalla ... · cs.CL
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU a...

Abdin et al., Microsoft; 3.8B matches much larger models; efficient edge-deployable LLM

📚 2k citations
NLP67
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux ... · landmark
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two expe...

Jiang et al.; sparse MoE; outperforms dense 70B at fraction of cost

📚 2k citations
NLP84
Gemma Team, Thomas Mesnard, Cassidy Hardin ... · cs.CL
This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We rele...

Google DeepMind; open-weight models distilled from Gemini; widely fine-tuned base

📚 976 citations
NLP80
DeepSeek-AI, Aixin Liu, Bei Feng ... · landmark
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts i...

DeepSeek; MLA attention; efficient MoE; competitive open weights

NLP80
DeepSeek-AI, Aixin Liu, Bei Feng ... · cs.CL
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, w...

DeepSeek; 671B MoE; $6M training cost; matched proprietary frontier

NLP83
Yiran Ding, Li Lyna Zhang, Chengruidong Zhang ... · landmark
Large context window is a desirable feature in large language models (LLMs). However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. This paper i...

Zhao et al.; survey of methods for extending context window

2023

General ML68
OpenAI, Josh Achiam, Steven Adler ... · cs.CL
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks...

OpenAI; multimodal GPT-4; frontier model; bar-setting benchmark results

📚 23k citations
NLP77
Hugo Touvron, Louis Martin, Kevin Stone ... · cs.CL
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform ope...

Touvron et al.; Meta; commercial open-weights with RLHF

📚 16k citations
Vision81
Alexander Kirillov, Eric Mintun, Nikhila Ravi ... · cs.CV
We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting...

Kirillov et al.; Meta; promptable segmentation; billion-mask dataset

📚 13k citations💬 Reddit · HN🎬 ▶ Video 1
Vision75
Haotian Liu, Chunyuan Li, Qingyang Wu ... · landmark
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to ge...

Liu et al.; open-source multimodal instruction-following

📚 9k citations
General ML92
Rafael Rafailov, Archit Sharma, Eric Mitchell ... · landmark
While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect...

Rafailov et al.; simpler RLHF alternative; widely adopted

📚 8k citations💬 Reddit🎬 ▶ Video 1
Vision83
Maxime Oquab, Timothée Darcet, Théo Moutakanni ... · cs.CV
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual feat...

Oquab et al., Meta; curated pretraining + self-supervised; universal vision backbone

📚 7k citations
General ML76
Albert Gu, Tri Dao · landmark
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, a...

Gu & Dao; SSM alternative to Transformer; linear scaling in sequence length

📚 6k citations
General ML90
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman ... · landmark
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained languag...

Dettmers et al.; 4-bit quantized LoRA; democratized LLM fine-tuning

📚 4k citations
NLP84
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì ... · landmark
Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models ex...

Schick et al.; Meta; self-supervised tool-use learning

📚 3k citations
NLP60
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch ... · landmark
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-q...

Jiang et al.; efficient 7B; sliding window attention; widely deployed

📚 3k citations
Robotics84
Anthony Brohan, Noah Brown, Justice Carbajal ... · cs.RO
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot obser...

Brohan et al.; Google; VLM directly outputs robot actions

📚 3k citations
Systems80
Tri Dao · landmark
Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The atten...

Dao; further 2x improvement over FlashAttention

📚 2k citations
Robotics80
Guanzhi Wang, Yuqi Xie, Yunfan Jiang ... · cs.AI
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum th...

Wang et al.; Minecraft agent; LLM as controller with skill library

📚 1k citations
Systems71
Hao Liu, Matei Zaharia, Pieter Abbeel · landmark
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby posing...

Liu et al.; distributed ring attention; million-token context

📚 443 citations
Vision83
Dustin Podell, Zion English, Kyle Lacey ... · cs.CV
We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention contex...

Podell et al.; improved Stable Diffusion

NLP88
Hugo Touvron, Thibaut Lavril, Gautier Izacard ... · landmark
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to propriet...

Touvron et al.; Meta; open-weights foundation; sparked open-source LLM movement

NLP83
Shunyu Yao, Dian Yu, Jeffrey Zhao ... · landmark
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic l...

Yao et al.; systematic search over reasoning chains

Systems83
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang ... · landmark
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently,...

Kwon et al.; PagedAttention; near-zero KV cache waste; production LLM serving

💬 HN🎬 ▶ Video 1
NLP75
Rylan Schaeffer, Brando Miranda, Sanmi Koyejo · landmark
Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not...

Zheng et al.; LMSYS; Elo-based human preference leaderboard

General ML80
Gemini Team, Rohan Anil, Sebastian Borgeaud ... · landmark
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to o...

Google DeepMind; multimodal Gemini; matched GPT-4 on many benchmarks

NLP83
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle ... · landmark
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide mul...

Rozière et al.; Meta; open-weights code LLM; extends Llama 2 for code

NLP78
Zhaofeng Wu, Linlu Qiu, Alexis Ross ... · landmark
The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, w...

Liu et al.; showed LLMs ignore middle of context; important limitation study

Systems80
Ji Lin, Jiaming Tang, Haotian Tang ... · landmark
Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hard...

Lin et al.; better quantization by protecting salient weights

Vision70
Haotian Liu, Chunyuan Li, Yuheng Li ... · cs.CV
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, na...

Liu et al.; CLIP + LLM with simple MLP projection; strong VQA baseline

2022

NLP92
Jason Wei, Xuezhi Wang, Dale Schuurmans ... · cs.CL
We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large languag...

Wei et al.; showed reasoning emerges with step-by-step prompting

📚 17k citations
Vision88
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol ... · cs.CV
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, ...

OpenAI; landmark text-to-image system

📚 9k citations
NLP78
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin ... · landmark
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further ou...

Chowdhery et al.; Google; 540B params; chain-of-thought abilities

📚 8k citations
NLP78
Shunyu Yao, Jeffrey Zhao, Dian Yu ... · cs.CL
While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studie...

Yao et al.; interleaved reasoning and tool use; foundation of agents

📚 7k citations
Audio79
Alec Radford, Jong Wook Kim, Tao Xu ... · landmark
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are ofte...

Radford et al.; OpenAI; standard ASR; 680k hours weak supervision

📚 7k citations💬 Reddit🎬 ▶ Video 1
NLP78
Xuezhi Wang, Jason Wei, Dale Schuurmans ... · landmark
Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It...

Wang et al.; majority-vote sampling over CoT paths

📚 6k citations
Vision84
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc ... · cs.CV
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural inn...

Alayrac et al.; DeepMind; few-shot VLM from frozen LLM

📚 6k citations
Systems89
Tri Dao, Daniel Y. Fu, Stefano Ermon ... · landmark
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, ...

Dao et al.; 2-4x speedup; enabled longer contexts; universally adopted

📚 4k citations💬 Reddit🎬 ▶ Video 1
NLP68
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra ... · landmark
Large "instruction-tuned" language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, an...

Wang et al.; bootstrapped instruction data; enabled Alpaca

📚 3k citations
General ML88
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch ... · landmark
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keepin...

Hoffmann et al.; revised scaling laws; data matters as much as params

📚 3k citations
Vision83
Tim Brooks, Aleksander Holynski, Alexei A. Efros · cs.CV
We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large ...

Brooks et al., UC Berkeley; text-guided image editing; enabled fine-grained image control

📚 3k citations
General ML63
Yuntao Bai, Saurav Kadavath, Sandipan Kundu ... · landmark
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided throu...

Bai et al.; Anthropic; RLAIF; scalable safety

📚 3k citations
NLP87
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao ... · landmark
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive ne...

Srivastava et al.; Google; 204-task collaborative LLM benchmark

📚 2k citations
Systems79
Elias Frantar, Saleh Ashkboos, Torsten Hoefler ... · cs.LG
Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference...

Frantar et al.; 3/4-bit quantization with minimal quality loss; widely used

📚 2k citations
Audio84
Zalán Borsos, Raphaël Marinier, Damien Vincent ... · landmark
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers...

Borsos et al.; Google; language model for audio tokens

📚 889 citations
Biology69
Gabriele Corso, Hannes Stärk, Bowen Jing ... · landmark
Predicting the binding structure of a small molecule ligand to a protein -- a task known as molecular docking -- is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods...

Hoogeboom et al.; 3D molecular generation with equivariant diffusion

📚 686 citations
NLP89
Long Ouyang, Jeff Wu, Xu Jiang ... · cs.CL
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. I...

Ouyang et al.; RLHF for LLMs; precursor to ChatGPT

💬 HN🎬 ▶ Video 1
Biology80
Yuan Xie, Shaohan Huang, Tianyu Chen ... · landmark
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as t...

Lin et al.; Meta; protein LLM; fast structure prediction

Biology75
Yu Feng, Ben Zhou, Haoyu Wang ... · landmark
Temporal reasoning is the task of predicting temporal relations of event pairs. While temporal reasoning models can perform reasonably well on in-domain benchmarks, we have little idea of these systems' generalizability due to existing datasets' limitations. In this work, we intr...

Watson et al.; David Baker lab; generative protein design

Robotics64
Anthony Brohan, Noah Brown, Justice Carbajal ... · landmark
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fie...

Brohan et al.; Google; large-scale robot transformer; real manipulation

2021

Vision92
Alec Radford, Jong Wook Kim, Chris Hallacy ... · cs.CV
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly...

Radford et al.; zero-shot transfer; most influential vision-language model

📚 46k citations💬 Reddit🎬 ▶ Video 1
Vision92
Robin Rombach, Andreas Blattmann, Dominik Lorenz ... · cs.CV
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image gene...

Rombach et al.; enabled open-source text-to-image at scale

📚 24k citations💬 Reddit · HN🎬 ▶ Video 1
General ML85
Edward J. Hu, Yelong Shen, Phillip Wallis ... · landmark
An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3...

Hu et al.; standard PEFT method; enables consumer fine-tuning

📚 18k citations💬 Reddit🎬 ▶ Video 1
Vision83
Kaiming He, Xinlei Chen, Saining Xie ... · landmark
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric en...

He et al.; Meta; high masking ratio MAE; efficient ViT pretraining

📚 11k citations
NLP81
Mark Chen, Jerry Tworek, Heewoo Jun ... · landmark
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctnes...

Chen et al.; OpenAI; code generation benchmark

📚 9k citations
Vision86
Mathilde Caron, Hugo Touvron, Ishan Misra ... · landmark
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the...

Caron et al.; Meta; self-distillation; strong visual features without labels

📚 9k citations
Vision85
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh ... · cs.CV
Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during...

OpenAI; first large-scale text-to-image model

📚 6k citations
NLP89
Jason Wei, Maarten Bosma, Vincent Y. Zhao ... · landmark
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks...

Wei et al.; instruction tuning; zero-shot generalization

📚 5k citations
Audio82
Neil Zeghidour, Alejandro Luebs, Ahmed Omran ... · arxiv
We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a res...

Pioneered neural audio codec architecture (encoder + RVQ + adversarial training) that became the foundation for EnCodec, DAC, and Moshi.

📚 1k citations💬 Discussion🎬 ▶ Video 1
Biology59
Hoang-Son Nguyen, Yiran He, Hoi-To Wai · landmark
Recently, the stability of graph filters has been studied as one of the key theoretical properties driving the highly successful graph convolutional neural networks (GCNs). The stability of a graph filter characterizes the effect of topology perturbation on the output of a graph ...

Jumper et al.; DeepMind; Nature 2021; solved protein structure prediction

📚 9 citations💬 Reddit · HN🎬 ▶ Video 1
Vision88
Ze Liu, Yutong Lin, Yue Cao ... · landmark
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the...

Tolstikhin et al.; showed Transformer not strictly necessary for vision

2020

Vision95
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov ... · cs.CV
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components ...

Dosovitskiy et al.; Transformer for vision; displaced CNN backbones

📚 60k citations
NLP96
Tom B. Brown, Benjamin Mann, Nick Ryder ... · cs.CL
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of th...

Brown et al.; 175B params; in-context learning; paradigm shift

📚 56k citations💬 Reddit · HN🎬 ▶ Video 1
Vision89
Jonathan Ho, Ajay Jain, Pieter Abbeel · cs.LG
We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a...

Ho et al.; launched the diffusion model era

📚 29k citations💬 Reddit🎬 ▶ Video 1
Vision83
Nicolas Carion, Francisco Massa, Gabriel Synnaeve ... · landmark
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly e...

Carion et al.; detection as set prediction; replaced anchors

📚 18k citations
NLP82
Dan Hendrycks, Collin Burns, Steven Basart ... · landmark
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving abil...

Hendrycks et al.; 57-domain knowledge benchmark; standard LLM eval

📚 8k citations
General ML90
Jared Kaplan, Sam McCandlish, Tom Henighan ... · cs.LG
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural ...

Kaplan et al.; power-law compute/data/parameter tradeoffs

📚 7k citations
NLP88
Nisan Stiennon, Long Ouyang, Jeff Wu ... · cs.CL
As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these...

Stiennon et al.; OpenAI; early RLHF demonstration on summarization

📚 3k citations
NLP86
Kevin Clark, Minh-Thang Luong, Quoc V. Le ... · cs.CL
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require larg...

Clark et al.; compute-efficient pretraining

General ML80
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma ... · cs.LG
Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SD...

Song et al.; unified view of score-matching & diffusion

Audio82
Ahnjae Shin, Do Yoon Kim, Joo Seong Jeong ... · landmark
Hyper-parameter optimization is crucial for pushing the accuracy of a deep learning model to its limits. A hyper-parameter optimization job, referred to as a study, involves numerous trials of training a model using different training knobs, and therefore is very computation-heav...

Baevski et al.; Meta; self-supervised speech; standard baseline

NLP80
Patrick Lewis, Ethan Perez, Aleksandra Piktus ... · landmark
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowl...

Lewis et al.; Meta; grounded generation; production standard

General ML83
Ruibin Xiong, Yunchang Yang, Di He ... · cs.LG
The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-...

Kitaev et al.; LSH attention; reduced quadratic complexity

2019

NLP74
Yinhan Liu, Myle Ott, Naman Goyal ... · cs.CL
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have sign...

Liu et al.; showed BERT was undertrained

📚 29k citations
NLP74
Colin Raffel, Noam Shazeer, Adam Roberts ... · cs.LG
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, me...

Raffel et al.; text-to-text framing for NLP

📚 25k citations
NLP76
Zhilin Yang, Zihang Dai, Yiming Yang ... · cs.CL
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects depende...

Yang et al.; autoregressive BERT alternative

📚 9k citations
Systems89
Mohammad Shoeybi, Mostofa Patwary, Raul Puri ... · landmark
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techn...

Shoeybi et al.; NVIDIA; tensor parallelism; standard multi-GPU training

📚 3k citations
Systems85
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase ... · landmark
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining com...

Rajbhandari et al.; Microsoft; partitioned optimizer state / gradients / params

📚 2k citations

2018

NLP93
Jacob Devlin, Ming-Wei Chang, Kenton Lee ... · cs.CL
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly co...

Devlin et al.; transformed NLP; bidirectional language models

📚 113k citations💬 Reddit🎬 ▶ Video 1 · ▶ Video 2
General ML92
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel ... · cs.LG
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which nece...

Haarnoja et al.; state-of-the-art continuous control

📚 11k citations
NLP85
Alex Wang, Amanpreet Singh, Julian Michael ... · landmark
For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of t...

Wang et al.; standard NLP benchmark suite

📚 8k citations
General ML88
Jonathan Frankle, Michael Carbin · landmark
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures...

Frankle & Carlin; sparse subnetworks; influential pruning theory

📚 4k citations
Vision55
Joseph Redmon, Ali Farhadi · landmark
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 m...

Redmon & Farhadi; real-time detection; widely deployed

2017

General ML99
Ashish Vaswani, Noam Shazeer, Niki Parmar ... · cs.CL
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architect...

Vaswani et al.; most cited ML paper ever; foundation of modern AI

📚 172k citations💬 Reddit · HN🎬 ▶ Video 1 · ▶ Video 2
General ML85
John Schulman, Filip Wolski, Prafulla Dhariwal ... · cs.LG
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient method...

Schulman et al.; OpenAI; default RL algorithm for LLM alignment

📚 26k citations
General ML88
Petar Veličković, Guillem Cucurull, Arantxa Casanova ... · landmark
We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in ...

Veličković et al.; attention on graphs; widely cited

📚 25k citations
General ML18
Chelsea Finn, Pieter Abbeel, Sergey Levine · landmark
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal...

Finn et al.; gradient-based meta-learning; few-shot adaptation

📚 14k citations

2016

General ML92
Thomas N. Kipf, Max Welling · landmark
We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order appro...

Kipf & Welling; standard graph neural network baseline

📚 34k citations
Audio94
Aaron van den Oord, Sander Dieleman, Heiga Zen ... · cs.SD
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently tr...

Oord et al.; DeepMind; autoregressive raw waveform; landmark TTS

📚 8k citations

2015

Vision91
Olaf Ronneberger, Philipp Fischer, Thomas Brox · cs.CV
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently...

Standard architecture for image segmentation; 70k+ citations

📚 94k citations
Vision95
Shaoqing Ren, Kaiming He, Ross Girshick ... · landmark
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we i...

Ren et al.; end-to-end detector; standard baseline for years

📚 72k citations
General ML95
Sergey Ioffe, Christian Szegedy · cs.LG
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and...

Made very deep networks trainable; Ioffe & Szegedy

📚 46k citations
General ML85
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel ... · cs.LG
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network arc...

Lillicrap et al.; actor-critic for continuous action spaces

📚 15k citations

2014

General ML93
Diederik P. Kingma, Jimmy Ba · cs.LG
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invarian...

Default optimizer for most modern ML; Kingma & Ba

📚 165k citations
Vision92
Karen Simonyan, Andrew Zisserman · cs.CV
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, ...

Established depth as key factor in CNNs

📚 110k citations
NLP93
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio · cs.CL
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. Th...

Bahdanau attention — the precursor to Transformer

📚 29k citations
NLP92
Ilya Sutskever, Oriol Vinyals, Quoc V. Le · cs.CL
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general...

Sutskever et al.; foundation of seq2seq NMT

📚 22k citations
Vision96
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza ... · stat.ML
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the t...

Ian Goodfellow et al.; introduced adversarial training

📚 2k citations

2013

Vision90
Matthew D Zeiler, Rob Fergus · cs.CV
Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a ...

Zeiler & Fergus; visualised what CNNs learn; led to AlexNet improvements

📚 17k citations
General ML100
Volodymyr Mnih, Koray Kavukcuoglu, David Silver ... · cs.LG
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output...

DeepMind; launched modern deep RL

📚 14k citations
NLP80
Tomas Mikolov, Ilya Sutskever, Kai Chen ... · cs.CL
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both t...

Mikolov et al.; standard word embeddings for years

2012

General ML88
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky ... · cs.NE
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in ...

Hinton et al.; fundamental regularization technique; arXiv 1207.0580