这是 OpenAI Strawberry(o1) 和 Reasoning 的研究论文和博客的集合。并且该存储库将不断更新以跟踪 LLM Reasoning 的前沿。
OpenAI Docs
Blogs
- [OpenAI] Learning to Reason with LLMs
- [OpenAI] OpenAI o1-mini Advancing cost-efficient reasoning
- [OpenAI] Finding GPT-4’s mistakes with GPT-4
- [Tibor Blaho] Summary of what we have learned during AMA hour with the OpenAI o1 team
- [Nathan Lambert] OpenAI’s Strawberry, LM self-talk, inference scaling laws, and spending more on inference
- [Nathan Lambert] Reverse engineering OpenAI’s o1
- [Andreas Stuhlmüller, jungofthewon] Supervise Process, not Outcomes
Talks
- [Noam Brown] Parables on the Power of Planning in AI: From Poker to Diplomacy
- [Hyung Won Chung] Don’t teach. Incentivize.
- [OpenAI Developers] We’re hosting an AMA for developers from 10–11 AM PT today.
Papers
format:
- [title](paper link) [links]
- author1, author2, and author3...
- publisher
- code
- experimental environments and datasets
Relevant Paper from OpenAI o1 contributors
- Training Verifiers to Solve Math Word Problems
- Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman
- Generative Language Modeling for Automated Theorem Proving
- Stanislas Polu, Ilya Sutskever
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou
- Let’s Verify Step by Step
- Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe
- LLM Critics Help Catch LLM Bugs
- Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, Jan Leike
- Self-critiquing models for assisting human evaluators
- William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, Jan Leike
- Scalable Online Planning via Reinforcement Learning Fine-Tuning
- Arnaud Fickinger, Hengyuan Hu, Brandon Amos, Stuart Russell, Noam Brown.
2024
- Planning In Natural Language Improves LLM Search For Code Generation
- Evan Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, Will Song, Vaskar Nath, Ziwen Han, Sean Hendryx, Summer Yue, Hugh Zhang
- Training Language Models to Self-Correct via Reinforcement Learning
- Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust
- To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
- Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett
- An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
- Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, Yiming Yang
- Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
- Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, Mehran Kazemi
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar
- Generative Verifiers: Reward Modeling as Next-Token Prediction
- Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal
- Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
- Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, Mao Yang
- Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
- Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, Azalia Mirhoseini
- Improve Mathematical Reasoning in Language Models by Automated Process Supervision
- Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, Abhinav Rastogi
- Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning
- Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, Bo An
- Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B
- Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang
- Self-Rewarding Language Models
- Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
- Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models
- Zhiyuan Hu, Chumin Liu, Xidong Feng, Yilun Zhao, See-Kiong Ng, Anh Tuan Luu, Junxian He, Pang Wei Koh, Bryan Hooi
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman
- https://github.com/ezelikman/quiet-star
- Advancing LLM Reasoning Generalists with Preference Trees
- Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan et al.
- Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
- Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu.
- AlphaMath Almost Zero: Process Supervision Without Process
- Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan.
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
- Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao Dong, and Jie Tang.
- MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time
- Jikun Kang, Xin Zhe Li, Xi Chen, Amirreza Kazemi, Qianyi Sun, Boxing Chen, Dong Li, Xu He, Quan He, Feng Wen, Jianye Hao, Jun Yao.
- Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning
- Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, Michael Shieh.
- Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
- Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma.
- ReFT: Reasoning with Reinforced Fine-Tuning
- Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li
- Chain-of-Thought Reasoning Without Prompting
- Xuezhi Wang, Denny Zhou
2023
- Training Chain-of-Thought via Latent-Variable Inference
- Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous
- Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training
- Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, Jun Wang
- Reasoning with Language Model is Planning with World Model
- Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, Zhiting Hu
- Don’t throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding
- Liu, Jiacheng, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyilmaz.
- Certified reasoning with language models
- Gabriel Poesia, Kanishk Gandhi, Eric Zelikman, Noah D. Goodman
2022
- Chain of Thought Imitation with Procedure Cloning
- Mengjiao Yang, Dale Schuurmans, Pieter Abbeel, Ofir Nachum.
- STaR: Bootstrapping Reasoning With Reasoning
- Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman
- Solving math word problems with processand outcome-based feedback
- Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, Irina Higgins
2021
- Scaling Scaling Laws with Board Games
- Andy L. Jones.
- Show Your Work: Scratchpads for Intermediate Computation with Language Models
- Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, Augustus Odena
2017
- Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
- David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis.
Projects
Evaluation
[AryanDLuffy] Codeforces – O1-mini benchmark- [Dominater069] Codeforces – Analyzing how good O1-Mini actually is