Large Language Models (LLMs) have evolved rapidly this year, and as new generations of models continue to be developed, it is important for researchers and engineers to be aware of the latest advances. This article summarizes some important LLM papers published between September and October.

These papers cover a range of topics for language models, from model optimization and scaling to inference, benchmarking, and enhancing performance. The final section discusses papers on safety training and ensuring that its behavior remains beneficial.

Optimize and scale

Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning

Summary of large language model research papers in September

Large Language Models (llms) like GPT-4 exhibit superior performance for a variety of tasks, but this powerful performance is often accompanied by the high cost of using paid API services.

In this paper, the authors examine the cost of building LLM cascades to save the cost of using LLM, specifically for performing reasoning (e.g., mathematics, causality) tasks.

Cascaded pipelines follow the theory that simple problems can be solved by weaker but more affordable LLMs, while only challenging problems require more powerful and expensive LLMs.

To achieve this decision, they saw the weaker LLM's "answer consistency" as a signal of the difficulty of the problem, and proposed several methods of answer sampling and consistency checking, including a hybrid method that utilizes two representations of thinking, namely chain-of-thought and program-of-thought.

Through experiments on six inference benchmark datasets, using GPT-3.5 turbo and GPT-4 as weaker and stronger LLMs, respectively, it is proved that the proposed LLM cascade can achieve comparable performance to stronger LLM alone at a cost of only 40%.

EcoAssistant: Using LLM Assistant More Affordably and Accurately

Users ask large language models (LLMs) to act as assistants to answer queries that require external knowledge; They ask about the weather, stock prices, and even the specific location of their neighborhood in a particular city.

These queries require LLM to generate external API code to answer the user's question, but LLM rarely generates the correct code on the first attempt, requiring iterative optimization on execution results. This results in high query volumes that can be expensive.

In this work, the authors contribute a framework, EcoAssistant, that enables LLM to answer code-driven queries more economically and accurately. EcoAssistant consists of three components:

First, it allows the LLM assistant to talk to an automated code executor to iteratively improve the code or generate answers based on the results of execution.

Second, we use the hierarchy of LLM assistants, which try to answer queries with weaker and cheaper LLMs.

Third, retrieve the solution from past successful queries as a contextual demonstration to aid subsequent queries.

EcoAssistant has clear advantages in affordability and accuracy, with a success rate of more than 10 percentage points over GPT-4 and less than 50% of the cost of GPT-4.

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

The authors propose the Arbitrary Modal Enhanced Language Model (AnyMAL), a unified model that can reason about multiple input modal signals (i.e., text, image, video, audio, IMU motion sensors) and generate text responses.

AnyMAL inherits the powerful text-based inference capabilities of state-of-the-art llms, including LLaMA-2 (70B), and converts modal-specific signals into federated text spaces through pre-trained aligner modules.

To further strengthen the capabilities of multimodal LLM, they fine-tuned the model using a manually collected multimodal instruction set, covering a variety of topics and tasks beyond simple question answering. They conducted comprehensive empirical analysis, including human and automated assessments, and demonstrated state-of-the-art performance in a variety of multimodal tasks.

Human feedback-based reinforcement learning (RLHF)

A Long Way to Go: Investigating Length Correlations in RLHF

Using reinforcement learning based on human feedback (RLHF) to calibrate large language models has been a great success. Open-sourcing good datasets and reward models enables a wider range of experiments beyond normal chat settings, especially making the system more "useful" for tasks such as web question answering, summarizations, and multi-turn conversations. When optimizing for usefulness, RLHF has been observed to drive models to produce longer outputs.

The paper shows that optimizing the response length is an important factor behind the improvements reported by RLHF in these settings. They studied the relationship between reward and length of reward models trained on three open-source datasets. It was found that length was closely related to reward, and the improvement of reward score was mainly driven by changing the distribution of output length.

It was then explored whether interventions during RL and reward model learning could achieve the same downstream improvements as RLHF without increasing length. Although interventions mitigated the increase in length, they were not all effective in different settings.

The paper also found that length-based rewards alone reproduce most of the downstream improvements of the initial strategy model even when running RLHF, suggesting that the reward model at these settings still has a long way to go.

illation

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

The recently released GPT-4 code interpreter has demonstrated extraordinary proficiency in solving challenging mathematical problems, largely thanks to its ability to seamlessly use natural language for reasoning, generating code, executing code, and continuing inference based on execution output.

The paper proposes a way to fine-tune open-source language models that enable them to use code to model and derive mathematical equations, thereby improving their mathematical reasoning abilities.

It contains a method for generating novel datasets of high-quality mathematical problems and their code-based solutions, called MathCodeDirective. Each solution is intertwined with natural language, code, and execution results. We also introduce a customized supervised fine-tuning and reasoning method.

This approach resulted in the MathCoder model, a set of models capable of generating code-based solutions for solving challenging mathematical problems. The MathCoder model achieved state-of-the-art scores on the MATH (45.2%) and GSM8K (83.9%) datasets, significantly outperforming other open source alternatives. The MathCoder model not only outperforms ChatGPT-3.5 and PaLM-2 on GSM8K and MATH, but also outperforms GPT-4 on competition-level MATH datasets.

Large Language Models Cannot Self-Correct Reasoning Yet

Large Language Modeling (LLM) has become a breakthrough technology with unparalleled text generation capabilities in a variety of applications. However, concerns remain about the accuracy and appropriateness of the content it generates.

The paper proposes a self-correcting approach as a remedy for these problems. At the heart of the research is the concept of intrinsic self-correction, where LLM attempts to correct its initial response based solely on its inherent ability, without relying on external feedback.

In the case of reasoning, studies have shown that LLMs have difficulty self-correcting their responses without external feedback, and sometimes their performance declines in self-correcting indicators. Based on these insights, the authors make recommendations for future research and practical applications in the field.

Large Language Models as Analogical Reasoners

The Chain of Thought (CoT) cues of language models demonstrate impressive performance in inference tasks, but often need to be flagged as examples of the inference process.

The paper introduces a new cue method, analogy prompt, which can automatically guide the reasoning process of large language models. Analogical reasoning is a cognitive process in which humans draw knowledge from relevant past experiences to solve new problems. Our approach is inspired by analogical reasoning, prompting language models to self-generate relevant paradigms or knowledge in context before proceeding to solve a given problem.

This approach has several advantages: it avoids the need to label or retrieve samples, provides versatility and convenience; It can also tailor the generated examples and knowledge for each problem, providing adaptability. Experimental results show that the proposed method is superior to 0-shot CoT and less manual-shot CoT in various inference tasks, including mathematical problem solving in GSM8K and math, code generation in Codeforces, and other inference tasks in BIG-Bench.

LLM progress and benchmarks

How FaR Are Large Language Models From Agents with Theory-of-Mind?

"To think is to act." Humans can infer the mental states of others through observations—an ability known as theory of mind (ToM)—and then take actual action based on those inferences. Existing question-and-answer benchmarks, such as ToMi, ask the model questions to infer the beliefs of the characters in the story, but do not test whether the model can use those inferences to guide their actions.

We propose a new evaluation paradigm for large language models (llms): Thinking for Doing (T4D), which requires models to link inferences about the mental states of others with actions in social scenarios. Experiments on T4D have shown that llms like GPT-4 and PaLM 2 seem to be good at tracking the beliefs of characters in stories, but they struggle to translate this ability into strategic action.

The paper introduces a zero-sample cue framework, Foresight and Reaction (FaR), which provides a reasoning structure that encourages LLM to predict future challenges and reason about potential actions.

FaR improves GPT-4's performance in T4D from 50% to 71%, outperforming other cue methods. In addition, FaR generalizes to different story structures and scenes outside the distribution, and ToM reasoning is also required to select an action, which is always superior to other methods (including a small amount of contextual learning).

SmartPlay: A Benchmark for LLMs as Intelligent Agents

Recent large-scale language models (LLMs) have demonstrated the great potential of intelligent agents and next-generation automation, but there is currently no systematic benchmark to assess the capabilities of LLMs as agents.

SmartPlay, presented by the paper: both a challenging benchmark and a way to evaluate LLM as an agent. SmartPlay consists of 6 different games, including rock-paper-scissors, Tower of Hanoi, and Minecraft.

Each game has a unique setting, offering up to 20 evaluation settings and unlimited environmental variations. Each game in SmartPlay uniquely challenges a subset of the 9 important features of the Smart LLM agent, including object-dependent reasoning, planning ahead, spatial reasoning, learning from history, and understanding randomness. The difference between the set of competencies tested by each game allows us to analyze each competency separately.

SmartPlay not only serves as a rigorous testing ground to evaluate the overall performance of LLM agents, but also serves as a roadmap to identify gaps in current approaches.

Improve LLM performance

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

Most large language models (LLMs) are trained only once and are never updated; As a result, they lack the ability to dynamically adapt to a changing world. In this work, the authenticity of the texts generated by LLM is studied in detail in the context of answering questions that test knowledge of the current world.

Introducing FreshQA, a new dynamic QA benchmark that includes a wide variety of question answering types, including questions that require knowledge of a rapidly changing world, as well as questions that need to be debunked about false premises.

Benchmarking various closed and open-source LLMs under a dual-mode evaluation program, through human evaluations involving more than 50,000 judgments, reveals the limitations of these models and demonstrates significant room for improvement: for example, all models, regardless of model size, struggle with problems involving rapidly changing knowledge and false premises.

Inspired by these results, the paper proposes FreshPrompt, a simple, low-volume prompt method that greatly improves LLM performance by incorporating relevant and up-to-date information retrieved from search engines into tips.

Experiments have shown that FreshPrompt outperforms competitors' search engine enhancement prompt methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity AI. Further analysis of FreshPrompt showed that the amount of retrieved evidence and its order played a key role in influencing the correctness of LLM-generated answers.

In addition, guiding LLM to generate concise and direct answers helps reduce hallucinations compared to encouraging lengthy answers.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

The ML community is rapidly exploring techniques for prompt language models (LMs) and stacking them into pipelines that solve complex tasks. But existing LM pipelines are often implemented using hard-coded "prompt templates", i.e. long strings discovered through trial and error.

To develop and optimize the LM pipeline more systematically, the paper proposes DSPy, a programming model that abstracts the LM pipeline into a text transformation graph, i.e. an imperative computation graph, in which the LM is invoked through a declarative module. DSPy modules are parameterized, meaning they can learn (by creating and collecting demos) how to apply a combination of hinting, tuning, augmentation, and reasoning techniques.

The authors also designed a compiler that will optimize any DSPy pipeline to maximize a given metric. Two case studies were conducted showing that concise DSPy programs can express and optimize complex LM pipelines that can interpret mathematical word problems, handle multi-hop retrievals, answer complex problems, and control proxy loops.

Within minutes of compilation, a few lines of DSPy allow GPT-3.5 and llama2-13b-chat self-guided pipelines that outperform standard few-sample prompts (typically over 25% and 65%, respectively) and expert-created demo pipelines (up to 5-46% and 16-40%, respectively). On top of that, DSPy programs are compiled to open and relatively small LMs, such as 770M-parameter T5 and llama2-13b-chat, which are competitive with approaches that rely on proprietary GPT-3.5 cue chains written by experts.

Enable Language Models to Implicitly Learn Self-Improvement From Data

Large Language Models (LLMs) have demonstrated extraordinary capabilities in open-ended text generation tasks. But the inherent openness of these tasks means that there is always room for improvement in the quality of model responses.

To address this challenge, various approaches to LLM performance have been proposed. There is a growing focus on enabling LLMs to self-improve the quality of their responses, thereby reducing reliance on extensive human annotation work to collect diverse and high-quality training data. Prompt-based methods have been widely explored in self-improvement methods because of their effectiveness, efficiency, and convenience.

But these methods often require explicit and thorough writing of rules as input to the LLM. The paper proposes an implicit self-improvement (PIT) framework that implicitly learns improvement goals from human preference data. PIT only requires preference data for training the reward model without additional manpower.

The authors reformulate the training goal of reinforcement learning based on human feedback (RLHF)—not to maximize the quality of the response for a given input, but to maximize the quality gap of the response under the conditions of the reference response. In this way, PIT is implicitly trained, with the goal of improvement being better aligned with human preferences. Experiments on two real-world datasets and one synthetic dataset show that this method is significantly superior to the cue-based approach.

Regulations and ethics

HeaP: Hierarchical Policies for Web Actions using LLMs

Large Language Models (LLMs) have demonstrated their superior ability to perform a range of instruction-following tasks in low-volume and zero-sample settings.

But combining large open-world missions and changes across web interfaces presents significant challenges for models. The authors address these challenges by leveraging LLM to break down web tasks into a set of subtasks, each of which can be solved by a low-level closed-loop strategy.

These policies form a shared syntax across tasks, that is, new web tasks can be represented as a combination of these policies. The paper proposes a new framework, Hierarchical Strategy for Web Operations (HeaP) using LLM, which learns from a demo a set of hierarchical LLM prompts for planning high-level tasks and executing them through a series of low-level policies.

HeaP was evaluated against baselines for a range of web tasks, including miniwob++, WebArena, simulated airline CRM, and real-time website interactions, and showed that it was able to outperform previous work with less data.

Written by Youssef Hosni

Summary of large language model research papers in September