RESEARCH · May 2025

"Meta-Reflection" Learning for Small Models

Introducing the ReflectEvo framework — enabling self-evolution of small models through self-generated reflection data without human annotation

Paper (arXiv) Code (GitHub) Dataset (HuggingFace)

In the pursuit of stronger reasoning capabilities for large language models (LLMs), a central challenge persists: how can smaller models achieve human-like "self-reflection" and "continuous improvement" without relying on expensive human annotations or distillation from larger models?

Recently, the Language and Interaction Laboratory at the Beijing Institute for General Artificial Intelligence (BIGAI) introduced ReflectEvo, a pioneering "meta-reflection" learning framework. For the first time, this framework systematically leverages reflection data generated by small language models themselves to achieve iterative evolution of reasoning capabilities through self-supervised training, opening a new pathway for efficient, low-cost self-improvement of small models.

Background

From "Generating Answers" to "Reflecting on Errors": Why Do Small Models Need Meta-Reflection?

Self-reflection is a critical component of human cognition — it refers to the active examination and evaluation of one's own behaviors and thought processes. For language models, self-reflection means the ability to review their reasoning paths, identify deviations in intermediate steps, analyze the causes of failures, and propose corrective strategies.

Currently, large language models have demonstrated a degree of self-reflection and error-correction potential. However, existing approaches are heavily dependent on the model's large parameter count, or require supervision signals obtained through distillation from more powerful models (e.g., GPT-4). For resource-constrained smaller models, effectively training reflection capabilities in the absence of high-quality annotated data remains a formidable challenge.

ReflectEvo is proposed precisely to address this core problem: we investigate whether small models can accomplish effective reflection learning through "self-generated data," combining their own low-quality reflections with a small number of successful high-quality reflections to construct a progressively self-optimizing learning loop.

Core Design

Four Key Innovations of ReflectEvo

Automated Pipeline for Reflection Data Construction and Learning (ReflectEvo)

We propose, for the first time, the use of self-generated reflection data from small models for "meta-reflection learning." Through a dual-module architecture comprising a "Generator" and a "Reflector," along with a three-stage reflection instruction protocol — "identify failure, locate error, propose correction" — the system automatically produces structured reflection content. The model forms a human-like learning trajectory of "reflect — improve — reflect again," enabling self-improvement.

Large-Scale Reflection Dataset: ReflectEvo-460K

Using small models, we automatically constructed a large-scale dataset containing 460,000 self-reflection samples, spanning 10 task categories — including mathematics, code, logic, and commonsense reasoning — drawn from 17 distinct data sources. This provides small models with a multi-domain, highly generalizable training foundation.

Fully Self-Supervised Reflection Training Paradigm

We propose four training paradigms based on Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). This approach is entirely based on self-generated data from small models, requiring no human annotation or large-model distillation, significantly reducing training costs and resource consumption. Small models trained with this paradigm can even surpass larger models built on the same base architecture.

Reusable and Transferable Reflection Learning Framework

ReflectEvo serves as a plug-and-play reasoning enhancement module that can be transferred across diverse tasks and applied to models of varying architectures and scales. It is well-suited for continual learning and model updating in low-resource environments, demonstrating strong evolutionary potential and generalizability.

图1 Figure 1: End-to-end pipeline of the ReflectEvo framework, encompassing data generation, filtering, and training stages.

Technical Approach

End-to-End Self-Reflection Evolution Pipeline

ReflectEvo is a complete pipeline for collecting self-generated reflection data from small models and using that data to train the models themselves.

1. Generating Reflection Data

The core of the pipeline is the collaboration between a Generator and a Reflector:

Generator: Produces an initial answer with a chain-of-thought reasoning trace for a given problem. If the answer is incorrect, reflection is triggered.
Reflector: Using the same base model as the Generator, it executes two stages:
1. Self-Reflection: Analyzes the initial answer, locates reasoning errors, and diagnoses their causes.
2. Self-Correction: Based on the reflection, generates a revised answer.

Through a designed three-stage reflection instruction pool and rejection sampling, the diversity and quality of generated reflections are ensured.

图2 Figure 2: The ReflectEvo reflection pipeline.

2. Reflection Data Filtering and Construction

The generated data undergoes rigorous filtering to construct three training subsets:

D+ (High-Quality Set): Retains only samples where the post-reflection answer was corrected successfully.
Dpref (Preference Pair Set): Uses GPT-4o to select superior reflections from D+, constructing pairwise preference data.
D+- (Positive-Negative Contrast Set): Combines successful reflections (positive samples) with failed reflections (negative samples), enabling the model to simultaneously learn "how to succeed" and "how to avoid failure."

3. Reflection Learning on Self-Generated Data

Training is conducted using the constructed datasets:

Supervised Fine-Tuning (SFT): Trains the model on D+ to learn how to leverage reflections for answer improvement.
Direct Preference Optimization (DPO): Performs preference learning on Dpref and D+- to optimize the quality of the model's reflection generation.

4. Inference

At inference time, the trained reflection model performs multiple rounds (e.g., two rounds) of "self-reflection — self-correction" iterations until the answer is correct or the maximum number of rounds is reached.

Dataset and Results

Large-Scale Reflection Data Spanning Multiple Domains

The ReflectEvo-460K dataset integrates tasks from 17 data sources, including LogiQA, MATH, MBPP, and BIG-bench, covering 10 categories such as mathematical reasoning, code generation, logical reasoning, and commonsense question answering, providing the model with extensive learning material.

Task distribution of the ReflectEvo-460K dataset — 图3 Figure 3: Distribution of task categories in the ReflectEvo-460K dataset.

Experiments demonstrate that small models trained with ReflectEvo (e.g., Llama-3-8B, Mistral-7B) achieve significant performance improvements across multiple reasoning benchmarks. Their reasoning capabilities can even surpass those of larger models from the same family that have not undergone reflection training. This validates the effectiveness of self-generated reflection data for "meta-reflection" learning.

图4 Figure 4: A ReflectEvo reflection example illustrating how the model identifies errors, locates issues, and proposes corrections.

Outlook

Toward Explainable AI with Metacognitive Capabilities

The release of ReflectEvo represents not only an efficient framework for enhancing small model capabilities, but also a systematic exploration of "metacognitive" abilities in language models. It demonstrates that small models can learn and evolve entirely through self-generated "experience," providing new technical foundations for continual learning of AI models in low-resource environments.

Going forward, we will continue to explore the following directions:

🔄

Multi-Round Reflection and Long-Term Evolution

Extending the current single-round reflection to more complex multi-round, long-term self-iterative learning loops.

🧩

Cross-Modal and Cross-Task Generalization

Transferring the reflection learning mechanism to more complex tasks such as multimodal understanding and embodied reasoning.

👁️

Enhanced Explainability and Reliability

Further leveraging structured reflection processes to improve the explainability and trustworthiness of model decisions.

We believe that endowing models with the capacity for "self-reflection" is a critical step toward more intelligent, reliable, and explainable artificial intelligence. ReflectEvo has open-sourced all code, models, and data, and we invite colleagues in both academia and industry to jointly advance this research direction.

Self-ReflectionMetacognitionSmall Language ModelsSelf-Supervised LearningReasoning Enhancement

Authors

Jiaqi Li¹, Xinyi Dong², Yang Liu¹, Zhizhuo Yang², Quansen Wang¹, Xiaobo Wang¹, Song-Chun Zhu¹, Zixia Jia^†1, Zilong Zheng^†1

¹ BIGAI, ² Peking University

^† Corresponding authors.