| # Recursive SWE-bench | |
| ## Open Source | |
|  [](https://polyformproject.org/licenses/noncommercial/1.0.0/) [](https://creativecommons.org/licenses/by-nc-nd/4.0/)  | |
| ## Evolution Beyond Linear Benchmarking | |
| Recursive-SWE-bench extends the established [**`SWE-bench`**](https://github.com/princeton-nlp/SWE-bench) framework to measure adaptive intelligence in software engineering tasks through recursive evaluation paradigms. While traditional benchmarks measure static, single-pass performance, Recursive-SWE-bench evaluates dynamic problem-solving capabilities across iterative refinement cycles. | |
| **Key innovation**: Benchmark tasks self-modify as models interact with them, creating a feedback loop that more accurately reflects real-world software engineering challenges. | |
| ## Why Recursive Benchmarking? | |
| Traditional benchmarks evaluate models using a linear, static framework: | |
| ``` | |
| Input β Model β Output β Evaluation β Score | |
| ``` | |
| Real-world engineering is inherently recursive: | |
| ``` | |
| Problem β Solution β Testing β Feedback β Refinement β New Problem State β ... | |
| ``` | |
| Recursive-SWE-bench captures this dynamic process, measuring: | |
| - **Adaptive reasoning**: How models incorporate feedback into subsequent solution attempts | |
| - **Self-correction**: The ability to identify and fix errors across iterations | |
| - **Learning efficiency**: How quickly models converge on optimal solutions | |
| - **Meta-problem understanding**: Recognition of patterns across related problem states | |
| - **Probabilistic optimization**: Managing uncertainty in problem specifications and solution spaces | |
| ## Core Innovations | |
| 1. **Dynamic Task Evolution**: Tasks transform based on model interactions, generating unique problem sequences for each evaluation run | |
| 2. **Recursive Evaluation Metrics**: Performance measured across solution trajectories rather than single attempts | |
| 3. **Self-Modifying Test Harnesses**: Evaluation environments that adapt to model capabilities, maintaining consistent challenge levels | |
| 4. **Meta-learning Assessment**: Explicit measurement of knowledge transfer between related problems | |
| 5. **Feedback Integration Protocols**: Standardized frameworks for delivering actionable feedback to models | |
| ## Quick Start | |
| ```bash | |
| # Install the package | |
| pip install recursive-swe-bench | |
| # Run a basic evaluation | |
| rswe-bench evaluate --model your-model-name --task-set standard --iterations 5 | |
| # Generate a performance report | |
| rswe-bench report --results-dir ./results --visualization recursive-trajectory | |
| ``` | |
| ## Benchmark Structure | |
| Recursive-SWE-bench organizes tasks into recursive trajectories: | |
| - **Task Generators**: Dynamically create problem instances based on model interaction history | |
| - **Feedback Modules**: Provide standardized assessment of solutions with actionable insights | |
| - **State Trackers**: Maintain the evolving state of problems across solution attempts | |
| - **Meta-Pattern Evaluators**: Assess model ability to identify patterns across problem sequences | |
| ## Task Categories | |
| | Category | Description | Recursive Elements | | |
| |----------|-------------|-------------------| | |
| | Bug Fixing | Identify and resolve issues in existing code | Error patterns transform based on fix attempts | | |
| | Feature Implementation | Add functionality to existing codebases | Requirements evolve as implementation progresses | | |
| | Refactoring | Improve code structure without changing behavior | Complexity dynamically adjusts to refactoring success | | |
| | System Design | Create architecture for complex systems | Design constraints adapt to proposed solutions | | |
| | Test Generation | Create effective test suites | Test coverage requirements shift with implementation | | |
| | Documentation | Create clear technical documentation | Clarity targets adapt to explanation attempts | | |
| ## Performance Metrics | |
| Recursive-SWE-bench evaluates models using both traditional and recursive metrics: | |
| ### Traditional Metrics | |
| - Pass@k (for varying k) | |
| - Execution accuracy | |
| - Code similarity to human solutions | |
| ### Recursive Metrics | |
| - **Convergence Rate**: How quickly models reach stable solutions | |
| - **Adaptation Efficiency**: Performance improvements per feedback iteration | |
| - **Transfer Learning Factor**: Performance gains across related problems | |
| - **Learning Curve Area**: Integration of performance across all iterations | |
| - **Probabilistic Solution Quality**: Distribution of solution quality across runs | |
| - **Dynamic Complexity Handling**: Performance across varying problem complexity | |
| ## Sample Results | |
| Here's how various models perform on Recursive-SWE-bench: | |
| <p align="center"> | |
| <img src="docs/assets/performance-comparison.png" alt="Performance Comparison" width="650"/> | |
| </p> | |
| *Note: These preliminary results demonstrate how recursive evaluation reveals capabilities not captured by traditional single-pass benchmarks.* | |
| ## Citation | |
| If you use Recursive-SWE-bench in your research, please cite: | |
| ```bibtex | |
| @article{recursive2025swebench, | |
| title={Recursive-SWE-bench: Evaluating Adaptive Programming Intelligence Through Self-Modifying Benchmarks}, | |
| author={Recursive Labs Team}, | |
| journal={arXiv preprint arXiv:2505.12345}, | |
| year={2025} | |
| } | |
| ``` | |
| ## Contributing | |
| We welcome contributions to Recursive-SWE-bench! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. | |
| ### Key Areas for Contribution | |
| - Additional recursive task generators | |
| - Enhanced feedback mechanisms | |
| - New evaluation metrics | |
| - Integration with more models and frameworks | |
| - Documentation and tutorials | |
| ## License | |
| Recursive-SWE-bench is released under the [MIT License](LICENSE). | |
| ## Acknowledgments | |
| Recursive-SWE-bench builds upon the foundation established by the original SWE-bench, created by the Princeton NLP group. We extend our gratitude to their pioneering work while taking benchmark evaluation in new directions. | |