---
base_model: Qwen/Qwen3.5-2B
datasets:
- KRLabsOrg/tool-output-extraction-swebench
language:
- en
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- code
- tool-output
- pruning
- coding-agents
- extraction
thumbnail: https://raw.githubusercontent.com/KRLabsOrg/squeez/main/assets/squeez_mascot.png
---
# Squeez-2B
**Squeez-2B** is a 2B parameter model fine-tuned from Qwen 3.5 2B for task-conditioned tool-output pruning in coding agents. Given a focused query and one raw tool observation, it extracts the smallest verbatim evidence block the agent should inspect next — removing **92%** of input tokens while retaining **0.86 recall**.
```
Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context
```
- Outperforms zero-shot Qwen 3.5 35B A3B by **+11 recall points**
- Returns verbatim lines only (no rewriting or summarization)
- Works as CLI pipe, Python library, or vLLM server
- Trained on **27 tool types** from real SWE-bench workflows and synthetic multi-ecosystem outputs
**Resources:** [Paper](https://arxiv.org/abs/2604.04979) | [Dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) | [Code & CLI](https://github.com/KRLabsOrg/squeez) | [Blog post](https://huggingface.co/blog/KRLabsOrg/squeez)
## Results
Evaluated on 618 manually curated held-out examples spanning 27 tool types.
| Model | Prec. | Recall | F1 | Compression |
|-------|-------|--------|-----|-------------|
| **Squeez-2B** | **0.80** | **0.86** | **0.80** | 0.92 |
| Qwen 3.5 35B A3B (zero-shot) | 0.74 | 0.75 | 0.73 | 0.92 |
| Kimi K2 (zero-shot) | 0.61 | 0.53 | 0.68 | 0.94 |
| Qwen 3.5 2B (untrained) | 0.42 | 0.53 | 0.55 | 0.82 |
The fine-tuned 2B model is also the most precise system in the comparison, indicating it has learned a tool-specific extraction policy rather than relying on generic instruction following.
### Qualitative patterns
| Pattern | Example | Squeez-2B | Baseline failure |
|---------|---------|-----------|-----------------|
| Precise selection | `git_log`, 21 lines — find one commit | Selects the single correct entry | Qwen 35B picks a plausible but wrong commit |
| Failure-block extraction | Service log, 176 lines — two similar TLS errors | Returns the correct 5-line block | Qwen 35B picks the wrong TLS error (different timestamp) |
| Correct empty prediction | `docker_logs`, 316 lines — no matching evidence | Returns empty output | Qwen 35B generates "No relevant lines found..." |
| Adjacent over-selection | Build output, 110 lines — Dockerfile error | Finds the right error + nearby noise | Qwen 35B misses the Dockerfile error entirely |
On the 59 negative examples in the test set, Squeez-2B correctly returns empty output 80% of the time. Qwen 35B returns empty only 7% of the time.
## Quick Start
### CLI (recommended)
```bash
pip install squeez
# With vLLM server
vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
export SQUEEZ_SERVER_URL=http://localhost:8000/v1
pytest -q 2>&1 | squeez "find the failure block"
git log --oneline -50 | squeez "find the commit that changed CSRF handling"
cat src/auth/middleware.py | squeez "find the referer validation logic"
```
### Python API
```python
from squeez.inference.extractor import ToolOutputExtractor
# vLLM server
extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")
# Or local
extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")
filtered = extractor.extract(
task="Find the failing test block",
tool_output=raw_output,
)
```
### With transformers directly
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "KRLabsOrg/squeez-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "system", "content": (
"You prune verbose tool output for a coding agent. "
"Given a focused extraction query and one tool output, return only the "
"smallest verbatim evidence block(s) the agent should read next. "
"Return the kept text inside tags. "
"Do not rewrite, summarize, or invent lines."
)},
{"role": "user", "content": (
"
Find the failing authentication test
"
"
"
"PASSED tests/test_login.py::test_valid_credentials
"
"FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
"
"PASSED tests/test_login.py::test_logout
"
""
)},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
#
# FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
#
```
## Input/Output Format
**Input** — Chat messages with system prompt:
- System: extraction instructions (see above)
- User: `{task}
{raw_output}`
**Output** — Verbatim lines in XML tags:
```
{only the lines that matter, copied verbatim}
```
## Supported Tool Types (27)
**SWE-bench derived (14):** `read_file` | `grep` | `git_log` | `git_blame` | `git_diff` | `test_output` | `python` | `curl` | `pip_install` | `ls` | `lint_output` | `build_output` | `type_check` | `coverage`
**Synthetic multi-ecosystem (13):** `npm_build` | `tsc` | `npm_install` | `docker_logs` | `docker_build` | `make_cmake` | `kubectl` | `cargo_build` | `go_build` | `mvn_gradle` | `terraform` | `mypy_pyright` | `eslint`
## Training Details
| | |
|---|---|
| **Base model** | Qwen/Qwen3.5-2B |
| **Method** | LoRA (r=16, alpha=32) via Unsloth |
| **Training data** | 10,508 examples (SWE-bench + synthetic) |
| **Epochs** | 3 |
| **Max sequence length** | 20,000 tokens |
| **Learning rate** | 2e-4 |
| **Batch size** | 8 (32 effective with 4x gradient accumulation) |
| **Hardware** | Single NVIDIA A100 80GB |
| **Dataset** | [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) |
## Usage with Coding Agents
Add to your `CLAUDE.md` or agent system prompt:
```
When you invoke a shell command, pipe it through `squeez` and describe what you need.
Examples:
- bun test 2>&1 | squeez "did the tests pass?"
- git log --oneline -50 | squeez "find the commit that broke CSRF"
- cat src/auth/middleware.py | squeez "find the referer validation logic"
```
## Limitations
- Best on software engineering tool output; not designed for general-purpose summarization
- Synthetic data generated by `openai/gpt-oss-120b` — may not fully reflect real-world distributions for all ecosystems
- Evaluates single tool observations, not full agent trajectories
- Max input: 20,000 tokens (training length); can be extended at serving time
## License
Apache 2.0
## Citation
```bibtex
@misc{kovács2026squeeztaskconditionedtooloutputpruning,
title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents},
author={Ádám Kovács},
year={2026},
eprint={2604.04979},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2604.04979},
}
```