--- base_model: Qwen/Qwen3.5-2B datasets: - KRLabsOrg/tool-output-extraction-swebench language: - en license: apache-2.0 pipeline_tag: text-generation library_name: transformers tags: - code - tool-output - pruning - coding-agents - extraction thumbnail: https://raw.githubusercontent.com/KRLabsOrg/squeez/main/assets/squeez_mascot.png ---

Squeez mascot

# Squeez-2B **Squeez-2B** is a 2B parameter model fine-tuned from Qwen 3.5 2B for task-conditioned tool-output pruning in coding agents. Given a focused query and one raw tool observation, it extracts the smallest verbatim evidence block the agent should inspect next — removing **92%** of input tokens while retaining **0.86 recall**. ``` Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context ``` - Outperforms zero-shot Qwen 3.5 35B A3B by **+11 recall points** - Returns verbatim lines only (no rewriting or summarization) - Works as CLI pipe, Python library, or vLLM server - Trained on **27 tool types** from real SWE-bench workflows and synthetic multi-ecosystem outputs **Resources:** [Paper](https://arxiv.org/abs/2604.04979) | [Dataset](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) | [Code & CLI](https://github.com/KRLabsOrg/squeez) | [Blog post](https://huggingface.co/blog/KRLabsOrg/squeez) ## Results Evaluated on 618 manually curated held-out examples spanning 27 tool types. | Model | Prec. | Recall | F1 | Compression | |-------|-------|--------|-----|-------------| | **Squeez-2B** | **0.80** | **0.86** | **0.80** | 0.92 | | Qwen 3.5 35B A3B (zero-shot) | 0.74 | 0.75 | 0.73 | 0.92 | | Kimi K2 (zero-shot) | 0.61 | 0.53 | 0.68 | 0.94 | | Qwen 3.5 2B (untrained) | 0.42 | 0.53 | 0.55 | 0.82 | The fine-tuned 2B model is also the most precise system in the comparison, indicating it has learned a tool-specific extraction policy rather than relying on generic instruction following. ### Qualitative patterns | Pattern | Example | Squeez-2B | Baseline failure | |---------|---------|-----------|-----------------| | Precise selection | `git_log`, 21 lines — find one commit | Selects the single correct entry | Qwen 35B picks a plausible but wrong commit | | Failure-block extraction | Service log, 176 lines — two similar TLS errors | Returns the correct 5-line block | Qwen 35B picks the wrong TLS error (different timestamp) | | Correct empty prediction | `docker_logs`, 316 lines — no matching evidence | Returns empty output | Qwen 35B generates "No relevant lines found..." | | Adjacent over-selection | Build output, 110 lines — Dockerfile error | Finds the right error + nearby noise | Qwen 35B misses the Dockerfile error entirely | On the 59 negative examples in the test set, Squeez-2B correctly returns empty output 80% of the time. Qwen 35B returns empty only 7% of the time. ## Quick Start ### CLI (recommended) ```bash pip install squeez # With vLLM server vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384 export SQUEEZ_SERVER_URL=http://localhost:8000/v1 pytest -q 2>&1 | squeez "find the failure block" git log --oneline -50 | squeez "find the commit that changed CSRF handling" cat src/auth/middleware.py | squeez "find the referer validation logic" ``` ### Python API ```python from squeez.inference.extractor import ToolOutputExtractor # vLLM server extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1") # Or local extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b") filtered = extractor.extract( task="Find the failing test block", tool_output=raw_output, ) ``` ### With transformers directly ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "KRLabsOrg/squeez-2b" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) messages = [ {"role": "system", "content": ( "You prune verbose tool output for a coding agent. " "Given a focused extraction query and one tool output, return only the " "smallest verbatim evidence block(s) the agent should read next. " "Return the kept text inside tags. " "Do not rewrite, summarize, or invent lines." )}, {"role": "user", "content": ( " Find the failing authentication test " " " "PASSED tests/test_login.py::test_valid_credentials " "FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401 " "PASSED tests/test_login.py::test_logout " "" )}, ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True) response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) print(response) # # FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401 # ``` ## Input/Output Format **Input** — Chat messages with system prompt: - System: extraction instructions (see above) - User: `{task} {raw_output}` **Output** — Verbatim lines in XML tags: ``` {only the lines that matter, copied verbatim} ``` ## Supported Tool Types (27) **SWE-bench derived (14):** `read_file` | `grep` | `git_log` | `git_blame` | `git_diff` | `test_output` | `python` | `curl` | `pip_install` | `ls` | `lint_output` | `build_output` | `type_check` | `coverage` **Synthetic multi-ecosystem (13):** `npm_build` | `tsc` | `npm_install` | `docker_logs` | `docker_build` | `make_cmake` | `kubectl` | `cargo_build` | `go_build` | `mvn_gradle` | `terraform` | `mypy_pyright` | `eslint` ## Training Details | | | |---|---| | **Base model** | Qwen/Qwen3.5-2B | | **Method** | LoRA (r=16, alpha=32) via Unsloth | | **Training data** | 10,508 examples (SWE-bench + synthetic) | | **Epochs** | 3 | | **Max sequence length** | 20,000 tokens | | **Learning rate** | 2e-4 | | **Batch size** | 8 (32 effective with 4x gradient accumulation) | | **Hardware** | Single NVIDIA A100 80GB | | **Dataset** | [KRLabsOrg/tool-output-extraction-swebench](https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench) | ## Usage with Coding Agents Add to your `CLAUDE.md` or agent system prompt: ``` When you invoke a shell command, pipe it through `squeez` and describe what you need. Examples: - bun test 2>&1 | squeez "did the tests pass?" - git log --oneline -50 | squeez "find the commit that broke CSRF" - cat src/auth/middleware.py | squeez "find the referer validation logic" ``` ## Limitations - Best on software engineering tool output; not designed for general-purpose summarization - Synthetic data generated by `openai/gpt-oss-120b` — may not fully reflect real-world distributions for all ecosystems - Evaluates single tool observations, not full agent trajectories - Max input: 20,000 tokens (training length); can be extended at serving time ## License Apache 2.0 ## Citation ```bibtex @misc{kovács2026squeeztaskconditionedtooloutputpruning, title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents}, author={Ádám Kovács}, year={2026}, eprint={2604.04979}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2604.04979}, } ```