Steelman-14B-Ada v0.3
A 14B parameter model for Ada 2022 and SPARK code generation. QLoRA fine-tune of Qwen2.5-Coder-14B on 6,110 compiler-verified instruction pairs across 9 training categories, evaluated on a 754-prompt benchmark spanning 10 task categories. Named after the 1978 DoD Steelman requirements that defined the Ada language.
Built by one person with AI assistance.
Headline Results
Steelman Eval v4: 754 prompts, 10 Ada task categories, strict GNAT compilation + functional scoring
| Model | Size | Score |
|---|---|---|
| Steelman R7 v0.3 | 14B (QLoRA) | 62.4% |
| GPT-5.4 | -- | 12.9% |
| Claude Opus 4.6 | -- | 12.7% |
| Gemini 3.1 Pro | -- | 5.3%* |
| Grok 4 | -- | 19.2%* |
*Grok 4 and Gemini 3.1 Pro results are partial (537/754 and 547/754 respectively). Final numbers will be updated when complete.
A 14B QLoRA adapter outperforms every frontier model tested on strict Ada code generation by 49+ percentage points over the best complete frontier result (GPT-5.4 at 12.9%).
Per-Category Breakdown (Steelman Eval v4)
| Category | Count | Steelman R7 | Opus 4.6 | GPT-5.4 |
|---|---|---|---|---|
| spark | 53 | 79.2% | 9.4% | 15.1% |
| error_fix | 78 | 73.1% | 14.1% | 0.0% |
| ada2022 | 47 | 70.2% | 8.5% | 8.5% |
| tasking | 54 | 70.4% | 7.4% | 29.6% |
| spec_to_body | 88 | 67.0% | 54.5% | 51.1% |
| multi_file | 57 | 59.6% | 0.0% | 0.0% |
| spec_impl | 91 | 59.1% | 3.8% | 4.4% |
| generics | 55 | 54.5% | 0.0% | 1.8% |
| modification | 124 | 53.8% | 5.8% | 4.8% |
| standard | 107 | 53.3% | 12.1% | 12.1% |
Steelman leads in all 10 categories. The largest gaps are on error_fix (+59pp vs Opus), multi_file (+59.6pp vs both), and generics (+54.5pp vs Opus).
MultiPL-E HumanEval-Ada (157 problems, pass@1)
Standard code generation benchmark translated to Ada. Measures functional correctness: code must compile, run, and produce correct output.
| Model | Pass@1 | Compile Rate |
|---|---|---|
| Claude Opus 4.6 | 81.5% (128/157) | 83.4% |
| Grok 4 | 77.1% (121/157) | 81.5% |
| GPT-5.4 | 70.7% (111/157) | 80.2% |
| Steelman R7 v0.3 | 45.9% (72/157) | 85.4% (134/157) |
| Gemini 3.1 Pro | 35.7% (56/157) | 37.6% |
Steelman has the highest Ada compile rate of any model tested (85.4%), including all frontier models. Frontier models win on algorithmic correctness because HumanEval tests general CS algorithms, not Ada-specific skills. Our eval (above) tests what matters for Ada development.
Reproducibility note: HumanEval is a code completion benchmark -- it provides a code prefix and expects the model to continue it. This model uses an Alpaca instruction template for inference. When we ran HumanEval through the Alpaca template (wrapping the code prefix in ### Instruction: / ### Response:), the model scored 8.9% because it treated the code as an instruction rather than code to complete. The 45.9% result above uses a raw completion template ({{ .Prompt }}{{ .Response }} with no system prompt and no instruction wrapping). If you are reproducing these results, you must use a raw template for HumanEval, not the Alpaca template used for instruction-following tasks.
What Changed in v0.3
| v0.2 (R6) | v0.3 (R7) | |
|---|---|---|
| Base model | Qwen2.5-Coder-14B-Instruct | Qwen2.5-Coder-14B (base) |
| Training data | 3,018 pairs | 6,110 pairs (all compile-verified) |
| LoRA rank / alpha | 32 / 64 | 64 / 128 (rsLoRA) |
| Template | ChatML | Alpaca |
| Eval | v3 (500 prompts, compile-only) | v4 (754 prompts, compile + functional scoring) |
| Task categories | 8 | 10 (+modification, +spec_impl) |
| Decontamination | Cosine similarity 0.80 | Multi-stage: exact + 10-gram + Levenshtein |
Why the base variant? With 6,110 training examples, the base model learns more effectively than starting from an instruction-tuned checkpoint. Qwen2.5-Coder has Ada in its 92-language training set, making it the strongest available foundation for Ada fine-tuning.
Why the eval numbers look lower than v0.2: Eval v4 is substantially harder than eval v3. It uses -gnatwe (warnings as errors) plus functional correctness scoring (modification tasks require passing test suites, not just compiling). The model is better; the bar is higher.
What This Model Does
- Generates complete Ada 2022 source files from natural language instructions
- Writes SPARK contracts: preconditions, postconditions, loop invariants, type invariants, ghost code
- Generates package bodies from
.adsspecifications (spec-to-body) - Modifies existing Ada code: adds features, fixes bugs, refactors (new in R7)
- Handles multi-file Ada projects using
[FILE: name.ext]/[/FILE]delimiters - Fixes compilation errors given GNAT diagnostics (73.1% accuracy)
- Uses modern Ada 2022 features: expression functions, aspects, quantified expressions, delta aggregates, declare expressions, target name
@
What This Model Does NOT Do
- Not optimized for code completion or fill-in-the-middle (FIM) -- it generates complete files from instructions
- Not a chatbot -- best results come from direct code generation prompts with a system prompt
- SPARK contracts compile but may not prove with
gnatprove-- formal verification is not part of the training loop - Algorithmic correctness is weaker than frontier models (45.9% vs 81.5% on HumanEval) -- the model excels at Ada-specific patterns, not general CS puzzles
Usage
Ollama (recommended)
Download the GGUF from steelman-14b-ada-GGUF and create a Modelfile:
FROM ./steelman-r7-coder-base-q8_0.gguf
TEMPLATE "Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{{ .Prompt }}
### Response:
{{ .Response }}"
SYSTEM "You are an expert Ada 2022 and SPARK programmer."
PARAMETER stop "### Instruction:"
PARAMETER temperature 0.0
PARAMETER num_ctx 32768
ollama create steelman -f Modelfile
ollama run steelman "Write an Ada procedure implementing a producer-consumer pattern with protected objects"
Python (transformers + peft)
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-14B", device_map="auto")
model = PeftModel.from_pretrained(base, "the-clanker-lover/steelman-14b-ada")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-14B")
SYSTEM = "You are an expert Ada 2022 and SPARK programmer."
TEMPLATE = (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{system}\n\n{instruction}\n\n### Response:\n"
)
prompt = TEMPLATE.format(
system=SYSTEM,
instruction="Write an Ada function with SPARK contracts that performs binary search on a sorted array.",
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.0)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Training Details
| Setting | Value |
|---|---|
| Base model | Qwen/Qwen2.5-Coder-14B |
| Method | QLoRA (4-bit NF4) via Unsloth + TRL SFTTrainer |
| LoRA rank / alpha | 64 / 128 |
| rsLoRA | Yes |
| LoRA dropout | 0.0 |
| Target modules | q/k/v/o_proj, gate/up/down_proj (all linear layers) |
| Trainable parameters | ~550M (LoRA adapters only) |
| Training data | 6,110 compiler-verified pairs |
| Sequence length | 8,192 tokens |
| Schedule | 3 epochs, cosine learning rate decay, lr 1e-4 |
| Batch size | Effective 32 (4 per device, 8 gradient accumulation) |
| Early stopping | Patience 2 (based on eval loss) |
| Hardware | NVIDIA H100 80GB |
Every training example compiles cleanly with gnatmake -gnat2022 -gnatwa. The model never trains on broken code.
Training History
| Round | Score | Eval | Dataset | Base Model | Notes |
|---|---|---|---|---|---|
| Baseline | ~35% | v2 (compile) | -- | Qwen2.5-Coder-14B-Instruct | Untuned |
| R1 | 53.8% | v2 (compile) | 1,800 | Qwen2.5-Coder-14B-Instruct | First release |
| R5 (v0.1) | 68.6% | v2 (compile) | 3,430 | Qwen2.5-Coder-14B-Instruct | +task types |
| R6 (v0.2) | 72.0% | v3 (strict compile) | 3,018 | Qwen2.5-Coder-14B-Instruct | Strict-triaged data |
| R7 (v0.3) | 62.4% | v4 (compile + functional) | 6,110 | Qwen2.5-Coder-14B | New base, 2x data, 10 categories |
Note: R7's 62.4% on eval v4 and R6's 72.0% on eval v3 use different scoring methodologies. Eval v4 includes functional correctness scoring (modification tasks must pass test suites) and two new categories (modification, spec_impl). The model improved; the eval got harder.
Evaluation Methodology
Eval v4 is a 754-prompt benchmark across 10 Ada task categories with three scoring modes:
- Binary (compile-or-not): standard, spec_to_body, error_fix, multi_file, generics, tasking, spark, ada2022
- Tiered (0/0.25/0.5/1.0): modification tasks scored with SWE-bench-style dual-gate tests (new code passes + no regressions)
- Dual-level (compile + test harness): spec_impl tasks scored on both compilation and test suite execution
Compiler: GNAT 13.3.0 (GCC)
Compilation flags:
gnatmake -gnat2022 -gnatwa -gnata -gnateE -gnateF -gnateV
-gnatU -gnatVa -gnatyabehiklprt -gnatwe -cargs -fstack-check
Scoring note: Steelman's eval was run via two independent methods (local Ollama Q8_0 and RunPod fp16 transformers) with results within 0.5pp. The 62.4% headline uses the fp16 run. Frontier models were evaluated via OpenRouter API with identical prompts and local compilation scoring.
Decontamination: Multi-stage pipeline following published methodology:
- Exact substring match after whitespace normalization (Lozhkov et al. 2024 / StarCoder2)
- 10-gram overlap (Guo et al. 2024 / DeepSeek-Coder)
- Levenshtein similarity on extracted task content (Riddell et al. 2024 / ACL)
- Instruction template boilerplate stripped before comparison (Lee et al. 2023 / Open-Platypus)
Result: 754 prompts, 0 contaminated against 6,110 training examples.
Full methodology with citations: METHODOLOGY.md
Eval prompts: eval_v4.json
Limitations
- Ada/SPARK only -- not a general-purpose coding model
- Algorithmic reasoning is weaker than frontier models. HumanEval-Ada pass@1 is 45.9% vs Opus 4.6's 81.5%. The model excels at Ada-specific patterns (specs, SPARK, error fixing, multi-file) but lags on general CS algorithms.
- SPARK contracts compile but may not prove.
gnatproveverification is not part of the training loop. - Synthetically generated training data. All 6,110 examples were produced by language models, not written by Ada developers. Every example is compiler-verified, but there may be non-idiomatic patterns.
- Alpaca template only. This model uses an Alpaca-style instruction template, not ChatML. Using the wrong template will significantly degrade output quality (we measured 8.9% vs 45.9% on HumanEval with the wrong template).
- 14B parameter model. It will miss things a larger model would catch.
About
This project exists because Ada developers working on safety-critical systems -- avionics, defense, rail, medical devices, space -- couldn't use LLM-assisted tooling. Every major model struggles with Ada's strict type system, SPARK's verification contracts, and the GNAT compiler's error messages.
Steelman is a solo project, built by a self-taught developer using AI assistance at every step -- dataset generation, training pipeline, evaluation harness, and this model card. The entire project was built in about three weeks.
Related
- GGUF quantization -- Q8_0 for local inference via Ollama or llama.cpp
- Training dataset and eval -- 6,110 compiler-verified pairs + 754-prompt eval with methodology
Community
This project was shaped by community feedback:
- Fer (Irvise) on forum.ada-lang.io -- recommended the runtime verification flags (
-gnata,-gnateE,-gnateF,-gnatVa,-fstack-check) that became the foundation of strict evaluation. The single most impactful technical input. - K_Kolomeitsev on r/LocalLLaMA -- identified generics and tasking coverage as priorities, directly shaping the eval categories.
License
Apache 2.0 (same as base model).
Citation
@misc{steelman-14b-ada,
title={Steelman-14B-Ada: QLoRA Fine-Tuning for Ada 2022/SPARK Code Generation},
author={the-clanker-lover},
year={2026},
url={https://huggingface.co/the-clanker-lover/steelman-14b-ada}
}
- Downloads last month
- 562
Model tree for the-clanker-lover/steelman-14b-ada
Dataset used to train the-clanker-lover/steelman-14b-ada
Evaluation results
- Overall Score on Steelman Eval v4 (754 prompts)self-reported62.400
- Pass@1 on MultiPL-E HumanEval-Adaself-reported45.900