Steelman-14B-Ada v0.3

A 14B parameter model for Ada 2022 and SPARK code generation. QLoRA fine-tune of Qwen2.5-Coder-14B on 6,110 compiler-verified instruction pairs across 9 training categories, evaluated on a 754-prompt benchmark spanning 10 task categories. Named after the 1978 DoD Steelman requirements that defined the Ada language.

Built by one person with AI assistance.

Headline Results

Steelman Eval v4: 754 prompts, 10 Ada task categories, strict GNAT compilation + functional scoring

Model	Size	Score
Steelman R7 v0.3	14B (QLoRA)	62.4%
GPT-5.4	--	12.9%
Claude Opus 4.6	--	12.7%
Gemini 3.1 Pro	--	5.3%*
Grok 4	--	19.2%*

*Grok 4 and Gemini 3.1 Pro results are partial (537/754 and 547/754 respectively). Final numbers will be updated when complete.

A 14B QLoRA adapter outperforms every frontier model tested on strict Ada code generation by 49+ percentage points over the best complete frontier result (GPT-5.4 at 12.9%).

Per-Category Breakdown (Steelman Eval v4)

Category	Count	Steelman R7	Opus 4.6	GPT-5.4
spark	53	79.2%	9.4%	15.1%
error_fix	78	73.1%	14.1%	0.0%
ada2022	47	70.2%	8.5%	8.5%
tasking	54	70.4%	7.4%	29.6%
spec_to_body	88	67.0%	54.5%	51.1%
multi_file	57	59.6%	0.0%	0.0%
spec_impl	91	59.1%	3.8%	4.4%
generics	55	54.5%	0.0%	1.8%
modification	124	53.8%	5.8%	4.8%
standard	107	53.3%	12.1%	12.1%

Steelman leads in all 10 categories. The largest gaps are on error_fix (+59pp vs Opus), multi_file (+59.6pp vs both), and generics (+54.5pp vs Opus).

MultiPL-E HumanEval-Ada (157 problems, pass@1)

Standard code generation benchmark translated to Ada. Measures functional correctness: code must compile, run, and produce correct output.

Model	Pass@1	Compile Rate
Claude Opus 4.6	81.5% (128/157)	83.4%
Grok 4	77.1% (121/157)	81.5%
GPT-5.4	70.7% (111/157)	80.2%
Steelman R7 v0.3	45.9% (72/157)	85.4% (134/157)
Gemini 3.1 Pro	35.7% (56/157)	37.6%

Steelman has the highest Ada compile rate of any model tested (85.4%), including all frontier models. Frontier models win on algorithmic correctness because HumanEval tests general CS algorithms, not Ada-specific skills. Our eval (above) tests what matters for Ada development.

Reproducibility note: HumanEval is a code completion benchmark -- it provides a code prefix and expects the model to continue it. This model uses an Alpaca instruction template for inference. When we ran HumanEval through the Alpaca template (wrapping the code prefix in ### Instruction: / ### Response:), the model scored 8.9% because it treated the code as an instruction rather than code to complete. The 45.9% result above uses a raw completion template ({{ .Prompt }}{{ .Response }} with no system prompt and no instruction wrapping). If you are reproducing these results, you must use a raw template for HumanEval, not the Alpaca template used for instruction-following tasks.

What Changed in v0.3

	v0.2 (R6)	v0.3 (R7)
Base model	Qwen2.5-Coder-14B-Instruct	Qwen2.5-Coder-14B (base)
Training data	3,018 pairs	6,110 pairs (all compile-verified)
LoRA rank / alpha	32 / 64	64 / 128 (rsLoRA)
Template	ChatML	Alpaca
Eval	v3 (500 prompts, compile-only)	v4 (754 prompts, compile + functional scoring)
Task categories	8	10 (+modification, +spec_impl)
Decontamination	Cosine similarity 0.80	Multi-stage: exact + 10-gram + Levenshtein

Why the base variant? With 6,110 training examples, the base model learns more effectively than starting from an instruction-tuned checkpoint. Qwen2.5-Coder has Ada in its 92-language training set, making it the strongest available foundation for Ada fine-tuning.

Why the eval numbers look lower than v0.2: Eval v4 is substantially harder than eval v3. It uses -gnatwe (warnings as errors) plus functional correctness scoring (modification tasks require passing test suites, not just compiling). The model is better; the bar is higher.

What This Model Does

Generates complete Ada 2022 source files from natural language instructions
Writes SPARK contracts: preconditions, postconditions, loop invariants, type invariants, ghost code
Generates package bodies from .ads specifications (spec-to-body)
Modifies existing Ada code: adds features, fixes bugs, refactors (new in R7)
Handles multi-file Ada projects using [FILE: name.ext]/[/FILE] delimiters
Fixes compilation errors given GNAT diagnostics (73.1% accuracy)
Uses modern Ada 2022 features: expression functions, aspects, quantified expressions, delta aggregates, declare expressions, target name @

What This Model Does NOT Do

Not optimized for code completion or fill-in-the-middle (FIM) -- it generates complete files from instructions
Not a chatbot -- best results come from direct code generation prompts with a system prompt
SPARK contracts compile but may not prove with gnatprove -- formal verification is not part of the training loop
Algorithmic correctness is weaker than frontier models (45.9% vs 81.5% on HumanEval) -- the model excels at Ada-specific patterns, not general CS puzzles

Usage

Ollama (recommended)

Download the GGUF from steelman-14b-ada-GGUF and create a Modelfile:

FROM ./steelman-r7-coder-base-q8_0.gguf

TEMPLATE "Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{{ .Prompt }}

### Response:
{{ .Response }}"

SYSTEM "You are an expert Ada 2022 and SPARK programmer."

PARAMETER stop "### Instruction:"
PARAMETER temperature 0.0
PARAMETER num_ctx 32768

ollama create steelman -f Modelfile
ollama run steelman "Write an Ada procedure implementing a producer-consumer pattern with protected objects"

Python (transformers + peft)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-14B", device_map="auto")
model = PeftModel.from_pretrained(base, "the-clanker-lover/steelman-14b-ada")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-14B")

SYSTEM = "You are an expert Ada 2022 and SPARK programmer."
TEMPLATE = (
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{system}\n\n{instruction}\n\n### Response:\n"
)

prompt = TEMPLATE.format(
    system=SYSTEM,
    instruction="Write an Ada function with SPARK contracts that performs binary search on a sorted array.",
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.0)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training Details

Setting	Value
Base model	Qwen/Qwen2.5-Coder-14B
Method	QLoRA (4-bit NF4) via Unsloth + TRL SFTTrainer
LoRA rank / alpha	64 / 128
rsLoRA	Yes
LoRA dropout	0.0
Target modules	q/k/v/o_proj, gate/up/down_proj (all linear layers)
Trainable parameters	~550M (LoRA adapters only)
Training data	6,110 compiler-verified pairs
Sequence length	8,192 tokens
Schedule	3 epochs, cosine learning rate decay, lr 1e-4
Batch size	Effective 32 (4 per device, 8 gradient accumulation)
Early stopping	Patience 2 (based on eval loss)
Hardware	NVIDIA H100 80GB

Every training example compiles cleanly with gnatmake -gnat2022 -gnatwa. The model never trains on broken code.

Training History

Round	Score	Eval	Dataset	Base Model	Notes
Baseline	~35%	v2 (compile)	--	Qwen2.5-Coder-14B-Instruct	Untuned
R1	53.8%	v2 (compile)	1,800	Qwen2.5-Coder-14B-Instruct	First release
R5 (v0.1)	68.6%	v2 (compile)	3,430	Qwen2.5-Coder-14B-Instruct	+task types
R6 (v0.2)	72.0%	v3 (strict compile)	3,018	Qwen2.5-Coder-14B-Instruct	Strict-triaged data
R7 (v0.3)	62.4%	v4 (compile + functional)	6,110	Qwen2.5-Coder-14B	New base, 2x data, 10 categories

Note: R7's 62.4% on eval v4 and R6's 72.0% on eval v3 use different scoring methodologies. Eval v4 includes functional correctness scoring (modification tasks must pass test suites) and two new categories (modification, spec_impl). The model improved; the eval got harder.

Evaluation Methodology

Eval v4 is a 754-prompt benchmark across 10 Ada task categories with three scoring modes:

Binary (compile-or-not): standard, spec_to_body, error_fix, multi_file, generics, tasking, spark, ada2022
Tiered (0/0.25/0.5/1.0): modification tasks scored with SWE-bench-style dual-gate tests (new code passes + no regressions)
Dual-level (compile + test harness): spec_impl tasks scored on both compilation and test suite execution

Compiler: GNAT 13.3.0 (GCC)

Compilation flags:

gnatmake -gnat2022 -gnatwa -gnata -gnateE -gnateF -gnateV
         -gnatU -gnatVa -gnatyabehiklprt -gnatwe -cargs -fstack-check

Scoring note: Steelman's eval was run via two independent methods (local Ollama Q8_0 and RunPod fp16 transformers) with results within 0.5pp. The 62.4% headline uses the fp16 run. Frontier models were evaluated via OpenRouter API with identical prompts and local compilation scoring.

Decontamination: Multi-stage pipeline following published methodology:

Exact substring match after whitespace normalization (Lozhkov et al. 2024 / StarCoder2)
10-gram overlap (Guo et al. 2024 / DeepSeek-Coder)
Levenshtein similarity on extracted task content (Riddell et al. 2024 / ACL)
Instruction template boilerplate stripped before comparison (Lee et al. 2023 / Open-Platypus)

Result: 754 prompts, 0 contaminated against 6,110 training examples.

Full methodology with citations: METHODOLOGY.md

Eval prompts: eval_v4.json

Limitations

Ada/SPARK only -- not a general-purpose coding model
Algorithmic reasoning is weaker than frontier models. HumanEval-Ada pass@1 is 45.9% vs Opus 4.6's 81.5%. The model excels at Ada-specific patterns (specs, SPARK, error fixing, multi-file) but lags on general CS algorithms.
SPARK contracts compile but may not prove. gnatprove verification is not part of the training loop.
Synthetically generated training data. All 6,110 examples were produced by language models, not written by Ada developers. Every example is compiler-verified, but there may be non-idiomatic patterns.
Alpaca template only. This model uses an Alpaca-style instruction template, not ChatML. Using the wrong template will significantly degrade output quality (we measured 8.9% vs 45.9% on HumanEval with the wrong template).
14B parameter model. It will miss things a larger model would catch.

About

This project exists because Ada developers working on safety-critical systems -- avionics, defense, rail, medical devices, space -- couldn't use LLM-assisted tooling. Every major model struggles with Ada's strict type system, SPARK's verification contracts, and the GNAT compiler's error messages.

Steelman is a solo project, built by a self-taught developer using AI assistance at every step -- dataset generation, training pipeline, evaluation harness, and this model card. The entire project was built in about three weeks.

GGUF quantization -- Q8_0 for local inference via Ollama or llama.cpp
Training dataset and eval -- 6,110 compiler-verified pairs + 754-prompt eval with methodology

Community

This project was shaped by community feedback:

Fer (Irvise) on forum.ada-lang.io -- recommended the runtime verification flags (-gnata, -gnateE, -gnateF, -gnatVa, -fstack-check) that became the foundation of strict evaluation. The single most impactful technical input.
K_Kolomeitsev on r/LocalLLaMA -- identified generics and tasking coverage as priorities, directly shaping the eval categories.

License

Apache 2.0 (same as base model).

Citation

@misc{steelman-14b-ada,
  title={Steelman-14B-Ada: QLoRA Fine-Tuning for Ada 2022/SPARK Code Generation},
  author={the-clanker-lover},
  year={2026},
  url={https://huggingface.co/the-clanker-lover/steelman-14b-ada}
}

Downloads last month: 562

Model tree for the-clanker-lover/steelman-14b-ada

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-Coder-14B

Adapter

(6)

this model

Dataset used to train the-clanker-lover/steelman-14b-ada

Evaluation results

Overall Score on Steelman Eval v4 (754 prompts)
self-reported

62.400
Pass@1 on MultiPL-E HumanEval-Ada
self-reported

45.900

the-clanker-lover
/

steelman-14b-ada