SkillRet-Embedding-8B

arXiv

This is a sentence-transformers model fine-tuned for AI agent skill retrieval. Given a natural-language user request, the model retrieves relevant agent skills from a large skill library.

The model is fine-tuned from Qwen/Qwen3-Embedding-8B on the SkillRet benchmark training split using contrastive learning (MultipleNegativesRankingLoss).

📄 Technical report: SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents (arXiv:2605.05726)

Usage

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ThakiCloud/SkillRet-Embedding-8B", trust_remote_code=True)

query_prompt = "Instruct: Given a skill search query, retrieve relevant skills that match the query\nQuery: "

queries = [
    query_prompt + "Help me set up a CI/CD pipeline for my Python project"
]
skills = [
    "ci-cd-setup | Configure continuous integration and deployment pipelines ...",
    "python-debugging | Debug Python applications using pdb and logging ...",
]

q_emb = model.encode(queries, normalize_embeddings=True)
s_emb = model.encode(skills, normalize_embeddings=True)

similarities = q_emb @ s_emb.T
print(similarities)

Training Details

  • Base model: Qwen3-Embedding-8B (8B parameters)
  • Training data: SkillRet benchmark training split (127,190 query–skill pairs from 63,259 queries and 10,123 skills)
  • Loss: MultipleNegativesRankingLoss (InfoNCE) with cross-GPU negative sharing
  • Hardware: 4× NVIDIA B200 GPUs (DDP)
  • Effective batch size: 80 (20 per device × 4 GPUs)
  • Max sequence length: 8,192 tokens
  • Learning rate: 2e-5
  • Epochs: 1
  • Precision: BF16

Evaluation Results

Evaluated on the SkillRet benchmark test split (4,997 queries, 6,660 skills).

Metric @5 @10 @15
NDCG 0.8123 0.8345 0.8418
Recall 0.8558 0.9123 0.9355
Completeness 0.7562 0.8463 0.8841

For comparison with the 0.6B variant, see ThakiCloud/SkillRet-Embedding-0.6B.

Intended Use

This model is designed for retrieving agent skills given natural-language user requests. It is part of the SkillRet benchmark submission for evaluating skill retrieval systems for AI agents.

Limitations

  • Optimized for English-language queries and agent skills.
  • Performance may vary on domains outside the SkillRet benchmark distribution.
  • The model retrieves skills but does not execute them.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 5.4.1
  • Transformers: 5.5.4
  • PyTorch: 2.7.1+cu128

Citation

If you use this model or the SkillRet benchmark, please cite:

@article{cho2026skillret,
  title   = {SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents},
  author  = {Cho, Hongcheol and Kang, Ryangkyung and Kim, Youngeun},
  journal = {arXiv preprint arXiv:2605.05726},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.05726}
}

Paper: https://arxiv.org/abs/2605.05726

Downloads last month
32
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ThakiCloud/SKILLRET-Embedding-8B

Finetuned
(27)
this model

Paper for ThakiCloud/SKILLRET-Embedding-8B