Meno-Lite-0.1

A 7B language model that masters the art of reading, not memorizing.

Meno-Lite-0.1 is a 7-billion-parameter large language model purpose-built for Russian-language RAG pipelines, document question answering, information extraction, and summarization. Rather than trying to cram encyclopedic world knowledge into a modest parameter budget, Meno-Lite doubles down on language skills — the ability to parse, transform, and reason over text supplied in context. The result is a compact model that punches well above its weight class on tasks where the answer lies in the input, not in the model's memory.

Key idea. We hypothesize that the capabilities of LLMs decompose into two largely independent axes: world knowledge (facts, dates, entities) and language knowledge (comprehension, extraction, inference, generation). World knowledge scales roughly linearly with parameter count, but language knowledge reaches a surprisingly high plateau even in 7B-class models — provided it is deliberately cultivated. Meno-Lite-0.1 is an empirical test of this hypothesis: by investing training compute exclusively into language skills, we obtain a model that rivals or surpasses much larger systems on context-grounded tasks while remaining deployable on a single consumer GPU.

Model Details

Model Description

Meno-Lite-0.1 is derived from RuadaptQwen2.5-7B-Lite-Beta through a carefully designed two-stage training pipeline (continued pretraining → supervised fine-tuning) that sharpens the model's ability to work with documents rather than from parametric memory. The full lineage is:

Qwen/Qwen2.5-7B-Instruct
  └─► t-tech/T-lite-it-1.0
        └─► RefalMachine/RuadaptQwen2.5-7B-Lite-Beta
              └─► bond005/Meno-Lite-0.1   ◄── you are here

Each ancestor added a layer of Russian-language adaptation; Meno-Lite-0.1 adds a final layer of skill-oriented training focused on information extraction, entity normalization, multi-hop reasoning over long contexts, and instruction following for RAG scenarios. Although the model is primarily oriented toward Russian, it retains strong English performance thanks to bilingual pretraining data (sampled FineWeb-Edu) and English-language SFT examples (MultiHopRAG, MTRAGEval).

  • Developed by: Ivan Bondarenko and colleagues, Novosibirsk State University (NSU)
  • Model type: Causal decoder-only transformer (Qwen2.5 architecture)
  • Parameters: ~7B
  • Language(s): Russian (primary), English (retained)
  • License: Apache 2.0
  • Base model: RefalMachine/RuadaptQwen2.5-7B-Lite-Beta

Model Sources

Motivation: Language Knowledge vs. World Knowledge

Modern LLMs are often evaluated — and marketed — on their ability to recall factual trivia. Yet the vast majority of production deployments do not rely on parametric recall at all: RAG systems, function-calling agents, document assistants, and code-generation copilots all receive the necessary information in context. What these applications demand is not a bigger encyclopedia but a sharper reader.

We formalize this intuition as a two-axis framework:

Axis What it captures Scaling behavior Examples
World knowledge Facts, entities, relations memorized during pretraining Scales roughly linearly with parameters CheGeKa, MaMuRAMu, ruMMLU
Language knowledge Comprehension, extraction, transformation, reasoning over supplied text Reaches a high plateau at 7B and above MultiQ, ruTiE, USE, RAG QA, summarization

Meno-Lite-0.1 deliberately sacrifices world-knowledge breadth (which is inherently limited at 7B) in favor of maximizing language-knowledge depth. The training data and SFT instructions were curated to reinforce how the model processes text, not what it knows about the world. As the benchmarks below demonstrate, this trade-off pays off handsomely for context-grounded tasks.

Uses

Direct Use

  • RAG pipelines: Meno-Lite-0.1 excels at answering questions when relevant passages are retrieved and injected into the prompt. Its training on multi-hop QA datasets (MultiHopRAG, MTRAGEval, LongContextMultiQ) makes it particularly adept at synthesizing information scattered across multiple chunks.
  • Document QA and summarization: Legal contracts, technical manuals, scientific papers — any scenario where the model must read carefully and respond precisely.
  • Information extraction and entity normalization: SFT on NEREL-based instructions and GPT-4o-mini–generated entity definitions equips the model with robust NER and normalization capabilities.
  • Function calling and agentic workflows: Tasks where the required knowledge arrives via tool outputs rather than parametric memory.

Downstream Use

Meno-Lite-0.1 can serve as a strong starting point for further fine-tuning on domain-specific corpora (e.g., medical, financial, or governmental documents) where context-grounded accuracy is paramount.

Out-of-Scope Use

  • Open-domain factual QA without context: The model was not optimized for parametric recall; do not expect it to outperform larger models on trivia-style benchmarks.
  • Safety-critical applications without human oversight: Like all LLMs, Meno-Lite-0.1 can hallucinate, especially when relevant context is absent.
  • Languages other than Russian and English: While the Qwen2.5 backbone supports many languages, Meno-Lite-0.1 has been validated only on Russian and English.

How to Get Started with the Model

import json
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "bond005/meno-lite-0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

SYSTEM_PROMPT = "Вы — полезный ассистент. Отвечайте на вопросы, опираясь на предоставленный контекст."

CHUNKS = [
    "Новосибирский государственный университет (НГУ) был основан в 1959 году в Академгородке.",  # 0
    "12 сентября 1959 года был успешно осуществлён запуск автоматической межпланетной станции «Луна-2». " \
    "14 сентября 1959 года станция «Луна-2» впервые в мире достигла поверхности Луны в районе Моря Дождей " \
    "вблизи кратеров Аристилл, Архимед и Автолик.",  # 1
    "Московский государственный университит имени М. В. Ломоносова (МГУ) был основан в 1755 году. " \
    "Изначально университет располагался в здании Главной аптеки (бывший Земский приказ) на месте " \
    "Государственного исторического музея на Красной площади.",  # 2
]
CONTEXT = "\n\n".join([f"Контекст {idx + 1}:\n```text\n{val}\n```" for idx, val in enumerate(CHUNKS)]) + "\n\nВопрос: "

USER_QUESTION = "Какой университет был основан в том же году, когда впервые в истории рукотворный аппарат достиг поверхности Луны?"
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": CONTEXT + USER_QUESTION + "\n"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(f"Вопрос: {USER_QUESTION}\nОтвет модели: {response}\n")

ANOTHER_USER_QUESTION = "Через сколько лет после университета в Москве был основан университет в Новосибирске?"
messages2 = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": CONTEXT + ANOTHER_USER_QUESTION + "\n"}
]

text2 = tokenizer.apply_chat_template(messages2, tokenize=False, add_generation_prompt=True)
inputs2 = tokenizer([text2], return_tensors="pt").to(model.device)

outputs2 = model.generate(**inputs2, max_new_tokens=256)
response2 = tokenizer.decode(outputs2[0][inputs2["input_ids"].shape[-1]:], skip_special_tokens=True)
print(f"Вопрос: {ANOTHER_USER_QUESTION}\nОтвет модели: {response2}\n")

FEW_SHOTS_FOR_NER = [
    {
        "role": "system",
        "content": "Вы - эксперт в области анализа текстов и извлечения семантической информации из них."
    },
    {
        "role": "user",
        "content": "Выделите именованные сущности классов ORGANIZATION, PERSON и LOCATION из входного текста и запишите ответ в JSON-формате." \
                   "\n\nВходной текст:\n\n```text\nНаучный сотрудник лаборатории прикладных цифровых технологий Международного " \
                   "научно-образовательного математического центра НГУ Иван Бондаренко рассказал о грантовой программе и о том, как " \
                   "его проект RAGU попал в число победителей.\n```\n"
    },
    {
        "role": "assistant",
        "content": "{\"ORGANIZATION\": [\"лаборатория прикладных цифровых технологий Международного научно-образовательного математического центра НГУ\", " \
                   "\"Международный научно-образовательный математический центр НГУ\", \"НГУ\"], \"PERSON\": [\"Иван Бондаренко\"], \"LOCATION\": []}"
    },
    {
        "role": "user",
        "content": "Выделите именованные сущности классов ORGANIZATION, PERSON и LOCATION из входного текста и запишите ответ в JSON-формате." \
                   "\n\nВходной текст:\n\n```text\nНациональный исследовательский университет «Высшая школа экономики» (НИУ ВШЭ) представил результаты " \
                   "15-го мониторинга качества приема на бюджетные и платные места российских вузов в 2025 году. В группе лидеров " \
                   "10 московских университетов, три питерских и по одному представителю из таких регионов, как Татарстан (Иннополис), " \
                   "Нижний Новгород и Новосибирск (НГУ).\n```\n"
    },
    {
        "role": "assistant",
        "content": "{\"ORGANIZATION\": [\"Национальный исследовательский университет «Высшая школа экономики»\", \"НИУ ВШЭ\", \"НГУ\"], " \
                   "\"PERSON\": [], \"LOCATION\": [\"московский\", \"питерский\", \"Татарстан\", \"Иннополис\", \"Нижний Новгород\", " \
                   "\"Новосибирск\"]}"
    },
    {
        "role": "user",
        "content": "Выделите именованные сущности классов ORGANIZATION, PERSON и LOCATION из входного текста и запишите ответ в JSON-формате." \
                   "\n\nВходной текст:\n\n```text\nПочему китайская ИИ-модель DeepSeek гораздо эффективнее и дешевле западных аналогов?\n```\n"
    },
    {
        "role": "assistant",
        "content": "{\"ORGANIZATION\": [], \"PERSON\": [], \"LOCATION\": [\"китайская\", \"западный\"]}"
    }
]

INPUT_TEXT_FOR_NER = "Станислав Владимирович Дробышевский – российский антрополог, кандидат биологических наук, доцент кафедры антропологии " \
                     "биологического факультета МГУ им. М.В. Ломоносова, научный редактор портала “Антропогенез.ру” и, без сомнения, " \
                     "одна из самых ярких и узнаваемых фигур в российской науке."

text3 = tokenizer.apply_chat_template(
    FEW_SHOTS_FOR_NER + [{"role": "user", "content": INPUT_TEXT_FOR_NER}],
    tokenize=False, add_generation_prompt=True
)
inputs3 = tokenizer([text3], return_tensors="pt").to(model.device)

outputs3 = model.generate(**inputs3, max_new_tokens=256)
response3 = json.loads(tokenizer.decode(outputs3[0][inputs3["input_ids"].shape[-1]:], skip_special_tokens=True))
print(f"Входной текст: {INPUT_TEXT_FOR_NER}\nРаспознанные сущности:\n{json.dumps(response3, ensure_ascii=False, indent=4)}\n")

As a result, you will see text similar to the following:

Вопрос: Какой университет был основан в том же году, когда впервые в истории рукотворный аппарат достиг поверхности Луны?
Ответ модели: Новосибирский государственный университет (НГУ) был основан в том же году, когда впервые в истории рукотворный аппарат достиг поверхности Луны.

Вопрос: Через сколько лет после университета в Москве был основан университет в Новосибирске?
Ответ модели: Университет в Новосибирске был основан через 204 года после Московского государственного университета.

Входной текст: Станислав Владимирович Дробышевский – российский антрополог, кандидат биологических наук, доцент кафедры антропологии биологического факультета МГУ им. М.В. Ломоносова, научный редактор портала “Антропогенез.ру” и, без сомнения, одна из самых ярких и узнаваемых фигур в российской науке.
Распознанные сущности:
{
    "ORGANIZATION": [
        "биологический факультет МГУ им. М.В. Ломоносова",
        "МГУ им. М.В. Ломоносова"
    ],
    "PERSON": [
        "Станислав Владимирович Дробышевский"
    ],
    "LOCATION": [
        "российская"
    ]
}

Using vLLM for high-throughput serving:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "bond005/meno-lite-0.1"

tok = AutoTokenizer.from_pretrained(model_name)
llm = LLM(
    model=model_name,
    dtype="bfloat16",
    max_model_len=32768,
    gpu_memory_utilization=0.85
)
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)

messages = [
    {
        "role": "system",
        "content": "Вы — Менон, разработанный в Новосибирском государственном университете. Вы — полезный помощник."
    },
    {
        "role": "user",
        "content": "Привет! Расскажи о себе."
    }
]
input_text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([input_text], sampling_params)
print(outputs[0].outputs[0].text)

As a result, you will see text similar to the following:

Привет! Меня зовут Менон, и я — виртуальный помощник, созданный в Новосибирском государственном университете. Я здесь, чтобы помочь вам с различными вопросами и задачами.

Tokenizer Efficiency

An often-overlooked determinant of real-world throughput is tokenizer efficiency: the more characters each token covers, the fewer autoregressive steps are needed to generate text of a given length. Meno-Lite-0.1 inherits the extended tokenizer from RuadaptQwen2.5-7B-Lite-Beta, which dramatically improves Russian-language efficiency compared to the original Qwen2.5 vocabulary.

Model Chars/token (RU) Chars/token (EN)
Meno-Lite-0.1 3.77 4.13
RuadaptQwen2.5-7B-Lite-Beta 3.77 4.13
AvitoTech/avibe (8B) 3.79 4.06
t-tech/T-lite-it-2.1 (7B) 3.74 4.14
t-tech/T-lite-it-1.0 (7B) 2.57 4.14
Qwen/Qwen2.5-7B-Instruct 2.57 4.14
GigaChat3-10B-A1.8B 3.74 3.99

Meno-Lite-0.1 achieves 3.77 characters per token on Russian text — a 47% improvement over the original Qwen2.5 tokenizer (2.57 chars/token). This translates directly into faster inference and lower serving costs for Russian-language workloads, while English efficiency remains on par with the best models in the class.

Evaluation

MERA Benchmark

MERA is the most comprehensive Russian-language benchmark for evaluating LLMs on "strong AI" tasks. It comprises 15 scored tasks (with closed test sets) spanning world knowledge, reasoning, logic, mathematics, coding, and language understanding. The overall score is the mean across tasks (for tasks with multiple metrics, those metrics are averaged first).

Selected models from the MERA leaderboard (sorted by score):

# Model Size MERA Score
1 Human Benchmark 0.852
3 BerryLM-MT 20B 0.745
11 Cotype Pro 2.5 32.5B 0.671
14 T-pro-it-2.0 32.8B 0.660
22 A-vibe 8B 0.618
25 RuadaptQwen-32B-instruct 32B 0.615
27 Qwen2.5-32B-Instruct 32B 0.603
29 Qwen2.5-72B-Instruct 72.7B 0.601
30 Meta-Llama-3.1-405B-Instruct 405B 0.590
36 Meno-Lite-0.1 7B 0.555
37 Llama-3.3-70B-Instruct 70B 0.555
38 Meta-Llama-3.1-70B-Instruct 70.6B 0.554
39 T-lite-it-1.0 7B 0.552
44 RuadaptQwen2.5-7B-Lite-v1 7B 0.536
49 GigaChat3-10B-A1.8B 10B 0.518
56 Qwen2.5-7B-Instruct 7B 0.482

Key observations:

  • Meno-Lite-0.1 (0.555) matches or exceeds 70B-class models such as Llama-3.3-70B-Instruct (0.555) and Meta-Llama-3.1-70B-Instruct (0.554), despite being 10× smaller.
  • It surpasses its direct ancestor T-lite-it-1.0 (0.552) and significantly outperforms both the base Qwen2.5-7B-Instruct (0.482) and RuadaptQwen2.5-7B-Lite-v1 (0.536).
  • Among 7B-class models, Meno-Lite-0.1 achieves the highest MERA score, demonstrating that targeted skill training can close the gap with much larger architectures.

Task-level breakdown (Meno-Lite-0.1 vs. key comparisons)

Task Meno-Lite-0.1 (7B) T-lite-it-1.0 (7B) Qwen2.5-7B-Instruct (7B) A-vibe (8B) Llama-3.3-70B (70B)
RWSD 0.569 0.535 0.515 0.565 0.600
PARus 0.818 0.894 0.848 0.910 0.914
RCB 0.541/0.458 0.571/0.533 0.562/0.493 0.582/0.547 0.575/0.380
MultiQ 0.536/0.403 0.523/0.398 0.425/0.296 0.539/0.410 0.573/0.418
ruWorldTree 0.949/0.760 0.964/0.964 0.939/0.939 0.968/0.968 0.954/0.769
ruOpenBookQA 0.880/0.705 0.905/0.905 0.845/0.845 0.888/0.887 0.910/0.735
CheGeKa 0.346/0.293 0.502/0.413 0.077/0.048 0.168/0.120 0.339/0.276
ruTiE 0.794 0.786 0.777 0.811 0.824
USE 0.240 0.147 0.219 0.371 0.298
MathLogicQA 0.666 0.662 0.467 0.661 0.566
ruMultiAr 0.347 0.346 0.307 0.391 0.340
LCS 0.186 0.144 0.114 0.172 0.168
ruModAr 0.497 0.493 0.473 0.929 0.570
MaMuRAMu 0.749 0.775 0.711 0.761 0.802
ruCodeEval 0.377/0.569/0.622 0.082/0.168/0.226 0.025/0.071/0.098 0.545/0.703/0.732 0.139/0.280/0.396

Notable strengths of Meno-Lite-0.1:

  • ruCodeEval: A dramatic jump from the base T-lite-it-1.0 (0.082→0.377 pass@1), exceeding even the 70B Llama-3.3 (0.139). This suggests that improved language skills transfer to code generation ability.
  • MathLogicQA (0.666): Best among all 7B models and ahead of the 70B Llama-3.3 (0.566), reflecting strong verbal reasoning.
  • MultiQ (0.536/0.403): The multi-hop QA task — central to RAG — shows clear gains over both the base Qwen2.5-7B (0.425/0.296) and T-lite-it-1.0 (0.523/0.398).
  • CheGeKa (0.346/0.293): While this is a world-knowledge task, Meno-Lite still outperforms Qwen2.5-7B (0.077/0.048) by a large margin, suggesting that even factual recall benefits from better language comprehension.

LIBRA Benchmark (Long-Context Understanding)

LIBRA (Long Input Benchmark for Russian Analysis) evaluates models on 21 tasks across four complexity groups, with context lengths from 4K to 128K tokens. We evaluated Meno-Lite-0.1 alongside 7B–14B peers on all four groups.

Simple Information Retrieval

The Passkey and PasskeyWithLibrusec tasks measure a model's ability to locate a short code hidden inside a long distractor text — a prerequisite for any context-grounded application.

Model Size Task 4K 8K 16K 32K 64K 128K
Meno-Lite-0.1 7B Passkey 1.00 1.00 1.00 1.00 1.00 0.98
Meno-Lite-0.1 7B PasskeyLibrusec 1.00 1.00 1.00 1.00 1.00 0.95
RuadaptQwen2.5-7B-Lite-Beta 7B Passkey 1.00 1.00 1.00 1.00 1.00 0.98
RuadaptQwen2.5-7B-Lite-Beta 7B PasskeyLibrusec 1.00 0.99 1.00 1.00 1.00 0.95
T-lite-it-2.1 7B Passkey 1.00 1.00 1.00 1.00 1.00 0.90
T-lite-it-2.1 7B PasskeyLibrusec 1.00 1.00 1.00 1.00 1.00 0.93
AvitoTech/avibe 8B Passkey 1.00 1.00 1.00 1.00 1.00 0.90
AvitoTech/avibe 8B PasskeyLibrusec 1.00 1.00 1.00 1.00 1.00 0.94
T-lite-it-1.0 7B Passkey 1.00 1.00 1.00 1.00 1.00 0.58
T-lite-it-1.0 7B PasskeyLibrusec 1.00 1.00 1.00 1.00 0.86 0.45
Qwen2.5-7B-Instruct 7B Passkey 1.00 1.00 1.00 1.00 1.00 0.58
Qwen2.5-7B-Instruct 7B PasskeyLibrusec 1.00 1.00 1.00 1.00 1.00 0.66
Qwen2.5-14B-Instruct 14B Passkey 1.00 1.00 1.00 1.00 1.00 0.58
Qwen2.5-14B-Instruct 14B PasskeyLibrusec 1.00 1.00 1.00 1.00 0.99 0.63

All models achieve perfect scores up to 64K. The differentiation happens at 128K, where Meno-Lite-0.1 (0.98/0.95) leads all compared models, followed by its parent RuadaptQwen2.5-7B-Lite-Beta (0.98/0.95). Notably, both Meno-Lite and its parent substantially outperform the original Qwen2.5-7B-Instruct (0.58/0.66) and even the twice-larger Qwen2.5-14B-Instruct (0.58/0.63) at this extreme length — a direct benefit of the Ruadapt tokenizer and continued pretraining pipeline.

Multi-hop Question Answering

Multi-hop QA is the task group most directly relevant to RAG, as it requires the model to locate and combine evidence scattered across a long context. Below we show per-length scores to reveal how each model degrades as context grows.

ruBABILongQA1 (single supporting fact):

Model Size 4K 8K 16K 32K 64K 128K
Meno-Lite-0.1 7B 0.72 0.65 0.67 0.61 0.51 0.36
T-lite-it-1.0 7B 0.74 0.71 0.71 0.64 0.52 0.34
T-lite-it-2.1 7B 0.77 0.76 0.63 0.58 0.52 0.44
AvitoTech/avibe 8B 0.66 0.62 0.49 0.44 0.25 0.18
RuadaptQwen2.5-7B-Lite-Beta 7B 0.74 0.64 0.60 0.54 0.38 0.29
Qwen2.5-7B-Instruct 7B 0.65 0.68 0.65 0.78 0.60 0.48
Qwen2.5-14B-Instruct 14B 0.90 0.89 0.80 0.77 0.55 0.38

ruBABILongQA2 (two supporting facts):

Model Size 4K 8K 16K 32K 64K 128K
Meno-Lite-0.1 7B 0.34 0.20 0.19 0.16 0.07 0.02
T-lite-it-1.0 7B 0.29 0.13 0.07 0.06 0.10 0.04
T-lite-it-2.1 7B 0.44 0.42 0.33 0.24 0.15 0.05
AvitoTech/avibe 8B 0.47 0.35 0.34 0.31 0.24 0.09
RuadaptQwen2.5-7B-Lite-Beta 7B 0.32 0.18 0.07 0.08 0.05 0.05
Qwen2.5-7B-Instruct 7B 0.33 0.19 0.15 0.13 0.11 0.06
Qwen2.5-14B-Instruct 14B 0.61 0.55 0.41 0.34 0.22 0.11

ruBABILongQA3 (three supporting facts):

Model Size 4K 8K 16K 32K 64K 128K
Meno-Lite-0.1 7B 0.24 0.17 0.17 0.15 0.08 0.11
T-lite-it-1.0 7B 0.20 0.10 0.10 0.16 0.08 0.07
T-lite-it-2.1 7B 0.23 0.25 0.20 0.12 0.08 0.11
AvitoTech/avibe 8B 0.20 0.23 0.15 0.12 0.09 0.11
RuadaptQwen2.5-7B-Lite-Beta 7B 0.19 0.16 0.09 0.07 0.06 0.02
Qwen2.5-7B-Instruct 7B 0.24 0.24 0.18 0.24 0.16 0.15
Qwen2.5-14B-Instruct 14B 0.37 0.35 0.33 0.25 0.18 0.20

ruBABILongQA4 (two argument relations):

Model Size 4K 8K 16K 32K 64K 128K
Meno-Lite-0.1 7B 0.56 0.58 0.57 0.52 0.33 0.22
T-lite-it-1.0 7B 0.56 0.62 0.59 0.62 0.37 0.15
T-lite-it-2.1 7B 0.60 0.65 0.59 0.61 0.43 0.27
AvitoTech/avibe 8B 0.57 0.54 0.52 0.49 0.35 0.25
RuadaptQwen2.5-7B-Lite-Beta 7B 0.59 0.58 0.55 0.47 0.27 0.22
Qwen2.5-7B-Instruct 7B 0.62 0.53 0.58 0.52 0.25 0.08
Qwen2.5-14B-Instruct 14B 0.66 0.69 0.66 0.64 0.34 0.15

ruBABILongQA5 (three argument relations):

Model Size 4K 8K 16K 32K 64K 128K
Meno-Lite-0.1 7B 0.80 0.73 0.80 0.74 0.63 0.54
T-lite-it-1.0 7B 0.81 0.79 0.77 0.78 0.78 0.54
T-lite-it-2.1 7B 0.79 0.73 0.76 0.70 0.71 0.69
AvitoTech/avibe 8B 0.73 0.74 0.76 0.73 0.66 0.59
RuadaptQwen2.5-7B-Lite-Beta 7B 0.79 0.76 0.75 0.74 0.65 0.49
Qwen2.5-7B-Instruct 7B 0.81 0.79 0.76 0.80 0.82 0.69
Qwen2.5-14B-Instruct 14B 0.86 0.78 0.82 0.82 0.78 0.64

LibrusecMHQA and ru2WikiMultihopQA (open-domain multi-hop):

Model Size LibrusecMHQA 8K ru2Wiki 8K ru2Wiki 16K ru2Wiki 32K
Meno-Lite-0.1 7B 0.484 0.388 0.422 0.244
T-lite-it-1.0 7B 0.456 0.367 0.352 0.228
T-lite-it-2.1 7B 0.453 0.469 0.375 0.268
AvitoTech/avibe 8B 0.440 0.347 0.336 0.228
RuadaptQwen2.5-7B-Lite-Beta 7B 0.432 0.367 0.367 0.252
Qwen2.5-7B-Instruct 7B 0.419 0.245 0.305 0.228
Qwen2.5-14B-Instruct 14B 0.484 0.531 0.391 0.285

LongContextMultiQ (multi-document multi-hop):

Model Size 4K 8K 16K 32K 64K 128K
Meno-Lite-0.1 7B 0.045 0.320 0.075 0.000 0.005 0.180
T-lite-it-1.0 7B 0.055 0.270 0.060 0.000 0.005 0.000
T-lite-it-2.1 7B 0.065 0.335 0.040 0.000 0.005 0.000
AvitoTech/avibe 8B 0.060 0.360 0.085 0.070 0.005 0.000
RuadaptQwen2.5-7B-Lite-Beta 7B 0.050 0.300 0.165 0.000 0.005 0.150
Qwen2.5-7B-Instruct 7B 0.055 0.260 0.035 0.000 0.005 0.000
Qwen2.5-14B-Instruct 14B 0.075 0.345 0.090 0.010 0.005 0.000

Key observations on multi-hop QA:

  • Consistent improvement over the ancestral chain. Across nearly all ruBABILong tasks and context lengths, Meno-Lite-0.1 outperforms its direct parent RuadaptQwen2.5-7B-Lite-Beta, confirming that the CPT+SFT pipeline adds genuine multi-hop reasoning capability rather than just superficial instruction following.
  • Best-in-class on real-world multi-hop QA among 7B models. On LibrusecMHQA (0.484) and ru2WikiMultihopQA at 8K (0.388) and 16K (0.422), Meno-Lite-0.1 leads all 7B-class models. On LibrusecMHQA it ties with the twice-larger Qwen2.5-14B-Instruct. These tasks — based on Russian literary texts and Wikipedia — are closest to real RAG scenarios.
  • Unique long-context multi-hop ability. On LongContextMultiQ at 128K, Meno-Lite-0.1 is the only model besides its parent to achieve a non-trivial score (0.18 vs. 0.00 for avibe, T-lite-it-1.0, T-lite-it-2.1, Qwen2.5-7B, and Qwen2.5-14B). This suggests that the CPT data selection strategy preserved long-range coherence even as skills were sharpened.
  • Different degradation profiles. On the synthetic ruBABILong tasks, avibe shows a steeper degradation curve on QA1 (single fact: 0.66→0.18 from 4K to 128K) compared to Meno-Lite (0.72→0.36), indicating that Meno-Lite retains better focus as context grows for single-fact retrieval. Conversely, avibe is stronger on QA2 (two facts) across all lengths, reflecting a complementary strength in multi-fact aggregation.

Question Answering and Multiple Choice

These tasks evaluate reading comprehension on Russian literary, scientific, and factual texts.

ruQuALITY (reading comprehension over long narratives):

Model Size 8K 16K
Meno-Lite-0.1 7B 0.805 0.720
T-lite-it-1.0 7B 0.854 0.770
T-lite-it-2.1 7B 0.805 0.727
AvitoTech/avibe 8B 0.732 0.677
RuadaptQwen2.5-7B-Lite-Beta 7B 0.683 0.671
Qwen2.5-7B-Instruct 7B 0.732 0.634
Qwen2.5-14B-Instruct 14B 0.732 0.702

MatreshkaYesNo (yes/no comprehension questions):

Model Size 4K 8K 16K 32K 64K 128K
Meno-Lite-0.1 7B 0.836 0.793 0.757 0.773 0.690 0.603
T-lite-it-1.0 7B 0.920 0.770 0.753 0.590 0.530 0.530
T-lite-it-2.1 7B 0.836 0.843 0.797 0.807 0.757 0.577
AvitoTech/avibe 8B 0.809 0.817 0.797 0.777 0.770 0.633
RuadaptQwen2.5-7B-Lite-Beta 7B 0.809 0.827 0.790 0.747 0.687 0.593
Qwen2.5-7B-Instruct 7B 0.860 0.727 0.737 0.620 0.587 0.567
Qwen2.5-14B-Instruct 14B 0.876 0.827 0.773 0.763 0.697 0.637

MatreshkaNames (entity name extraction from narratives):

Model Size 4K 8K 16K 32K 64K 128K
Meno-Lite-0.1 7B 0.453 0.413 0.320 0.207 0.113 0.040
T-lite-it-1.0 7B 0.467 0.507 0.400 0.253 0.060 0.060
T-lite-it-2.1 7B 0.647 0.520 0.453 0.467 0.360 0.193
AvitoTech/avibe 8B 0.647 0.513 0.473 0.400 0.273 0.153
RuadaptQwen2.5-7B-Lite-Beta 7B 0.433 0.347 0.300 0.153 0.047 0.013
Qwen2.5-7B-Instruct 7B 0.480 0.460 0.373 0.327 0.167 0.113
Qwen2.5-14B-Instruct 14B 0.647 0.547 0.500 0.420 0.227 0.193

ruSciAbstractRetrieval (scientific abstract retrieval):

Model Size 4K 8K 16K 32K 64K 128K
Meno-Lite-0.1 7B 0.986 0.848 0.757 0.476 0.185 0.085
T-lite-it-1.0 7B 0.981 0.910 0.805 0.538 0.215 0.125
T-lite-it-2.1 7B 0.986 0.952 0.895 0.810 0.230 0.140
AvitoTech/avibe 8B 0.981 0.933 0.933 0.738 0.375 0.165
RuadaptQwen2.5-7B-Lite-Beta 7B 0.986 0.862 0.738 0.476 0.185 0.095
Qwen2.5-7B-Instruct 7B 0.976 0.905 0.867 0.710 0.330 0.160
Qwen2.5-14B-Instruct 14B 0.986 0.919 0.929 0.790 0.430 0.195

LibrusecHistory (historical literary QA):

Model Size 8K 16K 32K 64K
Meno-Lite-0.1 7B 0.906 0.938 0.813 0.875
T-lite-it-1.0 7B 0.906 0.906 0.875 0.844
T-lite-it-2.1 7B 1.000 1.000 0.938 0.875
AvitoTech/avibe 8B 1.000 0.938 1.000 0.938
RuadaptQwen2.5-7B-Lite-Beta 7B 0.844 0.875 0.844 0.781
Qwen2.5-7B-Instruct 7B 0.938 0.906 0.906 0.750
Qwen2.5-14B-Instruct 14B 0.938 0.938 0.906 0.781

Key observations on QA tasks:

  • ruQuALITY: best among 7B models except T-lite-it-1.0. Meno-Lite-0.1 scores 0.805 at 8K and 0.720 at 16K, surpassing avibe (0.732/0.677), Qwen2.5-7B (0.732/0.634), and even Qwen2.5-14B (0.732/0.702 at 16K). The improvement over the direct parent RuadaptQwen2.5-7B-Lite-Beta is substantial (+0.12 at 8K, +0.05 at 16K).
  • MatreshkaYesNo: exceptional stability across context lengths. While T-lite-it-1.0 starts strong at 4K (0.920) but drops to 0.530 at 128K, Meno-Lite-0.1 degrades much more gracefully (0.836→0.603), maintaining a clear advantage at 32K (0.773 vs. 0.590).
  • LibrusecHistory: strong gains over ancestors. Meno-Lite-0.1 outperforms its parent RuadaptQwen2.5-7B-Lite-Beta at every context length and matches or exceeds the 14B Qwen2.5 at 16K and 64K.

Complex Reasoning and Mathematical Problems

ruQasper (QA over scientific papers):

Model Size 8K 16K 32K
Meno-Lite-0.1 7B 0.542 0.538 0.360
T-lite-it-1.0 7B 0.478 0.508 0.321
T-lite-it-2.1 7B 0.476 0.543 0.299
AvitoTech/avibe 8B 0.507 0.524 0.388
RuadaptQwen2.5-7B-Lite-Beta 7B 0.465 0.468 0.347
Qwen2.5-7B-Instruct 7B 0.454 0.436 0.346
Qwen2.5-14B-Instruct 14B 0.507 0.519 0.411

ruGSM100 (grade-school math in Russian):

Model Size 16K
AvitoTech/avibe 8B 0.31
Qwen2.5-14B-Instruct 14B 0.29
Meno-Lite-0.1 7B 0.26
T-lite-it-2.1 7B 0.19
RuadaptQwen2.5-7B-Lite-Beta 7B 0.19
T-lite-it-1.0 7B 0.16
Qwen2.5-7B-Instruct 7B 0.15

ruSciPassageCount (counting relevant passages):

Model Size 4K 8K 16K 32K 64K 128K
Meno-Lite-0.1 7B 0.43 0.12 0.05 0.05 0.00 0.02
T-lite-it-1.0 7B 0.47 0.15 0.06 0.03 0.02 0.02
T-lite-it-2.1 7B 0.32 0.14 0.00 0.01 0.00 0.02
AvitoTech/avibe 8B 0.36 0.08 0.08 0.03 0.01 0.02
RuadaptQwen2.5-7B-Lite-Beta 7B 0.50 0.21 0.13 0.01 0.01 0.02
Qwen2.5-7B-Instruct 7B 0.32 0.10 0.03 0.04 0.01 0.02
Qwen2.5-14B-Instruct 14B 0.68 0.27 0.12 0.04 0.00 0.03

Key observations on complex reasoning:

  • ruQasper: best 7B model at 8K. Meno-Lite-0.1 achieves 0.542 at 8K — the highest score among all tested models including the 14B Qwen2.5 (0.507). This task, which requires QA over full scientific papers, is a direct test of document comprehension skill.
  • ruGSM100: strongest 7B model for math. Despite no math-specific training, Meno-Lite (0.26) substantially outperforms all 7B peers (T-lite-it-1.0: 0.16, Qwen2.5-7B: 0.15) and approaches the 14B model (0.29). This supports the hypothesis that improved language comprehension transfers to mathematical reasoning when problems are stated in natural language.

LIBRA Summary

The per-length analysis reveals a nuanced picture:

Strength Meno-Lite-0.1 advantage
Real-world multi-hop QA (LibrusecMHQA, ru2WikiMHQA) Best 7B model; ties with Qwen2.5-14B on LibrusecMHQA
Document QA (ruQasper at 8K, ruQuALITY) Highest ruQasper score among all models at 8K; second-best ruQuALITY among 7B models
Comprehension stability (MatreshkaYesNo at 32K) More graceful degradation than T-lite-it-1.0; competitive with models twice its size
Ultra-long retrieval (Passkey at 128K) 0.98 vs. 0.58 for stock Qwen2.5-7B/14B
Long-context multi-hop (LongContextMultiQ at 128K) Only model (besides its parent) with non-zero score at 128K
Math reasoning from context (ruGSM100) Best among all 7B models

The model's profile is well-suited for production RAG pipelines, where contexts typically fall in the 4K–16K range — precisely where Meno-Lite-0.1 shows its strongest performance relative to peers.

Summary of Benchmark Findings

The evaluation on MERA and LIBRA, combined with tokenizer analysis, paints a coherent picture of Meno-Lite-0.1's strengths and trade-offs.

1. MERA (general Russian LLM evaluation): Meno-Lite-0.1 achieves 0.555 — the highest score among all 7B-class models on the leaderboard, matching 70B-class Llama-3.3-70B-Instruct (0.555) and Meta-Llama-3.1-70B-Instruct (0.554). The gap over the base Qwen2.5-7B-Instruct (+0.073) and the direct parent RuadaptQwen2.5-7B-Lite-Beta is substantial, confirming that the CPT+SFT pipeline adds genuine capability rather than superficial instruction tuning.

2. LIBRA (long-context understanding): The per-length analysis reveals that Meno-Lite-0.1 excels in the 4K–16K context range most relevant to production RAG:

  • Best 7B model on real-world multi-hop QA — leading on LibrusecMHQA (0.484, tied with Qwen2.5-14B) and ru2WikiMultihopQA at 8K and 16K.
  • Highest ruQasper score at 8K among all tested models (0.542), including the 14B Qwen2.5-Instruct (0.507), demonstrating strong scientific document comprehension.
  • Best ruQuALITY at 16K among 7B models (0.720), surpassing even the 14B model (0.702).
  • Near-perfect passkey retrieval up to 128K (0.98), far ahead of stock Qwen2.5-7B/14B (0.58).
  • Unique non-zero score on LongContextMultiQ at 128K (0.18) — the only model besides its parent to solve any multi-hop questions at this extreme length.
  • At very long contexts (64K–128K) on certain tasks (MatreshkaNames, ruSciAbstractRetrieval, ruBABILongQA2), models with larger effective context training — such as avibe and T-lite-it-2.1 — retain quality better, reflecting a known trade-off between mid-range precision and ultra-long-range retention.

3. Tokenizer efficiency: With 3.77 Russian characters per token (vs. 2.57 for stock Qwen2.5), Meno-Lite-0.1 generates Russian text ~47% more efficiently, directly reducing inference latency and serving costs.

4. The hypothesis in practice. These results provide empirical support for the language-knowledge vs. world-knowledge decomposition. By concentrating training signal on language skills — comprehension, extraction, multi-hop reasoning, instruction following — a 7B model can match or exceed systems 2×–10× its size on context-grounded tasks. The model's relative weaknesses align with prediction: world-knowledge-heavy tasks (CheGeKa, MaMuRAMu) and pure long-range retrieval tasks beyond 32K remain areas where larger models or specialized long-context training hold an advantage. For the vast majority of RAG, document QA, and agentic deployments — where relevant information is supplied in context and typical chunk sizes fall in the 4K–16K range — Meno-Lite-0.1 offers a compelling combination of quality, speed, and cost efficiency.

Training Details

Training Data

The training data was curated to maximize language-skill acquisition while minimizing reliance on world-knowledge memorization.

Continued Pretraining (CPT) data:

Source Language Description
FineWeb-Edu (sampled) EN High-quality educational web text
RuLM subset RU Russian web text selected for maximal FineWeb-Edu similarity using gte-multilingual-base embeddings
RU FinePDFs-edu RU Educational PDF documents in Russian
RuREBus (Dialogue'20) RU Unlabeled text corpus from the RuREBus shared task

Supervised Fine-Tuning (SFT) data:

Source Language Description
NEREL → instructions RU Named entity recognition corpus converted to instruction format, plus GPT-4o-mini–generated synthetic entity normalization and definitions
LightRAG query logs RU GPT-4o–generated queries over Habr articles and the NSU website
MultiHopRAG EN Multi-hop question answering training dialogs
MTRAGEval EN Multi-turn RAG evaluation training dialogs

Training Procedure

Stage 1 — Continued Pretraining (CPT): The model was further pretrained on a balanced mix of Russian and English educational, legal, and scientific-technical texts. The Russian subset was specifically selected to match the quality distribution of FineWeb-Edu, ensuring that the model absorbs high-quality linguistic patterns rather than noisy web crawls.

Stage 2 — Supervised Fine-Tuning (SFT): The SFT stage used a custom instruction set designed to reinforce language skills (extraction, normalization, summarization, multi-hop QA) rather than inject world knowledge. This is the critical distinction: conventional SFT datasets often teach models to recall facts, whereas our instructions teach models to use context.

Training Hyperparameters

  • Training regime: bf16 mixed precision

Bias, Risks, and Limitations

  • Hallucination risk: Like all autoregressive LLMs, Meno-Lite-0.1 can generate plausible-sounding but factually incorrect text, especially when relevant context is not provided in the prompt. This is by design — the model was optimized for context-grounded tasks.
  • World knowledge gaps: The model deliberately trades world-knowledge capacity for language skills. It should not be used as a standalone knowledge base.
  • Language coverage: While the model retains good English capabilities, it has been primarily validated on Russian and English. Performance on other languages supported by the Qwen2.5 backbone is untested.
  • Training data biases: The model inherits biases present in its pretraining corpora (FineWeb-Edu, RuLM, Habr) and in the GPT-4o/GPT-4o-mini generations used for synthetic SFT data.
  • Context window: Although the model handles contexts up to 128K tokens in passkey tasks, complex reasoning performance degrades at very long contexts (>32K), consistent with other models in this size class.

Recommendations

  • Always provide relevant context in the prompt for best results.
  • For factual accuracy, use the model within a RAG pipeline with a reliable retrieval system.
  • Validate model outputs in high-stakes domains (legal, medical, financial).

Technical Specifications

Model Architecture and Objective

  • Architecture: Qwen2.5 (causal decoder-only transformer)
  • Parameters: ~7B
  • Context window: Up to 32,768 tokens (validated up to 128K on retrieval tasks)
  • Vocabulary: Extended tokenizer with improved Russian coverage (~3.77 chars/token for Russian)
  • Objective: Next-token prediction (autoregressive language modeling)

Compute Infrastructure

Hardware

Training was conducted on NVIDIA GPU infrastructure (10x NVIDIA Tesla A100 80 Gb).

Software

Citation

If you use Meno-Lite-0.1 in your research, please cite:

BibTeX:

@misc{bondarenko2025menolite,
  title={Meno-Lite-0.1: A 7B Language Model Optimized for Russian RAG Pipelines},
  author={Ivan Bondarenko},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/bond005/meno-lite-0.1}
}

Glossary

  • RAG (Retrieval-Augmented Generation): A paradigm where relevant documents are retrieved from an external knowledge base and injected into the LLM's context, enabling accurate answers without relying on parametric memory.
  • CPT (Continued Pretraining): An additional pretraining phase applied to an already-trained model, typically on domain-specific or quality-filtered data.
  • SFT (Supervised Fine-Tuning): Training on instruction–response pairs to align the model with desired behaviors.
  • Multi-hop QA: Question answering that requires synthesizing information from multiple passages or reasoning steps.
  • MERA: The most comprehensive Russian-language benchmark for evaluating LLMs, comprising 23 tasks covering world knowledge, logic, causality, AI ethics, and more. The overall leaderboard score is computed over 15 closed-test tasks.
  • LIBRA: Long Input Benchmark for Russian Analysis — 21 tasks for evaluating long-context understanding in Russian, spanning context lengths from 4K to 128K tokens.

Model Card Authors

Ivan Bondarenko (@bond005), Novosibirsk State University

Model Card Contact

For questions, feedback, or collaboration inquiries, please open an issue on the model repository or contact Ivan Bondarenko via Hugging Face.

Downloads last month
154
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bond005/meno-lite-0.1

Base model

Qwen/Qwen2.5-7B
Finetuned
(4)
this model