metadata
language:
- ar
license: apache-2.0
tags:
- arabic
- nanochat
- text-generation
- causal-lm
- from-scratch
pipeline_tag: text-generation
model-index:
- name: Fasih-2B
results:
- task:
type: text-generation
name: Arabic Knowledge (ArabicMMLU)
metrics:
- type: accuracy
value: 29.91
name: ArabicMMLU Accuracy
- task:
type: text-generation
name: Arabic Cultural Values (ACVA)
metrics:
- type: accuracy
value: 49.29
name: ACVA Accuracy
- task:
type: text-generation
name: Arabic Math (ArabicGSM8K)
metrics:
- type: accuracy
value: 4.47
name: ArabicGSM8K Accuracy
Fasih-2B (فصيح)
A 2.09 billion parameter Arabic language model trained from scratch using the nanochat framework.
Model Details
| Property | Value |
|---|---|
| Parameters | 2,088,768,048 (2.09B) |
| Architecture | NanoChatGPT (Transformer decoder) |
| Layers | 24 |
| Hidden size | 1536 |
| Attention heads | 12 |
| Vocabulary | 65,536 (Arabic BPE) |
| Context length | 2,048 tokens |
| Precision | bfloat16 |
| Language | Arabic |
Training
Pretraining
- Dataset: AraMix (8.19B tokens, Chinchilla-optimal)
- Steps: 7,812 steps
- Hardware: 8x NVIDIA H20 GPUs
- Batch size: 1,048,576 tokens
SFT (Supervised Fine-Tuning)
- Datasets: alpaca-gpt4-arabic (50K) + CIDAR (10K) + ArabicMMLU (x3) + ArabicGSM8K (x4) + Fasih identity conversations
- Total mixture: 90,621 rows
- Steps: 16 (1 epoch)
Evaluation Results
| Benchmark | Accuracy | Random Baseline |
|---|---|---|
| ArabicMMLU | 29.91% | 25% |
| ACVA | 49.29% | 50% |
| ArabicGSM8K | 4.47% | 0% |
| ChatCORE | 0.0175 | 0.0 |
Usage
This model uses the nanochat framework. To use it:
git clone https://github.com/karpathy/nanochat
cd nanochat
# Copy model files to $NANOCHAT_BASE_DIR/chatsft_checkpoints/d24/
python -m scripts.chat_web -i sft
Chat Format
The model uses nanochat's special token format:
<|bos|><|user_start|>مرحبا، من أنت؟<|user_end|><|assistant_start|>
Limitations
- Small model (2B params) — limited knowledge and reasoning compared to larger models
- Trained primarily on Arabic text — limited multilingual capability
- Short context (2048 tokens)
- This is a research/educational model trained from scratch in ~12 hours
License
Apache 2.0