--- language: - ar license: apache-2.0 tags: - arabic - nanochat - text-generation - causal-lm - from-scratch pipeline_tag: text-generation model-index: - name: Fasih-2B results: - task: type: text-generation name: Arabic Knowledge (ArabicMMLU) metrics: - type: accuracy value: 29.91 name: ArabicMMLU Accuracy - task: type: text-generation name: Arabic Cultural Values (ACVA) metrics: - type: accuracy value: 49.29 name: ACVA Accuracy - task: type: text-generation name: Arabic Math (ArabicGSM8K) metrics: - type: accuracy value: 4.47 name: ArabicGSM8K Accuracy --- # Fasih-2B (فصيح) A **2.09 billion parameter** Arabic language model trained **from scratch** using the [nanochat](https://github.com/karpathy/nanochat) framework. ## Model Details | Property | Value | |----------|-------| | Parameters | 2,088,768,048 (2.09B) | | Architecture | NanoChatGPT (Transformer decoder) | | Layers | 24 | | Hidden size | 1536 | | Attention heads | 12 | | Vocabulary | 65,536 (Arabic BPE) | | Context length | 2,048 tokens | | Precision | bfloat16 | | Language | Arabic | ## Training ### Pretraining - **Dataset**: AraMix (8.19B tokens, Chinchilla-optimal) - **Steps**: 7,812 steps - **Hardware**: 8x NVIDIA H20 GPUs - **Batch size**: 1,048,576 tokens ### SFT (Supervised Fine-Tuning) - **Datasets**: alpaca-gpt4-arabic (50K) + CIDAR (10K) + ArabicMMLU (x3) + ArabicGSM8K (x4) + Fasih identity conversations - **Total mixture**: 90,621 rows - **Steps**: 16 (1 epoch) ## Evaluation Results | Benchmark | Accuracy | Random Baseline | |-----------|----------|----------------| | ArabicMMLU | **29.91%** | 25% | | ACVA | **49.29%** | 50% | | ArabicGSM8K | **4.47%** | 0% | | ChatCORE | **0.0175** | 0.0 | ## Usage This model uses the nanochat framework. To use it: ```bash git clone https://github.com/karpathy/nanochat cd nanochat # Copy model files to $NANOCHAT_BASE_DIR/chatsft_checkpoints/d24/ python -m scripts.chat_web -i sft ``` ## Chat Format The model uses nanochat's special token format: ``` <|bos|><|user_start|>مرحبا، من أنت؟<|user_end|><|assistant_start|> ``` ## Limitations - Small model (2B params) — limited knowledge and reasoning compared to larger models - Trained primarily on Arabic text — limited multilingual capability - Short context (2048 tokens) - This is a research/educational model trained from scratch in ~12 hours ## License Apache 2.0