Fasih-2B / README.md
HeshamHaroon's picture
Upload Fasih-2B Arabic LLM (nanochat, SFT checkpoint)
a79efdc verified
metadata
language:
  - ar
license: apache-2.0
tags:
  - arabic
  - nanochat
  - text-generation
  - causal-lm
  - from-scratch
pipeline_tag: text-generation
model-index:
  - name: Fasih-2B
    results:
      - task:
          type: text-generation
          name: Arabic Knowledge (ArabicMMLU)
        metrics:
          - type: accuracy
            value: 29.91
            name: ArabicMMLU Accuracy
      - task:
          type: text-generation
          name: Arabic Cultural Values (ACVA)
        metrics:
          - type: accuracy
            value: 49.29
            name: ACVA Accuracy
      - task:
          type: text-generation
          name: Arabic Math (ArabicGSM8K)
        metrics:
          - type: accuracy
            value: 4.47
            name: ArabicGSM8K Accuracy

Fasih-2B (فصيح)

A 2.09 billion parameter Arabic language model trained from scratch using the nanochat framework.

Model Details

Property Value
Parameters 2,088,768,048 (2.09B)
Architecture NanoChatGPT (Transformer decoder)
Layers 24
Hidden size 1536
Attention heads 12
Vocabulary 65,536 (Arabic BPE)
Context length 2,048 tokens
Precision bfloat16
Language Arabic

Training

Pretraining

  • Dataset: AraMix (8.19B tokens, Chinchilla-optimal)
  • Steps: 7,812 steps
  • Hardware: 8x NVIDIA H20 GPUs
  • Batch size: 1,048,576 tokens

SFT (Supervised Fine-Tuning)

  • Datasets: alpaca-gpt4-arabic (50K) + CIDAR (10K) + ArabicMMLU (x3) + ArabicGSM8K (x4) + Fasih identity conversations
  • Total mixture: 90,621 rows
  • Steps: 16 (1 epoch)

Evaluation Results

Benchmark Accuracy Random Baseline
ArabicMMLU 29.91% 25%
ACVA 49.29% 50%
ArabicGSM8K 4.47% 0%
ChatCORE 0.0175 0.0

Usage

This model uses the nanochat framework. To use it:

git clone https://github.com/karpathy/nanochat
cd nanochat
# Copy model files to $NANOCHAT_BASE_DIR/chatsft_checkpoints/d24/
python -m scripts.chat_web -i sft

Chat Format

The model uses nanochat's special token format:

<|bos|><|user_start|>مرحبا، من أنت؟<|user_end|><|assistant_start|>

Limitations

  • Small model (2B params) — limited knowledge and reasoning compared to larger models
  • Trained primarily on Arabic text — limited multilingual capability
  • Short context (2048 tokens)
  • This is a research/educational model trained from scratch in ~12 hours

License

Apache 2.0