Fasih-2B / README.md

HeshamHaroon

Upload Fasih-2B Arabic LLM (nanochat, SFT checkpoint)

a79efdc verified 16 days ago

preview code

raw

history blame contribute delete

2.5 kB

metadata

language:
  - ar
license: apache-2.0
tags:
  - arabic
  - nanochat
  - text-generation
  - causal-lm
  - from-scratch
pipeline_tag: text-generation
model-index:
  - name: Fasih-2B
    results:
      - task:
          type: text-generation
          name: Arabic Knowledge (ArabicMMLU)
        metrics:
          - type: accuracy
            value: 29.91
            name: ArabicMMLU Accuracy
      - task:
          type: text-generation
          name: Arabic Cultural Values (ACVA)
        metrics:
          - type: accuracy
            value: 49.29
            name: ACVA Accuracy
      - task:
          type: text-generation
          name: Arabic Math (ArabicGSM8K)
        metrics:
          - type: accuracy
            value: 4.47
            name: ArabicGSM8K Accuracy

Fasih-2B (فصيح)

A 2.09 billion parameter Arabic language model trained from scratch using the nanochat framework.

Model Details

Property	Value
Parameters	2,088,768,048 (2.09B)
Architecture	NanoChatGPT (Transformer decoder)
Layers	24
Hidden size	1536
Attention heads	12
Vocabulary	65,536 (Arabic BPE)
Context length	2,048 tokens
Precision	bfloat16
Language	Arabic

Training

Pretraining

Dataset: AraMix (8.19B tokens, Chinchilla-optimal)
Steps: 7,812 steps
Hardware: 8x NVIDIA H20 GPUs
Batch size: 1,048,576 tokens

SFT (Supervised Fine-Tuning)

Datasets: alpaca-gpt4-arabic (50K) + CIDAR (10K) + ArabicMMLU (x3) + ArabicGSM8K (x4) + Fasih identity conversations
Total mixture: 90,621 rows
Steps: 16 (1 epoch)

Evaluation Results

Benchmark	Accuracy	Random Baseline
ArabicMMLU	29.91%	25%
ACVA	49.29%	50%
ArabicGSM8K	4.47%	0%
ChatCORE	0.0175	0.0

Usage

This model uses the nanochat framework. To use it:

git clone https://github.com/karpathy/nanochat
cd nanochat
# Copy model files to $NANOCHAT_BASE_DIR/chatsft_checkpoints/d24/
python -m scripts.chat_web -i sft

Chat Format

The model uses nanochat's special token format:

<|bos|><|user_start|>مرحبا، من أنت؟<|user_end|><|assistant_start|>

Limitations

Small model (2B params) — limited knowledge and reasoning compared to larger models
Trained primarily on Arabic text — limited multilingual capability
Short context (2048 tokens)
This is a research/educational model trained from scratch in ~12 hours

License

Apache 2.0