---
language:
- ar
license: apache-2.0
tags:
- arabic
- nanochat
- text-generation
- causal-lm
- from-scratch
pipeline_tag: text-generation
model-index:
- name: Fasih-2B
  results:
  - task:
      type: text-generation
      name: Arabic Knowledge (ArabicMMLU)
    metrics:
    - type: accuracy
      value: 29.91
      name: ArabicMMLU Accuracy
  - task:
      type: text-generation
      name: Arabic Cultural Values (ACVA)
    metrics:
    - type: accuracy
      value: 49.29
      name: ACVA Accuracy
  - task:
      type: text-generation
      name: Arabic Math (ArabicGSM8K)
    metrics:
    - type: accuracy
      value: 4.47
      name: ArabicGSM8K Accuracy
---

# Fasih-2B (فصيح)

A **2.09 billion parameter** Arabic language model trained **from scratch** using the [nanochat](https://github.com/karpathy/nanochat) framework.

## Model Details

| Property | Value |
|----------|-------|
| Parameters | 2,088,768,048 (2.09B) |
| Architecture | NanoChatGPT (Transformer decoder) |
| Layers | 24 |
| Hidden size | 1536 |
| Attention heads | 12 |
| Vocabulary | 65,536 (Arabic BPE) |
| Context length | 2,048 tokens |
| Precision | bfloat16 |
| Language | Arabic |

## Training

### Pretraining
- **Dataset**: AraMix (8.19B tokens, Chinchilla-optimal)
- **Steps**: 7,812 steps
- **Hardware**: 8x NVIDIA H20 GPUs
- **Batch size**: 1,048,576 tokens

### SFT (Supervised Fine-Tuning)
- **Datasets**: alpaca-gpt4-arabic (50K) + CIDAR (10K) + ArabicMMLU (x3) + ArabicGSM8K (x4) + Fasih identity conversations
- **Total mixture**: 90,621 rows
- **Steps**: 16 (1 epoch)

## Evaluation Results

| Benchmark | Accuracy | Random Baseline |
|-----------|----------|----------------|
| ArabicMMLU | **29.91%** | 25% |
| ACVA | **49.29%** | 50% |
| ArabicGSM8K | **4.47%** | 0% |
| ChatCORE | **0.0175** | 0.0 |

## Usage

This model uses the nanochat framework. To use it:

```bash
git clone https://github.com/karpathy/nanochat
cd nanochat
# Copy model files to $NANOCHAT_BASE_DIR/chatsft_checkpoints/d24/
python -m scripts.chat_web -i sft
```

## Chat Format

The model uses nanochat's special token format:
```
<|bos|><|user_start|>مرحبا، من أنت؟<|user_end|><|assistant_start|>
```

## Limitations

- Small model (2B params) — limited knowledge and reasoning compared to larger models
- Trained primarily on Arabic text — limited multilingual capability
- Short context (2048 tokens)
- This is a research/educational model trained from scratch in ~12 hours

## License

Apache 2.0