Upload Fasih-2B Arabic LLM (nanochat, SFT checkpoint)

Browse files

Files changed (5) hide show

README.md +105 -0
config.json +40 -0
model.safetensors +3 -0
token_bytes.pt +3 -0
tokenizer.pkl +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,105 @@

+---
+language:
+- ar
+license: apache-2.0
+tags:
+- arabic
+- nanochat
+- text-generation
+- causal-lm
+- from-scratch
+pipeline_tag: text-generation
+model-index:
+- name: Fasih-2B
+  results:
+  - task:
+      type: text-generation
+      name: Arabic Knowledge (ArabicMMLU)
+    metrics:
+    - type: accuracy
+      value: 29.91
+      name: ArabicMMLU Accuracy
+  - task:
+      type: text-generation
+      name: Arabic Cultural Values (ACVA)
+    metrics:
+    - type: accuracy
+      value: 49.29
+      name: ACVA Accuracy
+  - task:
+      type: text-generation
+      name: Arabic Math (ArabicGSM8K)
+    metrics:
+    - type: accuracy
+      value: 4.47
+      name: ArabicGSM8K Accuracy
+---
+# Fasih-2B (فصيح)
+A **2.09 billion parameter** Arabic language model trained **from scratch** using the [nanochat](https://github.com/karpathy/nanochat) framework.
+## Model Details
+| Property | Value |
+|----------|-------|
+| Parameters | 2,088,768,048 (2.09B) |
+| Architecture | NanoChatGPT (Transformer decoder) |
+| Layers | 24 |
+| Hidden size | 1536 |
+| Attention heads | 12 |
+| Vocabulary | 65,536 (Arabic BPE) |
+| Context length | 2,048 tokens |
+| Precision | bfloat16 |
+| Language | Arabic |
+## Training
+### Pretraining
+- **Dataset**: AraMix (8.19B tokens, Chinchilla-optimal)
+- **Steps**: 7,812 steps
+- **Hardware**: 8x NVIDIA H20 GPUs
+- **Batch size**: 1,048,576 tokens
+### SFT (Supervised Fine-Tuning)
+- **Datasets**: alpaca-gpt4-arabic (50K) + CIDAR (10K) + ArabicMMLU (x3) + ArabicGSM8K (x4) + Fasih identity conversations
+- **Total mixture**: 90,621 rows
+- **Steps**: 16 (1 epoch)
+## Evaluation Results
+| Benchmark | Accuracy | Random Baseline |
+|-----------|----------|----------------|
+| ArabicMMLU | **29.91%** | 25% |
+| ACVA | **49.29%** | 50% |
+| ArabicGSM8K | **4.47%** | 0% |
+| ChatCORE | **0.0175** | 0.0 |
+## Usage
+This model uses the nanochat framework. To use it:
+```bash
+git clone https://github.com/karpathy/nanochat
+cd nanochat
+# Copy model files to $NANOCHAT_BASE_DIR/chatsft_checkpoints/d24/
+python -m scripts.chat_web -i sft
+```
+## Chat Format
+The model uses nanochat's special token format:
+```
+<|bos|><|user_start|>مرحبا، من أنت؟<|user_end|><|assistant_start|>
+```
+## Limitations
+- Small model (2B params) — limited knowledge and reasoning compared to larger models
+- Trained primarily on Arabic text — limited multilingual capability
+- Short context (2048 tokens)
+- This is a research/educational model trained from scratch in ~12 hours
+## License
+Apache 2.0

config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+  "model_type": "nanochat",
+  "architecture": "NanoChatGPT",
+  "num_parameters": 2088768048,
+  "num_parameters_billion": 2.089,
+  "vocab_size": 65536,
+  "n_layer": 24,
+  "n_head": 12,
+  "n_kv_head": 12,
+  "n_embd": 1536,
+  "ff_mult": 4,
+  "sequence_len": 2048,
+  "window_pattern": "SSSL",
+  "dtype": "bfloat16",
+  "language": "ar",
+  "tokenizer_type": "nanochat_bpe",
+  "training": {
+    "framework": "nanochat",
+    "pretrain_steps": 7812,
+    "pretrain_tokens": "8.19B",
+    "pretrain_dataset": "AraMix",
+    "sft_steps": 16,
+    "sft_datasets": [
+      "alpaca-gpt4-arabic",
+      "CIDAR",
+      "ArabicMMLU",
+      "ArabicGSM8K",
+      "fasih_identity"
+    ],
+    "sft_mixture_rows": 90621,
+    "hardware": "8x NVIDIA H20",
+    "val_bpb": 0.4533
+  },
+  "evaluation": {
+    "ArabicMMLU": "29.91%",
+    "ACVA": "49.29%",
+    "ArabicGSM8K": "4.47%",
+    "ChatCORE": 0.0175
+  }
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3517a0033034f2a7d103c62a7e4fd40774c71b3e08b2bde17dcf17865f2a0c91
+size 4177554864

token_bytes.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3806dfc6c5b1ec9d2df2b265480ee8ba96c529b4d6b24b9b84be982b21b52fad
+size 263324

tokenizer.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f922d447008f293dc604661c53225d5909a7a39b7fc3f422492417435f1d14cc
+size 1095580