HeshamHaroon commited on
Commit
a79efdc
·
verified ·
1 Parent(s): 6c27b31

Upload Fasih-2B Arabic LLM (nanochat, SFT checkpoint)

Browse files
Files changed (5) hide show
  1. README.md +105 -0
  2. config.json +40 -0
  3. model.safetensors +3 -0
  4. token_bytes.pt +3 -0
  5. tokenizer.pkl +3 -0
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ license: apache-2.0
5
+ tags:
6
+ - arabic
7
+ - nanochat
8
+ - text-generation
9
+ - causal-lm
10
+ - from-scratch
11
+ pipeline_tag: text-generation
12
+ model-index:
13
+ - name: Fasih-2B
14
+ results:
15
+ - task:
16
+ type: text-generation
17
+ name: Arabic Knowledge (ArabicMMLU)
18
+ metrics:
19
+ - type: accuracy
20
+ value: 29.91
21
+ name: ArabicMMLU Accuracy
22
+ - task:
23
+ type: text-generation
24
+ name: Arabic Cultural Values (ACVA)
25
+ metrics:
26
+ - type: accuracy
27
+ value: 49.29
28
+ name: ACVA Accuracy
29
+ - task:
30
+ type: text-generation
31
+ name: Arabic Math (ArabicGSM8K)
32
+ metrics:
33
+ - type: accuracy
34
+ value: 4.47
35
+ name: ArabicGSM8K Accuracy
36
+ ---
37
+
38
+ # Fasih-2B (فصيح)
39
+
40
+ A **2.09 billion parameter** Arabic language model trained **from scratch** using the [nanochat](https://github.com/karpathy/nanochat) framework.
41
+
42
+ ## Model Details
43
+
44
+ | Property | Value |
45
+ |----------|-------|
46
+ | Parameters | 2,088,768,048 (2.09B) |
47
+ | Architecture | NanoChatGPT (Transformer decoder) |
48
+ | Layers | 24 |
49
+ | Hidden size | 1536 |
50
+ | Attention heads | 12 |
51
+ | Vocabulary | 65,536 (Arabic BPE) |
52
+ | Context length | 2,048 tokens |
53
+ | Precision | bfloat16 |
54
+ | Language | Arabic |
55
+
56
+ ## Training
57
+
58
+ ### Pretraining
59
+ - **Dataset**: AraMix (8.19B tokens, Chinchilla-optimal)
60
+ - **Steps**: 7,812 steps
61
+ - **Hardware**: 8x NVIDIA H20 GPUs
62
+ - **Batch size**: 1,048,576 tokens
63
+
64
+ ### SFT (Supervised Fine-Tuning)
65
+ - **Datasets**: alpaca-gpt4-arabic (50K) + CIDAR (10K) + ArabicMMLU (x3) + ArabicGSM8K (x4) + Fasih identity conversations
66
+ - **Total mixture**: 90,621 rows
67
+ - **Steps**: 16 (1 epoch)
68
+
69
+ ## Evaluation Results
70
+
71
+ | Benchmark | Accuracy | Random Baseline |
72
+ |-----------|----------|----------------|
73
+ | ArabicMMLU | **29.91%** | 25% |
74
+ | ACVA | **49.29%** | 50% |
75
+ | ArabicGSM8K | **4.47%** | 0% |
76
+ | ChatCORE | **0.0175** | 0.0 |
77
+
78
+ ## Usage
79
+
80
+ This model uses the nanochat framework. To use it:
81
+
82
+ ```bash
83
+ git clone https://github.com/karpathy/nanochat
84
+ cd nanochat
85
+ # Copy model files to $NANOCHAT_BASE_DIR/chatsft_checkpoints/d24/
86
+ python -m scripts.chat_web -i sft
87
+ ```
88
+
89
+ ## Chat Format
90
+
91
+ The model uses nanochat's special token format:
92
+ ```
93
+ <|bos|><|user_start|>مرحبا، من أنت؟<|user_end|><|assistant_start|>
94
+ ```
95
+
96
+ ## Limitations
97
+
98
+ - Small model (2B params) — limited knowledge and reasoning compared to larger models
99
+ - Trained primarily on Arabic text — limited multilingual capability
100
+ - Short context (2048 tokens)
101
+ - This is a research/educational model trained from scratch in ~12 hours
102
+
103
+ ## License
104
+
105
+ Apache 2.0
config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "nanochat",
3
+ "architecture": "NanoChatGPT",
4
+ "num_parameters": 2088768048,
5
+ "num_parameters_billion": 2.089,
6
+ "vocab_size": 65536,
7
+ "n_layer": 24,
8
+ "n_head": 12,
9
+ "n_kv_head": 12,
10
+ "n_embd": 1536,
11
+ "ff_mult": 4,
12
+ "sequence_len": 2048,
13
+ "window_pattern": "SSSL",
14
+ "dtype": "bfloat16",
15
+ "language": "ar",
16
+ "tokenizer_type": "nanochat_bpe",
17
+ "training": {
18
+ "framework": "nanochat",
19
+ "pretrain_steps": 7812,
20
+ "pretrain_tokens": "8.19B",
21
+ "pretrain_dataset": "AraMix",
22
+ "sft_steps": 16,
23
+ "sft_datasets": [
24
+ "alpaca-gpt4-arabic",
25
+ "CIDAR",
26
+ "ArabicMMLU",
27
+ "ArabicGSM8K",
28
+ "fasih_identity"
29
+ ],
30
+ "sft_mixture_rows": 90621,
31
+ "hardware": "8x NVIDIA H20",
32
+ "val_bpb": 0.4533
33
+ },
34
+ "evaluation": {
35
+ "ArabicMMLU": "29.91%",
36
+ "ACVA": "49.29%",
37
+ "ArabicGSM8K": "4.47%",
38
+ "ChatCORE": 0.0175
39
+ }
40
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3517a0033034f2a7d103c62a7e4fd40774c71b3e08b2bde17dcf17865f2a0c91
3
+ size 4177554864
token_bytes.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3806dfc6c5b1ec9d2df2b265480ee8ba96c529b4d6b24b9b84be982b21b52fad
3
+ size 263324
tokenizer.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f922d447008f293dc604661c53225d5909a7a39b7fc3f422492417435f1d14cc
3
+ size 1095580