burtenshaw HF Staff commited on
Commit
eccfa49
·
verified ·
1 Parent(s): 6cbffae

Extract evaluation results from README

Browse files

This commit adds structured evaluation results to the model card. The results are formatted using the model-index specification and will be displayed in the model card's evaluation widget.

Files changed (1) hide show
  1. README.md +96 -0
README.md CHANGED
@@ -3,6 +3,102 @@ library_name: transformers
3
  license: apache-2.0
4
  license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507/blob/main/LICENSE
5
  pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ---
7
 
8
  # Qwen3-235B-A22B-Thinking-2507
 
3
  license: apache-2.0
4
  license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507/blob/main/LICENSE
5
  pipeline_tag: text-generation
6
+ model-index:
7
+ - name: Qwen3-235B-A22B-Thinking-2507
8
+ results:
9
+ - task:
10
+ type: text-generation
11
+ dataset:
12
+ name: Benchmarks
13
+ type: benchmark
14
+ metrics:
15
+ - name: MMLU-Pro (Deepseek-R1-0528)
16
+ type: mmlu-pro
17
+ value: 85.0
18
+ - name: MMLU-Redux (Deepseek-R1-0528)
19
+ type: mmlu-redux
20
+ value: 93.4
21
+ - name: GPQA (Deepseek-R1-0528)
22
+ type: gpqa
23
+ value: 81.0
24
+ - name: SuperGPQA (Deepseek-R1-0528)
25
+ type: supergpqa
26
+ value: 61.7
27
+ - name: AIME25 (Deepseek-R1-0528)
28
+ type: aime25
29
+ value: 87.5
30
+ - name: HMMT25 (Deepseek-R1-0528)
31
+ type: hmmt25
32
+ value: 79.4
33
+ - name: LiveBench 20241125 (Deepseek-R1-0528)
34
+ type: livebench_20241125
35
+ value: 74.7
36
+ - name: HLE (OpenAI O3)
37
+ type: hle
38
+ value: 20.3
39
+ - name: LiveCodeBench v6 (25.02-25.05) (Deepseek-R1-0528)
40
+ type: livecodebench_v6_(25.02-25.05)
41
+ value: 68.7
42
+ - name: CFEval (Deepseek-R1-0528)
43
+ type: cfeval
44
+ value: 2099.0
45
+ - name: OJBench (Deepseek-R1-0528)
46
+ type: ojbench
47
+ value: 33.6
48
+ - name: IFEval (Deepseek-R1-0528)
49
+ type: ifeval
50
+ value: 79.1
51
+ - name: Arena-Hard v2$ (Deepseek-R1-0528)
52
+ type: arena-hard_v2$
53
+ value: 72.2
54
+ - name: Creative Writing v3 (Deepseek-R1-0528)
55
+ type: creative_writing_v3
56
+ value: 86.3
57
+ - name: WritingBench (Deepseek-R1-0528)
58
+ type: writingbench
59
+ value: 83.2
60
+ - name: BFCL-v3 (Deepseek-R1-0528)
61
+ type: bfcl-v3
62
+ value: 63.8
63
+ - name: TAU1-Retail (Deepseek-R1-0528)
64
+ type: tau1-retail
65
+ value: 63.9
66
+ - name: TAU1-Airline (OpenAI O4-mini)
67
+ type: tau1-airline
68
+ value: 49.2
69
+ - name: TAU2-Retail (Deepseek-R1-0528)
70
+ type: tau2-retail
71
+ value: 64.9
72
+ - name: TAU2-Airline (Deepseek-R1-0528)
73
+ type: tau2-airline
74
+ value: 60.0
75
+ - name: TAU2-Telecom (Deepseek-R1-0528)
76
+ type: tau2-telecom
77
+ value: 33.3
78
+ - name: MultiIF (Deepseek-R1-0528)
79
+ type: multiif
80
+ value: 63.5
81
+ - name: MMLU-ProX (Deepseek-R1-0528)
82
+ type: mmlu-prox
83
+ value: 80.6
84
+ - name: INCLUDE (Deepseek-R1-0528)
85
+ type: include
86
+ value: 79.4
87
+ - name: PolyMATH (Deepseek-R1-0528)
88
+ type: polymath
89
+ value: 46.9
90
+ - name: Qwen3-235B-A22B (Thinking) (Acc avg)
91
+ type: qwen3-235b-a22b_(thinking)
92
+ value: 82.9
93
+ - name: Qwen3-235B-A22B-Thinking-2507 (Full Attention) (Acc avg)
94
+ type: qwen3-235b-a22b-thinking-2507_(full_attention)
95
+ value: 95.4
96
+ - name: Qwen3-235B-A22B-Thinking-2507 (Sparse Attention) (Acc avg)
97
+ type: qwen3-235b-a22b-thinking-2507_(sparse_attention)
98
+ value: 95.5
99
+ source:
100
+ name: Model README
101
+ url: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
102
  ---
103
 
104
  # Qwen3-235B-A22B-Thinking-2507