模型描述
YModel2是SnifferCaptain训练的到目前为止(11/23/2025)最强大的大语言模型。其推理速度、数学能力、代码能力以及常识回答相比YModel1.x版本均有长足的进步。
模型细节
- 模型借鉴了MFA( https://arxiv.org/abs/2412.19255 )的优化思路,将PEGA(Position Embedding Gate Attention)升级到了PEGA2版本,在性能持平甚至超越PEGA的同时,带来了接近3x的速度提升。
- 模型在FFN部分采用了GeGLU。
训练细节
- 模型继承了YModel1.1的自蒸馏结构,在层间设置余弦相似度损失,使得模型倾向于将知识压缩到浅层。
- 模型将并行的线性层融合,经二阶优化器实验显示模型最终效果不会与不融合的情况更差。
- 模型采用最强大的SiMuon优化器训练。SiMuon是从Muon( https://kellerjordan.github.io/posts/muon )改进得到的( https://www.arxiv.org/abs/2507.11005 )。这个模型使用的SiMuon与原始的Muon以下区别:1:在执行NS操作的l2 norm前,执行sign操作。2:将NS迭代从5步减少到2步。
- 模型的tokenlizer与词嵌入层使用的是预训练权重,来自MiniMind2( https://github.com/jingyaogong/minimind )
- 模型在预训练阶段训练了0.4B token,在微调阶段训练了约7B token(8.56M样本)。
- 模型在512长度上进行预训练,学习率为1e-3(SiMuon,使用0.2*sqrt(max(fan in, fan out))进行学习率缩放)与1e-4(AdamW)。采用bf16 amp加速。
- 模型在512/1e-5、1024/1e-5、2048/1e-5、2048/2e-6(长度/学习率、SiMuon学习率为基础学习率的66x倍)上进行sft微调。采用bf16 amp加速。
Model Description
YModel2 is the most powerful large language model (LLM) trained by SnifferCaptain to date (as of November 23, 2025). It exhibits substantial improvements over the YModel1.x series in inference speed, mathematical reasoning, coding capabilities, and common-sense question answering.
Model Details
- The model incorporates optimization insights from MFA (https://arxiv.org/abs/2412.19255), upgrading Position Embedding Gate Attention (PEGA) to PEGA2. This new version matches or even exceeds the original PEGA in performance while delivering nearly a 3× speedup.
- The model employs GeGLU in its Feed-Forward Network (FFN) blocks.
Training Details
- The model inherits the self-distillation architecture from YModel1.1, applying an inter-layer cosine similarity loss that encourages the model to compress knowledge into shallower layers.
- Parallel linear layers have been fused. Experiments with second-order optimizers confirm that this fusion does not degrade the model’s final performance compared to the unfused counterpart.
- The model was trained using the powerful SiMuon optimizer, an enhanced variant of Muon (https://kellerjordan.github.io/posts/muon) as described in (https://www.arxiv.org/abs/2507.11005). The SiMuon implementation used here differs from the original Muon in two key aspects:
- A
signoperation is applied before the L2 norm in the Newton-Schulz (NS) iteration. - The number of NS iterations is reduced from 5 to 2.
- A
- The tokenizer and embedding layers are initialized with pre-trained weights from MiniMind2 (https://github.com/jingyaogong/minimind).
- The model was trained on 0.4 billion tokens during pre-training and approximately 7 billion tokens (8.56 million samples) during fine-tuning.
- Pre-training was conducted with a sequence length of 512, using a learning rate of 1e-3 for SiMuon (scaled by
0.2 * sqrt(max(fan_in, fan_out))) and 1e-4 for AdamW, with bf16 automatic mixed precision (AMP) acceleration. - Supervised Fine-Tuning (SFT) was performed at the following sequence length / learning rate combinations: 512/1e-5, 1024/1e-5, 2048/1e-5, and 2048/2e-6. The SiMuon learning rate was set to 66× the base learning rate. bf16 AMP was also used for acceleration during SFT.
模型性能 Model Performance
模型在多个数据集上跑分,仅供娱乐参考:
模型跑分结果如下,使用lm_eval框架:
| Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| ceval-valid | 2 | none | 5 | acc | ↑ | 0.2422 | ± | 0.0117 |
ceval bench result
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| ceval-valid | 2 | none | acc | ↑ | 0.2422 | ± | 0.0117 | |
| ceval-valid_accountant | 2 | none | 5 | acc | ↑ | 0.2449 | ± | 0.0621 |
| ceval-valid_advanced_mathematics | 2 | none | 5 | acc | ↑ | 0.1053 | ± | 0.0723 |
| ceval-valid_art_studies | 2 | none | 5 | acc | ↑ | 0.1818 | ± | 0.0682 |
| ceval-valid_basic_medicine | 2 | none | 5 | acc | ↑ | 0.2105 | ± | 0.0961 |
| ceval-valid_business_administration | 2 | none | 5 | acc | ↑ | 0.2424 | ± | 0.0758 |
| ceval-valid_chinese_language_and_literature | 2 | none | 5 | acc | ↑ | 0.3043 | ± | 0.0981 |
| ceval-valid_civil_servant | 2 | none | 5 | acc | ↑ | 0.1489 | ± | 0.0525 |
| ceval-valid_clinical_medicine | 2 | none | 5 | acc | ↑ | 0.3182 | ± | 0.1016 |
| ceval-valid_college_chemistry | 2 | none | 5 | acc | ↑ | 0.2500 | ± | 0.0903 |
| ceval-valid_college_economics | 2 | none | 5 | acc | ↑ | 0.2727 | ± | 0.0606 |
| ceval-valid_college_physics | 2 | none | 5 | acc | ↑ | 0.4211 | ± | 0.1164 |
| ceval-valid_college_programming | 2 | none | 5 | acc | ↑ | 0.2973 | ± | 0.0762 |
| ceval-valid_computer_architecture | 2 | none | 5 | acc | ↑ | 0.3810 | ± | 0.1086 |
| ceval-valid_computer_network | 2 | none | 5 | acc | ↑ | 0.1579 | ± | 0.0859 |
| ceval-valid_discrete_mathematics | 2 | none | 5 | acc | ↑ | 0.3750 | ± | 0.1250 |
| ceval-valid_education_science | 2 | none | 5 | acc | ↑ | 0.3103 | ± | 0.0874 |
| ceval-valid_electrical_engineer | 2 | none | 5 | acc | ↑ | 0.2973 | ± | 0.0762 |
| ceval-valid_environmental_impact_assessment_engineer | 2 | none | 5 | acc | ↑ | 0.1613 | ± | 0.0672 |
| ceval-valid_fire_engineer | 2 | none | 5 | acc | ↑ | 0.2258 | ± | 0.0763 |
| ceval-valid_high_school_biology | 2 | none | 5 | acc | ↑ | 0.1053 | ± | 0.0723 |
| ceval-valid_high_school_chemistry | 2 | none | 5 | acc | ↑ | 0.1579 | ± | 0.0859 |
| ceval-valid_high_school_chinese | 2 | none | 5 | acc | ↑ | 0.1053 | ± | 0.0723 |
| ceval-valid_high_school_geography | 2 | none | 5 | acc | ↑ | 0.3158 | ± | 0.1096 |
| ceval-valid_high_school_history | 2 | none | 5 | acc | ↑ | 0.3500 | ± | 0.1094 |
| ceval-valid_high_school_mathematics | 2 | none | 5 | acc | ↑ | 0.2222 | ± | 0.1008 |
| ceval-valid_high_school_physics | 2 | none | 5 | acc | ↑ | 0.1579 | ± | 0.0859 |
| ceval-valid_high_school_politics | 2 | none | 5 | acc | ↑ | 0.5789 | ± | 0.1164 |
| ceval-valid_ideological_and_moral_cultivation | 2 | none | 5 | acc | ↑ | 0.3158 | ± | 0.1096 |
| ceval-valid_law | 2 | none | 5 | acc | ↑ | 0.1250 | ± | 0.0690 |
| ceval-valid_legal_professional | 2 | none | 5 | acc | ↑ | 0.2174 | ± | 0.0879 |
| ceval-valid_logic | 2 | none | 5 | acc | ↑ | 0.2273 | ± | 0.0914 |
| ceval-valid_mao_zedong_thought | 2 | none | 5 | acc | ↑ | 0.2083 | ± | 0.0847 |
| ceval-valid_marxism | 2 | none | 5 | acc | ↑ | 0.3158 | ± | 0.1096 |
| ceval-valid_metrology_engineer | 2 | none | 5 | acc | ↑ | 0.2083 | ± | 0.0847 |
| ceval-valid_middle_school_biology | 2 | none | 5 | acc | ↑ | 0.3810 | ± | 0.1086 |
| ceval-valid_middle_school_chemistry | 2 | none | 5 | acc | ↑ | 0.2500 | ± | 0.0993 |
| ceval-valid_middle_school_geography | 2 | none | 5 | acc | ↑ | 0.0833 | ± | 0.0833 |
| ceval-valid_middle_school_history | 2 | none | 5 | acc | ↑ | 0.1818 | ± | 0.0842 |
| ceval-valid_middle_school_mathematics | 2 | none | 5 | acc | ↑ | 0.2632 | ± | 0.1038 |
| ceval-valid_middle_school_physics | 2 | none | 5 | acc | ↑ | 0.4737 | ± | 0.1177 |
| ceval-valid_middle_school_politics | 2 | none | 5 | acc | ↑ | 0.2381 | ± | 0.0952 |
| ceval-valid_modern_chinese_history | 2 | none | 5 | acc | ↑ | 0.1739 | ± | 0.0808 |
| ceval-valid_operating_system | 2 | none | 5 | acc | ↑ | 0.1579 | ± | 0.0859 |
| ceval-valid_physician | 2 | none | 5 | acc | ↑ | 0.2041 | ± | 0.0582 |
| ceval-valid_plant_protection | 2 | none | 5 | acc | ↑ | 0.2273 | ± | 0.0914 |
| ceval-valid_probability_and_statistics | 2 | none | 5 | acc | ↑ | 0.2222 | ± | 0.1008 |
| ceval-valid_professional_tour_guide | 2 | none | 5 | acc | ↑ | 0.3103 | ± | 0.0874 |
| ceval-valid_sports_science | 2 | none | 5 | acc | ↑ | 0.1579 | ± | 0.0859 |
| ceval-valid_tax_accountant | 2 | none | 5 | acc | ↑ | 0.2041 | ± | 0.0582 |
| ceval-valid_teacher_qualification | 2 | none | 5 | acc | ↑ | 0.2727 | ± | 0.0679 |
| ceval-valid_urban_and_rural_planner | 2 | none | 5 | acc | ↑ | 0.1739 | ± | 0.0565 |
| ceval-valid_veterinary_medicine | 2 | none | 5 | acc | ↑ | 0.2609 | ± | 0.0936 |
以下是模型的问答输出(由于模型过小,推荐加大repetition penalty):
[Q]你知道长江吗?
[A]长江是中国古代的一种文化现象,它不仅影响着江南地区的文化传统,也对人类社会产生了深远的影响。长江的历史可以追溯到公元前4世纪,当时中国历史上,随着时间的推移,长江的出现,江南地区的文化习俗逐渐被视为一种文化传承和精神修行。
[Q]世界上最高的山峰是什么?
[A]世界上最高的山峰是珠穆朗玛峰,位于喜马拉雅山脉,海拔8,848米。它位于喜马拉雅山脉,海拔8,848米,是世界上最高的山峰之一。
[Q]万有引力是谁提出的?
[A]万有引力是两个基本粒子之间的相互作用力,它们之间存在着相互作用的关系。在宇宙中,万有引力可以被描述为一个恒星或其他恒星。尽管如此,由于引力非常强大,其引力非常强大,但由于其引力非常强大,因此它只能以极高的速度旋转,而不产生任何外部力量。
[Q]你知道大熊猫的主要食物是什么吗?
[A]大熊猫主要以竹子为食,其主要食物包括但不限于:
- 水生动物:大熊猫通常在树上生长,如大熊猫、老虎和猪等。它们生活在森林中,水生动物是最常见的食物来源之一。
- 栖息地:大熊猫的体重较小,如大熊猫、大熊猫、大熊猫等。它们生活在树上,生活在森林中,如熊猫、大熊猫、大熊猫等。
- 竹子:竹子是竹子,它们生活在森林中,竹子可以生活在树上。竹子通常在树上生长,但它们生活在森林中,竹子可能在树上生长。
- 栖息地:竹子是竹子中最常见的种类,它们生活在树上,如树枝、草、竹子等。它们生活在森林中,竹子会在树下生长,以树上生长,并繁殖。
- 竹子:竹子是一种竹子,竹子通常在树上生长,而竹子则是竹子,竹子是竹子。竹子通常在树上生长,竹子则是竹子。竹子通常在树上生长,竹子则是竹子。竹子通常在树上生长,竹子则是竹子,竹子也在树上生长。
这些食物不仅丰富了竹子,也对全球生态系统产生了深远的影响。
[Q]海水为什么是咸的?
[A]海水是一种由海水或其他海洋生物组成的废弃物,主要是因为它们在大气中的污染和分解过程中起着重要的作用。当海水被压缩到海水时,它会释放出大量的二氧化碳,这些二氧化碳会释放出大量的二氧化碳,从而使其表面的温度降低,从而减少了海洋中的浮力。
海水的存在可以分为以下几个方面:
- 温度:海水的冷却速度非常快,这使得海水温度会更快地蒸发,从而影响海洋中的生物体系。
- 温度:海水的温度较低,这使得海水的温度相对较低,这使得海水的温度更高,这对于海水的生物多样性非常重要。
- 海水温度:海水的温度较高,这使得海水的温度增加,这对于海水的生态系统具有重要意义。
- 海洋生物:海水的存在也对海洋生物的生存和繁衍有重要影响,这对于维持海洋生态系统的稳定性和生物多样性至关重要。
综上所述,海水的温度通常是由于海水的直接或间接因素导致的浮力减弱,这对海洋生态系统、海洋生物、海洋生物以及人类活动等多个方面产生深远的影响。
[Q]写一段python的快速排序代码
[A]这是一个使用Python实现的简单排序算法:
def quick_sort(arr):
if len(arr) <= 1:
return arr
else:
return arr
# 测试代码
arr = [3, 5, 7, 9, 8]
sorted_arr = quick_sort(arr)
print(sorted_arr)
这段代码首先定义了一个名为sorted_arr的函数,它接受一个列表作为参数。然后,它使用一个for循环遍历从列表中的每个元素,并将它们合并到结果列表中。最后,它打印出排序后的列表。
[Q]截至2025年2月9日14时31分,电影《哪吒之魔童闹海》累计票房(含预售)突破78.09亿元,超过《星球大战:天行者崛起》的票房成绩,成为首部进入全球票房榜前40的亚洲电影。这一成就不仅标志着中国电影在国际市场的突破,也展示了中国动画电影的巨大潜力和市场吸引力。从这段文字总结20字的摘要给我。
[A]截至2025年2月9日14时31分,电影《哪吒之魔童闹海》截至2025年2月9日14日。
[Q]Justice has 4 ferns, 6 palms, and 8 succulent plants in her house. If she wants a total of 25 plants in her home, how many more plants does she need?
[A]Justice has 4 ferns, and 4 ferns, each of them will have 4 ferns. So, the total number of plants in her home is 4 + 8 = 12 ferns. The answer is 12.
[Q]请写一篇关于环保的社论,要求言之有物、具有说服力。
[A]好的,以下是一篇关于环保的社论:
随着工业化和城市化的快速发展,环境污染问题日益严重,环境污染问题日益凸显。为了保护环境,我们需要采取行动,从源头上减少对环境的负面影响。首先,政府应该加强环境监管,制定更加严格的环保法规,加大对环境污染的扩大力度,推动绿色经济的发展。其次,政府应该加强环境监管,推广清洁能源,推广清洁能源,鼓励使用环保型能源,加强环境监管,推广绿色生产方式等。最后,我们每个人都应该为环保事业做出贡献。
总之,环保是每个人的责任。只有我们共同努力,才能让我们的地球变得更加美好。
- Downloads last month
- 24
Model tree for SnifferCaptain/YModel2-s-2
Unable to build the model tree, the base model loops to the model itself. Learn more.