360智脑发布Tiny-R1-32B：5%参数逼近Deepseek-R1满血性能

360智脑团队与北京大学联合研发的中等量级推理模型 Tiny-R1-32B-Preview 正式发布。这一模型仅以5%的参数量，成功逼近了 Deepseek-R1-671B 的满血性能，展现了小模型在高效推理领域的巨大潜力。该模型在多个关键领域的表现尤为突出。在数学领域，Tiny-R1-32B-Preview 在 AIME2024评测中取得了78.1分的成绩，与原版 R1模型的79.8分相当接近，

DisonTangor

1587人浏览 · 2025-02-26 09:29:34

DisonTangor · 2025-02-26 09:29:34 发布

在这里插入图片描述

360智脑团队与北京大学联合研发的中等量级推理模型 Tiny-R1-32B-Preview 正式发布。这一模型仅以5%的参数量，成功逼近了 Deepseek-R1-671B 的满血性能，展现了小模型在高效推理领域的巨大潜力。

该模型在多个关键领域的表现尤为突出。在数学领域，Tiny-R1-32B-Preview 在 AIME2024评测中取得了78.1分的成绩，与原版 R1模型的79.8分相当接近，远超 Deepseek-R1-Distill-Llama-70B 的70.0分。在编程和科学领域，该模型分别在 LiveCodeBench 和 GPQA-Diamond 测试中取得了61.6分和65.0分的成绩，全面领先于当前最佳开源70B模型 Deepseek-R1-Distill-Llama-70B。这一成果不仅证明了 Tiny-R1-32B-Preview 在性能上的卓越表现，还通过仅需5%的参数量大幅降低了推理成本，实现了效率的跃迁。

评估

Model	Math (AIME 2024)	Coding (LiveCodeBench)	Science (GPQA-Diamond)
Deepseek-R1-Distill-Qwen-32B	72.6	57.2	62.1
Deepseek-R1-Distill-Llama-70B	70.0	57.5	65.2
Deepseek-R1	79.8	65.9	71.5
Tiny-R1-32B-Preview (Ours)	78.1	61.6	65.0

所有分数均以pass@1报告。
对于AIME 2024，我们抽取了16个响应样本，对于GPQA-Diamond，我们抽取了4个响应样本，两者均使用平均总体准确率进行稳定评估。

方法

Model	Math (AIME 2024)	Coding (LiveCodeBench)	Science (GPQA-Diamond)
Math-Model (Ours)	73.1	-	-
Code-Model (Ours)	-	63.4	-
Science-Model (Ours)	-	-	64.5
Tiny-R1-32B-Preview (Ours)	78.1	61.6	65.0

我们使用360-LaMA-Factory训练框架对Deepseek-R1-Distill-Qwen-32B的数学、编码和科学三个目标领域进行了监督微调（SFT），以生成三个特定领域的模型。我们将开源数据中的问题作为种子，使用 DeepSeek-R1 分别生成数学、编码和科学任务的回答，为每个领域创建专门的模型。在此基础上，我们利用 Arcee 团队提供的 Mergekit 工具合并了多个模型，创建了 Tiny-R1-32B-Preview，该模型显示出很强的整体性能。

数据

1.数学

来自 open-r1/OpenR1-Math-220k 的 58.3k CoT 轨迹，默认子集

2.编码

19k CoT 轨迹open-thoughts/OpenThoughts-114k，编码子集

3. 科学

我们使用 R1 在 7.6k 个种子示例上生成 8 条 CoT 轨迹，总共得到 60.8k 条 CoT 轨迹；种子示例如下：

2.7k个种子示例，来自simplescaling/data_ablation_full59K，科学和健康科学子集
4.9k个种子示例, 来自 open-thoughts/OpenThoughts-114k、科学子集

代码

安装依赖

pip install transformers bitsandbytes -U

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_name = "qihoo360/TinyR1-32B-Preview"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
    quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

prompt = "Is 123 a prime?"
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

Is 123 a prime? Let me check. To determine if 123 is a prime number, I need to check if it has any divisors other than 1 and itself. A prime number is a number greater than 1 that has no positive divisors other than 1 and itself. So, let's start by checking the divisibility of 123 by some smaller prime numbers.

First, check if 123 is even. Since it ends with a 3, which is odd,

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_name = "qihoo360/TinyR1-32B-Preview"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
    quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

prompt = "What is the sum of the first 100 natural numbers?"
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

What is the sum of the first 100 natural numbers? To find the sum of the first 100 natural numbers, we can use the formula for the sum of an arithmetic series. The formula is S = n/2 * (a₁ + aₙ), where n is the number of terms, a₁ is the first term, and aₙ is the last term. In this case, n = 100, a₁ = 1, and aₙ = 100. Plugging these values into