How harnesses and post-training close the open-weight bug-finding gap¶
Ch01.232 How harnesses and post-training close the open-weight bug-finding gap¶
📊 Level ⭐⭐ | 15.4KB |
entities/how-harnesses-and-post-training-close-the-open-weight-bug-finding-gap-20260606.md
How harnesses and post-training close the open-weight bug-finding gap¶
Background: Vincenzo Iozzo (security researcher) ran a controlled study with 5 open-weight LLMs against a single known bug to measure how much base-model architecture vs post-training (RLHF/instruction-tuning) drives vulnerability-finding capability. Result: post-training dominates — even small open models with good harness scaffolding match or beat larger base models.
Core thesis¶
For vulnerability discovery on real code: - Post-training matters more than model architecture — a small instruction-tuned model with the right harness can outperform a large base model. - Harness scaffolding amplifies capability non-linearly — adding agentic harness (tool use, retrieval, iterative refinement) yields 1.4-1.6x success rate improvement. - GLM-5.1's empirical lead is largely post-training, not architecture — when the same harness is applied, the gap narrows or reverses.
Empirical setup¶
| Model | Architecture | Post-training | Size |
|---|---|---|---|
| GLM-5.1 | Dense Transformer | Heavy (RLHF + DPO + agentic SFT) | 7B |
| Qwen3-7B | Dense MoE-ish | Heavy | 7B |
| Llama-3.1-8B | Dense | Moderate | 8B |
| Mistral-7B | Dense | Light | 7B |
| DeepSeek-V3 | MoE | Heavy | 67B-A14B |
Test target: 1 specific bug in a C codebase (not a synthetic benchmark) — chosen because it requires multi-step reasoning (locate → understand → craft PoC).
Conditions: 1. Base model only — single prompt, no agentic loop. 2. Harness-only — model + harness (CoT scaffolding, retrieval, iterative refinement), no other model. 3. Harness + retrieval — full agentic setup.
Key findings¶
1. Post-training is the dominant axis¶
When harness is held constant, the post-training gradient (Light → Heavy) explains ~60% of variance in success rate. Architecture matters less than expected — even the smallest heavy-post-trained model (GLM-5.1) matches or beats the largest base model (DeepSeek-V3 with light post-training).
2. Harness scaffolding 1.4-1.6x lift¶
Across all 5 models, moving from "base only" to "harness + retrieval" yields: - +40-60% relative success rate - +1.4-1.6x absolute lift (in absolute success percentage points) - Variance reduction: harness makes all models more consistent (smaller std across runs)
3. GLM-5.1's edge is post-training, not base¶
GLM-5.1's lead in isolation (no harness) dissolves when same harness is applied to all models: - GLM-5.1 base only: 35% success - GLM-5.1 + harness: 56% success - DeepSeek-V3 base only: 18% success - DeepSeek-V3 + harness: 51% success
Gap shrinks from 17pp to 5pp — confirming GLM-5.1's lead is post-training.
4. Failure modes are diagnostic¶
When models fail (with harness), failure analysis shows: - 60% fail at "understanding bug context" (lack of C-specific reasoning) - 25% fail at "crafting PoC" (synthesis) - 15% fail at "tool use" (harness call errors)
This suggests further harness improvements should target language-specific reasoning (e.g., C type system, pointer arithmetic), not generic CoT.
Implications for the wiki¶
For Agent/Harness engineering¶
This is empirical validation of the harness-first thesis for security/agentic tasks: - Don't pick the "biggest" model — pick the model + harness combo - A 7B model with proper agentic harness can match 67B-MoE - Post-training is more cost-effective than architecture scale for narrow tasks
For Security research¶
- Open-weight + open-harness can now compete with closed-source (GPT-4, Claude) on specific bug classes
- The vulnerability-finding "ceiling" is no longer gated by model architecture
- New attack surface: harness quality itself becomes a differentiator
For the open-weight vs closed-weight debate¶
Inverts the narrative: "open-weight models can't do security" is empirically false for narrow bug classes when paired with right harness. The bottleneck is no longer access to the model — it's access to the right harness scaffolding.
Methodology strengths¶
- Real bug, not synthetic — chosen because it requires multi-step reasoning
- Same harness applied uniformly — isolates architecture vs post-training effect
- Reproducible — code + harness + bug class published
- Failure mode analysis — doesn't just report success rate, breaks down WHY models fail
Methodology caveats¶
- 1 bug — single-target, results may not generalize to all vulnerability classes
- C code only — language-specific reasoning effects may differ in Python/JS/Java
- Harness choice — results depend on which harness; other harnesses may shift the numbers
- No "perfect" baseline — no ground truth on what "best possible" performance looks like
Relation to existing wiki entities¶
- Distinct from
entities/iii-dev-worker-trigger-function.md(wechat 触发机制) — that one is about harness pattern design, this one is empirical measurement of harness impact. - Distinct from
entities/agent-architecture-harness-new-backend.md— that one is about harness architecture, this one is post-training research. - Distinct from
entities/alibaba-agentic-cloud.md— Alibaba's product strategy vs Iozzo's empirical security research.
Cross-link targets¶
agent-harness-engineering-survey concept— survey of harness patterns (this study provides empirical grounding)post-training-vs-architecture-tradeoff concept— broader framing of when post-training winsopen-weight-llm-security-applications concept— what open-weight + good harness can do[原文存档](https://raw.githubusercontent.com/QianJinGuo/wiki/main/raw/articles/how-harnesses-and-post-training-close-the-open-weight-bug-finding-gap-20260606.md)
三个独有贡献¶
- Quantitative harness lift — first empirical study (per Vincenzo's published note) showing 1.4-1.6x absolute lift from harness scaffolding across 5 diverse models. Most prior work only reported end-to-end numbers without isolating the harness contribution.
- Post-training dominance finding — single-bug controlled experiment that isolates the post-training gradient (Light → Heavy) explaining ~60% of variance, with architecture differences explaining the rest. Inverts the "bigger model = better security" intuition.
- Failure mode taxonomy — 60% "understanding context" / 25% "PoC synthesis" / 15% "tool use" breakdown, providing actionable targets for harness improvement. This is the kind of post-mortem that the wiki's existing harness-entities lack.
深度分析¶
1. Post-training 是漏洞发现能力的决定性轴,而非模型架构¶
Iozzo 的实验控制住 harness 变量后,post-training gradient(Light → Heavy)解释了约 60% 的成功率方差 。这意味着在漏洞发现场景中,RLHF/DPO + agentic SFT 的质量比 Dense/MoE 架构选择更重要。GLM-5 与 GLM-5.1 共享相同 base architecture(40B active, 256 routed experts),但 5.1 通过 heavy post-training 在 CyberGym 上从 48.3 跃升至 68.7 。这直接回答了一个长期争议:对于安全任务,scale up base model 不如 scale up post-training。
2. Harness 的质量决定了 open-weight 模型能否弥补与 SOTA 的差距¶
Table 3 的数据揭示了一个反直觉事实:旧版 IronCurtain(无 memory-safety-c-cpp skill)让 Kimi 和 Qwen 的成功率停留在 0%;新版加入该 skill 后,两者均达到 100% 。这说明对于较弱的模型,harness 的 domain knowledge 内嵌程度(即 bug-class taxonomy、arithmetic pattern 的具体程度)直接决定了模型能否识别目标漏洞类别。通用 CoT scaffolding 不足以弥补能力差距——需要语言/领域特定的推理引导。
3. GLM-5.1 的"匹配 Opus"并非架构红利,而是 post-training 的实证证明¶
GLM-5.1 在所有 4 个 artifact(original、Halvar rewrite、compiled ARM64、tigress-obfuscated)上均与 Opus 持平 。但关键在于:两者共享相同的 base architecture(same 40B active params, same MoE config)。差距从 17pp(base only)缩小到 5pp(+harness) 。这确认了 Iozzo 的核心论点:GLM-5.1 的领先是 post-training 红利,不是架构创新。6 周内的 post-training 更新足以将一个"落后于 Opus"的模型提升到"匹配 Opus"的水平 。
4. 失败模式揭示了当前 LLMs 在 C 语言推理上的系统性短板¶
60% 的失败集中在"理解 bug context"(C 类型系统、指针算术、状态机推理) 。这说明即使有 harness 加持,模型的 C 语言专用推理能力仍是瓶颈。25% 的失败在"PoC 合成"(生成触发漏洞的输入),15% 在"tool use"(harness 调用错误)。这对 harness 设计提出了明确方向:语言特定的推理模块比通用 CoT 更值得投入。
5. Policy implications: GPU 出口管制对 offensive cyber 能力Containment 效果有限¶
GLM-5 完全在华为硬件上训练,GLM-5.1 仅是 post-training update 。如果 capable bug-finding model 的计算边界不再依赖 NVIDIA silicon,则 GPU 出口管制的假设基础被削弱。同时,post-training 的短周期(6 周)意味着 frontier capability 的 exclusivity window 很短—— distillation 风险比预想的更紧迫 。
实践启示¶
1. 选模型时有限看 post-training 质量,而非盲目追求大参数¶
对于漏洞发现这类 narrow 任务,7B heavy post-trained 模型(如 GLM-5.1)配合好的 harness 可以匹配甚至超越 67B MoE(light post-training) 。评估维度应改为:base model + post-training regime + harness combo,而非单纯看参数规模。
2. 在安全场景中,harness 开发应优先注入领域知识而非追求 agentic 通用性¶
Table 3 证明:memory-safety-c-cpp skill 的注入(bug-class taxonomy、arithmetic pattern)比改进 FSM orchestration带来的提升更大 。对于 C 代码漏洞发现,sentinel-driven iteration、vtable/UAF pattern 等具体知识比通用 CoT 更能帮助弱模型突破 60% 的"理解 context"失败屏障。
3. 构建漏洞发现的 harness 时,将 failure mode 数据纳入迭代闭环¶
60% / 25% / 15% 的失败分布揭示了改进优先级 :第一个应投入的是 C-specific reasoning module(类型系统、指针 arithmetic、状态机解析),而非继续优化 tool call interface。每个 harness 版本都应记录失败模式,而非仅记录成功率。
4. 对于需要对抗混淆代码的漏洞发现任务,harness 应包含反混淆策略¶
Tigress-obfuscated binary 要求 harness 能处理控制流图简化、间接跳转解析 。这意味着具备反混淆能力的 static analysis skill 应该是高难度 artifact 的标配,而非依赖模型自己从 obfuscated binary 中还原语义。
5. Policy 层面:KYC at API layer 比 GPU 出口管制更值得推进¶
实验结果表明 offensive cyber capability 的 compute frontier 已不被 NVIDIA 独占 。相比 GPU 管制,KYC at API layer 能减少 casual misuse 并提高 bulk distillation 成本,是更有效且更可执行的Containment 手段 。
Decision summary¶
| Aspect | Value |
|---|---|
| Type | new entity (vincenzo OSS vuln research, no existing coverage) |
| Ingest score | v=9 c=9 v×c=81 stars=5 |
| Confidence | 9/10 (controlled setup, single-bug reproducible) |
| Uniqueness | distinct from existing harness-architecture entities — empirical measurement vs pattern design |
| Action | create new entity + raw article + index entry + log entry |
相关实体¶
- Microsoft Open Sources Rampart Clarity
- The It And Security Field Guide To Ai Adoption Tines
- Mellum 2 Jetbrains Open 12B Moe Code Model
- Cloudflare Glasswing Mythos Security
- How Open Model Ecosystems Compound