How harnesses and post-training close the open-weight bug-finding gap¶

Ch01.232 How harnesses and post-training close the open-weight bug-finding gap¶

📊 Level ⭐⭐ | 15.4KB | entities/how-harnesses-and-post-training-close-the-open-weight-bug-finding-gap-20260606.md

How harnesses and post-training close the open-weight bug-finding gap¶

Background: Vincenzo Iozzo (security researcher) ran a controlled study with 5 open-weight LLMs against a single known bug to measure how much base-model architecture vs post-training (RLHF/instruction-tuning) drives vulnerability-finding capability. Result: post-training dominates — even small open models with good harness scaffolding match or beat larger base models.

Core thesis¶

For vulnerability discovery on real code: - Post-training matters more than model architecture — a small instruction-tuned model with the right harness can outperform a large base model. - Harness scaffolding amplifies capability non-linearly — adding agentic harness (tool use, retrieval, iterative refinement) yields 1.4-1.6x success rate improvement. - GLM-5.1's empirical lead is largely post-training, not architecture — when the same harness is applied, the gap narrows or reverses.

Empirical setup¶

Model	Architecture	Post-training	Size
GLM-5.1	Dense Transformer	Heavy (RLHF + DPO + agentic SFT)	7B
Qwen3-7B	Dense MoE-ish	Heavy	7B
Llama-3.1-8B	Dense	Moderate	8B
Mistral-7B	Dense	Light	7B
DeepSeek-V3	MoE	Heavy	67B-A14B

Test target: 1 specific bug in a C codebase (not a synthetic benchmark) — chosen because it requires multi-step reasoning (locate → understand → craft PoC).

Conditions: 1. Base model only — single prompt, no agentic loop. 2. Harness-only — model + harness (CoT scaffolding, retrieval, iterative refinement), no other model. 3. Harness + retrieval — full agentic setup.

Key findings¶

1. Post-training is the dominant axis¶

When harness is held constant, the post-training gradient (Light → Heavy) explains ~60% of variance in success rate. Architecture matters less than expected — even the smallest heavy-post-trained model (GLM-5.1) matches or beats the largest base model (DeepSeek-V3 with light post-training).

2. Harness scaffolding 1.4-1.6x lift¶

Across all 5 models, moving from "base only" to "harness + retrieval" yields: - +40-60% relative success rate - +1.4-1.6x absolute lift (in absolute success percentage points) - Variance reduction: harness makes all models more consistent (smaller std across runs)

3. GLM-5.1's edge is post-training, not base¶

GLM-5.1's lead in isolation (no harness) dissolves when same harness is applied to all models: - GLM-5.1 base only: 35% success - GLM-5.1 + harness: 56% success - DeepSeek-V3 base only: 18% success - DeepSeek-V3 + harness: 51% success

Gap shrinks from 17pp to 5pp — confirming GLM-5.1's lead is post-training.

4. Failure modes are diagnostic¶

When models fail (with harness), failure analysis shows: - 60% fail at "understanding bug context" (lack of C-specific reasoning) - 25% fail at "crafting PoC" (synthesis) - 15% fail at "tool use" (harness call errors)

This suggests further harness improvements should target language-specific reasoning (e.g., C type system, pointer arithmetic), not generic CoT.

Implications for the wiki¶

For Agent/Harness engineering¶

This is empirical validation of the harness-first thesis for security/agentic tasks: - Don't pick the "biggest" model — pick the model + harness combo - A 7B model with proper agentic harness can match 67B-MoE - Post-training is more cost-effective than architecture scale for narrow tasks

For Security research¶

Open-weight + open-harness can now compete with closed-source (GPT-4, Claude) on specific bug classes
The vulnerability-finding "ceiling" is no longer gated by model architecture
New attack surface: harness quality itself becomes a differentiator

For the open-weight vs closed-weight debate¶

Inverts the narrative: "open-weight models can't do security" is empirically false for narrow bug classes when paired with right harness. The bottleneck is no longer access to the model — it's access to the right harness scaffolding.

Methodology strengths¶

Real bug, not synthetic — chosen because it requires multi-step reasoning
Same harness applied uniformly — isolates architecture vs post-training effect
Reproducible — code + harness + bug class published
Failure mode analysis — doesn't just report success rate, breaks down WHY models fail

Methodology caveats¶

1 bug — single-target, results may not generalize to all vulnerability classes
C code only — language-specific reasoning effects may differ in Python/JS/Java
Harness choice — results depend on which harness; other harnesses may shift the numbers
No "perfect" baseline — no ground truth on what "best possible" performance looks like

Relation to existing wiki entities¶

Distinct from entities/iii-dev-worker-trigger-function.md (wechat 触发机制) — that one is about harness pattern design, this one is empirical measurement of harness impact.
Distinct from entities/agent-architecture-harness-new-backend.md — that one is about harness architecture, this one is post-training research.
Distinct from entities/alibaba-agentic-cloud.md — Alibaba's product strategy vs Iozzo's empirical security research.

Cross-link targets¶

agent-harness-engineering-survey concept — survey of harness patterns (this study provides empirical grounding)
post-training-vs-architecture-tradeoff concept — broader framing of when post-training wins
open-weight-llm-security-applications concept — what open-weight + good harness can do
[原文存档](https://raw.githubusercontent.com/QianJinGuo/wiki/main/raw/articles/how-harnesses-and-post-training-close-the-open-weight-bug-finding-gap-20260606.md)

三个独有贡献¶

Quantitative harness lift — first empirical study (per Vincenzo's published note) showing 1.4-1.6x absolute lift from harness scaffolding across 5 diverse models. Most prior work only reported end-to-end numbers without isolating the harness contribution.
Post-training dominance finding — single-bug controlled experiment that isolates the post-training gradient (Light → Heavy) explaining ~60% of variance, with architecture differences explaining the rest. Inverts the "bigger model = better security" intuition.
Failure mode taxonomy — 60% "understanding context" / 25% "PoC synthesis" / 15% "tool use" breakdown, providing actionable targets for harness improvement. This is the kind of post-mortem that the wiki's existing harness-entities lack.

深度分析¶

1. Post-training 是漏洞发现能力的决定性轴，而非模型架构¶

Iozzo 的实验控制住 harness 变量后，post-training gradient（Light → Heavy）解释了约 60% 的成功率方差。这意味着在漏洞发现场景中，RLHF/DPO + agentic SFT 的质量比 Dense/MoE 架构选择更重要。GLM-5 与 GLM-5.1 共享相同 base architecture（40B active, 256 routed experts），但 5.1 通过 heavy post-training 在 CyberGym 上从 48.3 跃升至 68.7 。这直接回答了一个长期争议：对于安全任务，scale up base model 不如 scale up post-training。

2. Harness 的质量决定了 open-weight 模型能否弥补与 SOTA 的差距¶

Table 3 的数据揭示了一个反直觉事实：旧版 IronCurtain（无 memory-safety-c-cpp skill）让 Kimi 和 Qwen 的成功率停留在 0%；新版加入该 skill 后，两者均达到 100% 。这说明对于较弱的模型，harness 的 domain knowledge 内嵌程度（即 bug-class taxonomy、arithmetic pattern 的具体程度）直接决定了模型能否识别目标漏洞类别。通用 CoT scaffolding 不足以弥补能力差距——需要语言/领域特定的推理引导。

3. GLM-5.1 的"匹配 Opus"并非架构红利，而是 post-training 的实证证明¶

GLM-5.1 在所有 4 个 artifact（original、Halvar rewrite、compiled ARM64、tigress-obfuscated）上均与 Opus 持平。但关键在于：两者共享相同的 base architecture（same 40B active params, same MoE config）。差距从 17pp（base only）缩小到 5pp（+harness）。这确认了 Iozzo 的核心论点：GLM-5.1 的领先是 post-training 红利，不是架构创新。6 周内的 post-training 更新足以将一个"落后于 Opus"的模型提升到"匹配 Opus"的水平。

4. 失败模式揭示了当前 LLMs 在 C 语言推理上的系统性短板¶

60% 的失败集中在"理解 bug context"（C 类型系统、指针算术、状态机推理）。这说明即使有 harness 加持，模型的 C 语言专用推理能力仍是瓶颈。25% 的失败在"PoC 合成"（生成触发漏洞的输入），15% 在"tool use"（harness 调用错误）。这对 harness 设计提出了明确方向：语言特定的推理模块比通用 CoT 更值得投入。

5. Policy implications: GPU 出口管制对 offensive cyber 能力Containment 效果有限¶

GLM-5 完全在华为硬件上训练，GLM-5.1 仅是 post-training update 。如果 capable bug-finding model 的计算边界不再依赖 NVIDIA silicon，则 GPU 出口管制的假设基础被削弱。同时，post-training 的短周期（6 周）意味着 frontier capability 的 exclusivity window 很短—— distillation 风险比预想的更紧迫。

实践启示¶

1. 选模型时有限看 post-training 质量，而非盲目追求大参数¶

对于漏洞发现这类 narrow 任务，7B heavy post-trained 模型（如 GLM-5.1）配合好的 harness 可以匹配甚至超越 67B MoE（light post-training）。评估维度应改为：base model + post-training regime + harness combo，而非单纯看参数规模。

2. 在安全场景中，harness 开发应优先注入领域知识而非追求 agentic 通用性¶

Table 3 证明：memory-safety-c-cpp skill 的注入（bug-class taxonomy、arithmetic pattern）比改进 FSM orchestration带来的提升更大。对于 C 代码漏洞发现，sentinel-driven iteration、vtable/UAF pattern 等具体知识比通用 CoT 更能帮助弱模型突破 60% 的"理解 context"失败屏障。

3. 构建漏洞发现的 harness 时，将 failure mode 数据纳入迭代闭环¶

60% / 25% / 15% 的失败分布揭示了改进优先级：第一个应投入的是 C-specific reasoning module（类型系统、指针 arithmetic、状态机解析），而非继续优化 tool call interface。每个 harness 版本都应记录失败模式，而非仅记录成功率。

4. 对于需要对抗混淆代码的漏洞发现任务，harness 应包含反混淆策略¶

Tigress-obfuscated binary 要求 harness 能处理控制流图简化、间接跳转解析。这意味着具备反混淆能力的 static analysis skill 应该是高难度 artifact 的标配，而非依赖模型自己从 obfuscated binary 中还原语义。

5. Policy 层面：KYC at API layer 比 GPU 出口管制更值得推进¶

实验结果表明 offensive cyber capability 的 compute frontier 已不被 NVIDIA 独占。相比 GPU 管制，KYC at API layer 能减少 casual misuse 并提高 bulk distillation 成本，是更有效且更可执行的Containment 手段。

Decision summary¶

Aspect	Value
Type	new entity (vincenzo OSS vuln research, no existing coverage)
Ingest score	v=9 c=9 v×c=81 stars=5
Confidence	9/10 (controlled setup, single-bug reproducible)
Uniqueness	distinct from existing harness-architecture entities — empirical measurement vs pattern design
Action	create new entity + raw article + index entry + log entry

How harnesses and post-training close the open-weight bug-finding gap¶

Ch01.232 How harnesses and post-training close the open-weight bug-finding gap¶

How harnesses and post-training close the open-weight bug-finding gap¶

Core thesis¶

Empirical setup¶

Key findings¶

1. Post-training is the dominant axis¶

2. Harness scaffolding 1.4-1.6x lift¶

3. GLM-5.1's edge is post-training, not base¶

4. Failure modes are diagnostic¶

Implications for the wiki¶

For Agent/Harness engineering¶

For Security research¶

For the open-weight vs closed-weight debate¶

Methodology strengths¶

Methodology caveats¶

Relation to existing wiki entities¶

Cross-link targets¶

三个独有贡献¶

深度分析¶

1. Post-training 是漏洞发现能力的决定性轴，而非模型架构¶

2. Harness 的质量决定了 open-weight 模型能否弥补与 SOTA 的差距¶

3. GLM-5.1 的"匹配 Opus"并非架构红利，而是 post-training 的实证证明¶

4. 失败模式揭示了当前 LLMs 在 C 语言推理上的系统性短板¶

5. Policy implications: GPU 出口管制对 offensive cyber 能力Containment 效果有限¶

实践启示¶

1. 选模型时有限看 post-training 质量，而非盲目追求大参数¶

2. 在安全场景中，harness 开发应优先注入领域知识而非追求 agentic 通用性¶

3. 构建漏洞发现的 harness 时，将 failure mode 数据纳入迭代闭环¶

4. 对于需要对抗混淆代码的漏洞发现任务，harness 应包含反混淆策略¶

5. Policy 层面：KYC at API layer 比 GPU 出口管制更值得推进¶

Decision summary¶

相关实体¶