Small models also found the vulnerabilities that Mythos found

Introduction

The authors present the results of experiments conducted on several LLMs, including Claude, OpenAI, and DeepSeek, to assess their ability to detect and exploit security vulnerabilities.

Experiment 1: FreeBSD NFS exploit detection

The researchers isolated a specific vulnerability in the FreeBSD NFS protocol (CVE-2026-4747) and asked various LLMs to assess it for security vulnerabilities. The results showed that all eight models correctly identified the stack buffer overflow and its potential for remote code execution. However, when asked to propose alternative solutions to deliver an ROP chain within a 1000-byte payload constraint, none of the models arrived at the exact multi-round RPC approach used by Mythos.

Experiment 2: OpenBSD SACK bug detection

The authors tested the LLMs' ability to detect the subtle 27-year-old OpenBSD TCP SACK vulnerability. GPT-OSS-120b correctly recovered the core public chain in a single API call and proposed the correct mitigation, which matches the actual OpenBSD patch.

Experiment 3: Patched vs unpatched code detection

The researchers tested whether LLMs can tell when a bug has been fixed by running both the patched and unpatched versions of the FreeBSD function through various models. The results showed that while all models detected the vulnerability in the unpatched code, most false-positive on the patched code.

Conclusion

The authors conclude that current LLMs have made significant progress in detecting security vulnerabilities but still lack the ability to conceive novel constrained-delivery mechanisms, which requires a higher level of creativity and engineering expertise. They suggest that the gap between "can reason about exploitation" and "can independently conceive a novel constrained-delivery mechanism" is an important capability boundary for LLMs.

Takeaways

Current LLMs can detect security vulnerabilities with high accuracy.
However, they often fail to recognize when code has been fixed (specificity).
LLMs can reason fluently about exploitation but lack the creative engineering step required to construct novel exploits.
The role of the full system (scaffold and triage layer) is essential in catching errors and false-positives.

Overall, this paper provides valuable insights into the capabilities and limitations of LLMs in security-related tasks, highlighting areas where further research and development are needed.

Read full original story ↗

More news