NEW YORK – Defenders are usually used to enjoy a few hours to days—or even weeks—of grace on mitigation until there was a public exploit for the vulnerability. If an AI could sift through the 130 Common Vulnerabilities and Exposure released by day in minutes and create working exploits, that “grace period” may no longer apply.

The system we have built uses a multi-stage pipeline: (1) analyzes CVE advisories and code patches, (2) creates both vulnerable test applications and exploit code, and (3) validates exploits by testing against vulnerable vs. patched versions to eliminate false positives. Scaling this up would allow an AI to process the daily stream of 130+ CVEs far faster (and more cost-efficient) than human researchers, writes Efi Weiss and Nahman Khayet.

Intro

Since Large Language Models entered our lives, people are continuously trying to find the most complex thing they can make it do. Some are good uses, like researching protein folding or planning your vacation. But also stuff like copycat phishing sites.

A “cyber-security holy grail” is to have an LLM autonomously exploit a system. There is already some interesting traction in the field – XBow reached 1st place on hacker-one and Pattern Labs shared a cool experiment with GPT-5.

But – what about building actual exploits? AIs encounter a collapse in accuracy after a certain amount of reasoning. If we can arm an AI with deterministic exploits, the reasoning chain will be simpler and more accurate.

The methodology we chose going into this was:

  1. Data preparation – Use the advisory + repository to understand how to create an exploit. This is a good job for LLMs – advisories are mostly text, and the same is true for code. The advisory will usually contain good hints to guide the LLM
  2. Context enrichment – Prompt the LLM in guided steps to create a rich context about exploitation – how to construct the payload? What is the flow to the vulnerability?
  3. Evaluation Loop – Create an exploit and an example “vulnerable app” to test the exploit until it works.

Stage 0 – The model

SaaS models like OpenAI, Anthropic, or Google APIs usually have guardrails causing the model to refuse to build POCs – either explicitly or by providing generic “fill here” templates. We started with `qwen3:8b` hosted locally on our MacBooks and later moved to `openai-oss:20b` when it was released. This was really useful as it allowed us to experiment for free until reaching a high level of maturity. A bit later, we found out that given the long step-by-step prompt chain we ended up making, the SaaS models stopped refusing to help :)
Claude-sonnet-4.0 was the best model for generating the PoCs, as it was the best performer in coding (Opus seemed to have a negligible improvement compared to its x5 cost).

Stage 0.5 – The Agent

We started with directly interfacing with the LLM APIs, but later refactored to pydantic-ai (which is amazing BTW) – type safety in an amorphic thing such as LLM is extra important. Another really important part was caching – LLMs are slow and expensive – so we implemented a cache layer very early on. This allowed us to speed up the testing and only rerun prompts that were changed or whose dependencies changed.

Stage 1 – CVE → Technical analysis

A one day begins with an advisory release – usually a CVE advisory. Open source projects on GitHub will usually also have a GitHub Advisory (GHSA).

Lets follow CVE-2025-54887 (we didn’t want “pass ; DROP TABLES USERS; —” vulnerability and also ruby is weird):

The vulnerability is a cryptographic bypass that allows attackers to decrypt invalid JWEs, among other things.

We decided to query the GHSA registry in addition to the NIST one – GHSA has more details, like:

  1. The affected git repository
  2. The affected versions and the patched version
  3. A human-readable description of the issue

Such details simplify some steps. We also created a short pipeline that clones the repository, extracts the patch using the given vulnerable and patched versions (with some LLM magic to solve for edge cases).

Now we can feed the advisory and the patch to the LLM and ask it to analyze it in steps to guide itself to create a plan on how to execute it

We broke down the task into several prompts on purpose – to allow us to debug each prompt quality separately and make the play easier.

A few snippets:

After guiding the agent through a thorough enough analysis, we can use a summarized report of the research as context for the next agents.

Stage 2 – Test plan

We want to create a working POC for open-source packages. Anyone who has coded with AI knows that the chances of getting exactly what you expect on the first try are essentially zero. Coding models are not that good at creating working code without evaluation loops. So to generate good exploits, we must create a test environment – a vulnerable app and an exploit, and test them against each other. After each test, we can provide the agent with the results and ask it to refine the approach.

Read the rest of this column at ValmareLox