When AI finds the bugs scanners miss

On May 11, Google Threat Intelligence Group (GTIG) published its 2026 AI Threat Tracker. Buried in the executive summary is a sentence that should change how every AppSec team thinks about coverage:

“For the first time, GTIG has identified a threat actor using a zero-day exploit that we believe was developed with AI.”

— GTIG, 2026 AI Threat Tracker

The headline is the zero day. The deeper story is what kind of zero day it was, and why the entire generation of code scanners enterprises currently rely on were structurally incapable of finding it first.

The vulnerability in question

The exploit was a two-factor authentication bypass in a popular open source web-based system administration tool. GTIG's analysis is precise about its nature:

“The vulnerability can be classified as a 2FA bypass, though it requires valid user credentials in the first place. It stems not from common implementation errors like memory corruption or improper input sanitization, but a high-level semantic logic flaw where the developer hardcoded a trust assumption.”

— GTIG, on the nature of the flaw

Read that carefully. This was not a buffer overflow. Not a SQL injection. Not a missing input sanitizer. It was a developer who wrote if (user.is_trusted) { skip_2fa() } somewhere in a control flow that was supposed to enforce 2FA universally. The code passed every test. It looked functionally correct. The 2FA logic existed. It just had a contradicting exception that a human reading the file would catch and a fuzzer never would.

GTIG calls this a semantic logic flaw. We call it application logic exposure. Same animal.

Why scanners didn't catch it first

GTIG's report contains a diagram (Figure 3) comparing LLM vulnerability discovery to fuzzers and static analysis. The conclusion is uncomfortable for the SAST industry:

“While fuzzers and static analysis tools are optimized to detect sinks and crashes, frontier LLMs excel at identifying these types of high-level flaws and hardcoded static anomalies.”

— GTIG, Figure 3 commentary

Static analyzers operate on syntactic patterns. They look for taint flowing into a sink, for known dangerous APIs, for missing sanitizers on known-bad inputs. They are very good at this. They are also fundamentally blind to a category of bugs that look like:

def enforce_2fa(user, request):
    if user.role == "admin" and request.source_ip in TRUSTED_RANGES:
        return True  # skip 2FA, this is a trusted admin
    return run_2fa_challenge(user)

There is no taint here. No injection. No sink. The grammar is clean. The function does what it says. The flaw is that the function should not exist at all, because policy says all admin access requires 2FA, and TRUSTED_RANGES was a debug constant that should have been removed before shipping.

This is the entire pattern of business logic vulnerabilities. They emerge from the gap between what the code says and what the system is supposed to enforce. Scanners do not have access to “what the system is supposed to enforce.” LLMs reasoning over the whole codebase, with context about adjacent files, comments, and stated security requirements, increasingly do.

GTIG's framing confirms it:

“They have an increasing ability to perform contextual reasoning, effectively reading the developer's intent to correlate the 2FA enforcement logic with the contradictions of its hardcoded exceptions. This capability can allow models to surface dormant logic errors that appear functionally correct to traditional scanners but are strategically broken from a security perspective.”

— GTIG, on contextual reasoning

The asymmetry just inverted

For a decade the defender argument against AppSec investment ran roughly: attackers cannot scale manual code review across your repos, so as long as you fuzz and SAST the hot paths you are fine on logic bugs. That argument is dead.

GTIG documented APT45 sending thousands of repetitive prompts that recursively analyse CVEs and validate proof of concept exploits. Other PRC-nexus clusters are integrating specialised vulnerability corpora (the wooyun-legacy dataset of 85,000+ historical bugs) as in-context priming for Claude code skills. They are deploying agentic frameworks like Hexstrike and Strix against real targets, with Graphiti-style temporal knowledge graphs to maintain persistent state across the attack surface.

Translation: well-resourced adversaries are now running automated business logic review at scale against your code, against your dependencies, and against any open source component you ship with. They are doing it cheaply. They are doing it continuously. And the class of bug they are best at finding is the exact class your existing scanners cannot see.

What defenders need to change

Three observations from the GTIG report deserve to land in every AppSec roadmap this quarter.

The defining capability gap is no longer pattern detection.

It is contextual reasoning over intent. If your detection stack cannot read a codebase the way an attacker's LLM does — comparing what the code enforces against what policy requires — you have a coverage hole that fuzzing more will not close.

Stateless single-file analysis is insufficient.

GTIG's APT examples leveraged temporal knowledge graphs to maintain state across multi-stage reconnaissance. The defender equivalent is a persistent organisational knowledge graph that accumulates context across scans, repos, services, and behavioural contracts, so that a logic anomaly discovered in service A can be correlated against the policy commitment made in service B's documentation. This is the architecture CYBRET is built on, and the GTIG findings are the clearest external signal yet that it is the right one.

The AI supply chain itself is now an initial access vector.

GTIG's coverage of TeamPCP (UNC6780) compromising LiteLLM, BerriAI, Trivy, and Checkmarx via PyPI packages and malicious pull requests is a reminder that the tooling layer holding your AI integrations together is itself a target. SAIF's Insecure Integrated Component and Rogue Actions risk categories are no longer theoretical. They are how cloud secrets are being exfiltrated right now.

The honest take

There is a temptation in the security industry to read reports like this one as marketing tailwind for whoever sells the closest adjacent product. We will resist most of that.

The honest reading of GTIG's 2026 tracker is that the bar for what counts as adequate AppSec coverage has moved. Detecting CWE-79 and CWE-89 in HTTP handlers is the cost of entry, not the value proposition. The frontier is detecting the logic flaws that emerge from the interaction between intent, policy, and implementation across services that no individual reviewer holds in their head. Attackers now have tools to do that at industrial scale.

Defenders need them too. The neurosymbolic combination of a persistent knowledge graph plus LLM reasoning is one credible answer. It is not the only one. But the era of treating business logic exposure as a “manual pen test problem we'll get to next year” is over, because the adversary side already industrialised it.

GTIG's report is mandatory reading. The full piece is on the Google Cloud blog. The CWE → CAPEC → BLADE mapping work we are publishing in the coming months is part of how we think the defensive side should respond.

Source

Google Threat Intelligence Group, GTIG AI Threat Tracker: Adversaries Leverage AI for Vulnerability Exploitation, Augmented Operations, and Initial Access, May 11, 2026. Read on the Google Cloud blog ↗

When AI finds the bugsscanners miss.

The vulnerability in question

Why scanners didn't catch it first

The asymmetry just inverted

What defenders need to change

The honest take

When AI finds the bugs
scanners miss.