On May 11, Google Threat Intelligence Group (GTIG) published its 2026 AI Threat Tracker. Buried in the executive summary is a sentence that should change how every AppSec team thinks about coverage:
“For the first time, GTIG has identified a threat actor using a zero-day exploit that we believe was developed with AI.”
The headline is the zero day. The deeper story is what kind of zero day it was, and why the entire generation of code scanners enterprises currently rely on were structurally incapable of finding it first.
The vulnerability in question
The exploit was a two-factor authentication bypass in a popular open source web-based system administration tool. GTIG's analysis is precise about its nature:
“The vulnerability can be classified as a 2FA bypass, though it requires valid user credentials in the first place. It stems not from common implementation errors like memory corruption or improper input sanitization, but a high-level semantic logic flaw where the developer hardcoded a trust assumption.”
Read that carefully. This was not a buffer overflow. Not a SQL injection. Not a missing input sanitizer. It was a developer who wrote if (user.is_trusted) { skip_2fa() } somewhere in a control flow that was supposed to enforce 2FA universally. The code passed every test. It looked functionally correct. The 2FA logic existed. It just had a contradicting exception that a human reading the file would catch and a fuzzer never would.
GTIG calls this a semantic logic flaw. We call it application logic exposure. Same animal.
Why scanners didn't catch it first
GTIG's report contains a diagram (Figure 3) comparing LLM vulnerability discovery to fuzzers and static analysis. The conclusion is uncomfortable for the SAST industry:
“While fuzzers and static analysis tools are optimized to detect sinks and crashes, frontier LLMs excel at identifying these types of high-level flaws and hardcoded static anomalies.”
Static analyzers operate on syntactic patterns. They look for taint flowing into a sink, for known dangerous APIs, for missing sanitizers on known-bad inputs. They are very good at this. They are also fundamentally blind to a category of bugs that look like:
def enforce_2fa(user, request):
if user.role == "admin" and request.source_ip in TRUSTED_RANGES:
return True # skip 2FA, this is a trusted admin
return run_2fa_challenge(user)There is no taint here. No injection. No sink. The grammar is clean. The function does what it says. The flaw is that the function should not exist at all, because policy says all admin access requires 2FA, and TRUSTED_RANGES was a debug constant that should have been removed before shipping.
This is the entire pattern of business logic vulnerabilities. They emerge from the gap between what the code says and what the system is supposed to enforce. Scanners do not have access to “what the system is supposed to enforce.” LLMs reasoning over the whole codebase, with context about adjacent files, comments, and stated security requirements, increasingly do.
GTIG's framing confirms it:
“They have an increasing ability to perform contextual reasoning, effectively reading the developer's intent to correlate the 2FA enforcement logic with the contradictions of its hardcoded exceptions. This capability can allow models to surface dormant logic errors that appear functionally correct to traditional scanners but are strategically broken from a security perspective.”
The asymmetry just inverted
For a decade the defender argument against AppSec investment ran roughly: attackers cannot scale manual code review across your repos, so as long as you fuzz and SAST the hot paths you are fine on logic bugs. That argument is dead.
GTIG documented APT45 sending thousands of repetitive prompts that recursively analyse CVEs and validate proof of concept exploits. Other PRC-nexus clusters are integrating specialised vulnerability corpora (the wooyun-legacy dataset of 85,000+ historical bugs) as in-context priming for Claude code skills. They are deploying agentic frameworks like Hexstrike and Strix against real targets, with Graphiti-style temporal knowledge graphs to maintain persistent state across the attack surface.
Translation: well-resourced adversaries are now running automated business logic review at scale against your code, against your dependencies, and against any open source component you ship with. They are doing it cheaply. They are doing it continuously. And the class of bug they are best at finding is the exact class your existing scanners cannot see.
What defenders need to change
Three observations from the GTIG report deserve to land in every AppSec roadmap this quarter.
The honest take
There is a temptation in the security industry to read reports like this one as marketing tailwind for whoever sells the closest adjacent product. We will resist most of that.
The honest reading of GTIG's 2026 tracker is that the bar for what counts as adequate AppSec coverage has moved. Detecting CWE-79 and CWE-89 in HTTP handlers is the cost of entry, not the value proposition. The frontier is detecting the logic flaws that emerge from the interaction between intent, policy, and implementation across services that no individual reviewer holds in their head. Attackers now have tools to do that at industrial scale.
Defenders need them too. The neurosymbolic combination of a persistent knowledge graph plus LLM reasoning is one credible answer. It is not the only one. But the era of treating business logic exposure as a “manual pen test problem we'll get to next year” is over, because the adversary side already industrialised it.
GTIG's report is mandatory reading. The full piece is on the Google Cloud blog. The CWE → CAPEC → BLADE mapping work we are publishing in the coming months is part of how we think the defensive side should respond.