Back to blog
Field Notes2026-04-285 min

We Scanned One Production Codebase. AI Had Quietly Shipped 4,295 Issues.

We ran StableStack against one production codebase last week. Just one. A real app, in real use, written largely with AI assistance over the last year and a half.

The scan finished in under a minute. Here is what it found.

stablestack scan ./codebase
scan complete·691 files·47 rules·58s
4,295issues found
ERROR25
block production
WARNING1,070
require review
INFO3,200
cleanup tier
By category
Code complexity
2,302
Type-safety holes
451
Memory time-bombs
122
Error leakage
98
Performance / N+1
72
Timezone bugs
71
Async footguns
64
Race conditions
35
Security gaps
31
+ 38 other rules
1,049
Top 3 priorities surfaced↓ see below

That is from one repo. Not a bad repo. Not a junior team. A shipped, paying-customer product written by capable people using the best tools available.

We are sharing this because the pattern is the point. If your team writes code the way most teams now write code (with an AI doing a meaningful share of the typing), you almost certainly have a version of this same report waiting to be run.

What kept showing up

We will not list every rule. A few categories did most of the damage.

**Security gaps that look like normal code.** A sensitive value being written into a log line "for debugging" and never removed. A signature check that silently turns itself off when an environment variable is missing, so the check passes in production by accident. Predictable values being used where unpredictable ones are required. None of these look wrong on the screen. All of them look like working code. That is the problem.

**Async footguns.** Functions that fire off work and then immediately move on, assuming it finished. Functions that update something and then read the old value back. Promises that are spawned and never awaited. The code runs. The code "works." Until it doesn't, intermittently, in a way that is almost impossible to reproduce by hand.

**Memory time-bombs.** Database queries with no limit on them. The query is fine when the table has 200 rows. It is a production incident the day the table has 200,000.

**Type-safety escape hatches.** Hundreds and hundreds of places where the type system was told "trust me" instead of being given a real answer. Each one is small. Together they undo the entire reason you have a type system.

**Timezone bugs.** Date handling that mixes the user's clock with the server's clock, on the assumption they will agree. They will not.

**Error messages that say too much.** Internal stack traces being returned to end users in API responses. Helpful to a developer. Useful to an attacker.

**Race conditions in places that look serial.** Two requests doing the same thing at the same time, racing to overwrite each other.

Why AI writes code like this

These are not random bugs. They are a signature.

AI assistants are trained to produce code that looks right. Code that compiles. Code that passes the obvious test. They are not trained to remember that this query has no LIMIT, or that this signature check needs to fail closed instead of fail open, or that this value is sensitive and must never be logged.

They write the happy path beautifully. They do not write the unhappy path at all.

A human reviewing the diff sees clean, plausible code and approves it. Of course they do. The code looks fine. That is what AI is good at.

Not everything we flag is a bug

We are not going to pretend the scanner is right every time. Looking at this same report, a small number of findings are false positives. Patterns that trigger a rule but are not actually wrong in the place they triggered.

We tell customers this directly. A static scan is a starting list, not a verdict. The job after the scan is triage, and we expect the first pass to throw out maybe 5 to 15 percent of findings as not-applicable. We would rather err on the side of telling you about a pattern and letting you decide than miss a real one.

Where this team needs to start

Looking at this report, three findings rise above the rest. They are not the most numerous (the long tail of style and complexity findings is much bigger). They are the ones a security or platform reviewer would draw a circle around and say "fix this first, before you do anything else."

We have flagged them privately for the team that owns the codebase. Two are security issues that need a fix this week. One is a missing piece of project documentation that, if added, would prevent the next several hundred findings from ever being written in the first place.

We are not going to publish what they are. Knowing the specific patterns is a small but real piece of leverage, and we would rather hand it to a team that is going to act on it than print it for everyone whose code happens to look the same.

Your three priorities will be different. They will be specific to the way your codebase has grown, the prompts you have been giving your AI assistant, and the conventions your team has and has not written down. The only way to know what they are is to look.

If you want the same scan on your code

We are running targeted mini-reports for teams who want to see what is in their own repo before committing to anything broader. Pick one rule cluster (type safety, async, memory, security) and we will run it and walk the team through the results. No code leaves your environment.

It takes us about a day. The report takes the scanner about a minute. The hardest part is reading it.

Free with every install. No license key required.

pip install stablestack