PR Churn Pitfalls When Using LLM Tools — A Data-Driven Post-Mortem¶

Published: 2026-03-08 | ~2,800 words | Tags: LLM, code-generation, PR-review, engineering-process

1. The Setup¶

This project is unusual: every line of production code was written by an AI. No human authored a function, test, or configuration file — Claude (Anthropic's coding assistant) produced all of it across 115 pull requests covering features, tests, documentation, CI pipelines, and refactors.

That makes this repository a controlled laboratory. When automated reviewers (CodeRabbit and GitHub Copilot) flag problems, those findings are not noise from an inconsistent human team. They are systematic gaps in AI code generation, repeatable patterns that surface across every work type.

We ran a full audit: 1,830 reviewer findings across 115 PRs, classified into 26 named anti-patterns across six work categories. This post shares what we found and what it cost us.

2. The Problem: What Is PR Churn?¶

PR churn is the iterative back-and-forth between code generation and review feedback that follows a predictable loop:

generate → review → fix → re-review → fix again → ...

Each cycle carries a real cost:

Latency: Every round trip adds minutes-to-hours of clock time.
Context loss: LLMs don't remember previous sessions. Each fix cycle risks reintroducing problems that were already fixed, because the model no longer has the context of why a change was made.
Compounding errors: A fix in one place can break something adjacent if the surrounding context is no longer in the model's active window.
Review fatigue: Automated reviewers re-flag the same class of issue repeatedly, drowning signal in noise.

The worst outcome is not a bug — it is a pattern of bugs that never stops appearing, because the root cause is in the generation process rather than any specific piece of code.

3. The Data¶

Overall Distribution¶

Work Type	Findings	Share
TEST	634	34%
FEATURE	590	32%
DOCS	343	18%
FIX	142	8%
CI	84	5%
REFACTOR	35	2%
Total	1,830	100%

Tests and features together account for two-thirds of all reviewer feedback. This matters: the generation prompt for a test PR and the generation prompt for a feature PR are very different, yet both produce roughly equal volumes of churn.

Top Anti-Patterns by Frequency¶

Rank	ID	Pattern	Count	Work Type
1	D5	WRONG_FORMAT	139	DOCS
2	D1	INACCURATE_CLAIM	94	DOCS
3	T2	MISSING_CALL_VERIFY	93	TEST
4	F4	SECURITY_VULN	74	FEATURE
5	D2	STALE_REFERENCE	65	DOCS
6	F3	THREAD_SAFETY	64	FEATURE
7	F2	TYPE_ANNOTATION	63	FEATURE
8	T1	WEAK_ASSERTION	54	TEST
9	F1	MISSING_ERROR_HANDLING	53	FEATURE
10	G1	ABSOLUTE_PATH	53	ALL

The single most common finding — D5·WRONG_FORMAT with 139 hits — is also the most preventable. It is a markdown lint failure. The model generates documentation without running a lint pass before committing.

4. Root Causes by Category¶

4.1 Test Generation¶

The core failure: tests written to achieve coverage line-count rather than to verify contracts.

T2 · MISSING_CALL_VERIFY (93 findings)¶

A mock is set up and a function is called, but the test never asserts that the mock received the right arguments. A facade forwarding the wrong payload still passes.

# Bad: proves the code ran, not that it called the right thing
def test_send_notification(mock_notifier):
    service.notify_user(user_id=42, message="hello")
    mock_notifier.send.assert_called_once()  # ← only checks call count


# Good: verifies the exact payload was forwarded
def test_send_notification(mock_notifier):
    service.notify_user(user_id=42, message="hello")
    mock_notifier.send.assert_called_once_with(
        recipient=42,
        body="hello",
    )

The question "if this mock was called with the wrong argument, would this test catch it?" is never asked at generation time.

T1 · WEAK_ASSERTION (54 findings)¶

# Bad: asserts execution completed, not that behavior was correct
def test_process_file(tmp_path):
    result = processor.process(tmp_path / "test.txt")
    assert result["success"] is True  # ← passes even if output is wrong


# Good: asserts the meaningful side-effect
def test_process_file(tmp_path):
    input_file = tmp_path / "test.txt"
    input_file.write_text("raw content")
    result = processor.process(input_file)
    assert result["success"] is True
    assert result["output_path"].exists()
    assert result["output_path"].read_text() == "processed content"

T9 · RESOURCE_LEAK (~10 findings)¶

LRU-cached database engines and file-backed fixtures not disposed between tests. On Windows, tmp_path cleanup fails because a cached SQLAlchemy engine holds the file handle open.

# Bad: cached engine holds file handle across tests
@lru_cache(maxsize=None)
def get_engine(db_url: str):
    return create_engine(db_url)


# Good: clear the cache in a fixture so tmp_path can clean up
import pytest

@pytest.fixture(autouse=True)
def clear_engine_cache():
    get_engine.cache_clear()
    yield
    get_engine.cache_clear()

4.2 Feature Generation¶

The core failure: code generated to the happy path only. Security boundaries, error paths, and concurrency are afterthoughts.

F4 · SECURITY_VULN (74 findings)¶

This was the most severe pattern in feature PRs. Examples from the dataset:

Auth tokens passed as ?token=... query strings — exposed in server access logs, browser history, and proxy logs.
Path inputs accepted without sanitization — directory traversal possible.
get_settings() called directly inside a route handler that was already passed an ApiSettings instance — bypassing the injected config.
Secrets echoed in log output.

# Bad: token in URL is visible in every log line
@router.get("/files")
async def list_files(token: str = Query(...)):
    if not verify_token(token):
        raise HTTPException(status_code=401)
    ...


# Good: token in Authorization header, not query string
@router.get("/files")
async def list_files(
    credentials: HTTPAuthorizationCredentials = Depends(security),
):
    if not verify_token(credentials.credentials):
        raise HTTPException(status_code=401)
    ...

F3 · THREAD_SAFETY (64 findings)¶

# Bad: file truncated before lock acquired — partial write window
def write_config(path, data):
    with open(path, "w") as f:   # truncates here
        fcntl.flock(f, fcntl.LOCK_EX)   # lock acquired after truncation
        json.dump(data, f)


# Good: open for read-write, lock before truncate
def write_config(path, data):
    with open(path, "r+") as f:
        fcntl.flock(f, fcntl.LOCK_EX)
        f.seek(0)
        f.truncate()
        json.dump(data, f)

F1 · MISSING_ERROR_HANDLING (53 findings)¶

Feature PRs consistently generated happy-path-only code. An exception raised by a dependency propagates to the caller with no wrapping or user-facing message. This is not a difficult pattern to detect — but the model generates code as if every call will succeed.

4.3 Documentation¶

The core failure: docs written from memory and intent rather than from reading the actual implementation first.

D5 · WRONG_FORMAT (139 findings — #1 overall)¶

Markdown lint failures: heading level skips (# Title followed immediately by ### Section), nested code fences, missing blank lines around fenced blocks, table formatting errors. These are generated consistently across every docs PR. The model does not run a lint pass before committing.

D1 · INACCURATE_CLAIM (94 findings)¶

Documented behavior that does not match the implementation:

Method signatures shown in docs that don't exist in the codebase.
Parameter names that are wrong.
File format support claimed (.doc via python-docx) that the library does not actually provide.
Plugin API examples inconsistent with the actual base class interface.

The pattern is consistent: the model writes from its training data's approximation of the API rather than inspecting the real source.

D6 · CONTRADICTION (29 findings)¶

Two sections of the same document disagree. The classic example from our dataset: a feature described as "complete" in the overview section and as "planned" in the status table on the same page. Another: coverage described as 0% in a metrics table and "comprehensive" in the prose.

4.4 CI / Infrastructure¶

The core failure: CI config generated without reading the existing workflow to understand what already runs, and thresholds written from memory.

C4 · COVERAGE_GATE (25 findings)¶

The coverage threshold in a PR description did not match the value enforced in pyproject.toml or the workflow YAML. Three different numbers appeared across the same PR: 70% in the README, 75% in the PR description, and the actual enforced threshold in configuration.

C3 · CACHE_MISCONFIG¶

@lru_cache applied to functions that read environment variables at call time. The cache holds the value from the first call; if the env var changes between invocations (common in tests), the cache returns a stale value.

# Bad: env var read once and cached forever
@lru_cache(maxsize=None)
def get_config_manager():
    config_dir = os.environ.get("FO_CONFIG_DIR", "~/.config/file-organizer")
    return ConfigManager(config_dir)


# Good: pass the value in, cache the constructed object separately
def get_config_manager(config_dir: str | None = None) -> ConfigManager:
    resolved = config_dir or os.environ.get("FO_CONFIG_DIR", "~/.config/file-organizer")
    return _build_manager(resolved)


@lru_cache(maxsize=None)
def _build_manager(config_dir: str) -> ConfigManager:
    return ConfigManager(config_dir)

5. Patterns Unique to LLMs vs. Human Authors¶

These patterns appear across every work type and are structural to how LLMs generate code — not specific to any particular model or task.

5.1 Context Window Limitations Cause Repeated Mistakes¶

A human engineer who fixed a threading bug on Monday will not reintroduce it on Tuesday because they remember writing the fix. An LLM has no persistent memory across sessions. The same class of bug — THREAD_SAFETY, MISSING_CALL_VERIFY, SECURITY_VULN — appears again in the next PR of the same type.

This is not a model quality problem. It is an architectural property: without explicit memory or rules that travel with each generation prompt, the model reverts to its base behavior.

5.2 "Success-Only" Thinking¶

The model generates tests that demonstrate the code executes without error. It does not reliably generate tests that demonstrate the code produces the correct output when given adversarial or edge-case input, because the generation prompt asks "write a test for this function" — not "write a test that would catch a regression in this function's contract."

Humans who have debugged production failures write tests defensively. LLMs write tests to satisfy the stated requirement.

5.3 Overconfident Documentation¶

The model writes documentation from its training data's approximation of the API. It does not read the source file before documenting it. When an API changed after training cutoff — or was never in the training data at all — the documentation is confidently wrong rather than absent.

Humans would not document a method they haven't read. LLMs document what they believe the method should look like.

5.4 Security Boundary Blindness¶

The model generates logic without considering trust boundaries. It does not ask "who controls this input?" before accepting it or "where does this value end up?" before logging it. Auth token placement, input sanitization, and secret handling require reasoning about the deployment context — information the model infers poorly from a function signature alone.

6. What Actually Works¶

After 115 PRs of data, these interventions produced measurable reductions in churn rate:

6.1 Pre-Generation Checklists¶

The most effective intervention was adding mandatory checklist steps to the generation workflow before the model writes any code:

Source-first docs: "Read the implementation file before writing any docs claim." Reduced D1·INACCURATE_CLAIM incidents in subsequent doc PRs.
Mock verification rule: "Every generated test that uses a mock must assert the mock's arguments, not just its call count." Directly targets T2.
Security boundary check: "Before implementing any input handling, identify the trust boundary: is this input user-controlled?" Targets F4.
Lint before commit: "Run markdownlint before committing any .md file." The highest-ROI intervention for D5·WRONG_FORMAT (139 hits from a simple format error).

6.2 Persistent Memory Across Sessions¶

Rule files committed to the repository (e.g. .claude/rules/feature-generation-patterns.md, .claude/rules/test-execution.md) travel with the generation prompt automatically. Unlike a one-time instruction in a chat session, they are present for every subsequent generation.

The key insight: the model will not remember a correction from session N in session N+1. The correction must be written down and loaded at the start of every session.

6.3 Single-Pass Review Response¶

Instead of addressing one reviewer comment at a time — the natural instinct — the protocol is:

Extract all reviewer findings upfront in one operation.
Verify each against current code.
Apply all valid fixes locally in a single pass.
Push once.

Iterative comment-by-comment responses cause compounding context loss: the model fixes comment #3 without the full context of comments #1 and #2, introducing inconsistencies that generate new review findings.

6.4 Automated Pre-Commit Validation¶

A pre-commit hook that runs lint, type-checking, and pattern validation before any commit catches the highest-frequency, lowest-severity findings (D5, G4, G2) before they reach review. This compresses the feedback loop from hours (waiting for automated review) to seconds.

The economics matter: a finding caught at commit time costs one second. The same finding caught in PR review costs one round-trip cycle.

7. Takeaways for Teams Using LLM Tools¶

Automated Review Is Not Optional¶

If you are using LLMs to write code, you need automated reviewers. Not as a quality gate at the end, but as a systematic data source. The findings tell you where your generation process is weak.

Without automated review, you have no data. Without data, you cannot improve the generation process. You are flying blind across a pattern space that repeats every PR.

Rule Files Reduce Repeat Mistakes More Than Prompting¶

A one-time instruction ("always verify mock arguments") degrades across sessions. A rule file loaded at the start of every session does not. The cost of maintaining rule files is low; the benefit compounds with each PR.

The pattern: identify a class of mistake from review data → write a rule → load the rule automatically → measure whether the mistake recurs.

Measure Churn Rate Per PR Type¶

Aggregate churn rate is not actionable. Churn rate by work type is. In this dataset:

Docs PRs had the highest density of findings (343 findings across fewer PRs, dominated by format errors with a simple fix).
Feature PRs had the highest severity findings (security vulnerabilities, thread safety, missing error handling).
Test PRs had the highest stealth risk: findings in the 600s that appear to pass CI while not actually verifying behavior.

Each work type needs its own intervention. A mock verification rule does not help with markdown lint failures.

The Speed Gain Has a Hidden Cost¶

LLMs are fast. A feature that would take a human two days can be prototyped in minutes. But if that prototype generates 15 reviewer findings that require three fix cycles to resolve, the net elapsed time — accounting for review latency, context re-establishment, and re-testing — may exceed the manual baseline.

The actual speed gain from LLM code generation only materializes when churn is low. Investing in pre-generation checklists, rule files, and automated validation is not overhead — it is what makes the speed gain real.

Appendix: Anti-Pattern Reference¶

ID	Name	Count	Category
D5	WRONG_FORMAT	139	DOCS
D1	INACCURATE_CLAIM	94	DOCS
T2	MISSING_CALL_VERIFY	93	TEST
F4	SECURITY_VULN	74	FEATURE
D2	STALE_REFERENCE	65	DOCS
F3	THREAD_SAFETY	64	FEATURE
F2	TYPE_ANNOTATION	63	FEATURE
T1	WEAK_ASSERTION	54	TEST
F1	MISSING_ERROR_HANDLING	53	FEATURE
G1	ABSOLUTE_PATH	53	ALL
G5	NAMING_CONVENTION	43	ALL
G4	UNUSED_CODE	41	ALL
D4	MISSING_SECTION	36	DOCS
F5	HARDCODED_VALUE	36	FEATURE
D6	CONTRADICTION	29	DOCS
T6	PERMISSIVE_FILTER	26	TEST
C4	COVERAGE_GATE	25	CI
T8	BRITTLE_ASSERTION	23	TEST
T3	WRONG_PAYLOAD	20	TEST
D3	BROKEN_EXAMPLE	20	DOCS
T4	BROAD_EXCEPTION	~15	TEST
T10	DEAD_TEST_CODE	~15	TEST
G2	LOGGING_FORMAT	~12	ALL
F6	API_CONTRACT_BROKEN	~12	FEATURE
T9	RESOURCE_LEAK	~10	TEST
F7	RESOURCE_NOT_CLOSED	~10	FEATURE

Source: curdriceaurora/Local-File-Organizer issue #657 — 1,830 findings across 115 PRs.

Issue #657: Full-project PR review audit — source data for this post
Issue #656: Test-generation anti-pattern deep-dive — precursor analysis