Avoiding Production Crises with Vibe Coding Harness

Learn how to prevent production issues caused by AI-generated code using the Vibe Coding Harness framework.

Image 1

1. The Truth Behind the 2:47 AM Engineer Phone Call

At 2:47 AM, a phone call shattered the silence of the night, waking an engineer from deep sleep—production environment crashed, severe data corruption, user complaint calls flooded in, and no one knew what the problem was or how long it would take to fix.

After a long night, the root cause was discovered: six weeks earlier, on a Tuesday afternoon, an engineer had used AI to write 50 lines of code, generated in just 4 seconds. The code ran and compiled successfully, so it was deployed without further thought, testing, or clear production standards. This led to the midnight crisis.

This isn’t the fault of AI; it’s a pitfall many engineers face: we often think coding with AI is about prompt techniques, forgetting a painful truth—Vibe Coding’s core is 10% prompt and 90% skills, standards, and framework engineering.

Today, we will discuss the vibe-coding-harness, an open-source tool that helps engineers avoid such midnight crises. It is completely free, open-source, and commercially usable, providing a mature AI-assisted development framework that allows AI-generated code to be directly adapted to production environments with zero failures.

2. Core Breakdown: 5 BKM Steps of Vibe Coding with Executable Code

The core logic of vibe-coding-harness is simple: before sending prompts to AI, do 90% of the preparation work—set standards, build frameworks, and conduct tests, allowing AI to follow guidelines rather than improvise. This framework includes 5 core BKMs (Best Known Methods), each with specific operations and code examples that even beginners can directly apply.

1. Understand: Why Does AI-Generated Code Fail in Production?

Many engineers use AI to write code with just a simple prompt like “write a task manager” or “create a URL shortener.” While this seems efficient, it hides significant risks. For instance, the following 50 lines of Python for a task manager, generated by AI, can run and be tested but contains 7 production-level bugs that could lead to system crashes.

import uuid
from datetime import datetime
tasks = {}  # Global state, thread safety issue
def create_task(title):
    id = str(uuid.uuid4())
    tasks[id] = {"id": id, "title": title, "status": "todo", "created": datetime.now()}
    return tasks[id]       # Returns a mutable reference, caller can modify internal state
def complete_task(id):
    if id in tasks:
        tasks[id]["status"] = "done"
    # Silently returns None if the task does not exist, making troubleshooting difficult
def set_status(id, status):
    tasks[id]["status"] = status  # No id check, easy to raise KeyError; status is unrestricted, any string can be input
def list_tasks():
    return list(tasks.values())  # Returns all tasks, including archived ones, which does not meet actual needs

The dangers of these 7 bugs are as follows, each capable of triggering a production crisis:

  • Critical: No persistence, all tasks lost after service restart; no concurrency protection, data corruption under high load.
  • High Risk: No title validation, empty titles can be created; no status restrictions, any string can be used as a status.
  • Medium Risk: Returns a mutable reference, allowing callers to modify internal state; no exception handling, easily raises KeyError leading to service crashes.

The root of the problem is not that AI writes poorly, but that we did not provide AI with clear “production standards”—it does not know to implement persistence, handle concurrency, or validate inputs, leading to “running but unusable” code.

2. 5 BKM Practical Steps to Ensure Zero-Failure AI-Generated Code

The 5 BKMs of vibe-coding-harness essentially help us establish “production rules” for AI, preventing bugs from the source. Each step includes specific operations and code examples that can be directly applied.

BKM1: Set Standards Before Writing Prompts (Product Planning First)

Before opening any AI tool, write a context_spec.md that clearly defines “what to do, how to do it, and what constitutes success.” This is key to preventing AI from improvising. This specification must include 5 core parts, without exception:

# Context Spec: Task Manager v1
## What are we building?
A single-user task manager that supports SQLite persistence, allowing for task creation, deletion, modification, and querying, with data retained after service restarts.
## What are the success criteria?
- Task creation: must appear in the list and not be lost after service restarts.
- Task completion: should not appear in the task list after completion.
- Input validation: must raise a clear ValueError for empty titles or titles exceeding 500 characters.
- Performance requirement: querying the list must take no more than 50 milliseconds with 100 tasks.
## System constraints
- Python 3.11+, only using standard libraries (no additional dependencies).
- Use SQLite (via sqlite3), no ORM frameworks.
- All public methods must have type hints and docstrings.
## Out of scope features (to be implemented in v2)
- User authentication.
- Task pagination.
- REST interface (as a separate service).
## At least 5 edge cases
1. Empty title → raise ValueError.
2. Title exceeding 500 characters → raise ValueError.
3. Querying a nonexistent task ID → return None (no error).
4. Repeatedly marking a task as completed → idempotent operation (no error, no state change).
5. Deleting a nonexistent task → silently succeed (no error).

Remember: if you cannot write this specification, it means you do not yet know what you want to do. At this point, do not send a prompt to AI—otherwise, you will only generate a bunch of unusable “demo code.”

BKM2: Context Engineering, Helping AI Remember Your Project Rules

AI sessions are stateless; each new session, it “forgets” previous rules, easily leading to contradictory code. Context engineering helps AI “remember” fixed rules of the project and the current session’s goals, divided into two core files:

The first is the global context file (CLAUDE.md, slightly varying names depending on the tool), serving as the project’s “permanent memory,” storing rules that AI cannot infer from the code, such as architectural decisions, invariants, common commands, etc. It must be kept under 200 lines (a 2026 ETH Zurich study shows AI can stably follow a maximum of 150-200 instructions).

# Task Manager — Project Context
## Tech Stack
Python 3.11 · SQLite · pytest
## Core Invariants (absolutely must not be violated)
- Task status only supports TODO, DONE, ARCHIVED (strict enumeration).
- All database operations must use parameterized SQL, no string concatenation (to prevent SQL injection).
- All external calls must be wrapped in exception handling, no bare except.
## Current Iteration Focus
Currently developing: task list query function (filter by status).
Prohibited modifications: /models/task.py (database migration ongoing, until 2026-05-15).
## Common Commands
- make test-unit: only run unit tests (fast, no database dependency).
- make migrate: perform database migration (local environment).
- make lint: code check (ruff + mypy).
## Known Technical Debt (no expansion, planned for Q3 removal)
- /utils/legacy_mapper.py: old data mapping tool, not to be modified for now.

The second is the session context package, akin to a “daily task briefing,” clarifying the current session’s goals, roles, and data models to prevent AI from deviating:

# Session Context — 2026-05-01
## Role
Senior Python engineer, focusing on code robustness, no bare exceptions, all public methods must have docstrings.
## Data Model
Task: { id: uuid4, title: str(1-500), status: TODO|DONE|ARCHIVED,
        created_at: UTC datetime, updated_at: UTC datetime }
## Established Decisions (no re-discussion)
- Use SQLite, no PostgreSQL (v1 does not require complex infrastructure).
- Do not use ORM, only standard library sqlite3.
- Task status is strict enumeration, no arbitrary strings allowed.
## Today's Goal
Only implement list_tasks(status: Optional[TaskStatus]) method, no other features.
## Involved Files
- src/task_manager.py (only modify list_tasks method).
- tests/test_harness.py (only extend the 4th layer of tests).

BKM3: Write Tests Before Writing Code (Harness Engineering)

Many engineers are accustomed to “writing code first, then writing tests,” but when using AI to write code, this approach only leads to pitfalls—AI will write code that barely passes tests while overlooking potential hazards. The correct approach is: write the test harness first, then let AI write the code, making tests the “gatekeeper”.

The test harness consists of 4 layers, progressing from basic to core, ensuring all test cases come from context_spec.md, guaranteeing AI does not miss any edge cases. Below is the complete test code, ready for direct copying:

import pytest
from task_manager import TaskManager, TaskStatus
import time
class VibeSafeHarness:
    def __init__(self, db_path="memory:"):
        self.tm = TaskManager(db_path)
    
    def _expect_error(self, func, error_type):
        """Assert that executing func raises the specified type of exception"""
        with pytest.raises(error_type):
            func()
    
    def _assert_none(self, value):
        """Assert that the value is None"""
        assert value is None
    
    def _assert_idempotent(self, tm, status):
        """Assert that marking a task as completed is idempotent"""
        task_id = tm.create("Test Idempotency").id
        tm.complete(task_id)
        first_status = tm.get(task_id).status
        tm.complete(task_id)
        second_status = tm.get(task_id).status
        assert first_status == second_status == status
    
    def _assert_fields(self, task, fields):
        """Assert that the task contains all required fields"""
        for field in fields:
            assert hasattr(task, field)
    
    def _assert_type(self, value, expected_type):
        """Assert that the value's type matches the expected type"""
        assert isinstance(value, expected_type)
    
    def _assert_frozen(self, task):
        """Assert that the task is immutable"""
        with pytest.raises(AttributeError):
            task.title = "Modification Test"
    
    def _assert_persists(self, tm):
        """Assert that the task exists after service restart"""
        task_id = tm.create("Persistence Test").id
        # Simulate service restart (reinitialize TaskManager)
        new_tm = TaskManager("memory:")
        assert new_tm.get(task_id) is not None
    
    def _assert_performance(self, tm, count=100, max_ms=50):
        """Assert that list query performance meets requirements"""
        # Create count test tasks
        for i in range(count):
            tm.create(f"Test Task {i}")
        # Test query time
        start = time.time()
        tm.list_tasks()
        end = time.time()
        elapsed_time = (end - start) * 1000  # Convert to milliseconds
        assert elapsed_time < max_ms
    
    def run_all_tests(self):
        """Run all 4 layers of tests"""
        self.layer_1_smoke()
        self.layer_2_edge_cases()
        self.layer_3_contracts()
        self.layer_4_acceptance()
        print("✅ All tests passed, ready for the next step")
    
    def layer_1_smoke(self):
        """Layer 1: Smoke test (basic usability)"""
        self._assert_none(self._expect_error(lambda: __import__('task_manager'), ImportError))
        self._assert_none(self._expect_error(lambda: TaskManager("memory:"), Exception))
        print("✓ Smoke test passed")
    
    def layer_2_edge_cases(self):
        """Layer 2: Edge case tests (from context_spec.md)"""
        self._expect_error(lambda: self.tm.create(""), ValueError)
        self._expect_error(lambda: self.tm.create("x" * 501), ValueError)
        self._assert_none(self.tm.get("nonexistent-id"))
        self._assert_idempotent(self.tm, TaskStatus.DONE)
        print("✓ Edge case tests passed")
    
    def layer_3_contracts(self):
        """Layer 3: Contract tests (data model, interface specifications)"""
        task = self.tm.create("Contract Test")
        self._assert_fields(task, ["id", "title", "status", "created_at", "updated_at"])
        self._assert_type(task.status, TaskStatus)
        self._assert_frozen(task)
        print("✓ Contract tests passed")
    
    def layer_4_acceptance(self):
        """Layer 4: Acceptance tests (meet success criteria)"""
        self._assert_persists(self.tm)
        self._assert_performance(self.tm, count=100, max_ms=50)
        print("✓ Acceptance tests passed")

# Run tests
if __name__ == "__main__":
    harness = VibeSafeHarness()
    harness.run_all_tests()

Here are 3 key rules that must be strictly followed: 1) Tests must be written by humans, AI cannot write them (AI will write tests that it can pass, losing their validation significance); 2) Tests must be written before the code, defining “correct code” through tests; 3) Test failures must block deployment, no “deploy first, fix later”.

BKM4: Structured Prompts, Rejecting Improvisation (Prompt Patterns)

Free prompts (like “write a list_tasks method”) are a major cause of AI writing bad code. In production-level development, structured prompts must be used, clearly defining AI’s role, context, tasks, and constraints, allowing AI to “only write code according to the rules”.

The structured prompt format is fixed as [ROLE][CONTEXT][TASK][CONSTRAINTS], as shown below, which can be directly replaced and reused:

[ROLE]
Senior Python engineer, focusing on code robustness and maintainability. All public methods must have type hints and docstrings, no bare excepts, and database operations must use parameterized SQL.
[CONTEXT]
Data model: Task: { id: uuid4, title: str(1-500), status: TODO|DONE|ARCHIVED, created_at: UTC datetime, updated_at: UTC datetime }
Database: SQLite (standard library sqlite3), WAL mode enabled, no ORM.
Core constraints: task status is strict enumeration, arbitrary strings are not allowed; all inputs must be validated.
[TASK]
Implement create(title: str) -> Task method in the TaskManager class.
Specific requirements: 1. Validate title (non-empty, trimmed, max 500 characters); 2. Generate uuid4 as task ID; 3. Set created_at and updated_at to current UTC time; 4. Persist the task to SQLite; 5. Return an immutable Task data class.
[CONSTRAINTS]
1. Code must not exceed 60 lines;
2. Validation failures must raise ValueError with clear error messages;
3. Must not use string interpolation or format for SQL, must use parameterized queries;
4. Must pass the 1st and 2nd layer tests in test_harness.py without modifying test code.

BKM5: Double Verification, Guarding the Final Defense (LLM-as-Judge + Human-as-Judge)

Even if code passes all 4 layers of tests, it cannot go live directly—tests can only check “if it meets the specifications” but cannot check for semantic vulnerabilities (like SQL injection risks, sensitive data leaks, logical flaws). At this point, double verification is needed to guard the final defense.

The first step: LLM-as-Judge (AI automatic review), using another AI model to conduct a semantic review of the code, capturing issues missed by tests. Below is the complete review script, requiring the installation of the anthropic SDK (pip install anthropic) and configuration of the API key for direct execution:

import anthropic
import sys

def llm_judge(code):
    client = anthropic.Anthropic(api_key="your_api_key")
    REVIEW_PROMPT = f"""
[ROLE] Senior Python code reviewer, strictly checking for issues in the code, marking line numbers, speaking frankly, and not missing any hidden dangers.
[TASK] Classify issues according to the following levels, returning only JSON format results, without any extra content:
  CRITICAL: Data loss, security vulnerabilities, crash risks, silent data corruption
  MAJOR: Logical errors, contract violations, missing necessary validations
  MINOR: Code style, readability, performance optimization suggestions
[CONSTRAINT] The return format must strictly follow the following JSON structure, fields must not be modified:
{{
  "critical": [{{"line": number, "issue": "issue description", "fix": "fix suggestion"}}],
  "major": [{{"line": number, "issue": "issue description", "fix": "fix suggestion"}}],
  "minor": [{{"line": number, "issue": "issue description", "fix": "fix suggestion"}}],
  "verdict": "SHIP | SHIP WITH MINOR FIXES | DO NOT SHIP",
  "summary": "A one-sentence summary of the review result"
}}
[CODE] {code}
    """
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1024,
        messages=[{"role": "user", "content": REVIEW_PROMPT}]
    )
    return response.content

# Read code file and conduct review
if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python llm_judge.py <code_file_path>")
        sys.exit(1)
    with open(sys.argv[1], "r", encoding="utf-8") as f:
        code = f.read()
    result = llm_judge(code)
    print("LLM Review Result:")
    print(result)

The second step: Human-as-Judge (human review), which is the final line of defense. Different types of code require different roles for review to avoid missing critical vulnerabilities:

  • New data model/schema changes: reviewed by technical lead, DBA.
  • External API integrations: reviewed by security team, platform team.
  • Code involving permissions, payments, health data: reviewed by product + engineering leads.
  • Modifications to the testing framework: reviewed by senior engineers.

3. Project Directory Structure: Helping AI Understand Your Project Architecture

A reasonable directory structure acts as a “project map” for AI, helping it understand the purpose of each file, avoiding misplaced code and dependency confusion. The standard directory structure for vibe-coding-harness is as follows, ready for reuse:

my-project/
│
├── CLAUDE.md                      ← Global context (loaded in each session, <200 lines)
├── CLAUDE.local.md                ← Personal configuration, not submitted to git
├── context_spec.md                ← Current iteration's functional specification
│
├── skills/                        ← Reusable knowledge modules (team shared)
│   ├── api-design.md              ← API design specifications
│   ├── testing.md                 ← Testing best practices
│   ├── database.md                ← Database operation specifications
│   ├── error-handling.md          ← Exception handling specifications
│   └── security.md                ← Security specifications
│
├── src/                           ← Core code
│   ├── routes/                    ← HTTP interfaces (only handling requests)
│   ├── services/                  ← Business logic (only handling business rules)
│   ├── repositories/              ← Data access (only operating databases)
│   ├── models/                    ← Data models
│   ├── schemas/                   ← Data validation schema
│   └── shared/                    ← Shared utilities (permissions, logging, etc.)
│
├── tests/                         ← Testing directory
│   ├── test_harness.py            ← 4-layer testing framework (written before code)
│   ├── unit/                      ← Unit tests
│   ├── integration/               ← Integration tests
│   └── e2e/                       ← End-to-end tests
│
├── docs/                          ← Documentation directory
│   ├── architecture.md            ← Architecture documentation
│   ├── technical-debt.md          ← Technical debt records (AI logs)
│   └── adr/                       ← Architecture decision records
│
└── commands/                      ← Common command scripts

Was this helpful?

Likes and saves are stored in your browser on this device only (local storage) and are not uploaded to our servers.

Comments

Discussion is powered by Giscus (GitHub Discussions). Add repo, repoID, category, and categoryID under [params.comments.giscus] in hugo.toml using the values from the Giscus setup tool.