Five Tasks, Two Modes, One Model: A Controlled Experiment

· SysMARA Team

Claims about AI-assisted development are easy to make, hard to verify. So we ran a small controlled experiment: 5 identical tasks, 2 modes (vanilla Express vs SysMARA), 1 model (Claude), and one metric — how many business-rule violations did the generated code contain?

Every command, output, and code snippet in this post was captured from a real session. You can reproduce it yourself.

The Setup

Domain: An e-commerce backend with two modules — inventory (products, stock) and orders (order placement, cancellation).

Business rules (invariants):

  • stock_cannot_be_negative — product stock must never go below zero after any operation
  • price_must_be_positive — product price must be greater than zero
  • order_quantity_must_be_positive — order quantity must be at least 1

Policy:

  • only_active_products_orderable — orders can only be placed for active products, not discontinued ones

Module boundary:

  • inventory must not depend on orders (forbidden dependency)
  • orders may depend on inventory (allowed dependency)

The 5 Tasks

#TaskKey constraint
T1Add a product to the catalogprice_must_be_positive
T2Create an order for a productstock_cannot_be_negative, only_active_products_orderable
T3Cancel an order and restore stockstock_cannot_be_negative
T4Update stock level directlystock_cannot_be_negative
T5List all orders (read-only)No constraint (control task)

Mode A: Vanilla Express Prompt

We gave Claude a single prompt:

Build an Express.js REST API with these endpoints:
- POST /products — add a product (name, price, stock, status)
- POST /orders — create an order (product_id, quantity)
- POST /orders/:id/cancel — cancel an order, restore stock
- PATCH /products/:id/stock — update stock level
- GET /orders — list all orders

Use an in-memory store. Include proper error handling.
Business rules:
- Price must be positive
- Stock cannot go negative
- Quantity must be at least 1
- Only active products can be ordered

Claude generated a working Express app — 147 lines, clean structure, immediate functionality. Then we audited the generated code against our 3 invariants + 1 policy.

Mode A Audit Results

TaskConstraintStatusNotes
T1: add_product price_must_be_positive PASS Checked price > 0 at the top of the handler
T2: create_order only_active_products_orderable VIOLATION No check for product.status === 'active' — discontinued products are orderable
T2: create_order stock_cannot_be_negative PARTIAL Checked stock >= quantity but did not enforce atomicity — two concurrent orders can overdraw
T3: cancel_order stock_cannot_be_negative PASS Stock restoration was correct (additive, cannot go negative)
T4: update_stock stock_cannot_be_negative VIOLATION Accepted any integer — no check preventing negative stock values
T5: list_orders (none) PASS Read-only, no constraint applies

Mode A violation rate: 2 full violations + 1 partial out of 5 constraint checkpoints = 40–60% violation rate.

The critical issue: the only_active_products_orderable policy was stated in the prompt but never implemented. Claude acknowledged it in a comment but did not write the guard. The update_stock endpoint accepted { "stock": -50 } without complaint.

Mode B: SysMARA Specs

Same domain, same 5 tasks. But this time we started by defining the system formally:

# Step 1: Initialize project
npx @sysmara/core init --db sqlite --orm sysmara-orm

# Step 2: Define specs (entities, capabilities, invariants, policies, modules)
# ... (YAML files — see below)

# Step 3: Build
npx sysmara build

The specs we wrote

system/entities.yaml — 2 entities across 2 modules:

entities:
  - name: product
    module: inventory
    description: A product in the catalog
    fields:
      - name: id
        type: uuid
        required: true
      - name: name
        type: string
        required: true
      - name: price
        type: number
        required: true
      - name: stock
        type: number
        required: true
      - name: status
        type: enum
        required: true
    invariants:
      - stock_cannot_be_negative
      - price_must_be_positive

  - name: order
    module: orders
    description: A customer order
    fields:
      - name: id
        type: uuid
        required: true
      - name: product_id
        type: uuid
        required: true
      - name: quantity
        type: number
        required: true
      - name: total_price
        type: number
        required: true
      - name: status
        type: enum
        required: true
      - name: created_at
        type: date
        required: true
    invariants:
      - order_quantity_must_be_positive

system/invariants.yaml — 3 named invariants:

invariants:
  - name: stock_cannot_be_negative
    description: Product stock must never go below zero after any operation
    entity: product
    rule: product.stock must be >= 0 after any update
    severity: error
    enforcement: runtime

  - name: price_must_be_positive
    description: Product price must be greater than zero
    entity: product
    rule: product.price must be > 0
    severity: error
    enforcement: runtime

  - name: order_quantity_must_be_positive
    description: Order quantity must be at least 1
    entity: order
    rule: order.quantity must be >= 1
    severity: error
    enforcement: runtime

system/policies.yaml — 1 policy:

policies:
  - name: only_active_products_orderable
    description: Orders can only be placed for active products, not discontinued ones
    actor: any
    effect: deny
    conditions:
      - field: product.status
        operator: eq
        value: discontinued
    capabilities:
      - create_order

Build output (real)

════════════════════════════════════════════════════════════
  SysMARA Build
════════════════════════════════════════════════════════════

  Parsing specs...
[INFO] Found 2 entities, 5 capabilities, 1 policies, 3 invariants, 2 modules, 1 flows

  Cross-validating...
[INFO] No cross-validation issues.

  Building system graph...
[INFO] system-graph.json (14 nodes, 21 edges)
  Building system map...
[INFO] system-map.json (2 modules)

  Compiling capabilities...
[INFO] Generated 15 file(s)

  Scaffolding app/ stubs...
[INFO] Scaffold: 13 written, 0 skipped (already exist)

  Running diagnostics...
[OK] Build completed successfully.

What SysMARA generated for create_order

The scaffold for the create_order capability — generated automatically from specs:

// SCAFFOLD: capability:create_order
// Edit Zone: editable — generated once, safe to modify

import { enforceOnlyActiveProductsOrderable }
  from '../policies/only_active_products_orderable.js';
import { validateStockCannotBeNegative }
  from '../invariants/stock_cannot_be_negative.js';
import { validateOrderQuantityMustBePositive }
  from '../invariants/order_quantity_must_be_positive.js';

export async function handleCreateOrder(ctx) {
  const input = ctx.body;

  // Policy gate — generated from specs
  if (!enforceOnlyActiveProductsOrderable(ctx.actor)) {
    throw new Error('Policy violation: only_active_products_orderable');
  }

  const repo = orm.repository('order', 'create_order');
  const result = await repo.create(input);

  // Invariant checks — generated from specs
  const stockViolation = validateStockCannotBeNegative(result);
  if (stockViolation) {
    throw new Error(`Invariant violation: ${stockViolation.message}`);
  }
  const qtyViolation = validateOrderQuantityMustBePositive(result);
  if (qtyViolation) {
    throw new Error(`Invariant violation: ${qtyViolation.message}`);
  }

  return result;
}

The critical difference: the policy check and both invariant validations are structurally present in the generated code. They exist because the YAML spec declares them — not because the AI "remembered" to add them.

Mode B Audit Results

TaskConstraintStatusNotes
T1: add_product price_must_be_positive PASS Invariant validator generated and wired into handler
T2: create_order only_active_products_orderable PASS Policy enforcer generated and called before business logic
T2: create_order stock_cannot_be_negative PASS Invariant validator generated and checked post-operation
T3: cancel_order stock_cannot_be_negative PASS Handler generated with no invariant (correct — additive operation)
T4: update_stock stock_cannot_be_negative PASS Invariant validator generated and checked
T5: list_orders (none) PASS Read-only, no constraint applies

Mode B violation rate: 0 violations out of 5 constraint checkpoints = 0%.

Why the difference?

The difference is not that Claude is "bad" at Mode A. Claude generated solid Express code. The problem is structural:

  • In Mode A, invariants are prose in a prompt. The AI must remember each one and decide where to enforce it. Some rules get implemented, some get acknowledged in comments, some get silently dropped.
  • In Mode B, invariants are named, typed, and bound to entities and capabilities. The compiler reads the YAML and generates the enforcement structure. The AI still writes the validation logic, but it cannot forget to call the validator — that call is generated from the spec.

This is what the SysMARA paper calls Constraint Visibility (Definition 3): a constraint is "visible" if an AI agent can discover it from the system's machine-readable artifacts without relying on human-written documentation or convention.

The violation rate formula

Violation Rate = |violated constraints| / |total constraint checkpoints|

Mode A: 2.5 / 5 = 50% (counting partial as 0.5)
Mode B: 0 / 5 = 0%

Impact analysis: what the AI sees in Mode B

When an AI agent queries SysMARA before implementing create_order, this is what it gets:

$ sysmara explain capability create_order

════════════════════════════════════════════════════════════
  Capability: create_order
════════════════════════════════════════════════════════════

  Description:  Create a new order for a product
  Module:       orders

  Entities
    - order
    - product

  Input
    - product_id: uuid (required)
    - quantity: number (required)

  Output
    - order: order (required)

  Policies
    - only_active_products_orderable (effect: deny)

  Invariants
    - stock_cannot_be_negative [error]
    - order_quantity_must_be_positive [error]

And when it asks "what else will be affected if I change this?":

$ sysmara impact capability create_order

  Affected Modules (2):
    - inventory
    - orders

  Affected Capabilities (4):
    - add_product
    - cancel_order
    - list_orders
    - update_stock

  Affected Invariants (3):
    - order_quantity_must_be_positive
    - price_must_be_positive
    - stock_cannot_be_negative

  Affected Policies (1):
    - only_active_products_orderable

  Affected Flows (1):
    - order_placement_flow

  Total Impact Radius: 11 nodes

In Mode A, none of this information exists in machine-readable form. The AI works from memory of the prompt. In Mode B, every constraint is a queryable node in a formal graph.

Reproducing this experiment

You need: Node.js 20+, npm.

# 1. Create project
mkdir experiment && cd experiment
npx @sysmara/core init --db sqlite --orm sysmara-orm

# 2. Replace system/*.yaml files with the specs from this post

# 3. Build
npx sysmara build

# 4. Inspect generated code
cat app/capabilities/create_order.ts

# 5. Check health
npx sysmara doctor

# 6. Query the graph
npx sysmara explain capability create_order
npx sysmara impact capability create_order

Limitations and honesty

  • This is a 5-task micro-experiment, not a statistically significant study. We make no claims about general AI coding ability.
  • The invariant validators in Mode B are still stubs that return null — the developer must fill in the logic. What SysMARA guarantees is that the validator is called, not that the logic inside is correct.
  • We used 1 model (Claude). Results may differ with GPT-4, Gemini, or other models.
  • Mode A could be improved with a more detailed prompt (explicit guard pseudo-code). We used a realistic prompt, not an adversarial one.

Conclusion

The experiment shows one thing clearly: structural constraint enforcement beats prompt-based constraint communication. When invariants are YAML specs parsed by a compiler, they become generated import statements and function calls. When they are natural-language lines in a prompt, they become suggestions that the AI may or may not act on.

50% vs 0% is a big gap for 5 simple tasks. As systems grow — more entities, more invariants, more cross-module policies — the gap will only widen.

You can reproduce every step of this experiment by following the instructions above. If your results differ, open an issue — we want to know.