We Vibed 200K Lines of Code. Then We Discovered CPU Profiling.

We built Callipso — an Electron app for terminal orchestration and voice routing — over three months using Claude Code. No engineering background. No CS degree. Every line of code was written by an LLM, guided by voice commands and natural language instructions. 1,657 commits. 200,000 lines of code. 10 terminal adapters, a 3D WebGPU visualizer, a voice pipeline with on-device speech-to-text, 650+ IPC channels.

It worked. It shipped. People used it.

Then one day we noticed the fans were spinning.

The first sign was the Apple menu. Top-left of the screen, Force Quit. Callipso was listed in red — the system was telling us something was wrong. We did not know what CPU usage meant beyond "the app is doing too much." We did not know where to look.

We mentioned this to Claude Code. It told us to open Activity Monitor — the app that shows every running process and how much CPU, memory, and energy each one consumes. We did not know Activity Monitor existed. We had been using the Force Quit menu as our only diagnostic tool.

Activity Monitor changed everything. We could see that "Electron Helper (Renderer)" was consuming 109% CPU. Not 9%. Not 19%. 109% — more than an entire CPU core, all the time, even when the app was sitting idle with nobody touching it.

What We Did Not Know

We did not know what a profiler was. We did not know the difference between CPU and memory. We did not know that Electron runs multiple processes — a main process, a renderer process, a GPU process, a network process — and that each one shows up as a separate line in Activity Monitor. We did not know what "idle CPU" even meant as a concept. If the app is doing nothing, why would it use any CPU at all?

We did not know about caching. Three months of vibe coding had produced code that called localStorage.getItem + JSON.parse 450 times every 10 seconds — per terminal, per render cycle, hitting Chromium's internal SQLite database on every access. The LLM had not added caching because we never asked for caching. We did not know caching was a thing you needed.

We did not know about CSS animation costs. Smooth opacity transitions on text elements trigger font re-rasterization on every display frame. A backdrop-filter: blur(12px) on an opaque background is invisible to humans but the GPU still runs the blur shader every frame. We had both. We did not know to avoid either.

These are things every software engineer knows. We are not software engineers. We are vibecoders. We build by describing what we want and letting the LLM figure out how. For three months, that worked perfectly — until the shortcuts caught up with us.

Idle CPU: The Easy Wins

The first round of fixes targeted things that were burning CPU while the app was doing nothing. Claude Code could identify these without any special tooling — just reading the code and reasoning about what runs continuously.

Ten optimizations in one commit:

Unbounced iTerm2 name-change events. Claude Code updates terminal titles rapidly during tool use. Each update triggered a full DOM rebuild of all terminals. Fix: 500ms debounce. At most 2 rebuilds per second instead of dozens.
CSS opacity animation with smooth easing on text. Chromium re-rasterized every text glyph on every display frame through its Rust font renderer. Fix: steps(8, end) — only 8 repaints per animation cycle instead of 60.
Full-window backdrop-filter: blur(12px) on an opaque container. The entire app shell was being blurred by the GPU every frame, behind an opaque background. Invisible but expensive. Fix: remove it.
Hidden BrowserWindows composited at full frame rate. The recording window was preloaded but hidden. The GPU compositor still ran at 60fps. Fix: setFrameRate(1) when hidden, restore to 60 before showing.
Waveform canvas animation loop running forever. A requestAnimationFrame loop with shadowBlur ran at 60fps even when not recording. Fix: auto-stop when idle, auto-start on data arrival, throttle to 6fps.
More CSS animations with smooth easing. Three decorative animations (recording dot pulse, streaming indicator, typing bounce) all used smooth interpolation. Fix: steps() on all three.
Space visualizer sub-modules not paused on tab switch. A 60fps label renderer and a 1-second health monitor ran even when the 3D tab was not visible. Fix: wire them into the parent module's pause/resume lifecycle.
localStorage reads hitting SQLite ~45 times per second. No caching on any localStorage.getItem call. Fix: in-memory read cache, invalidated on write.
Task text in the render hash. Every tool-use event changed the task text, which changed the hash, which triggered a full DOM rebuild. Fix: exclude task text from the hash, patch it inline instead.
State update debounce at 16ms. Every state mutation fired a renderer update within one frame. Fix: increase to 200ms to batch rapid events.

Result: renderer CPU dropped from 109% to approximately 2% idle. One commit, ten fixes, 65x improvement.

The Spike Problem

With idle CPU solved, a new pattern appeared. The app sat quietly at 2% — then spiked to 40-50% every time we sent a prompt to Claude Code. The spike lasted a few hundred milliseconds and disappeared. We could see it visually in Activity Monitor: the CPU column jumped, then dropped back.

Callipso hooks into Claude Code's lifecycle — a prompt-start hook fires every time a prompt is submitted. That hook triggers a cascade of events: the HTTP server receives the hook payload, the session manager registers or updates the session, the poll applicator diffs the terminal state, IPC sends the update to the renderer, and the renderer rebuilds the UI. Every link in that chain runs code, and some of that code was expensive. But we could not tell which part of the cascade was the bottleneck just by watching the CPU number spike and fall.

We needed to see inside those milliseconds.

The Road to the Right Tool

Getting from "we can see a spike in Activity Monitor" to "the LLM can autonomously diagnose it" took three failed attempts and one breakthrough. Each failure taught us what we actually needed.

Attempt 1: macOS `sample` command

The LLM tried the macOS sample command — a built-in tool that captures native call stacks for a process over a fixed duration:

bash

sample <PID> 5 -f /tmp/sample.txt

It recorded 5 seconds of stack traces, then parsed the output looking for hot symbols like fontations_ffi (font rendering) or CVDisplayLink (GPU compositor). This had worked for the idle CPU problem — a constant 109% over 5 seconds is impossible to miss.

But for the prompt-start spike, 5 seconds was too broad. The spike lasted 200-500 milliseconds — 4-10% of the sample window. The hot functions were buried in 4.5 seconds of idle noise, averaged away into insignificance.

Attempt 2: Instruments

We tried Instruments — Apple's visual profiling tool from Xcode. It was actually promising: we could see the spike on the timeline, select just that region, and drill into the call tree for those exact milliseconds. The data was right there.

But the workflow was manual. Open Instruments, record, visually find the spike, select it, read the call tree, copy-paste the relevant functions into the LLM conversation. Every iteration required a human in the loop. For one spike, that is tolerable. For dozens of fix-rebuild-re-profile cycles, it was too slow.

Attempt 3: `ps` polling + `contentTracing` — close but flat

We needed the LLM to do everything itself. Two ideas came together:

First, a ps polling loop running every 0.5 seconds to detect when the spike happened:

bash

for i in $(seq 1 40); do
  ts=$(date +%H:%M:%S)
  ps aux | grep "[E]lectron" | while IFS= read -r line; do
    cpu=$(echo "$line" | awk '{print $3}')
    if [ "$(echo "$cpu > 2.0" | bc)" = "1" ]; then
      printf "%s CPU:%5s%%\n" "$ts" "$cpu"
    fi
  done
  sleep 0.5
done

Second, a profiling endpoint using Electron's contentTracing API — the first time we wrote code specifically so the LLM could observe the app's performance:

bash

curl -s -X POST http://localhost:3110/dev/trace-cpu \
  -H "Content-Type: application/json" -d '{"durationMs":20000}'

Run both in parallel. The ps poll finds the spike timing. The profiler captures all function activity during that window. The LLM matches the two and finds the hot functions.

This almost worked. We had the timing, and we had the function names. But every function appeared as a flat list — no parent-child relationships. We could see that spawn() ran and clipboardWatcher.poll() ran, but not that poll() called spawn(). Without the tree, we could identify suspicious functions but could not trace the chain from symptom to root cause.

The Breakthrough: CDP with Call Trees

We replaced contentTracing with CDP (Chrome DevTools Protocol) — the same protocol Chrome DevTools uses internally. Electron exposes it via webContents.debugger. We rewrote the endpoint:

typescript

const dbg = this.mainWindow.webContents.debugger;
dbg.attach('1.3');
await dbg.sendCommand('Profiler.enable');
await dbg.sendCommand('Profiler.setSamplingInterval', { interval: 100 });
await dbg.sendCommand('Profiler.start');

setTimeout(async () => {
    const { profile } = await dbg.sendCommand('Profiler.stop');
    fs.writeFileSync('/tmp/callipso-cpu-trace.json', JSON.stringify(profile));
}, durationMs);

CDP gave us what contentTracing could not: hierarchical call trees with parent-child relationships. Every function node has a children array pointing to the functions it called, and a sample count — the number of times the V8 profiler caught that function executing during the recording window.

More samples = more time spent in that function = bigger CPU cost.

Now we had all three pieces working together: ps polling to find the spike timing, the CDP profiler to capture the full call tree, and the ability to run both in parallel. The LLM starts both streams, we use the app, and after 20 seconds it has everything it needs — the exact moment of the spike and the full hierarchy of what was running during it.

How the LLM Reads the Profile

The V8 profiler samples the call stack every 100 microseconds — 10,000 times per second. After a 20-second recording, there are approximately 200,000 samples. Each sample is a snapshot: "at this moment, function X was running, called by function Y, called by function Z."

The LLM reads the profile as structured JSON. It sorts functions by sample count and walks the call tree:

clipboardWatcher.poll() — 2,000 samples
  └── spawn('pbpaste') — 1,800 samples
      └── (native child process) — 1,750 samples

It can now reason: "90% of poll()'s time is spent spawning a shell command. Electron has a native clipboard API that does not spawn a process. Replace spawn('pbpaste') with clipboard.readText()."

The two streams meet here. Stream 1 tells the LLM: "the spike happened at 14:32:05." Stream 2 has the full function data for the entire 20-second window. The functions with the most samples around that timestamp are what caused the spike.

The Autonomous Loop

With both streams working, the optimization loop runs without manual intervention:

The LLM starts both streams — ps polling and V8 profiler — via HTTP endpoints in the dev harness
We use the app normally — send a prompt, switch terminals, record voice
Both streams finish. The LLM reads the ps output, finds the spike timestamp, reads the profiler output, scopes to that timeframe
The LLM identifies the hottest functions, reads their source code in the codebase, and understands why they are expensive
The LLM writes a fix, rebuilds the app, restarts it
Back to step 1 — re-profile to verify the fix worked

This loop ran dozens of times across two days. Each iteration targeted a specific function that the profiler identified. The LLM did not guess. Every fix was data-driven — backed by sample counts and call trees.

Microsoft shipped a similar loop in Visual Studio for .NET apps, and Meta built KernelEvolve for GPU kernel optimization. Both validate the pattern: LLM + profiler feedback loop = autonomous performance engineering.

What the Multi-Process Technique Reveals

Electron apps have 4+ processes. The V8 profiler only sees JavaScript. But ps sees total process CPU — including native Chromium rendering, GPU compositing, and child process startup.

The gap between what V8 reports and what ps reports reveals an entire category of costs invisible to JavaScript profilers:

V8 profiler says	ps says	The gap is
99.7% idle	6-8% CPU	Blink layout/paint/composite
98.8% idle	6-8% CPU	Native spawn(), readFileSync()
95% idle	33% CPU	Child process model loading

This technique — profiling multiple processes simultaneously and cross-referencing the gap — is how we found that CSS animations, backdrop-filter, and hidden BrowserWindows were burning GPU cycles that no JavaScript profiler would ever show.

The Full Tally

Across two days, the loop produced 38+ individual fixes:

Category	Fixes	Impact
Renderer idle CPU	10	109% to ~2%
DOM diffing (morphdom)	4	Eliminated full DOM rebuilds
Phantom processes	2	~8% CPU from invisible DevTools renderer
Recording window	5	Native audio levels, opaque window, frame cap
Main process	8	-60% main process CPU
Native overlay	3	Rounded corners via native API, not CSS
Startup spike	1	Identified as irreducible V8/Chromium cost
Transient spike tooling	1	Auto-sampling script for startup diagnostics

One finding deserves special mention: the startup spike (80-92% CPU for ~3 seconds on launch) turned out to be irreducible V8 and Chromium bootstrapping — _register_external_reference_worker, v8::Private::New, NSPerformVisuallyAtomicChange. No app-level fix exists. Knowing when to stop optimizing is part of the process.

What Broke Along the Way

Performance optimization is not free. Three of the changes introduced regressions that took additional commits to fix.

The render hash timing bug. We excluded task text from the render hash to avoid full DOM rebuilds on every tool-use event. But the hash was updated before the actual render happened (inside a requestAnimationFrame callback). If the RAF was dropped — window hidden, timing race — the hash already said "done" and future updates were silently skipped. The DOM got stuck showing stale terminals. Fix: commit the hash after the render succeeds, never before.

morphdom stale elements. We replaced innerHTML with morphdom for efficient DOM diffing. morphdom silently failed to remove some elements — stale keyed items stayed in the DOM even though the data said they should be gone. We had to add a post-render integrity check (count DOM elements vs expected count) and a periodic watchdog that self-corrects within 15 seconds.

Completion timer flicker. morphdom was destroying and recreating timer elements on every diff instead of preserving them. The timer visually flickered. Fix: pre-render the timer into the HTML so morphdom treats it as existing content.

All three regressions were caught within hours and fixed. The net result — 38 targeted fixes with 3 minor regressions, all resolved — is a ratio we would take again.

What We Learned

Vibe coding works until it doesn't. For three months, every feature shipped correctly. The LLM wrote functional code — it just did not write efficient code, because we never asked for efficiency. We never said "cache this" or "debounce that" or "do not animate text with smooth easing." The LLM optimizes for what you ask. We asked for features.

The LLM can fix what the LLM created. The same tool that wrote the inefficient code was able to identify and fix it — once it had observability. The gap was not intelligence. The gap was instrumentation. Without a profiler, the LLM was guessing. With a profiler, it was surgical.

Expose observability as HTTP endpoints. The critical decision was building profiling into the app as HTTP APIs that the LLM can call programmatically. POST /dev/trace-cpu to start recording, GET /dev/trace-cpu to read the results. The LLM never opens Chrome DevTools. It never reads a flame chart. It reads structured JSON with function names, sample counts, and call trees. This is what makes the loop autonomous — the LLM can trigger, read, reason, and act without human intermediation.

Know what a sample is. The V8 profiler checks "what function is running right now?" 10,000 times per second. Functions that appear in many samples used the most CPU. That is the entire mental model. Everything else — call trees, parent-child relationships, self time vs total time — is built on top of that one concept.

Profile all processes, not just one. The gap between what V8 reports and what ps reports is where the native costs hide. CSS animations, GPU compositing, child process spawning — none of these appear in a JavaScript profiler. You need both views to see the full picture.

We went from the Force Quit menu to an autonomous profiling loop in two days. The app went from 109% idle CPU to 2%. If you are vibe coding a large Electron app and you have not looked at Activity Monitor yet — open it. You might be surprised.

We Vibed 200K Lines of Code. Then We Discovered CPU Profiling.

The Force Quit Menu

What We Did Not Know

Idle CPU: The Easy Wins

The Spike Problem

The Road to the Right Tool

Attempt 1: macOS sample command

Attempt 2: Instruments

Attempt 3: ps polling + contentTracing — close but flat

The Breakthrough: CDP with Call Trees

How the LLM Reads the Profile

The Autonomous Loop

What the Multi-Process Technique Reveals

The Full Tally

What Broke Along the Way

What We Learned

Attempt 1: macOS `sample` command

Attempt 3: `ps` polling + `contentTracing` — close but flat