Back to Blog
announcementengineering

Your Brain on Vibe Coding: Flashcards From Your Own Commits

Callipso turns your commits into spaced-repetition flashcards with a voice-based free-recall loop. The learning science behind buying back cognitive debt.

Callipso TeamJune 11, 202613 min read

Your Brain on Vibe Coding: Flashcards From Your Own Commits

There is an accusation circulating about vibe coders, usually phrased as a question that is not really a question: are you actually learning anything anymore? The ammunition behind it is the MIT Media Lab study "Your Brain on ChatGPT" — participants who delegated essay writing to an LLM showed the weakest brain connectivity of any group, could not quote from their own essays minutes after writing them, and felt the least ownership of the result. The authors named the pattern cognitive debt, and the term escaped the lab immediately, because every agentic coder recognized themselves in it.

Here is the uncomfortable part: the accusation describes the real workflow accurately. The agents got good. You see a bug, you describe it with zero understanding of its cause, the agent one-shots the fix, the tests pass, you ship. A year ago you at least read the chain of thought; now you mostly do not, because you do not need to. Deep reasoning about the task happened — just not in your head.

So, concisely: what is true in the accusation, and what is wrong. True: delegating a cognitive task measurably reduces your engagement during the task and your memory and ownership of what it produced. The EEG data shows it, the older cognitive-offloading literature predicted it, and denying it is wishful. Wrong: nothing in the data shows your faculties decaying. Low brain engagement while someone else does the work is not damage — it is what delegation looks like, working as designed. What the data actually shows is encoding that never happened: you cannot remember what you never worked to understand. And the study's most overlooked result points straight at the lever: participants who did the effortful work first and brought the LLM in afterward kept strong recall and rich neural activation. The harm is not the tool. It is the ordering — tool-instead-of-effort skips encoding; effort-then-tool consolidates it.

Ordering is something you can engineer. So we built a feature that engineers it back in: a learning loop inside Callipso that turns commits you did not write into concepts you deliberately own, at a cost of about four minutes per concept. The rest of this post is what it does, and why every piece of it is shaped by the learning-science literature rather than by vibes.

The struggle got decoupled from the work

For the entire history of programming, learning came bundled with shipping. Fixing a bug yourself was simultaneously the work — mandatory, you had to deliver — and the struggle that built your mental model, which came along free as a by-product. Nobody ever budgeted time for the struggle, because it was packaged inside labor you had to perform anyway.

Coding agents broke that bundle. The work now gets done without you — which means the struggle is no longer included. It has become a separable, optional, pure-cost investment: something you must deliberately choose to buy, minutes at a time, when the old workflow handed it to you for free.

Most takes on AI and cognition miss this economic structure. The question is not "should you struggle for two hours like before" — that trade is gone and it is not coming back. The question is: what is the highest learning yield per deliberately spent minute, now that the minutes are no longer free? The flashcards tab is our answer to that question.

How seriously to take those studies

Receipts and caveats, in one block. The MIT result is a preprint — Kosmyna et al., 2025, not yet peer-reviewed — with 54 participants, only 18 by the fourth session, testing ChatGPT only, on essay writing only. The "47% drop in brain activity" figure that circulated in the press is a journalistic reading, not a claim in the paper. The complementary survey evidence — Gerlich (2025), 666 participants, AI use negatively correlated with critical-thinking scores, mediated by cognitive offloading — is correlational, and the causal arrow could run either way: minds less inclined to reflect may simply delegate more. None of it demonstrates durable decline; all of it is consistent with skipped encoding. We treat these studies as a warning about ordering, not a verdict on the tool — and the effort-first finding is the part we engineered around.

The loop

There is a new tab in Callipso for this. After an agent lands a commit you did not write, you can point at it and say: teach me this one. The agent already knows two things — what the commit actually did, and what you currently know, because it maintains a running model of your knowledge from every session you have run. It starts with a question, cold: where do you think this one came from? You guess, usually wrong, on purpose — the failed attempt primes the encoding that follows. Then it writes an explanation calibrated to your level, built on analogies from a domain you know deeply. You read those analogies out loud, not in your head — producing what you read strengthens encoding over silent reading, so the rule is: aloud on input, every time. After about twenty seconds the text hides, your microphone opens, and you dictate everything you can reconstruct from memory. The agent diffs your recall against the source, tells you exactly which piece you dropped, and you go again. Two or three rounds, usually under four minutes. Only then does it mint spaced-repetition flashcards — which you can edit or delete — and schedules them across the coming weeks.

ONE CONCEPT · ~4 MINUTESPRETESTguess first, failEXPLANATIONyour level, your analogiesREAD ALOUD ~20sthen the text hidesDICTATE RECALLfree recall, by voiceGAP MIRROR"you lost why it's O(1)" — not a re-read2–3 roundsCARDS → SPACED REVIEWSyou edit or delete; the schedule does the restCards are retrieval cues. The loop is what gives them something to retrieve.

Two decks live in the tab. A bundled vocabulary deck covers the floor every vibe coder walks on without naming it — what a radio button is versus a combo box versus a split button, what a toast is, what a modal owes you. And a personal deck, generated on demand from your own repository: the ring buffer the agent put in your recording pipeline, the reason your state manager follows domain-driven design, why the bug from this morning was a race and what class of race it was. The cards are anchored in commits you shipped, in an application you run every day.

The recall loop sits between the explanation and the cards, and it is deliberately not skippable by default. We evaluated the fully frictionless version — agent knows your knowledge state, you say "make me cards," cards appear, you never read an explanation at all. We rejected it, and not on vibes: a flashcard is a retrieval cue for something already encoded. Skip the encoding and the card retrieves nothing; you end up memorizing a word-chain detached from meaning, fragile and useless for transfer. The loop is the encoding event. The cards only maintain what it deposits.

Why each piece is shaped this way

Design decisionMechanism it recruitsKey evidence
Guess the cause before the explanationPretesting effectRichland, Kornell & Kao 2009; Pan & Sana 2021 (d ≈ 0.30 over post-study testing)
Explanation calibrated to your knowledge modelPrior-knowledge anchoringPretests potentiate most when material connects to existing schemas
Analogies from a domain you masterElaborative encoding, self-referenceDeep schemas give new concepts more to bind to
Read the explanation aloud, not silentlyProduction effect (encoding)MacLeod et al. 2010; Forrin & MacLeod 2017: aloud > silent, self-production does the work
Read ~20s, then the text hidesFree recall (hardest retrieval form)Roediger & Karpicke 2006: retrieval beats re-study at a delay
Dictate the recall by voiceOutput speed + low transcription loadSpeaking outruns typing; writing taxes working memory with spelling codes
Agent mirrors the gap, not the full textCorrective feedback on errorsFailed retrieval only pays when followed by correction
2–3 rounds in one sittingRepeated retrievalKarpicke & Roediger 2008: repeated retrieval drives long-term retention
Cards reviewed over days and weeksSpacing effectCepeda et al. 2006 meta-analysis: spaced beats massed
You edit, add, or delete the generated cardsGeneration effect (partial recovery)Slamecka & Graf 1978; Bertsch et al. 2007 (d ≈ 0.40)

Every row of that table is load-bearing. Here is each one, in order.

The pretest comes first because failing is an encoding event. Before the agent explains anything, it asks you the question cold: this bug — where do you think it came from? You will usually be wrong. That is the point. In Richland, Kornell and Kao's experiments, attempting to answer questions before reading a text improved later memory over spending the same time studying — even on items the pretest failed to retrieve. Pan and Sana's direct comparison across four experiments found pretesting beat post-study retrieval practice with an advantage around d = 0.30 at delays up to 48 hours. A wrong guess costs twenty seconds and primes the encoding that follows. It also punctures the fluency illusion: a question you just failed cannot lie to you about what you know.

The explanation is calibrated to your knowledge model because encoding needs an anchor. New material sticks in proportion to how much existing structure it can attach to — and the pretesting literature carries the matching caveat: prequestions potentiate learning most when the material is semantically related to what you already know. The agent's running model of your level exists to aim the explanation at that attachment surface — nothing re-explained that you already own, nothing assumed that you lack. The honest flip side: for a concept entirely foreign to you, with no schema to hook onto, every technique on this page yields less and the loop takes more rounds. That is not a bug in the method; it is what learning something genuinely new costs.

The analogies come from a domain you master because elaboration is binding. Translating a ring buffer into the vocabulary of a domain you know deeply — music theory, kitchen workflow, whatever your native expertise is — is elaborative encoding: the new concept inherits the dense web of associations the old domain already carries. And because the cards are anchored in your own commits, in an application you run daily, they recruit the self-reference effect — material processed in relation to yourself is remembered better than neutral material. One warning, expanded below: the analogy must sit at the edge of what you can follow, not at its comfortable center, or it does the understanding for you.

You read the explanation aloud, because production is encoding. Reading aloud yields measurably better memory than reading silently — the production effect (MacLeod, Gopie, Hourihan, Neary & Ozubko, 2010). The benefit holds across modes: vocalizing tops writing and typing, all of which top silent reading. And Forrin and MacLeod (2017) showed reading aloud even beats hearing a recording of your own voice — so it is the act of self-production, not the sound, that does the work. The standard account is distinctiveness: an articulated item carries extra dimensions — the motor act of saying it, the sound of your own voice saying it — that make its memory trace easier to find later. Your microphone is about to open for the recall phase anyway; spending the twenty-second read speaking instead of skimming is free.

The text hides because free recall is the strongest test there is. Hiding the explanation is what converts reading into retrieval. Free recall — reproduce everything you can, no cues, any order — is the most demanding form of retrieval practice, and retrieval is the family of strategies that reliably beats re-study at a delay (Roediger and Karpicke, 2006). The hiding is enforcement, not theater: with the text gone, re-reading — the low-yield strategy every learner defaults to because it feels productive — is physically off the table.

Voice is the recall channel because the bottleneck is transcription, not memory. The production effect earned its keep at the read-aloud step — it is an encoding-side finding, and we will not over-claim it for output. The output-side argument is more mundane and more decisive: speaking is several times faster than typing, so a fixed recall window captures more of what you actually know; and writing loads working memory with spelling and motor planning that dictation does not, so the keyboard siphons effort away from remembering into transcribing. Callipso already had the entire local STT stack — the same pipeline that routes your voice to terminals opens for recall, and your dictation lands as text the agent can diff.

The agent mirrors the gap instead of re-showing the text. After your recall, the agent does not redisplay the explanation. It tells you precisely what you dropped: you reconstructed the wrap-around behavior but lost why writes stay O(1). The pretesting literature is unambiguous that unsuccessful retrieval only produces its benefit when followed by corrective feedback — and a targeted correction of the missing piece is worth more than a full re-read, because re-reading is exactly the passive activity the testing effect outperforms.

You go two or three rounds because retention tracks repeated retrieval. Karpicke and Roediger's 2008 Science paper isolated the variable: once material can be recalled, further studying adds almost nothing — further retrieval is what cements it. Each round is also a fresh production event: you re-encode what you just reconstructed, in your own words, in your own voice. The rounds stop when your recall covers the concept, not at a fixed count.

The cards run on a spaced schedule because spacing is the most replicated effect in the field. Cepeda and colleagues' 2006 meta-analysis, spanning hundreds of comparisons, found spaced practice beating massed practice for long-term retention essentially always. The in-session rounds carve the trace; the scheduler then lets it decay just enough that the next review is effortful again — the desirable difficulty, in Bjork's term, that re-strengthens it. Massing feels better and performs worse; the scheduler exists so you never have to fight that illusion yourself.

You edit the cards because generation is the one effect we had to claw back. Self-generated material is remembered better than received material — the generation effect, d ≈ 0.40 in Bertsch's meta-analysis — and in our pipeline the model writes the cards, which surrenders that benefit at the authoring step. But the literature cuts both ways: learners who write their own questions often test themselves on the wrong material, missing what matters. The hybrid keeps both halves — the model proposes with full coverage of the concept, and the act of rejecting, rewording, or sharpening a card is a small generation event the human keeps. Answering the card later is, of course, full retrieval practice either way.

What flashcards cannot do

Honesty section, because this is where most "AI learning tool" claims quietly overreach.

Cards are excellent for declarative knowledge — what a ring buffer is, what idempotency means, the vocabulary layer. They are structurally weak for conditional knowledge — recognizing the situation where a ring buffer is the right call, which is a different mental object acquired through repeated confrontation with cases, not through recitation of definitions. Manu Kapur's productive-failure research is the sharpest statement of the gap: in his studies, the number of solution attempts students generated while struggling with a problem before instruction predicted their later conceptual understanding (r ≈ .82) and transfer (r ≈ .88) — correlations that simply did not exist in the direct-instruction condition. The depth lives in the struggle, and a four-minute loop is a compressed, partial substitute, not the thing itself.

Our card generator is therefore instructed to bias toward the conditional layer wherever the source material allows: not "what is a ring buffer" but "this recording bug — what class of error was it, and what symptom gives that class away next time?" For a commit, the highest-value card is rarely the syntax of the fix; it is the recognition pattern. We still tell you plainly: if your goal is to write these systems unaided, this feature will not get you there. If your goal is what we would call orchestrator literacy — knowing enough to direct agents, evaluate their output, and decide what to ask for — the vocabulary-plus-recognition layer is most of the job, and it is the layer this loop serves well.

There is also a calibration trap we engineered against. An agent that knows your level can make analogies too smooth — and a perfectly smooth analogy does the understanding for you, reintroducing the fluency illusion through the back door. The generation prompt targets the edge of your model of the domain, not its comfortable center: the analogy you can just barely follow, in Bjork's sense of a desirable difficulty.

Does AI-assisted coding actually make you dumber?

The defensible answer in one block: the evidence shows that delegating a cognitive task to an AI reduces mental engagement during that task and reduces retention and ownership of its output — demonstrated by EEG in the MIT Media Lab preprint and correlated with weaker critical-thinking scores in survey work, most strongly among younger, heavier users. The evidence does not show durable, irreversible decline; the studies are short-term, partly correlational, and the headline study is an unreviewed preprint with 54 participants. The pattern they do support is older than AI — cognitive offloading, the Google effect, GPS and spatial memory: a tool that removes friction also removes the exercise the friction provided. The actionable variable is ordering. Effort-first then tool consolidates; tool-instead-of-effort skips encoding. Vibe coding defaults you into the second configuration. This feature exists to let you buy back the first one, a few minutes per concept, for the concepts you choose.

Why not learn everything? Because exposure is the curriculum

A year of vibe coding exposes you to an enormous conceptual surface — the way a child is immersed in spoken language for years before writing a word. Exposure is real input; it builds familiarity and a map of what exists. But the linguistics literature itself warns against stopping there: comprehensible input alone is contested, and Swain's output hypothesis argues the strongest learning happens when you are forced to produce and notice your own gaps. Immersion tells you a concept exists. Production tells you whether you own it.

So the tab is deliberately a pull system, not a curriculum. The agent never assigns homework. You decide — per commit, per concept — whether this one deserves a slot in your memory or remains safely delegated. Deciding not to learn something is a legitimate move; knowing what to leave delegated is itself an orchestration skill. The feature's job is to make the cost of the deliberate choice as low as the science allows: a pretest, a calibrated explanation, two or three rounds of spoken free recall, and a deck that quietly does the spacing for you. It sits alongside Rewind Time in the same product thesis — agents took the labor; what is left to engineer is the human's attention, memory, and judgment. You can try both in the current build of Callipso.

The cards are just the receipt. The four minutes are the purchase.

Share: