Prototyping a language learning game helped me understand Spec Kit

Building a language-learning game with GitHub's Spec Kit showed me where spec-driven AI development helps, and where it's too heavy for prototyping.

If you’ve been job hunting lately, you already know the question. Somewhere between the portfolio review and the culture chat, someone asks what I think about AI, and how I fold it into my design work. I wanted an honest answer, not a rehearsed one. That pushed me into deeper experimentation than building my own app ever forced me to do.

The comfort of building alone

Working on your own project is comfortable in a way that hides things. Nobody challenges how you build. No colleague stress-tests your choices in review.

For me, that meant I’d been spot-using AI: reaching for it when a task interested me, ignoring it when I’d rather do the work myself. And it’s easy to justify. AI still isn’t as sharp as a skilled human, and when it goes off the rails, I can step in and take over. That convenience is exactly what kept me from learning what the tool can actually do.

Learning a tool by pushing it to its edge

So I set myself a constraint: make something genuinely useful without touching the design or the code myself. I want to be clear that I don’t think humans belong outside the loop. This wasn’t a bet on hands-off AI. It was a way to find the edges.

I learned this lesson in design school. Photoshop is a poor tool for page design, and I know that because I spent a semester trying to lay out pages in it. Illustrator is a poor tool for book and magazine layouts, and I know that because I tried that too. Pushing a tool past where it’s comfortable is how you map its limits. That’s the spirit I brought here.

There’s an even more extreme version of this that I skipped. Some engineers now orchestrate whole fleets of agents by voice, using dictation tools like Wispr Flow to direct them and have them check each other’s work. That’s a frontier I’d like to test someday, but my focus is prototyping, so I started somewhere smaller.

What I wanted to learn

What I really wanted to understand was what good, reliable input to an AI looks like. I didn’t want vibe-coded guesses. I wanted prototypes that met specific needs.

For that I leaned on Spec Kit, GitHub’s open framework for writing detailed specifications that an AI agent can build from. I’d watched other developers rely on it, and I wanted to see how well it held up for real prototyping work.

The experiment

I kept the surface area deliberately small. I set up the repository the way I always do, with React Router, TypeScript, and the usual stack, so that the only new variable was Spec Kit itself. Then I specified the game and how it should work, one Spec Kit stage at a time.

I also fenced off the UI. I told the AI to build with shadcn, a set of ready-made, accessible React components, and required it to use those instead of inventing its own. I’m under no illusion that LLMs are good at designing components from scratch right now. If they were, tools like Lovable probably wouldn’t exist. Constraining the visual building blocks kept the experiment from snagging on a weakness I already knew about, so I could stay focused on what I was actually testing: the quality of the input.

The game is called Blanko, a fill-in-the-blank tool for practicing spelling and recall in a foreign language. You paste in a passage, pick a difficulty, fill the missing words, and get scored, with every word you missed shown next to its correct spelling. The whole thing lives in a public repo, including everything Spec Kit fed the AI: the constitution, the spec, the clarifications, and the plan. You can also play the live build.

Screenshot: Blanko start screen (paste passage + difficulty) Paste in text in your target language and choose your difficulty.

Screenshot: Blanko game screen (fill-in-blanks) Fill in the blanks

Screenshot: Blanko results screen (score + missed words) Review your work. Your game session is kept on your device to review later.

How the process felt

Spec Kit structures the work as a sequence of stages, each producing a durable artifact before any code gets written: constitution, then spec, then clarify, then plan, then implement. Walking through it felt like sitting in a good planning session with skilled engineers and product managers. Structured. Heavyweight. Comfortable.

That comfort is worth examining, because I’ve got 20 years of design and engineering behind me, and the detail is reassuring to someone like me. The code was rarely the hard part on the teams I’ve worked with. Writing clear specs was always the thing that tripped us up. In that light, Spec Kit didn’t hand me something new. It wrapped the work we’ve always done as designers and developers into a process you can talk through with an AI agent, on your own.

The planning stages were the strongest part. Forcing ambiguity to the surface before any code exists, especially during the clarify step, is where this approach earned its keep. Implementation is where it got rougher and needed more steering. Even with shadcn written into the constitution, the first pass drifted into its own visual design, and I had to push it back onto the components I’d asked for. That’s a small but telling example: a constraint living in the spec isn’t the same as a constraint the model honors on the first try. My one small gripe is discoverability: the stages aren’t obvious up front, though the documentation clears it up quickly, if you’re the kind of person who reads documentation.

Too heavy for a prototype

My honest conclusion is that Spec Kit is heavier than a prototype needs. It’s genuinely useful for thinking an idea all the way through, and I’d reach for it on a project I intended to ship and maintain.

But the prototype I set out to build didn’t need that level of detail. To test that, I gave the same brief to four web clients and let each build its own version without Spec Kit: ChatGPT, Claude, Claude told not to use its built-in frontend design skill, and Google AI Studio. Each got the identical prompt unless noted. The Spec Kit version did run better, yet any of these would have served as a workable prototype.

Screenshot: Blanko built with ChatGPT Blanko built with ChatGPT (web client, no Spec Kit)

Screenshot: Blanko built with Claude Blanko built with Claude (web client, no Spec Kit)

Screenshot: Blanko built with Claude no frontend skill Blanko built with Claude, told not to use its built-in frontend design skill (web client, no Spec Kit)

Screenshot: Blanko built with Google AI Studio Blanko built with Google AI Studio (web client, no Spec Kit)

Why you might not need all that detail

I also suspect that, as an engineer, you don’t need to spell out that much for a modern model. Spec Kit arrived alongside an earlier generation of LLMs, and today’s models are very good at inference. They draw on your existing codebase, your documentation, and their own understanding of how software gets written.

With solid patterns in place and a stronger opening prompt, I think you can capture most of Spec Kit’s benefit with far fewer markdown files to manage.

A lighter alternative

So I started building toward that lighter approach. I wrote a small issue-auditing skill that scores prompts across five dimensions: success definition, assessment rubric, scope boundary, context and inputs, and a quality gate. The idea is to catch a thin, under-specified brief before the agent fills the gaps with bad inferences.

The dimensions aren’t mine. They come from Christo Zietsman’s work on what separates a real specification from a bare instruction. In his study of AI governance prompts, Zietsman treats these prompts as “executable specifications” and shows, across a corpus of real files, how often they’re missing the parts that make an agent’s output checkable.

An agent working from an instruction can satisfy it in almost any plausible way. An agent working from a specification has explicit criteria it can be measured against.

— Christo Zietsman, Structural Quality Gaps in Practitioner AI Governance Prompts (2026)

Two caveats matter. The paper is a preprint and hasn’t been peer reviewed. And I don’t think we truly understand yet how to structure input so an AI reliably produces what we intended. To that last point, Zietsman is careful about AI input himself. His framework checks whether a brief is well-formed, not whether it’s correct. I treat the five dimensions as a good first step, not a finished answer.

Last thoughts

One last thing, almost an aside. Blanko was my wife’s idea. I had a functional prototype of it in under an hour or two, and that included learning Spec Kit, setting up the repo, prompting, and reading through the docuemntation it generated.

Speed was never the point of the experiment. But that kind of speed changes what’s possible. It buys you room to try three directions before lunch instead of committing to one and hoping.

If you take anything from this, make it a question for your own work. Where are you still writing heavy, detailed instructions that your tools have quietly outgrown? Strip one back, watch what the model infers on its own, and adjust from there. You may find your less guided prompt works just fine.