Blog

Building a D&D Dungeon Master out of Claude: Teaching It to Draw

Lotte-Sara Laan

Updated June 26, 2026

11 minutes

Before and after: the same prompt with and without explicit gender and age constraints Same three characters, same prompt, same reference image. Left: no explicit gender or age constraints, and the party comes back as two women and a child. Right: with the MALE and not a child lines added. This whole post is about that gap.

The first time I generated a scene with Corrin in it, I got back a sweet-looking little boy in adventuring gear.

I tried again. Child. Again. Child. Every single time.

Corrin is a halfling — short, around three feet tall, but a weathered adult who's seen things. The model knew exactly one thing that's three feet tall with a human face: a kid. So that's what it kept drawing.

In Part 1 I built a Dungeon Master out of Claude. It ran the game, tracked the rules, and wrote each session up as a story. Then I wanted pictures. For the generation itself I reused an off-the-shelf skill (an open-source ai-image-creator that routes prompts to models like Gemini and FLUX). The easy part was wiring it in. The hard part was learning to write prompts that actually produced what I meant. I had character portraits already. I figured I'd hand the image model a portrait, describe the scene, and get back a picture of that character in that scene.

That is not how image models work. I learned this the way I learn most things: by trying, breaking it, and seeing what other people had already figured out.

The first surprise: the halflings were children

The fix for Corrin was more words:

clearly middle-aged, weathered face, deep-set knowing eyes,
a man in their forties at least, not young, not a boy or child.

None of that should be necessary if the reference and the description "just worked." But the model isn't matching your reference. It's reaching for its nearest concept and you have to push it off that concept by force.

The second surprise: my male characters kept coming back as women

One of the player characters is Lylnyler, a male half-elf. I had a portrait of him. I put that portrait in as a reference, asked for a scene, and got back a perfectly nice picture of a female elf.

I tried again. Female elf. I tried a third time and got a man, then regenerated to tweak the lighting, and he was a woman again.

This was baffling at first. I had given it a picture. A clear, male picture. How do you look at a reference photo of a man and draw a woman?

But that's the thing: the reference image is not an instruction. It's a suggestion, weighed against everything the model already believes. And the model has clearly seen a great many elves, a lot of them feminine. Its prior for "elf" leans female, and that prior was quietly outvoting the photo I handed it, especially on regeneration.

The fix is now a literal line in my image instructions, and I find it slightly funny that this is what it took:

MALE [race] — masculine features, MALE face, NOT female.

Shouting "NOT female" at a computer feels ridiculous. But it works, because it stops being a gentle hint and becomes an explicit constraint that can hold its own against the prior. The picture is a suggestion. The words are the argument.

The pattern across both: image models carry strong priors, and those priors will quietly overrule your explicit intent unless you push back, hard and specifically. This is the same lesson as the text side from Part 1, in a different medium. The model fills the gap with whatever it already believes, and your job is to leave it no gap to fill.

Reference images don't behave the way you'd think

Once I stopped expecting the portrait to do the work, I started learning how references actually behave. Almost none of it was intuitive.

More references make the faces worse, not better. Past about four reference images, the face-matching dilutes and everyone starts to blur toward a generic average. If one character's face really has to match, I put their reference first and describe the rest from text.
A character facing away still gets a face reference — but the prompt has to win the argument. The reference tells the model who the character is; the prompt has to explicitly tell it where they're looking. "Her back is toward the party, NOT facing the camera" has to be in there, or the model will twist them around to show the face anyway. The reference says who. The composition instructions say where.
Rooms have to be generated empty. If I use a busy location image as a reference, the model copies the people who were already in it. So locations get generated as empty rooms first, then populated.
Props become the subject if you let them. List a magic item early in the prompt and the model makes it the hero of the image. Small things have to be described as small, late, and explicitly minor, or they take over the frame.

The single change that helped most was this: every reference image gets a written description that travels with it. For each character, location, and prop, I keep a short text file next to the image that describes what it is. "A male half-elf with dark messy hair, sharp pointed ears, dark studded leather armor with plates and buckles across the chest." Whenever I generate a new scene with that character in it, I paste that description into the prompt alongside the reference image, word for word.

It sounds redundant. You're giving it the picture and a description of the picture. But it's the difference between "here's a photo, good luck" and "here's a photo, and here is exactly what to take from it." The image shows the model a face; the words tell it which parts are load-bearing. Left to itself, the model might decide the important thing about the reference is the lighting, or the pose, or the background, and quietly drop the pointed ears. The paired description pins down what actually has to carry over. The reference says who; the description says what matters.

None of these are things I'd have guessed. They're things I found out by generating a bad image, staring at it, and working out what the model had actually heard.

A worked example: one image, five tries

Let me show you the whole process on a single picture, because the picture is the cover of Part 1 and it fought me the entire way. The scene: the party stands in the crime lord Xanathar's lair. Xanathar is a floating ball of eyes the size of a draft horse. He keeps a pet goldfish on his desk. My son's character, the warlock Lylnyler, ignores the monster completely and stares at the fish.

Simple enough to describe. Here is what it took to actually get it.

Try one and two: he looked away. I asked for Lylnyler staring at the fish. I got Lylnyler staring off into the middle distance. I added "looking down at the bowl," and he looked down, but still not at anything. The model kept turning his face toward the camera, because a visible face is what it defaults to. The instruction said "look at the fish." The prior said "show me your good side."

Try 2: asked him to look at the fish; he looks past it. The model favors a camera-friendly face over the eyeline I asked for.

Try three: it duplicated the furniture. So I changed tactics. Instead of describing his gaze, I described the camera: move to the side, show him in profile, put the bowl in front of him. The model did not move the camera. Instead it added a second desk with a second fishbowl right in front of him, so that he'd have something to look at without anyone having to re-block the scene. It solved my request in the laziest geometrically-valid way it could find.

Try 3: I asked it to move the camera. It cloned the desk and bowl instead. The model takes the path of least resistance.

Try four: right pose, wrong staging. I got him standing in profile, looking at a bowl. But I'd been describing the scene wrong the whole time. In the actual scene he's standing across the room, looking at the distant desk where the fish and the monster both are. I'd had him sitting at the desk with the bowl under his nose. My fault, not the model's, but worth saying: half of these failures were me not specifying what I actually meant.

Try 4: correct gaze, wrong blocking. This one was my mistake, not the model's.

Try five: the staging works, but the armor is on backwards. Finally the composition was right. Lylnyler standing in the chamber, the desk and the goldfish and the floating monster grouped in the distance, his attention on the fish. But because we now saw him from behind, it became obvious the model had rendered his studded leather armor as if it were on frontwards. Chest straps on his back.

Try 5: the composition is finally right, but seen from behind, the chest straps of his armor are on his back. So close.

By this point I did not want to regenerate. Five tries in, I finally had the composition I wanted and I was not going to roll the dice on it again.

The fix: edit, don't regenerate. This is the technique I wish I'd known on try one. Instead of asking for a fresh image, I fed the model the image I already had, plus a back-view reference of the character's armor, and told it to change only the armor and leave everything else untouched. It did. Same chamber, same pose, same composition, correct armor.

The keeper: composition from try five, armor fixed with a targeted edit pass The keeper. Composition from try five, armor fixed with a targeted edit pass instead of a regeneration.

That last move is the one with the most carryover to real work. When you have an output that's 90% right, a targeted edit ("change only this, keep everything else") is almost always safer than generating again and hoping the 90% survives.

What it taught me about the day job

I'm a cloud consultant. I don't generate fantasy art for clients. But the habits this dragged out of me are the same ones I use when I'm building anything with a model.

The reference image that loses to the prior is the same as the example in a prompt that loses to the model's training. Showing isn't telling. If it matters, say it, explicitly, as a constraint, not a hint.

The "NOT female" line and the "not a boy or child" line are ugly, and they work. Good model instructions often look like that: blunt, repetitive, slightly paranoid, written in the scar tissue of everything the model got wrong before. I've stopped being embarrassed by prompts that read like a list of "do not."

And the edit-pass beats the regenerate. When a model gives you something mostly right, change the one thing, don't re-roll the whole thing. That applies just as well to a config file or a piece of generated code as it does to a goldfish.

In Part 3: I gave the whole thing a voice, and the model told me a bug was a law of physics.

Sidebar: my image-prompt rules, all earned the hard way

MALE [race], masculine features, NOT female. It feminizes male characters.

not young, not a boy or child. It draws short characters as kids.

Must-match face first; four references maximum. Faces dilute past four refs.

Back-to-camera characters still get a face reference — but explicitly instruct the composition. The model will twist them around otherwise.

Generate locations empty, then populate. It copies whoever was in the reference room.

Pair every reference image with a written description, fed verbatim. The reference says who; the description says what matters.

Describe props as small, late, and minor. Props become the subject if listed early.

To fix one detail, edit the image, don't regenerate. Re-rolling risks the 90% that was already right.

Credits. Image generation uses the open-source ai-image-creator skill by centminmod (MIT), which handles model routing, reference images, and aspect ratios. The scene-image-generator agent that drives it, and all the prompt rules in the sidebar above, are my own, written one failed image at a time.

All images in this post, including the header and every figure, are my own, generated with the pipeline described above.

Tags: