I Asked 7 AI Models to Bring a Painting to Life. Only 3 Understood What I Meant.

The concept seemed simple enough. A tiger stepping out of an oil painting. Front half: photorealistic fur. Back half: visible brushstrokes on canvas. Paint dripping at the transition point where one reality becomes another. One prompt. One image. Two visual languages coexisting in a single frame.

I ran that prompt through seven different AI image models. Same words, same concept, every time. What came back wasn't seven versions of the same thing. It was five completely different interpretations of what "emerging from a painting" means. And a hard lesson about what these models can and can't actually do.

The Concept: Cross-Medium Breach

The idea is straightforward in theory. Take a subject that exists in one artistic medium (oil paint on canvas) and show it transitioning into another medium: photorealistic photography. The front half should look like you could reach out and touch real fur. The back half should clearly be flat brushstrokes on a canvas surface. And somewhere in the middle, one becomes the other.

I'd been testing material transformation prompts in Adobe Firefly for months. Faces constructed from fire and ice. Bridges built of candle wax. Architectures made of chocolate. Those all worked beautifully. So when I sat down to test "a tiger constructed from oil paint, transitioning to real," I expected results in the same ballpark.

I was wrong.

What Firefly Did Instead

Firefly gave me women with paint on them. Not women emerging from paintings. Women standing in galleries with red and white paint smeared across their shoulders and arms. Fashion editorial body paint photography. Every single time.

I tried four different approaches.

Removed the canvas entirely and described a half-paint, half-real split down the center of the figure. Got body paint. Switched from a human subject to a tiger to eliminate the body-art genre entirely. Got a beautiful hyper-realistic painting of a tiger in a gallery, but still just one visual language. Reversed the direction completely, describing "a painting that is becoming three-dimensional." Got stunning trompe l'oeil: a painted face pushing out from a cracked canvas with gorgeous impasto texture. Technically impressive. But both sides were still oil paint.

Four attempts. Sixteen images. Not one instance of Firefly rendering two visual languages simultaneously.

Four attempts. Sixteen images. Not one breach. Firefly saw a fashion shoot every time.

The irony is that some of these "failures" are striking on their own terms. That trompe l'oeil face pressing outward from a cracked canvas? I'd put it in a portfolio. Firefly didn't do what I asked, but it invented something adjacent that has its own weird beauty. Which is a finding worth documenting, even if it wasn't the finding I was looking for.

The best wrong answer of the session. Firefly invented something beautiful. It just wasn't what I asked for.

The Fix: Ask Someone Else

After confirming that Firefly's limitation wasn't a prompt problem (I'd tried every angle I could think of) I pivoted. Same prompt, different models. Seven in total, four images each, scored on the same rubric I use for all my testing: Visual Quality (30%), Prompt Alignment (25%), Consistency (15%), Uniqueness (15%), and X Engagement Potential (15%).

Here's what happened.

Five Interpretations of One Prompt

The results sorted into five distinct categories, and this was the most interesting discovery of the entire session. Each model didn't just perform better or worse. It understood the concept differently.

GPT Image 1.5 and the Nano Banana family (both Pro and base) actually rendered the dual-language breach. Photorealistic fur dissolving into visible brushstrokes. A clear transition zone. Two visual languages coexisting in one frame. This is what I asked for, and only these three models delivered it.

Imagen 4 went full action movie. The tiger wasn't transitioning from paint to fur. It was physically smashing through the canvas like a wall. Debris flying, surface cracking, fragments everywhere. Both sides photorealistic. No medium transition at all, but visually explosive. If I'd asked for "tiger breaking through a painting," this would have been the winner.

Firefly Image 5 committed fully to trompe l'oeil. The face pushed outward from the canvas in three dimensions, but everything remained painted. Consistent, aesthetically strong, conceptually wrong.

Flux 2 Pro and Runway Gen4 couldn't make up their minds. Some images attempted the breach, others just rendered paintings in galleries, and one was essentially a photorealistic tiger standing near a canvas with some paint splashed on the floor. The inconsistency was the finding: these models didn't understand the concept well enough to fail in a coherent direction.

Imagen heard "emerging" and chose violence. Both sides photorealistic. No medium transition. Pure action movie.

The Numbers

Model	Composite Score	Did it work?
GPT Image 1.5	9.08	Yes
Nano Banana Pro	9.00	Yes
Nano Banana 2	8.88	Yes
Imagen 4	8.33	No (physical smash)
Firefly Image 5	7.48	No (trompe l'oeil)
Flux 2 Pro	7.33	Sort of (2 of 4)
Runway Gen4	6.43	Barely (1 of 4)

The gap between the top tier and everything else isn't about image quality. Visual Quality scores were comparable across all seven models. The differentiator was Prompt Alignment: whether the model produced what was actually requested. The three breach achievers scored 8.50 to 9.00 on alignment. Everyone else scored 5.38 to 6.63. When every model makes beautiful images, the one that makes your beautiful image wins.

Session peak: 9.30. You can trace where individual fur strands become paint strokes.

Why This Matters Beyond Pretty Tigers

Here's the thing I keep coming back to. I've spent months building prompt engineering techniques in Firefly. "Constructed from" language, gradient backgrounds, atmospheric haze, dark environments for self-illuminating materials. Those patterns are proven. Hundreds of images. Scores consistently above 8.0.

And none of it mattered for this concept. Not because the techniques were wrong, but because the model couldn't do what I was asking regardless of how I asked. The "constructed from" pattern that works brilliantly for fire, ice, and crystal fails for oil paint. Paint is something that actually goes on skin in real life. Firefly's training data includes thousands of body paint editorial photos, and that association is stronger than any prompt language I could write.

The lesson is uncomfortable but useful: prompt engineering has a ceiling, and that ceiling is model capability. You can be the best prompt writer in the world, but if the model can't render two visual languages simultaneously, no combination of words will make it happen. Sometimes the answer isn't a better prompt. It's a different model.

Same score, different approach. GPT kept the tiger attached to the canvas. Nano Banana let it walk away.

What I'd Do Differently

If I were starting this test over, I'd skip the human subject entirely from the beginning. The body-paint genre trap cost me eight images and two rounds of iteration before I identified the real problem. I'd also run the multi-model comparison earlier. I spent an hour fighting Firefly when I could have confirmed the limitation in fifteen minutes by testing one alternative model.

But the failures are what make this article possible. A clean success story ("I wrote a prompt and it worked") teaches you nothing. The path through body-paint misinterpretation, trompe l'oeil invention, genre traps, and material prerequisites is where the actual insights live. That's the research.

The prompt used for all seven models is included below. Try it yourself and see which interpretation your model of choice produces.

Copy-paste prompt:

A painted tiger emerging three-dimensionally from a large oil painting canvas in a gallery, the tiger's head and front paws are photorealistic fur and muscle while its hindquarters dissolve into visible brushstrokes of orange and black paint still flat on the canvas surface, thick oil paint dripping from where three-dimensional fur meets painted canvas, dimly lit gallery with single spotlight, 85mm lens, shallow depth of field, atmospheric haze, professional photography