I Scored 176 AI Images. The Model Mattered Less Than I Thought.

I started this session with a thunderstorm in a lightbulb. Twelve models. Four images each. Forty-eight scored generations of the same prompt, the same concept, the same table.

By the end I had scored 176 images across 5 containers and 5 ecosystems. And the biggest quality jump had nothing to do with which model I used.

Switching from the thunderstorm lightbulb to a forest snow globe raised the cross-model average from 8.54 to 9.07. That 0.53-point jump is larger than the gap between the best and worst model on any single prompt in the entire session. The thing I changed last turned out to matter most.

This article covers the full arc. Three rounds, seven key findings, and a principle that changed how I approach every prompt I write now. If you've been chasing model upgrades hoping for better results, the data says you're optimizing the wrong variable.

Round 1: Twelve Models, One Lightbulb

The concept was simple. A clear glass lightbulb on a dark wooden table, a complete miniature thunderstorm raging inside. Clouds, lightning, rain, puddles at the base, light casting through the glass onto the wood. Macro product photography, 85mm lens, shallow depth of field.

Here is the prompt I used across all twelve models:

A clear glass lightbulb sitting on a dark wooden table, inside the bulb a complete miniature thunderstorm is raging with tiny dark clouds and bright lightning bolts illuminating the glass from within, rain falling inside the bulb with tiny puddles collecting at the bottom, the lightning casts dramatic light through the glass onto the table surface, macro photography, 85mm lens, shallow depth of field, dark moody background, atmospheric haze, professional product photography

Twelve models. Four images each. Scored on a weighted five-dimension rubric: Visual Quality (30%), Prompt Alignment (25%), Consistency (15%), Uniqueness (15%), and X Engagement Potential (15%).

The full roster: GPT Image 1.5 (@ChatGPTapp), Firefly Image 5 (@AdobeFirefly), Flux 2 Pro, Flux 1 Kontext Max, Flux 1 Kontext Pro, Flux 1.1 Pro, Flux 1.1 Pro Ultra, Flux 1.1 Pro Ultra Raw (all @bfl_ml), Nano Banana 2, Nano Banana Pro (both @NanoBanana), Imagen 4 (@GoogleAIStudio), and Runway Gen 4 Image (@runwayml).

The top five:

Rank	Model	Avg Score	Best Single
1	GPT Image 1.5 (@ChatGPTapp)	8.75	9.15
2	Flux 1.1 Pro (@bfl_ml)	8.50	8.88
3	Flux 2 Pro (@bfl_ml)	8.31	8.70
4	Firefly Image 5 (@AdobeFirefly)	8.31	8.58
5	Nano Banana 2 (@NanoBanana)	8.30	8.55

The bottom three: Flux 1.1 Pro Ultra (7.83), Flux 1.1 Pro Ultra Raw (7.46), Runway Gen 4 Image (@runwayml, 7.36).

GPT won. That was expected. What was not expected was the containment breach spectrum.

Same prompt, same concept. One model sealed the storm. Another let lightning escape. A third turned the bulb into a landscape. The words were identical. The stories were not.

Same words produced six entirely different narratives across the twelve models. Firefly kept the storm perfectly sealed inside clean glass. Flux 2 Pro leaked water through the base. The Nano Banana family let lightning escape onto the table. Flux 1.1 Pro Ultra Raw abandoned containment entirely and rendered a landscape world inside the glass. Runway shrunk the storm into a decorative prop.

The gap between the top and bottom was not about rendering quality. It was about how each model interpreted "inside." Top models compressed a full atmosphere into the glass. Bottom models placed small objects inside a container. World vs. decoration. That distinction predicted scores more reliably than any other factor.

Forty-eight images scored in a single afternoon. The ranking was clear. But the ranking was about to stop mattering.

The other discovery worth noting here: version upgrades made things worse. Flux 1.1 Pro scored 8.50. The Ultra variant scored 7.83. Ultra Raw scored 7.46. A clean downward line from base model to premium tier. I had seen the same regression in video generation testing with different models. "Premium" in the model name is marketing, not a quality guarantee.

Round 2: Same Storm, Different Containers

I narrowed the field to four models (GPT Image 1.5 (@ChatGPTapp), Firefly Image 5 (@AdobeFirefly), Flux 1.1 Pro (@bfl_ml), Nano Banana Pro (@NanoBanana)) and kept the thunderstorm but changed the container. Mason jar. Snow globe. Coffee cup. Pocket watch. Four containers across four models. Sixty-four more images.

The cross-model averages by container:

Container	Cross-Model Avg
Pocket Watch	8.67
Snow Globe	8.54
Coffee Cup	8.45
Mason Jar	8.35
Lightbulb (R1)	8.54

The pocket watch, the container with no stock photography precedent for "weather inside a timepiece," scored highest. And the model rankings shifted depending on which object they were rendering.

The biggest swing belonged to Nano Banana Pro. On the lightbulb, it placed 6th out of 12 with an 8.26 average. On the pocket watch, it produced the session's highest-scoring single image at 9.23. A 0.97-point improvement from the same model, same storm, different container.

A 6th-place model produced the session's best single image. The only thing that changed was the container.

GPT's advantage actually grew on unfamiliar containers. On the pocket watch, GPT averaged 9.07 while Firefly averaged 8.29, a gap of 0.78 points. On the snow globe (a stock-familiar concept), the gap shrank to 0.46. Models that reason from physics scale to any container. Models that match training data patterns hit a ceiling when the container has no precedent.

The coffee cup introduced something else entirely. An opaque container forced every model to solve a problem transparent glass never posed: how do you show weather inside something you cannot see through? Four models produced four completely different solutions. Firefly floated the storm clouds above the rim. GPT pushed them deep into the cup, billowing over the edge. The creative problem-solving was more varied than any transparent container produced.

The pocket watch had no training data to lean on. That turned out to be its advantage.

Round 3: Same Globe, Different Worlds

Round 2 established that the snow globe offered the highest cross-model consistency (only 0.49 range across all four models). So I locked the container and changed what lived inside it. Coral reef. Ancient forest. Galaxy. Desert sandstorm. Plus the original thunderstorm from Round 2 for comparison.

Sixty-four more images. Four ecosystems across four models. The results were unanimous.

Ecosystem	Cross-Model Avg
Forest	9.07
Coral Reef	8.79
Galaxy	8.75
Desert Sandstorm	8.73
Thunderstorm	8.54

Forest won across all four models. Not one model ranked it second. The cross-model average of 9.07 made it the session's peak concept. GPT averaged 9.41 on the forest globe, with three individual images tying at 9.43 for the session's highest scores. Firefly hit 9.10, its best average across all 176 images in the session. Even Flux 1.1 Pro reached 8.90.

The thunderstorm, the concept I started the entire session with, finished last. Every colored, living ecosystem outperformed monochrome weather. Miniature worlds work best with worlds, not weather events.

The session champion. 9.43. Ancient forest, volumetric god rays, moss-covered trees pressing against curved glass. Every model agreed: this is where the concept peaks.

Here is the prompt that produced it:

A glass snow globe sitting on a dark wooden table, inside the globe a complete miniature ancient forest with towering moss-covered trees and thick green canopy, golden sunlight filtering through the leaves creating volumetric god rays inside the glass, tiny ferns and mushrooms covering the forest floor, the warm light casts a green-gold glow through the glass onto the table surface, macro photography, 85mm lens, shallow depth of field, dark moody background, atmospheric haze, professional product photography

Why forest? Several factors converged. God rays create visible volumetric light inside glass, which rewards the rendering engine. The green-gold palette produces natural warm-cool contrast against dark backgrounds. Trees pressing against curved glass create compression tension that reads as dramatic at any viewing size. And the concept maps to terrarium photography, which exists in training data but is not oversaturated the way "lightbulb with weather" might be.

God rays through glass. Green light bleeding onto dark wood. The formula that made four competing models agree on something for the first time.

The Finding That Changed My Process

Here is the session mapped as a single progression:

Thunderstorm in a lightbulb: 8.54 cross-model average. Thunderstorm in a pocket watch: 8.67. Forest in a snow globe: 9.07.

The starting point was the weakest version of the concept. Every step that improved quality was a change to what was inside the prompt, not which model rendered it.

The gap between the best model (GPT at 8.75) and the worst surviving model (Firefly at 8.31) on the lightbulb was 0.44 points. The gap between the worst ecosystem (thunderstorm at 8.54) and the best ecosystem (forest at 9.07) was 0.53 points. Changing the subject outperformed changing the model.

This does not mean model choice is irrelevant. GPT won every container and every ecosystem. But the order of operations matters. If I had spent the entire session testing 12 models on a thunderstorm in a lightbulb, the best possible result would have been GPT's 8.75. By testing 4 containers and 5 ecosystems instead, the worst model's best ecosystem (Firefly's forest at 9.10) outscored the best model's starting concept (GPT's lightbulb at 8.75).

Prompt engineering is not just the words. It is the objects. The ecosystems. The containers. The choices you make before you start typing.

176 images later, the variable that mattered most was the one I almost did not test.

What Comes Next

This session produced enough findings to fill a week. Over the next four articles, I am going to break each one open with the full data.

Tuesday: The container that transformed a 6th-place model into the session's top scorer. NB Pro's pocket watch story and what it teaches about stock familiarity.

Wednesday: The forest snow globe formula. Why every model agreed. The god ray principle. And the copy-paste prompt that never scored below 8.78.

Thursday: Why premium AI models keep scoring worse. The Flux regression data, confirmed across image and video generation.

Friday: I put a storm in a coffee cup. Four models. Four completely different solutions. What happens when you give AI a problem it cannot see through.

Every prompt from this session's top performers is in this article. The lightbulb and forest globe are in the sections above. Here are the remaining three you should try.

Best for X Engagement: Coral Reef Snow Globe

A glass snow globe sitting on a dark wooden table, inside the globe a complete miniature coral reef ecosystem with colorful coral formations and tiny tropical fish swimming through crystal clear turquoise water, bioluminescent jellyfish providing soft glowing light from within, light refracting through the water and glass onto the table surface, macro photography, 85mm lens, shallow depth of field, dark moody background, atmospheric haze, professional product photography

Best Unique Concept: Thunderstorm Pocket Watch

An open antique pocket watch sitting on a dark wooden table, inside the watch face a complete miniature thunderstorm is raging with tiny dark clouds and bright lightning bolts illuminating the watch glass from within, rain falling inside the watch with tiny puddles collecting on the watch mechanism, the lightning casts dramatic light through the watch crystal onto the table surface, macro photography, 85mm lens, shallow depth of field, dark moody background, atmospheric haze, professional product photography

Best Warm Palette: Desert Sandstorm Snow Globe

A glass snow globe sitting on a dark wooden table, inside the globe a complete miniature desert landscape with sand dunes and a raging sandstorm with swirling amber dust and tiny lightning bolts within the dust clouds, the warm amber light from the storm casts golden light through the glass onto the table surface, macro photography, 85mm lens, shallow depth of field, dark moody background, atmospheric haze, professional product photography

Score your own results. And if you find an ecosystem that beats the forest, I want to see it.

Glenn is an Adobe Firefly Ambassador and AI creator documenting the craft of prompt engineering at @GlennHasABeard. He publishes The Render newsletter and creates the Stor-AI Time series adapting world folktales through AI-generated video.

This article is the first in a five-part series analyzing one testing session: 176 images, 12 models, 5 containers, 5 ecosystems. Every score uses the same weighted five-dimension rubric applied across all prior research sessions.