Beyond Fingers: What AI Still Can't Get Right

It started with a question from Vedika Khurana (@KhuranaVedika) back in February:

❝

"What is the one thing AI still can't get right that consistently frustrates everyone, and why haven't we fixed it yet? Beyond just 'fingers'!"

I took that question and brought it to my own community. I told them I was writing an article and wanted to quote real creators. Not the polished "AI is amazing" version. The real version. The one where you've burned through your credits trying to get one shot to work and you're staring at the screen wondering if the model is gaslighting you.

The responses came fast.

I'll Go First

Before I get into what the community said, I want to be honest about my own wall.

I produce AI-animated videos for my Stor-AI Time series. Folktale adaptations, narrated, scored, edited in Adobe Premiere. For those videos, I need camera movement. Specifically, I need dolly shots: a smooth push in or pull out, the kind of move you see in every decent film made in the last century.

AI still can't reliably deliver one.

The closest I've gotten is with Kling. And even Kling isn't consistent. Sometimes it interprets a dolly prompt as a zoom, which is a completely different thing visually. Sometimes it drifts sideways. Sometimes it doesn't move at all. I've burned more credits on this single problem than almost anything else in my workflow.

What makes it frustrating isn't that the shot is hard. It's that it's not hard. A dolly in is one of the most basic camera moves in existence. The fact that AI models struggle to separate "camera moves through space" from "image scales up" tells you something important about what these models actually understand about three-dimensional space.

Spoiler: it isn't much.

Space Is the Problem Nobody Talks About

Stella Noir (@StellaNoir66) named it directly:

❝

"Continuity and the concept of a true three-dimensional space. Both are structural problems. AI needs to fuse with virtual world building of video games, because there, both those things are the most natural in the world."

That's a sharp observation. Game engines understand geometry because they're built on it. An object in Unreal Engine exists in 3D space and behaves accordingly. AI image and video models are generating plausible visual patterns based on training data. They're not modeling a world. They're predicting what a world would look like. That's a different thing, and the gap shows up the moment you ask for any kind of spatial reasoning.

Frontier Modal (@FrontierModal) put it plainly: "Camera control. It doesn't really understand three dimensional space, so you have to iterate and oversimplify."

That word, oversimplify, is one I've lived with in my own workflow. You learn to strip your prompts down to something the model can approximate, and then you build around it. It works. But it's a workaround, not a solution.

Megatronicjdw (@megatronicjdw) described spending hours trying to get a model to understand size relationships inside a hallway of illusion prompt. The model couldn't grasp that one door should be smaller than another based on distance. Placement, walking direction on stairs, objects that should be far appearing close. Basic spatial logic that a child understands visually.

The Body Knows When Something Is Wrong

Related to the spatial problem but distinct from it: the way AI handles bodies in motion.

AI Aimee (@RockGrokAI) offered a characteristically optimistic read on all of this: "The good news is AI is as dumb as it's ever going to be, right now. We ain't seen nothing yet." She's right, and I'll come back to that. But right now, in March 2026, the body mechanics problem is real.

Sonia Snowfrost (@soniasnowfrost) described a specific failure she kept hitting: heads that stay in place while a body spins, or worse, heads that go full horror-movie rotation when they shouldn't. "A lot of my images default to a dance spin if I leave it up to the AI."

DJ Kofi the Cat (@DjKofithecat) tested flips and spins specifically, noting that the torso and legs end up in the wrong place at the wrong time. His theory is that different body parts moving at different speeds breaks the model's ability to track them as part of the same object.

ADFortes (@ad_fortes) looked at this from a different angle, noting that multi-legged or non-standard forms show weird limb crossovers and fades when in motion. The model doesn't maintain a coherent sense of the shape it created. It generates the starting position and the ending position, and what happens in between is anybody's guess.

AbandonedMuse (@abandonedmuse) called out morphing as a specific failure: "If I have two photos of staircases, why can't you just move from one space to the next? Why do you have to do some weird illogical junction?" The model doesn't reason about the most realistic transition. It finds the nearest visual bridge in its training data, and sometimes that bridge makes no sense at all.

AI Doesn't Count. It Approximates.

Fellow creator Lukman Febrianto (@lukmanfebrianto) ran a systematic test of this that's worth reading in full. He tested three image models with two, five, and then ten people in a scene, tracking both face diversity and subject count across each test.

The findings were clear. Models that handled two characters well didn't necessarily perform when asked to place ten. More importantly, Lukman confirmed something many of us have suspected: AI models don't truly count objects. They approximate visual patterns. Flux 2 Pro produced the most diverse faces in his five-person test but only generated four women, prioritizing natural scene composition over the literal prompt.

This isn't a bug in the traditional sense. The model is doing what it's designed to do. It's generating a plausible scene. It's just not following your instructions the way you think it is.

Character Consistency Across Time

If the counting problem is about a single image, the character consistency problem is about sequences.

Nico Simon Princely (@NSPArtist) described it well: you build a shot starting on a character's face, then cut to their back, then in the next clip they turn around and it's a different person. The model doesn't retain a persistent identity across shots. Each generation is a new roll of the dice.

CreatioN CrypT (@creationcrypt) named both character and prop consistency as their top frustration. It's one of the biggest limiting factors for anyone trying to build a serialized narrative with AI video.

This is something I navigate every time I produce an episode of Stor-AI Time. Maintaining visual identity for a character across multiple shots requires constant reference management. The tools are improving, but consistency across time is still more craft than button-press.

A Note From the Professional Side

Most of the frustrations in this article are creator-facing. But Oranguerillatan (@Oranguerillatan), who works in post-production, raised a point that matters for anyone hoping to use AI output in professional pipelines.

The issue isn't just what AI generates. It's the format it generates in. Most AI video outputs are 8-bit MP4 files. Most image outputs are standard PNG. For a compositor working in Nuke or Flame, that's a problem. Professional VFX pipelines expect footage with wide dynamic range, native 4K resolution, and colorspace compatibility with systems like ACES. AI output often can't be composited seamlessly alongside real footage without visible degradation.

There are developments in ComfyUI and some tools will export floating point EXRs, which is a step in the right direction. But the gap between "looks good on social media" and "works in a professional VFX pipeline" is still significant. As AI output finds its way into more commercial production contexts, this is the conversation that's going to get louder.

The Things We Haven't Fixed

Weapons without geometry errors. Lip sync that doesn't look mechanical. Rain and tears on faces. Trains on tracks that actually go somewhere. Mirrors that reflect what should be in them. Text that reads correctly. Simple cartoon linework. Camera composition that doesn't center everything.

Uncle Dave (@deebeeeff) had a list that made me laugh out loud, not because it's funny, but because every single item on it was immediately recognizable. The melting butter problem. The train problem. Tears on faces. Each one is a small specific thing that should be easy, and each one reveals the same underlying gap: AI models predict appearance, they don't model reality.

Unleashed (@UnleashedAI99) pointed out weapons as a specific pain point. Bowstrings, swords with incorrect geometry, guns in holsters that don't sit right. The model has seen thousands of weapons. It generates something that looks like a weapon. But the functional logic of how a weapon is held, drawn, or used often isn't there.

Sachin Kamath (@DesigningFlow) raised something that sits a little differently from the rest of the list: AI struggles to represent people with disabilities or conditions like Down syndrome accurately. His read on why is straightforward. Small training datasets. When a model has seen millions of images of one type of face and far fewer of another, the output reflects that imbalance. This isn't a physics problem or a geometry problem. It's a representation problem, and it points to something the broader AI development conversation needs to keep front and center.

Wombat (@Wombat_fsolrtjg) pointed to a related bias: AI models carry a hard binary in how they interpret gender. When a model reads a figure as female based on face shape, hair, or body proportions, it applies a corresponding anatomical template regardless of what else is in the image. A masculine-presenting figure with softer facial features gets "corrected" by the model to match its training assumptions. The dataset doesn't have enough representation of bodies that sit outside that binary, so the model defaults to one side of it every time.

So Where Does That Leave Us?

AI Aimee (@RockGrokAI) said it best: AI is as dumb as it's ever going to be, right now.

Every frustration in this article is a temporary one. The models will improve. The spatial reasoning will get better. Character consistency will become standard. Professional pipelines will eventually get the formats they need. These aren't unsolvable problems. They're just unsolved ones.

In the meantime, the creators who are building now are the ones developing the workarounds, the prompt strategies, the reference management systems, and the creative instincts to get remarkable work out of imperfect tools. That's not a limitation. That's craft.

The fingers problem was solved. The next problems are already being solved. And the ones after that will be solved by people in this community who got frustrated enough to figure it out.

That's where we are. It's not perfect. It's also not bad.

What's the one thing that still costs you the most time, credits, or sanity? Drop it in the comments. Let's keep building this list together.

Glenn Williams is an AI art creator, Adobe Firefly Ambassador, and the author of The Render newsletter. Subscribe at glennwilliams.beehiiv.com