Some weeks, new ai tools show up and you move on. This wasn’t one of those weeks.
Three fresh releases stood out because they solve real problems in a very direct way: fixing missed focus after the fact, turning a still image into an interactive video stage you can “direct” with your mouse, and animating portraits for very long videos without the usual face drift.
Tool #1: Generative Refocusing (change focus after you took the photo)
Everyone has a photo like this: the moment is perfect, but the camera grabbed focus on the wrong thing. A face is soft, the background is sharp, or the subject is fine but the part you wanted to highlight is slightly blurry.
Generative Refocusing is built for that exact pain. It can surgically fix any out-of-focus photo after it’s already in your camera roll, and it also lets you shift focus on purpose for a more cinematic look.
The big idea is simple: it gives you flexible control over focus on a single image, long after capture. You can move focus from foreground to background (or the other way around), blur specific regions, or sharpen areas that were originally soft.
How Generative Refocusing “understands” your photo (the depth map concept)
In the demos, there’s a depth map shown in the corner. Think of it as a 3D hint for a 2D photo. The model uses that depth map to estimate what’s closer, what’s farther, and how blur should behave across space.
That’s why the focus shift looks natural. When it refocuses, it doesn’t just sharpen a cutout. It treats the image like a scene with depth, then adjusts blur according to distance.
If you’ve read about how modern image enhancement is moving beyond basic sharpening, the broader context is interesting too. This overview of AI methods that improve images explains why “enhancing” a photo is harder than it looks.
What it can do well (and why it feels different)
Generative Refocusing supports a mix of “fix it” work and “create a look” work:
- Re-focus different areas of a single image after the fact.
- Defocus specific parts with controlled blur, instead of blurring the whole background.
- Recover detail in areas that were out of focus, not only shift blur around.
- Keep the original content stable, so the person doesn’t change shape or angle.
That last point matters more than people expect. A lot of AI edits get the job done, but they quietly alter reality in the process.
Examples shown in the demos (what stood out)
The video runs through a bunch of cases, including tricky multi-person shots:
- A Joker portrait example where the focus shifts using the depth map, and the change looks consistent with the scene’s depth.
- A four-person scene where focus hops between front and back, then flips so the background becomes sharp when the foreground is blurred.
- A six-person group shot where the model can jump focus from one person to another without turning faces into mush.
- A photo of five girls where only the middle subject was sharp in the input, then the output refocuses so all faces are sharp.
- A “wrong subject” focus mistake where the camera locked onto the girl taking the photo, and the actual posed people were blurred. After processing, the posed group comes into focus and detail is recovered.
- A creative refocus where the subject in front becomes blurred and the amusement rides behind her snap into focus.
This is the kind of tool that’s helpful even if you never touch AI art. It’s closer to “fix my photo” than “generate a new one.”
Why it beat a common alternative in the comparison
The demo compares results against Nano Banana Pro. The key difference wasn’t blur quality. It was image integrity.
- In one comparison, Nano Banana Pro successfully changed focus, but it also changed the man’s posture and facial angle.
- In another, it zoomed out slightly and hallucinated parts of a face.
Generative Refocusing kept the person’s angle and structure consistent, then applied the focus change without rewriting the scene.
If you do use Nano Banana Pro for image work, this internal breakdown is a good companion read: Google Nano Banana Pro image generation guide
How it works (two-stage diffusion pipeline)
The approach described in the video uses a two-stage diffusion setup on a Flux.1 dev backbone:
- Stage 1 (DeblurET): Recovers a sharp, all-in-focus image from the input. It uses diffusion guided by initial deblurring predictions.
- Stage 2 (BokehET): Re-applies customizable bokeh based on your chosen focus plane, blur intensity, and even aperture shapes.
Training also mixes synthetic paired data (to keep geometry consistent) with real bokeh photos that include EXIF metadata, so it can learn lens behavior that simple simulation misses.
To explore the official materials, start with the Generative Refocusing project page.
If you’re comparing it with more traditional workflows, Adobe also documents AI sharpening and denoise options inside Photoshop, including partner models: Adobe’s guide to generative AI filters for enhancing images
Tool #2: World Canvas (draw what happens next in a video)
World Canvas flips the usual video generation flow. Instead of typing a prompt and hoping the model guesses your intent, you draw motions directly on the image.
It comes from Ant Group (part of the Alibaba ecosystem), and the core experience is simple: upload a still image, then draw trajectories to control actions and camera motion.
What “drawing control” looks like in practice
The demos make the point quickly:
- A dog reacts to different drawn lines.
- Draw a circle, the dog spins.
- Draw a sharp upward stroke, it jumps.
- You can add a new character, scale it up, and the dog reacts (like running away from a giant figure).
That’s already fun, but the more useful part is how this becomes a directing tool.
The strongest demos (camera motion, speed control, and multi-step scenes)
A few examples show how far the “draw to direct” idea can go:
- Road scene camera move: draw a forward trajectory and the camera moves down the road.
- Add a runner: draw a path and a person runs through the scene along that line.
- Dragon landing: type a prompt for a dragon, then draw a line from sky to street, and it flies down and lands where you drew.
- Spinning selfie: combine a text prompt (girl spinning while taking a selfie) with a circular gesture to set direction and motion. You can also gesture to make the background rotate.
- Speed control: the model looks at how fast you draw. A slow line produces a walk, a faster stroke produces a run.
- Choreography: draw a man’s path to lift a woman and spin her around, then generate the clip from that choreography.
- Mini narrative: spin the girl, draw a puppy’s run-in path, then draw her motion to pick up the puppy.
- Appear and disappear: add a car entering from one side, add an elderly man from the other, and the interaction follows context (the car slows, the man turns back).
- Reference image mixing: drop in two unrelated reference images (a Chinese-style painting character and a white bear), place them in a snowy mountain setting, then direct them with prompts and paths.
- Absurd physics scenes: a boy punches the Eiffel Tower and it collapses, a shark swims out of desert sand like it’s water.
The results aren’t perfect every time, but the control method is the headline. It feels less like “prompting” and more like blocking a scene.
How World Canvas ties text to your drawings
Under the hood, the system uses a diffusion transformer trained with a flow matching framework. The piece that matters for users is the alignment method:
- Spatial-aware cross attention aligns your text captions with your drawn trajectories.
- Trajectory injection represents drawn lines as Gaussian heat maps, then propagates features along those points.
That’s why it can keep multiple motions straight at once, instead of mixing up which object should do what.
For the official details and releases, see the World Canvas project page.
Setup reality check
The code is released (with inference code and separate model variants), but this is a 14B-parameter model. VRAM requirements aren’t spelled out in the demo discussion, but the expectation is high-end GPUs if you want to run it locally.
Tool #3: FlashPortrait (fast, long portrait animation that stays consistent)
FlashPortrait targets a very specific job: portrait animation where only the face, head, hair, and eyes move.
It makes a bold claim: infinite portrait animation that runs around six times faster than typical portrait generation tools, while staying stable across long videos.
The team behind it includes Microsoft Research Asia, Tencent, Tongji Lab, and Alibaba Group.
What FlashPortrait does (and what it intentionally doesn’t do)
The easiest way to understand it is by comparing it to broader “full-body copy” tools.
FlashPortrait is closer to “make this photo talk” than “make this person dance.”
In the examples:
- You provide a reference image (the identity to animate).
- You provide a driving video (the motion to copy).
- The output keeps the body still and focuses on:
- lip movement
- eye movement
- head tilt
A big benefit of this narrow scope is that it can ignore junk motion. If the driving video includes hands waving or body shifting, FlashPortrait filters that out and sticks to facial motion.
Why long-form output matters (the 1,700-frame stress test)
The long-form comparison is where FlashPortrait looks strongest.
In tests pushing past about 1,700 frames, other methods start to fall apart. The issues described include:
- color drift
- identity inconsistency
- face distortion
The competing tools named in the comparison include Live Portrait, EmoPortrait, AniPortrait, Fantasia Portrait, and One Animate.
FlashPortrait continues cleanly through the end of the sequence, which is the difference between “cool demo” and “usable for long content.”
To see the official examples and writeup, use the FlashPortrait project page.
How FlashPortrait works (three technical ideas driving quality and speed)
The model is built on a 14B diffusion transformer backbone. The speed and stability come from three pieces:
- Normalized facial expression blocks
These replace standard image cross-attention blocks. The goal is to prevent identity drift caused by the gap between diffusion latents and raw facial embeddings. A face encoder extracts head pose, eyes, emotion, and mouth embeddings separately, then normalizes them to help keep identity stable. - Weighted sliding window denoising for long videos
Long generation uses overlapping windows. The method assigns weights to overlap regions and fuses them, which helps avoid jitter and harsh transitions. - Adaptive latent prediction acceleration
This is the engine behind the speedup. It uses Taylor expansion to predict future latent steps, then adapts based on how much facial motion is happening. Faster inference without the usual “cheap animation” artifacts is the goal.
Availability limits (why most people won’t run it locally yet)
The inference code is released, but the specific acceleration code path is still listed as coming soon on the roadmap. The model weights are also large (around 58 GB in total), which makes local runs unrealistic on consumer GPUs right now, unless a smaller or quantized release shows up later.
Why these three AI tools matter (quick recap)
All three tools share a theme: more control, fewer accidental changes.
Here’s the clean way to think about them:
| Tool | What you give it | What you get back | Best for |
|---|---|---|---|
| Generative Refocusing | One photo | Refocused image with controllable blur | Fixing missed focus and directing attention |
| World Canvas | One image, text prompts, drawn paths | Video that follows your gestures | Storyboarding and interactive motion control |
| FlashPortrait | One portrait, one driving video | Long, stable talking-head animation | Long-form facial animation and lip-sync style output |
What I learned after seeing these three tools back-to-back
Seeing these tools in one sitting changed how I think about “AI editing” versus “AI generation.”
With Generative Refocusing, the best use isn’t flashy. It’s rescuing photos you already care about, and doing it without rewriting faces. The comparison against Nano Banana Pro made that tradeoff obvious: some tools will get you the look, but they may also change reality while they’re at it.
With World Canvas, I stopped thinking about prompts as the main control surface. Drawing a motion path is faster than describing it. It also removes the guesswork, because the model doesn’t need to interpret a paragraph of text about movement. It can literally follow the line.
The bigger takeaway: the most useful ai tools aren’t always the ones that generate the wildest outputs. They’re the ones that give you control, and then keep your inputs stable while you edit.
If you like tracking “weeks where everything drops at once,” this roundup connects well with another recent recap: AI agents and video generation breakthroughs 2025
Conclusion
This batch of ai tools doesn’t just chase prettier outputs. It focuses on control, stability, and workflows that map to real creative work.
If you take photos, Generative Refocusing is the one that can save “almost perfect” shots. If you build scenes, World Canvas gives you a direct way to block motion with your hand, not paragraphs of prompts. If you publish long videos with talking faces, FlashPortrait is aimed at staying stable when others drift.
If you try one of them, the most interesting test is simple: use your own imperfect input, not a curated sample, then see how well the tool respects what you gave it.
0 Comments