CVPR dispatch: Alibaba Cloud cracks the agent problem

Can agents handle real-world input, run at scale, and deliver something usable?

Jun 22, 2026

CVPR used to ask how machines see. Now the question is harder: once AI can see, can it understand what it’s looking at, generate something useful, and ship that into a real workflow?

An agent in the wild doesn’t get a clean prompt. It gets a contract with tables, clauses, and annotations jumbled together. A design file on its sixth revision. External references that contradict each other. It has to read, judge, retrieve, generate, revise, and hand off a result.

This is why the industry conversation has shifted from “can the agent answer questions” to “can it close the loop.” The bar is complex inputs, controlled costs, and output someone can actually use.

Alibaba Cloud is one of the most experienced agent practitioners among Chinese tech companies. What they call “full-stack agent support” is a capability stack from compute to application development. Their batch of CVPR papers this year maps neatly onto the three thresholds every agent must cross before it’s production-ready: understanding complex input, running affordably, and delivering usable output.

Understanding: making sense of messy documents

Agent demos look smooth because the input has been pre-cleaned. Real business isn’t like that. Contracts mix tables, clauses, and handwritten annotations. Financial reports interleave text, charts, and footnotes. Technical docs have formulas next to screenshots next to code blocks. And different documents can say contradictory things about the same topic.

“Understanding” for an agent means more than image recognition. It’s three problems stacked on top of each other.

Reading structure. When multimodal models fail at STEM tasks, the usual suspect is weak reasoning. CodePercept（Code-Grounded Visual STEM Perception for MLLM）argues the real bottleneck is earlier: perception. If the model misreads an image’s structure, reasoning on top of that is doomed. Their fix is unusual: use executable code as a verification standard. The model generates code to reconstruct the image’s structure, forcing it to parse the underlying logic rather than guess from surface patterns.

Finding the right evidence. Once an agent can read a single document, it needs to find the right page in a pile of thousands. Evo-Retriever（LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval）

tackles retrieval over visually complex, mixed-format documents. Its key idea: the training process itself should be dynamic. An LLM meta-controller monitors the model in real time, adjusting sample difficulty so that model and training strategy co-evolve. On AstraZeneca’s multimodal knowledge base QA benchmark, retrieval accuracy improved 14.1% over text-only baselines. The technology is being deployed in Alibaba Cloud’s OpenTrek agent platform.

Judging contradictions. Even after finding relevant material, different sources may conflict. CC-VQA（Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering）puts visual information back at the center of conflict resolution. It lets the model weigh external context against its own parametric knowledge through the visual input, without any retraining. Low-relevance context gets compressed via positional encoding; decoding adjusts dynamically based on conflict severity. Customer service, data analysis, and office productivity agents all hit this problem: lots of material, not all of it trustworthy. A useful agent knows what to ignore.

Running affordably: video at production scale

The second threshold is cost. Multimodal capabilities that work in a demo may not survive in a system handling thousands of calls a day. A single task is fine. Sustained operation, with tight latency, throughput, memory, and per-call cost requirements, is where things break.

Video is the worst offender. Understanding video means processing huge numbers of frames and visual tokens. Generating video means repeated computation across many diffusion steps. When an agent calls the model multiple times per task, checking and correcting as it goes, those costs compound fast.

Two papers attack this from opposite ends.

RAPID（Reusing Attention Sparsity with Inter-step Adaptation） targets video generation. Video diffusion models take many steps, and adjacent steps produce highly similar computations. RAPID exploits this by dynamically deciding how much to reuse based on each step’s actual attention sparsity, rather than applying a fixed reuse rule. On Wan2.1-14B and HunyuanVideo, its high-fidelity mode beat existing baselines at the same compute budget. Turbo mode hit 1.79x and 2.01x speedups while maintaining strong visual quality.

EarlyTom（Early Token Compression Completes Fast Video Understanding）targets the other side: video comprehension speed. It compresses video tokens early, letting the model start reasoning before processing every frame. On a single NVIDIA A100 running LLaVA-OneVision-7B, it improved time-to-first-token by up to 2.65x and cut FLOPs by 61%, with accuracy close to the full-token baseline.

One cuts redundant computation in generation, the other compresses redundant information in understanding. Same destination: multimodal capabilities that don’t bankrupt the task chain they sit inside.

Delivering: output that someone can actually use

Many generative AI products are stuck on the last mile. The model produces a result, but the user can’t edit it into anything usable.

Qwen-Image-Layered （Towards Inherent Editability via Layer Decomposition） tackles this head-on. It decomposes a single RGB image into semantically independent RGBA layers, so people, backgrounds, text, and decorative elements can each be edited without touching the rest.

From the team: “Most current editing methods essentially regenerate the entire image, or inpaint locally. Either way, changes ripple. You move a person to the right and the ocean waves change too. We wanted Photoshop-style layers: each element independently editable. And we wanted it end-to-end, layer decomposition inside a single diffusion process, no separate segmentation or inpainting steps.”

“Editable” matters more to industry than “pretty.” Output enters a workflow only when someone downstream can keep working with it.

Wan-Weaver（Interleaved Multi-modal Generation via Decoupled Training）addresses a different delivery challenge: generating interleaved text and images. Joint training of both modalities causes interference; full separation kills coherence. Wan-Weaver decouples text planning from visual consistency, determining narrative structure first, then generating matching visuals. Selected as a CVPR Oral. Its interleaved generation shipped in Wan 2.6, with the 2.7 release focusing on multi-image sets. Future content agents could deliver structured content units with narrative and visual continuity, not just loose assets.

A cluster of digital human papers extends the delivery story further.

OMG-Avatar（One-shot Multi-LOD Gaussian Head Avatar）and MeshLAM（Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction lower the modeling barrier: a drivable 3D head avatar from a single photo. OMG-Avatar uses multi-level-of-detail for different hardware constraints; MeshLAM goes with mesh-plus-texture for faster integration into existing animation and game pipelines.

AnyID（Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References） solves identity consistency across generated video. The problem with single-reference methods is that one still photo doesn’t carry enough 3D information.

From the AnyID team: “A single photo can’t show the other side of the face, or how muscles move with different expressions. The result might look right at a glance, but anyone who knows the person will feel something’s off. We use multiple references, pick one as the anchor, and describe only what should change via a differential prompt. Everything else stays consistent automatically.”

Compared to traditional pipelines built on 3D rendering and skeleton rigging, this is far more accessible: text prompts control background, motion, and styling. It won’t replace high-precision 3D workflows yet, but it opens a faster production path.

PortraitDirector (A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment) handles real-time facial reenactment, decomposing motion into head pose, lip shape, gaze, expression, and emotion, then recomposing them into natural output. Together, these papers trace a complete digital human production chain: modeling, driving, identity preservation, real-time expression.

“Delivering” means the agent doesn’t just generate once. It produces editable files, consumable content, or production assets that keep moving through the pipeline.

Coda

Agents in production keep dying the same ways. They misread the input. The inference bill explodes. They produce something that looks fine in a screenshot but nobody can work with.

“Full-stack agent support” is a phrase that usually points at compute, cloud platforms, and inference services. What Alibaba Cloud’s CVPR papers suggest is that the model side has just as much ground to cover. You need perception that handles messy, real-world documents. You need inference efficient enough to survive thousands of daily calls. And you need output that doesn’t dead-end at a demo, output that someone downstream can actually pick up and keep working with.

Each of these papers goes after a different point on that chain. Together they start to fill in what “full-stack” actually means on the research side. The hard part of agents was never tool-calling. It was everything around it.

Discussion about this post

Ready for more?