VentureBeat | April 30, 2026 1:51 PM PT

One of the key challenges of building effective AI agents is teaching them to choose between using external tools or relying on their internal knowledge. But large language models are often trained to blindly invoke tools, which causes latency bottlenecks, unnecessary API costs, and degraded reasoning caused by environmental noise. To overcome this challenge, researchers at Alibaba introduced Hierarchical Decoupled Policy Optimization (HDPO), a reinforcement learning framework that trains agents to balance both execution efficiency and task accuracy. Metis, a multimodal model they trained using this framework, reduces redundant tool invocations from 98% to just 2% while establishing new state-of-the-art reasoning accuracy across key industry benchmarks. This framework helps create AI agents that are not trigger-happy and know when to abstain from using tools, enabling the development of responsive and cost-effective agentic systems. Current agentic models face what the researchers call a "profound metacognitive deficit." The models have a hard time deciding when to use their internal parametric knowledge versus when to query an external utility. As a result, they blindly invoke tools and APIs, like web search or code execution, even when the user's prompt already contains all the necessary information. To solve the optimization dilemma of coupled rewards, HDPO separates accuracy and efficiency into two independent optimization channels. The accuracy channel focuses on maximizing task correctness. The efficiency channel optimizes for execution economy. HDPO computes the training signals for these two channels independently and only combines them at the final stage of loss computation. The efficiency signal is conditional upon the accuracy channel. This means that an incorrect response is never rewarded simply for being fast or using fewer tools.

Read more