CHEAPEST Chinese AI models: Baidu ERNIE 4.5, GLM‑4.1V, Tencent Hunyuan A13B, DeepSeek tops AI benchmarks, and more - Week #26
Hello AI Enthusiasts!
Welcome to the Twenty-Sixth edition of "This Week in AI Engineering"!
This week, China launched INSANE new AI models, a German firm rolled out a blazing-fast DeepSeek variant. LangChain published a guide on "Context Engineering" for agents, and only THIS open-source AI model made it to the top 5 list.
As always, we’ll wrap things up with under-the-radar tools and releases that deserve your attention.
Don’t have time read the newsletter? Listen to it on the go!
The ERNIE 4.5 lineup is making WAVES
ERNIE 4.5 is a new open-weight family of multimodal Mixture-of-Experts models from Baidu, scaling up to 424 billion total parameters with 47B and 3B active paths. Trained using a novel heterogeneous MoE structure and PaddlePaddle’s optimized infrastructure, the ERNIE 4.5 series delivers strong performance across language, vision, and cross-modal tasks , from math to document understanding to instruction following.
Multimodal MoE + Heterogeneous Design
Modality-Isolated Routing: Each modality (text, image) routes through dedicated experts with shared global parameters, improving mutual learning without interference.
Router Orthogonal & Token-Balanced Loss: Maintains training stability across modalities while ensuring fine-grained balance in attention and routing decisions.
FP8 Mixed-Precision + Intra-Node Parallelism: Enables efficient large-scale training and high inference throughput across distributed environments.
2-bit/4-bit Lossless Quantization: Achieved via convolutional code compression, boosting performance without sacrificing accuracy.
Post-Training for Purpose-Built Intelligence
Unified Preference Optimization (UPO): Combines reinforcement learning and preference-based fine-tuning for instruction-following tasks.
Modality-Specific Tuning: ERNIE 4.5-VL supports both “thinking” and “non-thinking” reasoning modes, tuned separately for perception and logic-heavy tasks.
High MFU Efficiency: Achieves 47% Model FLOPs Utilization on the largest variant , a notable feat for large-scale MoE models.
Benchmark Dominance at Every Scale
ERNIE-4.5-300B-A47B: Surpasses DeepSeek-V3-671B on 22 of 28 benchmarks. State-of-the-art in world knowledge, multi-step logic, and instruction response.
ERNIE-4.5-21B-A3B: Outperforms Qwen3-30B on BBH and CMATH with 30% fewer parameters , showcasing excellent efficiency-performance tradeoffs.
ERNIE-4.5-VL-424B-A47B: Matches or exceeds OpenAI-o1 on multimodal benchmarks like MathVista, MMMU, and VisualPuzzle while maintaining top-tier perception in RealWorldQA and CV-Bench.
ERNIE-4.5-VL-28B-A3B: Beats Qwen2.5-VL-7B and rivals Qwen2.5-VL-32B across reasoning and perception with fewer active parameters , while supporting both reasoning and standard modes.
Fully Open and Developer-Ready
Apache 2.0 License: All model variants, training code, and inference stacks are open for commercial and academic use.
Toolkit Release: Includes efficient fine-tuning pipelines, quantization utilities, and multi-device deployment support via PaddlePaddle.
Multi-Hardware Support: Optimized for diverse infrastructure setups, including GPU clusters and edge deployments.
ERNIE 4.5 sets a new benchmark for parameter-efficient, multimodal, instruction-following AI , freely available to the global developer and research community.
The future of AI reasoning
Zhipu AI, in collaboration with Tsinghua University, has released GLM‑4.1V‑9B‑Thinking, a next-gen open-weight vision-language model that pushes the limits of multimodal reasoning. Built on the GLM-4-9B foundation, it introduces a new “thinking paradigm” powered by reinforcement learning and curriculum sampling. The result: state-of-the-art performance among all 10B-class VLMs , even rivaling Qwen‑2.5‑VL‑72B on 18 benchmark tasks, with only 1/8th the parameters.
Thinking Mode for Deep Visual Reasoning
RLCS Fine-Tuning: A custom Reinforcement Learning with Curriculum Sampling framework teaches the model to handle increasingly complex reasoning tasks, step-by-step.
64k Context Length: Extended sequence processing allows long multimodal documents and conversations.
4K Image Input Support: Handles ultra‑high resolution visuals and arbitrary aspect ratios for richer spatial understanding.
Chinese–English Bilingual: Fully supports reasoning in both languages, broadening real-world deployment scenarios.
Benchmark Leadership at Lightweight Scale
GLM-4.1V-9B-Thinking outperforms previous VLMs like CogVLM2 and GLM‑4V across core reasoning and perception tasks.
Achieves parity or better performance than Qwen-2.5-VL-72B on 18 vision-language benchmarks , a significant step in reasoning-efficient model design.
Delivers top-tier results in mathematics, document understanding, spatial reasoning, and instruction-following at a fraction of the size.
Inference Performance
Inference performance varies significantly depending on the GPU framework used.
On an A100 GPU running the Transformers framework, the minimum VRAM required is 22 GB, delivering a speed of approximately 14 to 22 tokens per second using BF16 precision.
In contrast, when using the vLLM framework on the same A100 GPU with the same 22 GB VRAM and BF16 precision, performance increases dramatically, achieving speeds of around 60 to 70 tokens per second.
Open and Ready for Research
GLM‑4.1V‑9B‑Thinking is fully open-sourced on Hugging Face for academic and industrial experimentation.
GLM‑4.1V‑9B‑Base also released, giving the community access to a non-fine-tuned version ideal for downstream tuning and architecture studies.
Offers a robust baseline for future work in multimodal reasoning, multilingual instruction-following, and visual agents.
GLM‑4.1V‑Thinking represents a bold step toward intelligent, reasoning-capable VLMs that are compute-efficient, bilingual, and production-ready.
Tencent’s new AI model is a Reasoning POWERHOUSE
Tencent has introduced Hunyuan‑A13B, a new Mixture-of-Experts (MoE) model optimized for reasoning, instruction-following, and long‑context comprehension, without the massive compute footprint. It features 80B total parameters with just 13B active during inference, delivering top-tier performance across math, science, and agent benchmarks while remaining resource-efficient.
Key Features
Efficient MoE Architecture: 13B active parameters out of 80B total, achieving performance parity with much larger models.
Dual Thinking Modes: Supports both fast and slow thinking paradigms for flexible performance tuning.
256K Context Length: Natively handles ultra‑long documents and multi‑step agent interactions.
Agent-Ready: Tops benchmarks like BFCL-v3, τ-Bench, and C3-Bench, showcasing strong planning and decision-making skills.
Fast Inference: Built with Grouped Query Attention (GQA) and multi-quantization support for real-time deployment.
Benchmark Dominance at Every Scale
Hunyuan-A13B: Beats Qwen2.5‑72B and Qwen3‑A22B on MMLU (88.17), outperforms Qwen2.5 on BBH and GPQA, and stays highly competitive on MMLU-Pro and Redux despite being smaller in size.
Hunyuan-A13B-Instruct: Outclasses Qwen3‑A22B in agentic reasoning (BFCL v3: 78.3 vs. 70.8, ComplexFuncBench: 61.2 vs. 40.6) and leads in instruction tasks like IF-Eval (84.7) and SysBench (76.1), rivaling OpenAI‑o1 and DeepSeek R1 in ZebraLogic.
Hunyuan-A13B-Instruct (Math/Science): Achieves SOTA results on AIME 2024 (87.3), MATH (94.3), and CMATH (91.17), while edging out Qwen and DeepSeek on GPQA-Diamond (71.2) and dominating EvalPlus (78.64 vs. 65.93).
It consistently ranks among the best across multiple science and logic benchmarks, even against larger or more specialized models.
TNG’s DEEPSEEK but on STERIODS.
TNG-Tech has released R1T2-Chimera, the turbocharged successor to the original DeepSeek R1T. Built via Assembly of Experts using three parent models, DeepSeek R1‑0528, R1, and V3‑0324, this new tri‑mind architecture delivers big wins in reasoning accuracy, latency, and consistency, all without sacrificing personality or usability.
What’s New in R1T2
Tri-Mind Assembly: Combines three DeepSeek brains via fine-grained model merging for greater synergy and intelligence.
Think Token Fixed: The <think> token inconsistency from R1T is now fully resolved, improving reasoning flow and output alignment.
Speed Sweet Spot:
The model hits an optimal balance of speed and intelligence, running approximately 20% faster than R1 and delivering nearly 2× the speed of R1‑0528. Beyond just speed, it also demonstrates significantly improved reasoning capabilities compared to both R1 and earlier R1T versions, making it a notable upgrade across major reasoning benchmarks.Personality Retained: Balanced tone and well-behaved output without needing system prompts.
Model Positioning Guide
R1T2 is a strong drop-in replacement for the original R1, offering both improved reasoning capabilities and better latency. Compared to R1‑0528, R1T2 is not only faster and more affordable but also sufficient for most tasks unless absolute state-of-the-art performance is required. When stacked against R1T, R1T2 resolves previous tokenization issues, enhances overall intelligence, and retains the approachable qualities of its predecessor, making it the recommended choice in most scenarios. While V3‑0324 remains the fastest model overall, R1T2 is the preferred option when strong reasoning performance is a priority.
Benchmark Leadership at Lightweight Scal
R1T2 outperforms R1T and V3‑0324 across all major reasoning benchmarks , scoring 82.3 on AIME-24, 70.0 on AIME-25, and 77.9 on GPQA‑Diamond, while maintaining lower latency and higher efficiency.
Delivers comparable performance to R1 (AIME-24: 79.8, AIME-25: 70.0) and closes the gap with R1‑0528 (91.4, 87.5) , achieving a strong balance between speed, intelligence, and cost.
Surpasses V3‑0324 (59.4 / 49.6 / 68.4) by a wide margin across math and science tasks , establishing R1T2 as the ideal lightweight reasoning model for high-pass-rate use cases.
Key Notes
R1T2 offers a rare balance: it’s faster than R1, smarter than R1T, and more cost-efficient than R1-0528. While it’s not recommended for function-calling-heavy workloads (yet), for general reasoning, long-context debugging, and assistant-style use cases, it hits a new sweet spot.
TNG recommends following Microsoft’s guidelines for DeepSeek-based models (see MAI-DS-R1 on Hugging Face) for responsible deployment and usage.
A Must-Read for AGENT BUILDERS
LangChain just published a detailed breakdown of Context Engineering , the discipline of managing what goes into an LLM’s context window across an agent’s runtime. As agents get more capable and complex, how you write, select, compress, and isolate context is becoming one of the most critical parts of agent performance.
What Is Context Engineering?
Just like an OS manages RAM, context engineering decides what data sits in the LLM’s context window. The goal? Deliver just the right information at each step of the agent’s reasoning path , no more, no less.“Context engineering is the delicate art and science of filling the context window with just the right information for the next step.”
4 Core Strategies for Managing Agent Context
Write Context
Save key info outside the window to make it accessible later.Select Context
Pull relevant info back into the window at runtime.Compress Context
Trim what’s not needed, keep what matters.Isolate Context
Split context into subagents or environments.
Why It Matters
As per the publish - Context poisoning, confusion, distraction, or clash , these are real problems that sabotage agent reliability. With long-running tasks and deep tool feedback loops, token sprawl can wreck performance. As Cognition puts it:“Context engineering is effectively the #1 job of engineers building AI agents.”LangSmith complements this with agent tracing, token usage visualization, and evaluation tools for iterative testing.
Final Takeaway
If you’re building agents, context engineering isn’t optional , it’s the core loop. With LangGraph’s orchestration and LangSmith’s evaluation tools, LangChain offers one of the most complete frameworks for mastering this emerging discipline.
If you're building complex AI agents or tool-using workflows, LangChain’s guide is a must-read. It addresses why many agents fail , not because of the model, but because they lacked usable context.
The Only Open-Source Model in the TOP 5
The latest rankings from SciArena, a new benchmark for evaluating foundation models on scientific tasks, just dropped , and DeepSeek R1-0528 has secured a top-5 position. It's also the only open-source model in that elite group, standing tall among heavyweights like o3 and Claude-4-Opus.
What Is SciArena?
SciArena is an open, human-in-the-loop benchmarking platform built specifically for scientific inquiry and reasoning, think of it as Chatbot Arena, but tailored to the world of STEM.
The platform has three parts:
SciArena Platform: Human researchers submit scientific queries and vote on model responses in head-to-head matchups.
Leaderboard: Elo ratings dynamically rank model performance based on community votes.
SciArena-Eval: A meta-evaluation dataset built from human preferences to evaluate model evaluators.
DeepSeek R1‑0528: Punching Above Its Weight
Out of 23 leading foundation models evaluated, R1-0528 performed particularly well in Natural Sciences, landing it among the top 5 performers , and again, it’s the only open-weight model to do so.
Tools & Releases YOU Should Know About
Codiga is a robust AI coding assistant that transforms the development experience through intelligent support, precise autocomplete suggestions, and sophisticated code optimizations. It streamlines the coding process while upholding high code quality, making it a valuable companion for developers looking to write cleaner, faster, and more efficient code with minimal friction.
Trae is a next-generation coding IDE engineered to empower software developers with advanced automation, deep codebase comprehension, and real-time AI assistance. It analyzes entire projects to answer technical questions, generate code from natural language, and provide context-aware suggestions. By embedding intelligence into the development environment itself, Trae accelerates software creation and reduces cognitive load.
Pieces is an on-device copilot that helps developers capture, enrich, and reuse code snippets intelligently. Designed to integrate seamlessly into your workflow, it streamlines collaboration and boosts productivity through contextual awareness, understanding what you're working on and surfacing relevant insights, code references, or reusable components exactly when you need them.
And that wraps up this issue of "This Week in AI Engineering", brought to you by jam.dev— your flight recorder for AI apps! Non-deterministic AI issues are hard to repro, unless you have Jam! Instant replay the session, prompt + logs to debug ⚡️
Thank you for tuning in! Be sure to share this newsletter with your fellow AI enthusiasts and subscribe to get the latest updates directly in your inbox.
Until next time, happy building!