AI image and photo models are getting INSANELY good, this might change LLMs forever, OpenAI Deep Research API, Google Gemma 3n,Gemini CLI AI Agent and more - Week #25
Hello AI Enthusiasts!
Welcome to the Twenty-Fifth edition of "This Week in AI Engineering"!
This week, OpenAI expands its API with new Deep Research and Webhooks modules, Google released Gemma 3n for multimodal use on low-resource devices, and Gemini CLI hits the terminal. Meanwhile, Sakana.ai unveiled a new framework for reasoning via reinforcement-based teacher models, Higgsfield dropped a stunning new aesthetic model called Soul, and FLUX.1 Kontext dev released an image editor that rivals proprietary tools.
As always, we’ll wrap things up with under-the-radar tools and releases that deserve your attention.
Don’t have time read the newsletter? Listen to it on the go!
Higgsfield Soul: The Most Aesthetic AI Photo Model
Soul is the newest photo-only model by Higgsfield.ai, and it’s trained specifically to hit magazine-level visual quality out of the box.
AestheticNet Performance
95th Percentile Score on internal AestheticNet benchmarks for texture, lighting, and color fidelity.
Curated Presets: 50+ fashion‑grade styles, from “Quiet Luxury” to “Y2K Retro”
Technical Highlights
Photo‑Only Focus: Unlike generalist diffusion models, Soul is laser‑tuned for still imagery.
Precision Inpainting: Retains facial features and fine details across diverse poses and lighting.
Artistic Control
Preset Library: One‑click application of editorial looks.
Fine‑Tuning Sliders: Adjust contrast, grain, color saturation, and mood.
Key Use Cases
Fashion & Advertising: Rapid generation of campaign stills with consistent branding.
Portraiture Services: On‑demand professional headshots and social media avatars.
E‑Commerce: Product photography with consistent studio‑grade lighting.
FLUX.1 Kontext [dev]: Open Weights, Proprietary-Level Image Editing
Kontext, developed under FLUX.1, is now available as an open weights model that delivers image editing capabilities comparable to top proprietary tools.
Model Specs & Open Weights
12 B Parameters: Optimized for local & global edits.
Open Non‑Commercial License: Weights on Hugging Face with support for ComfyUI, Diffusers, and TensorRT.
Editing Capabilities
Iterative In‑Context Edits: Modify images step‑by‑step without drift.
Character Preservation: Maintains subject identity across multiple edits.
Dual‑Conditioning: Text + image prompts for precise control.
Benchmark Results
KontextBench: Outperforms open models (e.g., Bagel, HiDream‑E1) and closed systems (Gemini‑Flash Image) on human preference tests.
Optimized Variants: BF16, FP8, FP4 TensorRT options for speed–quality trade‑offs.
Integration & Variants
Dev: Fully open‑source, research‑focused.
Pro & Max: Commercial tiers offering faster renders (3–5 s), advanced typography, and enterprise SLAs.
Key Use Cases
Creative Toolchains: Embed studio‑grade editing into web and desktop apps.
Rapid Prototyping: Designers can test visual concepts on consumer hardware.
Academic Research: Study flow matching and iterative editing without license barriers.
For developers building creative tooling, Kontext provides a transparent, tunable base model with no license constraints. Think of it as a Photoshop-grade layer under your AI product, completely open.
This Might Change LLMs Forever
Sakana.ai has proposed a novel architecture: Reinforcement Learning Teachers of Test Time Scaling, which flips the traditional fine-tuning method on its head.
Learning‑to‑Teach Framework
Prompted with Question + Answer: RLTs receive both the problem and its solution, focusing on crafting clear, step‑by‑step explanations.
Clarity‑Driven Rewards: Teachers are rewarded based on how well a student LLM internalizes the lesson, measured via student log‑probabilities.
Training Process
Dense Reward Signals: Continuous feedback from student performance enables efficient RL on 7 B parameter teacher models.
Distillation‑Ready Outputs: Explanations directly serve as training data for downstream student models.
Performance Benchmarks
Competition Tasks: RLTs distilled into students that outperform pipelines using orders‑of‑magnitude larger LMs.
Zero‑Shot Generalization: Maintains reasoning efficacy on out‑of‑distribution benchmarks without additional tuning.
Key Applications
Cost‑Efficient Reasoning: Build high‑performance reasoning assistants without massive compute or retraining costs.
Curriculum Learning: Automate generation of teaching materials for specialized domains.
On‑Demand Fine‑Tuning: Rapidly adapt student models for new tasks by swapping in different RLT teachers.
It’s still early research, but this could be a breakthrough for cheaper, more scalable logic-intensive systems.
OpenAI API Adds Deep Research & Webhooks
OpenAI just added two powerful capabilities to its developer API, Deep Research and Webhooks, unlocking a whole new layer of intelligence and interactivity for agent-based apps.
Deep Research Models
o3‑deep‑research & o4‑mini‑deep‑research: These models synthesize across hundreds of web sources, returning structured, cited reports instead of snippets.
Autonomous Multistep Reasoning: Agents can now initiate deep dives on complex topics, market research, technical reviews, academic surveys, directly from code.
Pricing & Performance
o3 Pricing: $10 per 1M input tokens, $40 per 1M output tokens.
o4‑mini Pricing: $2 per 1M input tokens, $8 per 1M output tokens.
Latency & Reliability: Designed for background execution, pairing Deep Research with Webhooks to avoid timeouts and network issues.
Webhooks
Event‑Driven Workflows: Receive callbacks when long‑running tasks (e.g., deep research jobs) complete, eliminating the need for polling.
Secure & Scalable: Supports authenticated endpoints and structured payloads, ideal for batch processing, CI/CD pipelines, or CRM triggers.
Key Use Cases
Automated Competitive Analysis: Agents that track and report on new
Research Assistants: Build workflows that automatically generate literature reviews or technical audits.
Enterprise Integrations: Tie into ticketing systems or dashboards for on‑demand deep dives.
Together, these tools shift OpenAI’s API toward dynamic, live agent ecosystems, not just static prompting.
Google Releases Gemma 3n: Light, Open, Multimodal
Google has officially dropped Gemma 3n, the newest entry in its lightweight open model family, built on the same core research as Gemini.
Model Architecture
MatFormer Backbone & PLE Caching: Parameter‑efficient layers and per‑layer embedding caches reduce compute and memory footprint.
E2B & E4B Variants: Available in 2 B and 4 B parameter sizes, optimized for different performance–efficiency trade‑offs.
Multimodal & Multilingual
Input Types: Native support for text, images, video, and audio.
Language Coverage: Pretrained on 140+ spoken languages for text; 35 languages for multimodal tasks.
Efficiency & On‑Device Performance
Offline Inference: Runs entirely on-device, ideal for privacy‑sensitive or low‑connectivity scenarios.
2 GB RAM Footprint: Enables AI on smartphones, tablets, and edge hardware without cloud dependency.
Key Use Cases
Mobile Assistants: Local chatbots that understand voice, image, and text queries.
Privacy‑First Apps: Healthcare or finance tools where data never leaves the device.
Field Research: Offline translation and multimodal analysis for remote areas.
Whether you're building local AI assistants, mobile multimodal apps, or multilingual chat interfaces, Gemma 3n is a powerful, open alternative to proprietary multimodal giants.
Gemini CLI Brings AI to the Terminal
Google also quietly launched Gemini CLI, an open-source command-line interface that puts Gemini directly into your dev terminal.
Features & Integrations
Natural‑Language Prompts: Code generation, bug fixes, documentation, research queries.
MCP & Real‑Time Data: Leverages Google’s Model Context Protocol to fetch live web data when needed.
Multimodal Extensions: Integrations with Imagen and Veo for image/video generation.
Performance & Limits
60 requests/minute and 1,000 requests/day free (via Gemini Code Assist license).
1 M token context window for complex, multi‑step prompts.
Developer Experience & Extensibility
Fully Open‑Source: Explore code, contribute plugins, extend functionality.
ReAct Loop: Reason‑and‑act framework to chain local tools, scripts, and cloud services.
Key Use Cases
Terminal‑First Workflows: Reduce context‑switching for devs who prefer shells.
CI/CD Automation: Scripted AI checks for code quality or task orchestration.
Ad‑hoc Research: Quick content generation and data lookup without leaving the terminal.
For engineers tired of context-switching to chat UIs, Gemini CLI is a productivity boost you can script.
Tools & Releases YOU Should Know About
Warp 2.0 is an agentic development environment designed to accelerate software creation using AI. It enables you to spawn and orchestrate multiple agents in parallel, each handling specific tasks in a development workflow. From writing boilerplate code to debugging and documentation, Warp 2.0 abstracts complex development processes into coordinated agent actions, making it ideal for high-velocity engineering teams looking to boost productivity through AI-native workflows.
Gru.ai is an AI developer assistant that supports your daily programming needs—whether it's writing algorithms, debugging runtime errors, testing code, or answering technical questions. Gru.ai acts like a tireless pair programmer, helping you move faster through coding tasks by offering intelligent, context-aware suggestions across a wide range of languages and frameworks. It’s a valuable tool for solo developers and teams looking to reduce friction in the coding lifecycle.
GoCodeo is a full-stack AI development agent that lets you build, test, and deploy complete applications with minimal effort. It integrates seamlessly with Supabase for backend functionality and offers one-click deployment via Vercel, removing the need for manual setup. Whether you're prototyping or building production-ready apps, GoCodeo compresses hours of engineering work into minutes with its intuitive agent-driven automation.
Swimm enhances code comprehension and team collaboration through AI-powered, context-sensitive documentation. By leveraging static analysis and machine-generated explanations, Swimm integrates directly into IDEs like VSCode, JetBrains, IntelliJ, and PyCharm. It helps developers navigate unfamiliar codebases by providing inline documentation that evolves with your code—minimizing onboarding time and reducing the cognitive load of maintaining technical knowledge across teams.
And that wraps up this issue of "This Week in AI Engineering", brought to you by jam.dev— your flight recorder for AI apps! Non-deterministic AI issues are hard to repro, unless you have Jam! Instant replay the session, prompt + logs to debug ⚡️
Thank you for tuning in! Be sure to share this newsletter with your fellow AI enthusiasts and subscribe to get the latest updates directly in your inbox.
Until next time, happy building!