The BEST AI image generator, Google Gemma 3n, Mistral's new coding model, new DeepSeek update, and more - Week #21

May 31, 2025

Hello AI Enthusiasts!

Welcome to the Twenty-First edition of "This Week in AI Engineering"!

This week, Black Forest Labs released FLUX.1 Kontext, a powerhouse text-to-image suite, Gemma 3n debuts as Google’s first open model built on Gemini Nano’s architecture, Mistral’s Codestral Embed sets a new benchmark for code embeddings, DeepSeek R1.1 pushes open-source reasoning with pure RL, LangChain’s LangSmith adds GitHub/CI sync for prompts, and Google Vertex AI expands with cutting-edge document, media, and multimodal models.

With this, we'll also explore some under-the-radar tools that can supercharge your development workflow.

Don’t have time read the newsletter? Listen to it on the go!

FLUX.1 Kontext Is The BEST AI Image Generator

Black Forest Labs recently released FLUX.1 Kontext, their foundational suite of text-to-image models coupled with context-driven tooling to enhance generation control and fidelity. This suite doesn’t simply generate images; it offers streamlined workflows for inpainting, outpainting, structural conditioning, and image variation, setting a new standard in creative flexibility and output quality.

Flexible & Efficient

Hybrid Architecture & Flow Matching
FLUX.1 Kontext is built on a hybrid multimodal/parallel diffusion transformer backbone with rectified flow matching at its core. Flow matching aligns generated images with target distributions continuously, improving diversity and prompt adherence without requiring discrete denoising schedules.
Rotary Positional Embeddings & Parallel Attention
By employing 3D rotary positional embeddings, FLUX.1 encodes spatial relationships flexibly, preserving structural coherence even under complex edits. Parallel attention layers reduce computational overhead by attending to multiple modalities simultaneously, enabling faster inference and lower latency.
Improved VAE Backbone
FLUX.1’s autoencoder uses 16 latent channels and an adversarial objective to outpace related models in reconstruction. On 4,096 ImageNet samples (256×256), FLUX-VAE achieves a perceptual distance (PDist) of 0.332 ± 0.003, SSIM of 0.896 ± 0.004, and PSNR of 31.1 ± 0.08, all surpassing SD3-VAE and SDXL-VAE baselines.

Multi-Variant Releases

FLUX.1 [pro]: High-throughput, API-optimized for enterprise pipelines. Delivers best-in-class visual fidelity, prompt adherence, and output diversity. Available via BFL API or through Fal.ai, Replicate, Together.ai, Freepik, and Krea.ai.
FLUX.1 [dev]: Open-weight, guidance distilled into a 12B diffusion transformer. Weights on Hugging Face allow local inference or via platforms like Replicate and Mystic; ideal for R&D and academic exploration.
FLUX.1 [schnell]: A 1–4 step latent adversarial diffusion distillation model licensed under Apache 2.0. Integrated with ComfyUI for node-based pipelines, it delivers near–pro level quality on consumer-grade GPUs in low-latency local setups.

Strong Benchmark Performance

Unified Text-to-Image & Image-to-Image
FLUX.1 Kontext trains jointly on both T2I and I2I tasks via a rectified flow objective. Single-turn evaluations on the Internal-T2I-Bench (1,000 diverse prompts) show balanced performance across aesthetics, prompt following, typography accuracy, and realism, avoiding the “bakeyness” bias seen in other models. Upgrading from FLUX.1 [pro] to FLUX.1 Kontext [pro] to FLUX.1 Kontext [max] yields consistent gains in each category.
KontextBench – Real-World Multi-Turn Consistency
We introduce KontextBench: a 1,026-image benchmark spanning five tasks, local editing (416), global editing (262), text editing (92), style reference (63), and character reference (193). In human evaluations, FLUX.1 Kontext [pro] ranks top in text and local editing and leads in character preservation (measured via AuraFace embeddings), while [max] leads global editing and style reference.
Inference Latency
For 1024×1024 resolution, FLUX.1 Kontext achieves median text-to-image generation in ~3.2 seconds and image-to-image edits in ~3.8 seconds, matching or exceeding proprietary systems on speed while delivering superior fidelity.
Character & Object Preservation
Iterative editing tests reveal minimal identity drift over six successive edits. AuraFace cosine similarity scores between input and output remain above 0.92 per turn, compared to ~0.80 for comparable models, enabling robust multi-turn narrative workflows.
Inpainting & Outpainting SOTA
FLUX.1 Fill [pro] outperforms Ideogram 2.0 and FLUX-Controlnet-Inpainting in boundary consistency and semantic coherence, while FLUX.1 Fill [dev] offers nearly matching quality with 25% faster inference.

Key Usecases

Iterative Storyboarding & Narrative Creation
By generating consistent character renditions across multiple turns, e.g, a bird character moving from a bar to a movie theater to grocery shopping, FLUX.1 Kontext enables dynamic storyboard pipelines and rapid concept iteration for entertainment and marketing.
Interactive, Instruction-Driven Editing
Users can remove occlusions (e.g., “remove the thing from her face”), relocate subjects (“take a selfie in Freiburg”), or transform scenes (“make it snow”) with full preservation of character pose, clothing, and photographic style across edits.
Advanced Visual Cue & Text Editing
Support for bounding-box cues (e.g., “add hats in the boxes”) and embedded text manipulation (e.g., “replace ‘SYNC & BLOOM’ with ‘FLUX & JOY’”) enables precise product photography tasks, extracting garments, creating close-ups, or adjusting textual elements on signage.
Style Transfer & Artistic Variation
With FLUX.1 Canny/Depth modules, designers can restyle architecture renders or character art, preserving edges and depth while applying new textures or lighting. FLUX.1 Redux allows style extraction from an input (“Using this style…”) and generates novel scenes, such as a mirror-piano performance in zero-gravity or a jazz duo of owls on a moonlit bandstand, without compromising artistic consistency.
High-Fidelity Text-to-Image Pipelines
FLUX.1 [pro]/[max] translates detailed creative briefs, storyboards, concept art, editorial visuals, into polished outputs with prompt adherence, diverse stylistic palettes, and high resolution (up to 4 MP).

All FLUX.1 Kontext models comply with Black Forest Labs’ responsible AI policy; usage producing disallowed content is prohibited.

Gemma 3n Is Google’s First Open AI Model Built On Gemini Nano’s Architecture

Google has introduced Gemma 3n, its first open model leveraging Gemini Nano’s architecture. Available now in early preview, developers can experiment today, and later this year, this technology will power features across Android, Chrome, and other on-device Google ecosystems.

What Makes It Stand Out

Performance & Efficiency: 5 B and 8 B parameter sizes with Per-Layer Embeddings (PLE) from DeepMind, drastically reducing memory use, delivering the punch of larger models at the lightweight performance of a 2 B or 4 B model.
Flexible Inference (Many-in-1): Google’s MatFormer training lets Gemma 3n dynamically scale between faster, lower-precision outputs and slower, high-accuracy modes from the same model.
Speed & Footprint: On mobile, it’s ~1.5× faster than Gemma 3 4B, thanks to innovations like Prefix-Layer-Extension, Key-Value Cache sharing, and activation quantization.
Multimodal & Multilingual: Understands text, images, audio, and video - and excels in Japanese, German, Korean, Spanish, and French (e.g., WMT24++ benchmarks).
Privacy-First On-Device: Runs locally for enhanced privacy and offline capability, unlocking real-time apps like transcription, translation, and smart interactions.

Getting Started

Developers can start exploring Gemma 3n today through two main options:

Google AI Studio: Cloud-based, in-browser experience
Google AI Edge: On-device SDK for text and image tasks

As with all of Google's models, Gemma 3n was developed with a focus on safety, governance, and responsible AI use. Every step, from data handling to model alignment, was shaped by ethical guidelines and safety standards.

Mistral Codestral Embed Outperforms Cohere And OpenAI’s Models

Mistral AI recently released Codestral Embed - their first embedding model specifically designed for code. And it’s not just another tool in the box; it’s already outperforming the current leaders in the space, including Voyage Code 3, Cohere Embed v4.0, and OpenAI’s large code embedding model.

What sets Codestral Embed apart is its retrieval power on real-world coding tasks. It’s built for developers who need efficient, accurate code search and context retrieval, whether it’s for completions, editing, or explanation.

Flexible & Efficient

Choose embedding dimensions and precision to balance quality vs. storage (e.g., 256 dim int8 still beats competitors).
Relevance-ranked dimensions let you trim for storage or speed without steep quality loss.

Strong Benchmark Performance

SWE-Bench Lite: Codestral Embed sets a new open-model record on real-world issue-fix retrieval, outpacing Voyage Code 3, Cohere Embed v4.0, and OpenAI’s Text Embedding 3 Large
CodeSearchNet Code → Code: Achieves state-of-the-art mean reciprocal rank for retrieving code snippets from GitHub contexts, surpassing all current code-specialized embedders
CodeSearchNet Doc → Code (Text2Code GitHub): Delivers top precision on docstring-to-code retrieval tasks, outperforming closed-source alternatives in single-pass evaluations
CommitPack (Text2Code GitHub): Leads in mapping commit messages to the correct file modifications, setting a new SOTA on real-world commit retrieval benchmarks
SQL Retrieval (Spider, WikiSQL, Synthetic Text2SQL): Pushes slot-filling accuracy above 90% on natural-language-to-SQL benchmarks, outstripping Voyage Code 3 and Cohere Embed v4.0
Algorithmic Matching (DM Code Contests, APPS, CodeChef, MBPP+): Tops recall metrics across a broad suite of programming-contest and data-science problems, with leading performance in both algorithmic and DS-1000 retrieval tasks
Macro Average: Across all eleven code-retrieval categories, Codestral Embed achieves the highest aggregated score of any publicly available model, cementing its role as the go-to embedder for coding agents and RAG systems

Key Usecases

Codestral Embed is built with developers in mind and fits into a variety of real-world applications:

Retrieval-Augmented Generation - Pull the right snippets fast for code completions, edits, or documentation suggestions.
Semantic Code Search - Search codebases with natural language or code queries and get relevant results with precision.
Duplicate Detection - Identify functionally similar or near-duplicate code, even if it’s written differently.
Semantic Clustering - Group and analyze code by structure or function, helping with repo management, pattern discovery, and auto-documentation.

DeepSeek R1.1, Now With Reinforcement Learning

First-Generation Reasoning Models: DeepSeek-R1-Zero & DeepSeek-R1
DeepSeek’s R1-Zero (pure RL without initial SFT) naturally develops reasoning behaviors, while R1 adds a small SFT phase before RL for coherence.

What’s New?

Reinforcement Learning, No Fine-Tuning First

DeepSeek-R1-Zero is trained using large-scale reinforcement learning (RL) without the usual supervised fine-tuning (SFT) upfront. That’s a big shift. This approach allowed the model to naturally develop reasoning behaviors like step-by-step thinking, self-checking, and reflection, all without human-labeled datasets at the start.

But it wasn’t perfect. DeepSeek-R1-Zero had some quirks: repetition, occasional gibberish, and inconsistent language output. So DeepSeek introduced DeepSeek-R1, which starts with a small SFT phase before diving into RL. This helped polish its reasoning skills while keeping things coherent and readable.

Matching the Best

DeepSeek-R1 performs on par with OpenAI’s o1 models across coding, math, and reasoning tasks. Even more impressive? DeepSeek has open-sourced both R1 and R1-Zero, plus six smaller distilled models based on LLaMA and Qwen that pack a serious punch.

What makes DeepSeek-R1 such a leap forward

It’s the first open-source proof that pure RL (no SFT) can teach LLMs how to reason effectively.
It’s one of the best-performing open models on math and code.
The distilled versions (even as small as 1.5B or 7B) perform better than many competing mid-size models.

The Distilled Lineup

DeepSeek used R1 to generate reasoning-rich data, then trained smaller models on it - resulting in compact but powerful versions that outperform typical distilled models. These include checkpoints based on Qwen2.5 and Llama3, ranging from 1.5B to 70B parameters.

Benchmark Performance

General Knowledge & Reasoning (MMLU Series): 90.8% on MMLU and 84.0% on MMLU-Pro
Scientific QA (GPQA Diamond): 71.5% single-attempt accuracy
Code Generation (LiveCodeBench): Ranks just below OpenAI’s o4-mini and o3, outperforming xAI’s Grok 3 mini and Alibaba’s Qwen 3
Efficiency & Cost: Blended inference cost of $0.96 per 1 M tokens ($0.55 in, $2.19 out), delivers 31.9 tokens/sec with a first-token latency of 3.15 s, and supports a 130 K-token context window
Overall Intelligence Index: Ranks at 68 on an aggregated “Intelligence Index,” exceeding the average quality threshold for modern LLMs

All DeepSeek-R1 models, including the distilled ones, are open-source and commercially usable. Just note that some are derived from Qwen and LLaMA models, so they inherit those licenses (Apache 2.0 or LLaMA-specific).

Your LangChain Prompts Are Now Just Like Code

LangChain’s LangSmith platform now lets you treat prompts just like code by automatically syncing prompt definitions to GitHub and triggering your CI/CD pipelines on every update. Whether you’re collaborating on prompt engineering, auditing changes, or rolling out new prompt versions alongside your application code, this feature brings prompt development into your existing software lifecycle.

Flexible & Integrated

LangSmith’s new GitHub/CI sync leverages webhook triggers on prompt commits. You configure a webhook in the LangSmith Console (or via the REST API) that fires whenever a prompt is created or updated. That webhook payload can then:

Commit to GitHub: Push prompt manifests (YAML/JSON) directly into your repo, complete with version history and diffs.
Invoke CI/CD: Kick off GitHub Actions, Jenkins jobs, or any other CI workflow to run validation tests, deploy to staging, or promote to production.

Key Usecases

Prompt Versioning
Keep prompt definitions versioned alongside application code. Roll back to previous prompt versions using standard Git techniques.
Automated Validation
Trigger unit tests or linting (e.g., prompt-format checks, test generations) on every prompt change to catch errors before they reach production.
Continuous Deployment
Deploy updated prompts to staging or production LLM endpoints automatically as part of your CI/CD pipeline.

Audit & Compliance
Maintain an immutable audit trail of prompt changes for regulatory or internal governance needs.

Google Vertex AI Model Garden

Google’s Vertex AI continues to expand its ecosystem by integrating a diverse set of state-of-the-art models, from document understanding to generative audio, image, and video, giving enterprises the tools they need for advanced AI workflows.

Key Usecases

Document Automation: Extract structured data at scale with Mistral OCR for invoicing, compliance, and archival.

Conversational AI: Build chatbots and virtual assistants with Claude Opus 4 or Sonnet 4, scaling seamlessly using provisioned throughput.

Retrieval-Augmented Generation: Combine Claude or Gemini 2.5 Pro with your enterprise data in RAG pipelines for accurate, context-rich responses.

Audio Composition: Create background scores, jingles, or narration tracks with Lyria 2.

Image & Video Creation: Produce high-quality images (Imagen 4) and dynamic videos (Veo 3) directly from text prompts.
Healthcare NLP: Leverage MedGemma for medical coding, summarization, and insights extraction.

Tools & Releases YOU Should Know About

Replit Ghostwriter is a built-in AI assistant on the Replit online IDE that helps you write, debug, and optimize code collaboratively. Ghostwriter can generate entire functions, explain errors, and suggest performance enhancements in multiple languages. Because it runs directly in the browser, there’s no setup- just code and get suggestions instantly. It’s designed for hobbyists, educators, and full-stack developers who want an all-in-one coding environment with AI superpowers.

Sourcegraph Cody brings AI-driven code search and automation to large codebases. Cody can answer questions about your code, generate complex queries, and create PRs with ready-to-review changes. It integrates with your CI/CD pipeline and supports self-hosted setups for maximum security. With Cody, developer teams can onboard faster, enforce code standards, and reduce time spent digging through repositories, making it perfect for organizations managing monolithic or microservices architectures.

Codeium is a free AI-powered coding agent offering real-time completions, documentation lookup, and code navigation in your IDE. With support for VS Code, JetBrains, and Sublime Text, Codeium helps developers write code faster by generating snippets, refactoring existing functions, and suggesting improvements. It keeps your code proprietary by running inference in a secured environment. Codeium is ideal for startups and open-source contributors looking for a zero-cost AI boost without sacrificing security.

And that wraps up this issue of "This Week in AI Engineering", brought to you by jam.dev— your flight recorder for AI apps! Non-deterministic AI issues are hard to repro, unless you have Jam! Instant replay the session, prompt + logs to debug ⚡️

Thank you for tuning in! Be sure to share this newsletter with your fellow AI enthusiasts and subscribe to get the latest updates directly in your inbox.

Until next time, happy building!

This Week in AI Engineering

The BEST AI image generator, Google Gemma 3n, Mistral's new coding model, new DeepSeek update, and more - Week #21

FLUX.1 Kontext Is The BEST AI Image Generator

Flexible & Efficient

Multi-Variant Releases

Strong Benchmark Performance

Key Usecases

Gemma 3n Is Google’s First Open AI Model Built On Gemini Nano’s Architecture

What Makes It Stand Out

Getting Started

Mistral Codestral Embed Outperforms Cohere And OpenAI’s Models

Flexible & Efficient

Strong Benchmark Performance

Key Usecases

DeepSeek R1.1, Now With Reinforcement Learning

What’s New?

Matching the Best

What makes DeepSeek-R1 such a leap forward

The Distilled Lineup

Benchmark Performance

Your LangChain Prompts Are Now Just Like Code

Flexible & Integrated

Key Usecases

Google Vertex AI Model Garden

Key Usecases

Tools & Releases YOU Should Know About

Discussion about this post