Google DeepMind releases Gemma 4 with four model sizes, multimodal capabilities, and Apache 2.0 licensing. The 31B variant ranks #3 on Arena AI at 1452 Elo.
LLM Updates News
Large language model releases and capability changes.
Alibaba releases Qwen3.6 Plus with 1M token context and major efficiency upgrades, available free on OpenRouter.
Anthropic's Claude Code CLI leaked ~512,000 lines of source code via npmβits second such incident in a year, exposing unreleased features like KAIROS autonomous agent mode.
llama.cpp hit 100k GitHub stars in March 2026, marking a milestone for local AI. The C++ library enables running LLMs offline without API costs.
ETH Zurich researchers accuse Google's ICLR 2026 paper TurboQuant of mischaracterizing their RaBitQ work and using unfair CPU vs GPU benchmarks. The controversy exposes tensions around attribution in AI compression research.
Google Research's TurboQuant algorithm compresses LLM memory by 6x while delivering 8x speedup with zero accuracy loss, targeting the KV cache bottleneck in long-context inference.
Mistral AI releases Voxtral TTS, a 3B-parameter open-source text-to-speech model that beats ElevenLabs Flash v2.5 in human preference tests with a 68.4% win rate.
Google Research announces TurboQuant, a memory compression algorithm that reduces LLM KV cache size by 6x with zero accuracy loss and up to 8x speedup on H100 GPUs.
USCC warns China's open-source AI models now lead global usage, surpassing Meta's Llama in downloads and powering 80% of US AI startups.
OpenAI releases GPT-5.4 mini and nano with 54.4% on SWE-Bench Pro, nearly matching flagship performance at half the cost.
NVIDIA releases Nemotron-Cascade-2-30B-A3B, an open-weight 30B MoE model outperforming Qwen3.5 on coding and math benchmarks with only 3B activated parameters.
Alibaba Cloud's Qwen3.5-397B-A17B uses sparse MoE to achieve 8.6x-19x faster inference while matching GPT-5.2 and Claude Opus on benchmarks.
IBM releases Granite 4.0 1B Speech, a compact multilingual ASR model that ranks #1 on OpenASR leaderboard while running on edge devices.
Meta delays Avocado AI model to May 2026 after internal benchmarks showed it couldn't match Google Gemini 3.0, OpenAI, and Anthropic models.
Tesslate's OmniCoder-9B achieves 61% improvement over Qwen3.5-9B on coding benchmarks while running at 40tps on consumer hardware with 8GB VRAM.
llama.cpp's new sampler mechanism gives real control over reasoning tokens in hybrid models like Qwen3 and DeepSeek-R1, but benchmarking shows poorly-tuned budgets can drop HumanEval scores from 94% to 78%.
OpenAI retires GPT-5.1 from ChatGPT today, angering users who prefer its reasoning over GPT-5.4. The new model features 33% fewer errors and 1M token context.
OpenAI releases GPT-5.4 with 1M token context, three variants, and benchmark gains including 75% on OSWorld (beating human performance).
OpenAI releases GPT-5.4 with three variants, 1M token context, steerable reasoning, and 67.3% WebArena score. Reasoning models struggle to hide chain-of-thoughtβa safety feature.
Alibaba's Qwen 3.5 27B posts competitive benchmarks against GPT-5 while running locally at 90 tokens per second, challenging closed models with open weights.
OpenAI released GPT-5.4 with 75% on OSWorld-Verified, beating human baseline of 72.4%. The model features 1M token context and native computer use capabilities.