Google increases local artificial intelligence performance by up to 3 times with the MTP technology announced for Gemma 4 models. Here are the details.
Google took a new performance-oriented step for the Gemma 4 open source models it launched this spring. The company has made Multi-Token Prediction (MTP) drafter models available to developers, aiming to accelerate local artificial intelligence processes.
These experimental models utilize speculative decoding technology, a method of making predictions. In this way, models can significantly increase the speed of text creation compared to standard processes they produce on their own.
High Performance Target on Local Hardware
Gemma 4 models share a similar architecture with the infrastructure that underpins Google’s advanced Gemini artificial intelligence technology. Gemini models are optimized to run on custom TPU chips located in Google’s massive data centers.
Gemma, on the other hand, allows users to run this technology on their own local hardware without transferring their data to cloud systems.
With Gemma 4, Google also changed its licensing policy and switched to the Apache 2.0 license. This new license structure offers a much wider range of usage and flexibility than the special licenses used in previous versions.
However, models running on local systems do not have the high bandwidth memory (HBM) advantage offered by enterprise hardware. This causes processors to waste time when moving parameters from VRAM to computational units and to use processing cycles inefficiently.
How Does MTP Technology Work?
Traditional large language models produce units called tokens in an autoregressive structure, that is, one at a time. Each token requires the same amount of computing power, regardless of content.
MTP technology comes into play at this point, lightening the burden of the heavy model and creating speculative tokens through a lighter drafter model.
These small models, which have only 74 million parameters like the Gemma 4 E2B, are specifically optimized to accelerate speculative token production. Drafter models share the same key-value cache as the main model to avoid recalculating the context in which the main model is currently running.
Additionally, the E2B and E4B drafter models use the sparse decoding technique to narrow down the possible token sets. Thanks to these techniques, in tests performed on hardware such as NVIDIA RTX PRO 6000, it is observed that the waiting time is halved without compromising the output quality.
How do you think such speed increases on local hardware will change our artificial intelligence usage habits?