NVIDIA started a new era in artificial intelligence inference by reducing DeepSeek v4 token costs by 5 times with software optimizations on Blackwell GPUs.
Thanks to extensive inference software optimizations developed for its Blackwell architecture GPUs, NVIDIA was able to reduce the process cost per token on the DeepSeek v4 AI model by 5x in just one month
. With these software-focused improvements to hardware units such as GB200 and GB300, the company has set a new standard in ‘cost per token’, a critical metric in the world of artificial intelligence. While the development in question significantly reduces the operational costs of companies running artificial intelligence models, it also reinforces NVIDIA’s leadership in the hardware and software ecosystem.
Performance Increased with Software Optimizations
Behind NVIDIA’s success lies its multi-layered software strategy that manages hardware resources in the most efficient way. The company ensures that the hardware operates at full capacity by combining three main layers: production operations, application acceleration and infrastructure access.
This integration minimizes resource waste, especially in deep learning processes that require intense process power.
This system-level integration radically changes the total cost of ownership in AI projects.
The application acceleration layer offers developers a high-performance working environment and speeds up processes thanks to runtime optimizations such as core defragmentation and computation-communication overlap.
NVIDIA allows developers to get efficient results using high-level software tools directly, without having to deal with complex instruction sets at the device level.
Industry Pioneers Switch to Blackwell Architecture
Many technology companies have begun to integrate these software updates offered by NVIDIA into their own platforms. Baseten announced that it increased the amount of tokens generated per second in the DeepSeek V4 Pro model by 50 percent by using the TensorRT-LLM library. For example, platforms such as Cognition and Deep Infra are relieved of the burden of building their own infrastructure from scratch by using ready-made inference frameworks offered by NVIDIA.
These software advantages offered by NVIDIA directly affect the scalability of artificial intelligence models.
Players such as Together AI are achieving lower latency in real-time coding experiences by accelerating model optimization processes for tools such as Cursor. These collaborations clearly show how NVIDIA is building an ecosystem with software in the AI inference market, beyond hardware sales.
Efficiency-Focused Race is Accelerating in the Future Period
The additional advantages provided by technologies such as NVLink and NVFP4 prove that the Blackwell architecture is not only a speed increase, but also an efficiency revolution. The total throughput increase of up to 20 times increases the commercial sustainability of large language models (LLM). This strategic move by NVIDIA indicates that cost is no longer a drawback for artificial intelligence developers, but has turned into an opportunity for innovation.
How do you evaluate this huge cost optimization breakthrough by NVIDIA? How much do you think such software improvements will accelerate the spread of artificial intelligence models? Share your opinions with us in the comments.