The DeepSeek R1 0528 turbo model is a state of the art reasoning model that can generate very quick responses
The DeepSeek R1 0528 turbo model is a state of the art reasoning model that can generate very quick responses
DeepSeek-R1-0528-Turbo
Ask me anything
The NVIDIA DeepSeek-R1-0528-FP4 model is the quantized version of the DeepSeek AI's DeepSeek R1 0528 model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA DeepSeek R1 FP4 model is quantized with TensorRT Model Optimizer.
This model is ready for commercial/non-commercial use.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA (DeepSeek R1) Model Card.
Architecture Type: Transformers
Network Architecture: DeepSeek R1
Input Type(s): Text
Input Format(s): String
Input Parameters: 1D (One Dimensional): Sequences
Other Properties Related to Input: DeepSeek recommends adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance: \
Output Type(s): Text
Output Format: String
Output Parameters: 1D (One Dimensional): Sequences
Supported Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
Preferred Operating System(s):
** The model is quantized with nvidia-modelopt v0.31.0
** Data Collection Method by dataset: Hybrid: Human, Automated
** Labeling Method by dataset: Hybrid: Human, Automated
** Data Collection Method by dataset: Hybrid: Human, Automated
** Labeling Method by dataset: Hybrid: Human, Automated
** Data Collection Method by dataset: Hybrid: Human, Automated
** Labeling Method by dataset: Hybrid: Human, Automated
Engine: TensorRT-LLM
Test Hardware: B200
This model was obtained by quantizing the weights and activations of DeepSeek R1 to FP4 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformer blocks are quantized. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 1.6x.