Boost inference performance for Mixtral and Llama 2 models with new Amazon SageMaker containers
AWS Machine Learning
APRIL 8, 2024
Be mindful that LLM token probabilities are generally overconfident without calibration. TensorRT-LLM requires models to be compiled into efficient engines before deployment. Before introducing this API, the KV cache was recomputed for any newly added requests. For more details, refer to the GitHub repo.
Let's personalize your content