Blogs
242026
- Performance improvements with speculative decoding in vLLM for gpt-oss
- Red Hat and NVIDIA: Setting standards for high-performance AI inference
- Red Hat AI tops MLPerf Inference v6.0 with vLLM on Qwen3-VL, Whisper, and GPT-OSS-120B
- Configure NVIDIA Blackwell GPUs for Red Hat AI workloads
- 5 steps to triage vLLM performance
2025
- How to deploy and benchmark vLLM with GuideLLM on Kubernetes
- Autoscaling vLLM with OpenShift AI model serving: Performance validation
- Efficient and reproducible LLM inference with Red Hat: MLPerf Inference v5.1 results
- vLLM or llama.cpp: Choosing the right LLM inference engine for your use case
- How to set up KServe autoscaling for vLLM with KEDA
- Benchmarking with GuideLLM in air-gapped OpenShift clusters
- Ollama vs. vLLM: A deep dive into performance benchmarking
- MLPerf Inference v5.0 results with Supermicro's GH200 Grace Hopper Superchip-based Server and Red Hat OpenShift
- How to run performance and scale validation for OpenShift AI
- Unlocking the Effective Context Length: Benchmarking the Granite-3.1-8b Model
2024
- Achieve better large language model inference with fewer GPUs
- Accelerating generative AI adoption: Red Hat OpenShift AI achieves impressive results in MLPerf inference benchmarks with vLLM runtime
- Generative AI fine-tuning of LLMs: Red Hat and Supermicro showcase outstanding results for efficient Llama-2-70b fine tuning using LoRA in MLPerf Training v4.0
- Sharing is caring: How to make the most of your GPUs (part 2 — Multi-instance GPU)
- Sharing is caring: How to make the most of your GPUs (part 1 — time-slicing)
- Continuous performance and scale validation of Red Hat OpenShift AI model-serving stack
- Evaluating LLM inference performance on Red Hat OpenShift AI
2023
2022
Conference Talks
6A Cross-Industry Benchmarking Tutorial for Distributed LLM Inference on Kubernetes
Samuel Monson (Red Hat), Ganesh Kudleppanavar (NVIDIA), Jason Kramberger (Google), Jing Chen (IBM Research)
Routing Stateful AI Workloads in Kubernetes
Maroon Ayoub (IBM Research), Michey Mehta (Red Hat)
Learn How to Run an LLM Inference Performance Benchmark on NVIDIA GPUs
Samuel Monson (Red Hat), Ashish Kamra (Red Hat)
Multi-Node Finetuning LLMs on Kubernetes: A Practitioner's Guide
Ashish Kamra (Red Hat), Boaz Ben Shabat (Red Hat)
Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes
David Gray (Red Hat)
Efficiently Deploying and Benchmarking LLMs in Kubernetes
Nikhil Palaskar (Red Hat)
Publications
1llm-tuna: Hyperparameter Optimization for LLM Inference
An open-source framework that automates vLLM inference hyperparameter optimization using Bayesian search via Optuna, achieving up to 32.9% throughput improvement on mixture-of-experts models.
Other Upstream Projects Maintained
1
vllm-project
GuideLLM
SLO-aware benchmarking and evaluation platform for LLM deployments that simulates production workloads against OpenAI-compatible and vLLM-native servers.
Maintainer