PSAP — Performance and Scale for AI Platforms

Blogs

24

2026

Performance improvements with speculative decoding in vLLM for gpt-oss | Harshith Umesh Apr 16, 2026
Red Hat and NVIDIA: Setting standards for high-performance AI inference | Ashish Kamra Apr 2, 2026
Red Hat AI tops MLPerf Inference v6.0 with vLLM on Qwen3-VL, Whisper, and GPT-OSS-120B | Ashish Kamra, Diane Feddema, Michael Goin, Michey Mehta, Naveen Miriyalu, Nikhil Palaskar, Saša Zelenović, Aanya Sharma, Alberto Perdomo, Harika Pothina, Samuel Monson, Sayali Bhavsar Apr 1, 2026
Configure NVIDIA Blackwell GPUs for Red Hat AI workloads | Erwan Gallen, Tarun Kumar, Antonin Stefanutti, Selbi Nuryyeva, Michey Mehta Mar 16, 2026
5 steps to triage vLLM performance | David Whyte-Gray, Thameem Abbas Ibrahim Bathusha, Michael Goin, Ashish Kamra Mar 9, 2026

Conference Talks

6

A Cross-Industry Benchmarking Tutorial for Distributed LLM Inference on Kubernetes

Samuel Monson (Red Hat), Ganesh Kudleppanavar (NVIDIA), Jason Kramberger (Google), Jing Chen (IBM Research)

KubeCon + CloudNativeCon Europe 2026 March 24, 2026

Routing Stateful AI Workloads in Kubernetes

Maroon Ayoub (IBM Research), Michey Mehta (Red Hat)

KubeCon + CloudNativeCon North America 2025 November 11, 2025

Learn How to Run an LLM Inference Performance Benchmark on NVIDIA GPUs

Samuel Monson (Red Hat), Ashish Kamra (Red Hat)

DevConf.US 2025 September 20, 2025

Multi-Node Finetuning LLMs on Kubernetes: A Practitioner's Guide

Ashish Kamra (Red Hat), Boaz Ben Shabat (Red Hat)

KubeCon + CloudNativeCon India 2024 December 11, 2024

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes

David Gray (Red Hat)

KubeCon + CloudNativeCon North America 2024 November 13, 2024

Efficiently Deploying and Benchmarking LLMs in Kubernetes

Nikhil Palaskar (Red Hat)

DevConf.US 2024 August 14, 2024

Publications

1

llm-tuna: Hyperparameter Optimization for LLM Inference

An open-source framework that automates vLLM inference hyperparameter optimization using Bayesian search via Optuna, achieving up to 32.9% throughput improvement on mixture-of-experts models.

ACM Web Conference 2026 (WWW '26)

Thameem Abbas Ibrahim Bathusha, Aanya Sharma, Andy Huynh, R.C. Samaratunga, Ashish Kamra

Featured Projects

4

openshift-psap

auto-tuning-vllm

Auto-tuning for vLLM using Optuna and GuideLLM to find optimal hyperparameters and get the best performance out of your LLM deployment.

Creator

openshift-psap

performance-dashboard

Interactive performance analysis dashboard for Red Hat AI Inference Server benchmarks across different accelerators, versions, and configurations.

Creator

openshift-psap

llm-d-bench

Benchmarking automation for llm-d distributed LLM inference on Kubernetes.

Creator

vllm-project

GuideLLM

SLO-aware benchmarking and evaluation platform for LLM deployments that simulates production workloads against OpenAI-compatible and vLLM-native servers.

Maintainer

Blogs

2026

2025

2024

2023

2022

Conference Talks

Publications

Featured Projects