Blogs

24

2026

2025

2024

2023

2022

Conference Talks

6

A Cross-Industry Benchmarking Tutorial for Distributed LLM Inference on Kubernetes

Samuel Monson (Red Hat), Ganesh Kudleppanavar (NVIDIA), Jason Kramberger (Google), Jing Chen (IBM Research)

KubeCon + CloudNativeCon Europe 2026 March 24, 2026

Routing Stateful AI Workloads in Kubernetes

Maroon Ayoub (IBM Research), Michey Mehta (Red Hat)

KubeCon + CloudNativeCon North America 2025 November 11, 2025

Learn How to Run an LLM Inference Performance Benchmark on NVIDIA GPUs

Samuel Monson (Red Hat), Ashish Kamra (Red Hat)

DevConf.US 2025 September 20, 2025

Multi-Node Finetuning LLMs on Kubernetes: A Practitioner's Guide

Ashish Kamra (Red Hat), Boaz Ben Shabat (Red Hat)

KubeCon + CloudNativeCon India 2024 December 11, 2024

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kubernetes

David Gray (Red Hat)

KubeCon + CloudNativeCon North America 2024 November 13, 2024

Efficiently Deploying and Benchmarking LLMs in Kubernetes

Nikhil Palaskar (Red Hat)

DevConf.US 2024 August 14, 2024

Publications

1

llm-tuna: Hyperparameter Optimization for LLM Inference

An open-source framework that automates vLLM inference hyperparameter optimization using Bayesian search via Optuna, achieving up to 32.9% throughput improvement on mixture-of-experts models.

ACM Web Conference 2026 (WWW '26)
Thameem Abbas Ibrahim Bathusha, Aanya Sharma, Andy Huynh, R.C. Samaratunga, Ashish Kamra

Other Upstream Projects Maintained

1
vllm-project

GuideLLM

SLO-aware benchmarking and evaluation platform for LLM deployments that simulates production workloads against OpenAI-compatible and vLLM-native servers.

Maintainer