Speakers
Christopher Nuland, Technical Marketing Manager, Red Hat
This talk shows how we turn a suboptimal reinforcement learning exploration agent into a route seeking speedrunner by pairing PPO with a panel of model judges orchestrated by llm-d and served with vLLM. The agent plays normally while a narrow planner handles dialog and puzzle moments, then llm-d routes short clips to a vision judge, a state checker, and a rule judge to produce preferences that train a small reward model. The next PPO burst learns with shaped rewards, improving coverage, reliable dialog completion, and key speedrun metrics. We keep it practical for game developers with YAML first deployment on OpenShift AI using KServe and vLLM, plus a reusable kit of prompts, rubrics, and metrics. We will showcase Double Dragon and Zelda: Oracle of Seasons on the original Nintendo Gameboy, but the same pattern applies to any game for automated testing or optimal route discovery for speed running.