Novelty
Tests whether models can treat uncertainty as an action space instead of collapsing into generic safe answers.
Demo
Compare model behavior across partial-observability cases and inspect when a model should answer, abstain, or escalate.
Next Upgrade
Replace these demo rows with live benchmark runs and a public leaderboard.