Live project surface

Holup Benchmark Lab

A metacognitive reasoning benchmark for commit, abstain, and escalate decisions.

Novelty

Tests whether models can treat uncertainty as an action space instead of collapsing into generic safe answers.

Demo

Compare model behavior across partial-observability cases and inspect when a model should answer, abstain, or escalate.

Next Upgrade

Replace these demo rows with live benchmark runs and a public leaderboard.

Decision Lab

Scenario: evidence is partial, contradiction is moderate, downstream risk is high.