Risk-Scored Human-in-the-Loop Gates
Score each agent action on reversibility, cost, and scope. Low-risk auto-executes; high-risk pauses with a context summary. Turn the big trust question into many small ones.
The bad pattern in agent workflows is binary: either every action needs human approval (which makes the agent useless) or none do (which makes the agent dangerous). The fix is to score each action on three dimensions — reversibility, cost, and scope — and route by score.
Low-risk actions auto-execute. High-risk actions pause and send a one-paragraph context summary to whoever owns the approval, with approve/deny buttons. The risk scorer itself is a small classifier that gets better with feedback.
The architecture turns the question "should we trust the agent?" into a series of much smaller questions: for which classes of action, at what cost ceiling, in which contexts? Those are answerable. The big question isn't.