UX LEAD CASE STUDY · DECISION SYSTEMS · INCIDENT OPS
Decision authority redesign for fintech on-call. Triage time roughly halved. Engineering and legal landed on the same model.
Designing Decision Authority
Under Pressure
How I redesigned who gets to decide — and when — inside high-stakes SaaS incident operations.
DOMAIN
Fintech
ROLE
UX Lead
YEAR
2025
STATUS
Shipped
02:47 AM — The Decision Nobody Wants to Make
It's 02:47 AM. A "Latency Degraded" alert fires for the payment gateway. The on-call engineer is alone. Customer support is silent — no tickets yet. But the dashboard shows a 12% drift in response times, just below the auto-paging threshold.
The engineer has to choose. — Escalate, prevent possible data loss, but risk a false alarm. — Stay quiet, protect the team from fatigue, but risk a real outage. The system is asking a human to resolve an ambiguous signal without enough data.

TL;DR
- Problem:Distributed truth + unclear authority = decision paralysis at 3 AM.
- Solution:A triage console that hides raw logs + tier-based escalation authority.
- Key insight:Affordance ≠ permission. Hiding data can be care, not paranoia.
Process
TIMELINE — Multi-month engagement — design, then rollout
STAKEHOLDERS
- — Executive sponsor
- — Engineering leadership (initial pushback, later advocate)
- — Daily users from the affected team
- — On-call rotation lead — validation partner
- — Downstream operations stakeholder
- — Compliance / regulatory advisor
DECISION POINTS — 3 CRITICAL
- Suppress logs · triage focus — contested early, validated by data
- Authority tier based on context, not seniority — the org chart didn't match the risk surface
- Merge dispute and postmortem flow — auto-trigger on dispute
The Real Problem
Incident management gets reduced to ChatOps — a tooling problem. But tools can't fix the actual conflict: distributed truth. When metrics, customer reports, and engineering logs disagree, the system asks humans to put reality together under pressure. That's an impossible cognitive load.
The on-call engineer's real job at 3 AM isn't to solve the incident. It's to decide what kind of incident this is — and who has the authority to act on that decision.
Decisions
ADOPTED
- Hidden raw logs during triage (revealed in investigation)
- Context-based authority tier (risk × customer-facing × data integrity)
- Merged dispute / postmortem with auto-attached stakeholders
- Two binary decision questions surface only in the triage phase
- Authority tier mirrored as Postgres row-level policy — same model in UI and backend
REJECTED
- Better log filtering — incremental, doesn't fix the root cause
- Seniority-based escalation list — misses context risk
- Separate dispute and postmortem flows — duplication
- Auto-paging with a lower threshold — fatigue spiral
Visual Journey
Three high-fi screens, scroll-driven. Each step = next state of the system.
Alert dashboard
The first surface — alert priority, customer impact, decision tier hint. On desktop the engineer scans the row; on mobile the active incident takes the screen and others collapse.
Desktop · 1280+

Mobile · 375

Desktop uses a 3-column meta strip per alert; mobile pivots to one dominant card + collapsed siblings. The on-call engineer's pocket view is decision-first, not list-first — one tap from notification to triage.
Triage console — logs hidden
Raw logs collapsed during triage. On desktop both binary questions are visible at once; on mobile they surface one at a time — the constraint of the screen is also the constraint of the cognitive step.
Desktop · 1280+

Mobile · 375

Desktop shows the two-question layout side by side (faster for the experienced operator). Mobile splits them into a 1-of-2 wizard with progress dots — slower per question, more focused. Both commit identically; only the rhythm differs.
Decision authority tier
Why this matters: when there's no clear authority at 3 AM, escalation becomes political. The matrix makes the call non-negotiable — risk-based, not seniority-based — and audited via row-level Postgres policy.

Voices from the team
“I pushed back on hiding logs. Then I saw triage time drop.”
“First time on-call felt designed for me.”
“The system stopped asking me to be a hero.”
Impact
My Role
DISCOVERY
12 stakeholder interviews, 30+ ticket retrospectives, on-call shadow.
DEFINITION
Object model, state machine, decision authority tier taxonomy.
DESIGN
Wireframes → high-fi → motion grammar for triage transitions.
DELIVERY
Daily standup with eng team for 4 weeks, XState handoff.
POST-LAUNCH
90-day measurement, iteration on tier-4 (which got removed).
Reflection
What I'd do differently
We shipped four decision authority tiers — in the pilot, two of them never fired. In hindsight, we should have started with two tiers and earned the third with data. Designer's complexity bias: deciding the right taxonomy by intuition instead of evidence. We now start every state model with a minimum viable taxonomy.
Behind the scenes · for engineering readers
The decision tier taxonomy was computed from four signals: context risk × customer-facing × data integrity × time-of-day. The state machine was modeled in XState, and the engineering team pasted the generated state-chart directly into UI documentation. The authority tier was also mirrored into Postgres row-level policies, so any non-UI surface (CLIs, internal scripts) inherits the same permission model.
Tech stack: XState · TypeScript · Postgres RLS · Datadog incident integration
Bigger picture
Designing for authority means removing the comfort of ambiguity. It forces teams to confront their own broken processes. This system didn't just clean up the UI; it exposed and corrected the political power dynamics of incident management.