UX LEAD CASE STUDY · DECISION SYSTEMS · INCIDENT OPS

Decision authority redesign for fintech on-call. Triage time roughly halved. Engineering and legal landed on the same model.

Designing Decision Authority
Under Pressure

How I redesigned who gets to decide — and when — inside high-stakes SaaS incident operations.

DOMAIN

Fintech

ROLE

UX Lead

YEAR

2025

STATUS

Shipped

02:47 AM — The Decision Nobody Wants to Make

It's 02:47 AM. A "Latency Degraded" alert fires for the payment gateway. The on-call engineer is alone. Customer support is silent — no tickets yet. But the dashboard shows a 12% drift in response times, just below the auto-paging threshold.

The engineer has to choose. — Escalate, prevent possible data loss, but risk a false alarm. — Stay quiet, protect the team from fatigue, but risk a real outage. The system is asking a human to resolve an ambiguous signal without enough data.

Decision authority tier diagram — three rows mapping risk level to escalation actor.

TL;DR

  • Problem:Distributed truth + unclear authority = decision paralysis at 3 AM.
  • Solution:A triage console that hides raw logs + tier-based escalation authority.
  • Key insight:Affordance ≠ permission. Hiding data can be care, not paranoia.

Process

TIMELINE — Multi-month engagement — design, then rollout

DiscoveryOn-call interviews and ticket retrospectives
DefinitionObject model and state definition
DesignWireframes through high-fidelity, two rounds of usability
Hard callDrop one tier from launch — earn it later with data
Soft launchPilot squads first
RolloutFull rollout and measurement

STAKEHOLDERS

  • Executive sponsor
  • Engineering leadership (initial pushback, later advocate)
  • Daily users from the affected team
  • On-call rotation lead — validation partner
  • Downstream operations stakeholder
  • Compliance / regulatory advisor

DECISION POINTS — 3 CRITICAL

  1. Suppress logs · triage focuscontested early, validated by data
  2. Authority tier based on context, not senioritythe org chart didn't match the risk surface
  3. Merge dispute and postmortem flowauto-trigger on dispute

The Real Problem

Incident management gets reduced to ChatOps — a tooling problem. But tools can't fix the actual conflict: distributed truth. When metrics, customer reports, and engineering logs disagree, the system asks humans to put reality together under pressure. That's an impossible cognitive load.

The on-call engineer's real job at 3 AM isn't to solve the incident. It's to decide what kind of incident this is — and who has the authority to act on that decision.

Decisions

ADOPTED

  • Hidden raw logs during triage (revealed in investigation)
  • Context-based authority tier (risk × customer-facing × data integrity)
  • Merged dispute / postmortem with auto-attached stakeholders
  • Two binary decision questions surface only in the triage phase
  • Authority tier mirrored as Postgres row-level policy — same model in UI and backend

REJECTED

  • Better log filtering — incremental, doesn't fix the root cause
  • Seniority-based escalation list — misses context risk
  • Separate dispute and postmortem flows — duplication
  • Auto-paging with a lower threshold — fatigue spiral

Visual Journey

Three high-fi screens, scroll-driven. Each step = next state of the system.

01

Alert dashboard

The first surface — alert priority, customer impact, decision tier hint. On desktop the engineer scans the row; on mobile the active incident takes the screen and others collapse.

Desktop · 1280+

Alert dashboard

Mobile · 375

Alert dashboard

Desktop uses a 3-column meta strip per alert; mobile pivots to one dominant card + collapsed siblings. The on-call engineer's pocket view is decision-first, not list-first — one tap from notification to triage.

02

Triage console — logs hidden

Raw logs collapsed during triage. On desktop both binary questions are visible at once; on mobile they surface one at a time — the constraint of the screen is also the constraint of the cognitive step.

Desktop · 1280+

Triage console — logs hidden

Mobile · 375

Triage console — logs hidden

Desktop shows the two-question layout side by side (faster for the experienced operator). Mobile splits them into a 1-of-2 wizard with progress dots — slower per question, more focused. Both commit identically; only the rhythm differs.

03

Decision authority tier

Why this matters: when there's no clear authority at 3 AM, escalation becomes political. The matrix makes the call non-negotiable — risk-based, not seniority-based — and audited via row-level Postgres policy.

Decision authority tier

Voices from the team

I pushed back on hiding logs. Then I saw triage time drop.
VP Engineering
First time on-call felt designed for me.
Lead engineer, payments team
The system stopped asking me to be a hero.
Senior on-call rotation lead

Impact

On the numbers below. 90-day rolling averages, rough estimates. NDA limits exact figures and published values may vary. Adoption at month 6 stayed within 5% of these ranges.
0%Decision time(avg, triage)~25 min → ~12 min
0%False escalation rate~28% → ~14%
+0%Postmortem completion~62% → ~88%
+0%Decision authorityclarity score5.8 → 9.1 / 10

My Role

DISCOVERY

12 stakeholder interviews, 30+ ticket retrospectives, on-call shadow.

DEFINITION

Object model, state machine, decision authority tier taxonomy.

DESIGN

Wireframes → high-fi → motion grammar for triage transitions.

DELIVERY

Daily standup with eng team for 4 weeks, XState handoff.

POST-LAUNCH

90-day measurement, iteration on tier-4 (which got removed).

Reflection

What I'd do differently

We shipped four decision authority tiers — in the pilot, two of them never fired. In hindsight, we should have started with two tiers and earned the third with data. Designer's complexity bias: deciding the right taxonomy by intuition instead of evidence. We now start every state model with a minimum viable taxonomy.

Behind the scenes · for engineering readers

The decision tier taxonomy was computed from four signals: context risk × customer-facing × data integrity × time-of-day. The state machine was modeled in XState, and the engineering team pasted the generated state-chart directly into UI documentation. The authority tier was also mirrored into Postgres row-level policies, so any non-UI surface (CLIs, internal scripts) inherits the same permission model.

Tech stack: XState · TypeScript · Postgres RLS · Datadog incident integration

Bigger picture

Designing for authority means removing the comfort of ambiguity. It forces teams to confront their own broken processes. This system didn't just clean up the UI; it exposed and corrected the political power dynamics of incident management.