Vinnie Mehta

Designing AI Agent Behavior for a Family Storytelling App

Translating a trust architecture into embedded, family-grade UX · 7 minute read

Five design decisions in this case study:

  • Autonomy — how the agent earns more of it
  • Confidence — how the agent signals what it’s sure about
  • Questions — how the agent asks the user for input
  • Undo — how the user reverses the agent
  • Trial credit — how the earned $10 works

Overview

Nanosarte turns a child’s artwork and voice into illustrated stories for grandparents living far away. An AI agent called Nano picks up the parent’s creative load — capturing stories, adapting them per recipient, scheduling sends around birthdays and holidays, and eventually handling small gift purchases.

The design problem isn’t what Nano can do. It’s how much the parent will let Nano do. For a product handling children’s content and family relationships, the cost of a trust failure is permanent — one badly-shared story can damage a family connection that doesn’t get repaired.

This case study walks through the trust architecture I designed to make that delegation safe, the onboarding redesign that made it work for non-technical parents, and five specific design decisions in the delegation flow that hold it together.

My role

  • Led the design end-to-end — information architecture, user flows, interaction patterns, design system, decision documentation
  • Defined the delegation model (Collaborator → Delegator → Sponsor) and the progression mechanics that move users between tiers
  • Built the design system (~780 specification lines) that governs every trust surface
  • Ran informal research with five parents to pressure-test the model

Process

I led this design end-to-end. My process:

  1. Information architecture — mapped the agent delegation model, user roles, and system structure before any screens existed
  2. Product requirements doc — scope, constraints, success criteria, what’s in v1 vs. deferred
  3. Feature prioritization — what lands first, what earns its place later, what gets cut
  4. Design, iterate, document — every decision captured in a 28-entry structured log with intent, trade-off, resolution, and principle

I used Stitch for rapid low-fidelity exploration, Figma for production craft, and Claude for pressure-testing decisions and maintaining design-system consistency across 780 specification lines. The tools accelerated specific steps. The architecture, the decisions, and the trade-offs are mine

Problem

Families separated by distance have no good way to share the everyday moments of a child’s life with grandparents and extended family. Existing tools — WhatsApp, FaceTime, Google Photos — weren’t built for this relationship. Kids can’t type. Grandparents struggle with apps. The parent in the middle becomes an exhausting full-time relay.

Nanosarte ran as an e-commerce service where parents uploaded artwork and ordered personalized gifts. Parents still did all the curation, scheduling, translation, and recipient management manually. This project asks a harder question: what if an AI agent did that work on the parent’s behalf — and what would it take for the parent to actually let it?

The three user types span a 70-year age range and three literacy levels:

  • Parents (20s–40s) — the protagonist of this case study; managing the family relay today, skeptical of AI making decisions
  • Kids (3–10) — voice-first, zero-literacy interface required
  • Grandparents (55–80+) — often in a different country, different language; need radical simplicity

This case study focuses on the parent experience. The grandparent response side and Nano’s behavior specification are covered in a companion case study.

Research

Prior research

I built on existing research rather than starting over:

  • 12 months of customer email review — common themes around cultural adaptation, mockup requests, and product-fit concerns
  • Google Analytics — 79.2% new visitors, 52.9% mobile, 55+ age group growing 10% YoY
  • Usability testing (6 users) on the pre-AI product
  • Personas, empathy map, journey map already established

New research for this project

Informal conversations with five parents, 20–30 minutes each, focused on AI agent scenarios. This wasn’t structured research — it was listening. Four themes emerged.

The relay exhausts the parent, not the grandparent. “I’m the reason my mom sees my kids at all. If I’m tired, she doesn’t get photos.” Four of five parents named this specifically.

Cultural resonance matters more than translation. Two families had grandparents in a different country. Translation was rarely the hard part. “Would my mom actually enjoy this, or is it just ‘my kid made art’?”

Reversibility beats preview. When I described preview-before-send, parents agreed it was valuable but said they’d skip it after a while. When I described per-recipient pause and mid-flight stop, every parent leaned in. Being able to undo is a deeper form of trust than being able to review.

Parents don’t want to configure trust — they want to feel their family is cared for. Every parent wanted less to do. A system that asks them to configure autonomy has already failed, because it’s asked them to do the work they came to the product to escape.

External grounding

I read Anthropic’s and OpenAI’s public work on agent autonomy and behavioral specifications, plus HCI literature on calibrated trust. These provided vocabulary for why “auto-share a story” and “auto-buy a gift” are categorically different tasks, and why over-trust and under-trust are both failure modes.

Onboarding: scaling agent behavior from Collaborator to Sponsor

The problem with the conventional approach

My first onboarding build followed enterprise UX conventions for trust-architected products. It had a trust-mode onboarding flow explaining three tiers (Collaborator, Delegator, Sponsor), a value-calibration step with hypothetical scenarios (“What would you do if your child drew a picture of a sad day at school?”), and a governance dashboard showing Nano’s recent decisions and learned preferences.

Every piece had a principled reason. None of it worked for parents.

What failed, specifically

  • Value calibration felt like programming an AI. Parents said “I don’t know how I’d feel about that” and “it depends on who’s asking.” The hypothetical-scenario approach asked them to declare values rather than reveal them through use.
  • The governance dashboard was noise. Parents don’t care what Nano’s approval rate is. They care whether the last story was good. Showing metrics framed the relationship as surveillance.
  • The three-mode explainer was premature. Introducing the trust architecture to a parent who hadn’t yet experienced the product was like listing a friend’s three capability tiers before the friendship started.

What I shipped

Onboarding reduced to three screens — Meet Nano, Your Family, Connect Artwork — ending in under two minutes with Nano’s first prepared story on the dashboard. No value calibration. No trust-mode explainer. No governance dashboard. Nano arrives with common-sense defaults and learns the family’s preferences through contextual questions asked only when content genuinely warrants them.

The trust architecture still exists — it just doesn’t have dedicated screens. Legibility shows up as inline annotations on specific stories. Boundaries show up as the Autopilot Credit runtime. Reversibility shows up as five embedded moments the parent encounters where and when each is relevant. The architecture is the scaffolding. The felt experience is what stays.

Delegation: embedding trust into the flow

Five design decisions make the delegation flow work. Each solves a specific AI product design problem that isn’t unique to Nanosarte.

Autonomy — replaced approval-rate evidence with a natural observation

Action: Replaced the evidence-based progression prompt with a three-sentence agent-voice observation.

Task: My first progression design tracked how often the parent approved Nano’s drafts unchanged, then fed that data back at progression time: “You approved 12 of 13 stories — ready for Delegator mode?” The framework said this was the accuracy play. The psychology was wrong. Parents read the data and felt watched, not reassured. A product handling children’s content can’t frame its relationship with the user as performance surveillance.

I changed the progression model entirely. Nano earns trust by doing good work — preparing drafts proactively, learning from what the parent changes — and the progression prompt references shared experience rather than observation logs. The new prompt is a dashboard card: “The last three stories I prepared, you sent without changes. Want me to start sending the quick ones directly, and only ask you about the trickier ones?”

Strategic call — Trust is earned through demonstrated value, not demonstrated surveillance. The agent’s track record is the product itself, not a metrics dashboard. UX craft — Progression compressed from a full-screen evidence modal to a 3-sentence dashboard card in the agent’s voice. Shared experience replaces behavioral analytics. Information architecture — Approval tracking removed entirely. Parent behavior never surfaces as visible system state. Trade-off accepted — Less precise trust calibration. In exchange, the parent never feels watched — which is what made them willing to keep using the product.

Autonomy before / after

Result: The progression prompt went from a multi-section ceremony to a natural beat in the flow. The parent reads it in under ten seconds. Trust progression stops feeling like a product upgrade and starts feeling like a relationship milestone.

Confidence — replaced classification badges with verb-first activity states

Action: Hid the three-tier classification from the parent and let the agent’s behavior carry the meaning instead.

Task: Every AI that classifies content faces this choice: show the user the classification label, or let them experience it through how the product behaves. I built both. V1 showed Clear / Borderline / Hold tiers as colored badges on the activity feed. V4 hides them entirely.

Parents in V1 started arguing with the labels — “why is this Borderline, it’s fine.” They were treating them as editorial judgment rather than behavioral state. The labels leaked backend state into the experience and gave the parent a system to game.

In V4, the three tiers drive three distinct behaviors: Clear content appears with “Sent.” Borderline content surfaces an inline question. Hold content blocks and explains. The parent never sees the label. Typography and natural language carry state — no color coding, no severity indicators.

Strategic call — Classification is backend architecture, not UI. The parent experiences the agent’s confidence through behavior, not through system labels. UX craft — Three behaviors replace three labels. Verb-first activity states carry meaning through language alone. Information architecture — The feed groups by what Nano did, not by what Nano thought. Tier-based filtering moves to backend analytics the parent never sees. Trade-off accepted — Auditability decreased. A parent can no longer see “Nano classified this with 72% confidence.” Right trade for a non-technical audience.

Confidence before / after

Result: Parents stopped arguing with labels because there were no labels to argue with. They reacted to what Nano did, and Nano calibrated from those reactions — better data from shorter conversations.

Questions — designed the inline-but-blocking pattern

Action: Built a question card that sits between the story draft and the send button — visible, required, never alarming.

Task: When Nano encounters ambiguous content, she needs the parent to answer a question before sending. Two common patterns fail here. Inline notes at the bottom of a draft get scrolled past, especially on a polished-looking story. Hard-blocking modals with warning UI make the parent think something is wrong with content that isn’t actually problematic.

I designed an inline-but-blocking pattern. The question card sits between the story draft preview and the send button. The send button is visible but inactive until the parent answers. The question is framed warmly and specifically — “Emma mentioned feeling lonely. Want me to flag moments like this for you in the future?” — never as a warning. Once the parent answers, send activates immediately.

One question per story, maximum. Only when Nano’s common sense genuinely detects ambiguity. Never manufactured to simulate calibration.

Strategic call — Guarantee engagement without creating alarm. The question is the last step before value, not a barrier to it. UX craft — Card sits in the send path, not outside it. Warm specific framing replaces system-alert language. Send disables until answered, re-enables immediately. Information architecture — Maximum one question per story. Only when genuinely ambiguous. The question never appears when Nano is confident or when Nano is blocking. Trade-off accepted — Some parents will find the question slightly annoying in the first few story cycles. Annoyance is the cost of the calibration signal. It decays rapidly once Nano has learned the family’s preferences.

Questions pattern

Result: Parents stopped arguing with labels because there were no labels to argue with. They reacted to what Nano did, and Nano calibrated from those reactions — better data from shorter conversations.

Undo — distributed reversibility across five embedded moments

Action: Deleted the central Reversibility Log. Distributed undo across five moments the parent encounters where and when each applies.

Task: The trust architecture prescribed reversibility as a dedicated screen — a Reversibility Log listing every recent action with a “revert” button. I built it. It created anxiety instead of dissolving it. A dedicated screen listing everything that might need undoing made the system feel fragile. Parents either ignored the log or got stuck on it.

In V4, reversibility lives in five specific moments, each sitting exactly where the action it reverses happens:

  1. Cancel during generation — while Nano is preparing a draft, a visible cancel control
  2. 60-second undo window — after the parent hits send, a small banner offers undo for one minute
  3. Pause All — a persistent global stop available from the dashboard
  4. Per-recipient pause — granular pause on a specific family member
  5. Teach Nano — feedback on any individual story that shapes future behavior

Strategic call — Reversibility embedded in moments is stronger than reversibility surfaced in a log. Users need to undo at the point of doing, not in an audit panel. UX craft — Five distinct surfaces, each with its own interaction pattern: inline cancel control, post-action banner with timer, persistent global button, per-recipient toggle, story-level feedback capsule.Information architecture — Deleted the Reversibility Log from the dashboard entirely. Distributed its function across contextual moments. Trade-off accepted — Reversibility is less discoverable as a single named feature. A parent browsing the app won’t find a “Reversibility Center.” In exchange, the reversibility they actually need is always present at the moment they need it — the only kind that matters.

Undo: five embedded moments

Result: Parents who asked for “undo” during research were looking for the ability to stop and redirect future agent behavior, not to audit past behavior. Embedded moments deliver exactly that.

Trial credit — restricted earned credit to preserve the trial mechanic

Action: Made the $10 milestone reward restricted to Autopilot runtime only — not a flexible wallet.

Task: The Story Moments milestone awards the parent $10 for completing ten stories with Nano. My first instinct was to make that a flexible wallet the parent could spend however they wanted. It felt generous.

It would have broken the trial mechanic. A flexible wallet means the parent applies the $10 toward a gift they were planning to buy anyway, and Nano never gets the designed moment to demonstrate Autopilot. The milestone stops being a trust-transfer mechanism and becomes a generic discount coupon.

In V4, the $10 is Autopilot Credit — restricted to Autopilot runtime costs (story delivery and Nano-purchased gifts). For anything outside Autopilot — a parent manually ordering a gift in Collaborative mode — the parent pays by credit card at checkout. Two payment surfaces, two clear purposes.

Strategic call — Earned currency pulls users toward the behavior it was earned in. Purchased currency should be flexible. Conflating them sacrifices strategic intent for the illusion of generosity. UX craft — Two payment surfaces, never mixed. Autopilot Credit appears where Autopilot is active. Credit-card checkout appears for user-initiated purchases.Information architecture — Balance displays only on the milestone screen and post-activation dashboard. Not in Settings. Not as a global badge. Where it matters, when it matters. Trade-off accepted — Some parents will feel the restriction and want to use the $10 flexibly. That’s a 30-second friction in one specific scenario that protects a multi-month retention mechanic.

Trial credit before / after

Result: The milestone pulls parents into Autopilot as designed. The trial currency doesn’t leak out. The user’s own money stays separately usable for whatever they want to buy manually.

Failure: designing recovery for when the AI is wrong, unsure, or stuck

Every AI system has failure modes. Most portfolios skip this. I designed for three specific scenarios where Nano either gets something wrong, lacks confidence, or can’t proceed.

Three failure scenarios

Stress case 1 — Nano misreads or flags the wrong thing

What goes wrong: Nano flags emotional content in a story where the child is clearly role-playing, not sharing real distress. The flag is wrong.

Designed recovery:

  • The inline question card uses warm specific framing — never asserting something is wrong, just asking
  • A “Not sure — get help” link below every question routes to human support for this specific decision
  • If the parent answers “it’s fine,” Nano learns from that signal for future similar content
  • The parent never sees the word “hallucination” or “error.” Framing stays “additional support”

Stress case 2 — Parent is confused by what Nano is asking

What goes wrong: Nano asks a contextually appropriate question, but the parent can’t tell what she’s actually asking for.

Designed recovery:

  • Persistent support icon in the navigation — always visible, never dependent on the current screen
  • Routes to chat, email, or scheduled call with a Nanosarte support person
  • The parent can skip the question and send anyway, with the unanswered question logged for Nano’s reference
  • No blocking indefinitely, no “AI failure” framing

Stress case 3 — Autopilot Credit runs out mid-plan

What goes wrong: Nano has scheduled a story but the credit balance hits zero before the send executes.

Designed recovery:

  • Autopilot pauses automatically — no overdraft, no silent failure, no unexpected charge
  • Parent receives a warm notification explaining what paused and why
  • The pending story stays in the queue; nothing is lost
  • Options: top up now (single tap) or hold the story for manual send later
  • The Autopilot Credit card shows “$0.00 · Paused” clearly — as a pause, not a failure

Why failure design matters

Human-in-the-loop isn’t a banner. It’s what happens when specific things go wrong. These three scenarios demonstrate how the principle holds up under real failure conditions — not as a claim, as a design.

Testing

After the delegation flow shipped, I ran a second round of conversations with the same five parents plus two new ones. Four findings changed what shipped.

The milestone CTA caused a re-read. “Let Nano handle it” made parents pause — handle what, exactly? Renamed to “Activate Autopilot.” More mechanical, matches what the button does, parents stopped hesitating.

Nano’s visual pattern was inconsistent across screens. Two different presentations (warm-surface annotation vs. gold illustration tile) got mixed in one review round. Consolidated in the design system with explicit usage rules — annotation, action, and celebration are now distinct patterns.

An intermediate setup screen broke the flow. I’d designed an Autopilot Setup screen between the milestone CTA and the post-activation dashboard. Reviewing the flow, I realized the CTA was a promise — tapping it should activate, not route to a form. Removed the setup screen entirely and moved any residual config into the post-activation dashboard state.

The “free stories” framing was wrong. “Free from what?” asked one parent, correctly identifying that real API costs exist. Changed to “stories are covered by your balance.” Removed “free” from all product surfaces.

Learning

  • Frameworks are vocabulary, not templates. The trust mechanisms (legibility, boundaries, reversibility) are durable. The surfaces a textbook application produces — dashboards, logs, configuration screens — are not. My job was to keep the mechanisms and rebuild the surfaces.
  • Embedded is harder than explicit. The conventional build was faster because each trust mechanism got its own screen. The shipped version took longer because every mechanism had to find its moment in a flow that already had other work to do. Embedded UX is the harder discipline and the one that separates consumer AI products from ones that explain themselves to death.
  • Users want to stop the system more than they want to audit it. Pause All, per-recipient pause, and the 60-second undo did more for trust than any audit log ever could have.
  • Design-system consistency is recurring work. The system contradicted itself mid-project. Finding and fixing that contradiction was part of the design work, not a housekeeping step.