The Reality of Autonomous AI Employees: What We Actually Built

Beyond the Demo: Real Workflows for Real Problems

When a customer emails asking about their order status at 2 AM, our AI employee doesn't just send a canned response. It executes a multi-step investigation that would take a human agent 3-4 minutes, completing it in under 10 seconds.

Deployment Timeline

Oct 2025

Shadow Mode

Every action observed.
Nothing executed.

Nov 2025

Supervised

All writes reviewed
and approved.

Dec 2025

Fully autonomous

95% autonomous
across all workflows.

Today

Expanding

New workflow types
being added.

The workflow looks like this: First, it searches Shopify's customer database by email to locate all associated orders. Then it retrieves the complete order record—line items, fulfillment status, payment details, shipping address, and crucially, the fulfillment tracking data. Here's where it gets interesting: if the order shows "fulfilled" in Shopify but has multiple fulfillments (one for shipping insurance, one for actual products), the system parses each fulfillment individually to find the tracking number. It then evaluates the shipment status—not just whether something shipped, but whether it's in transit, out for delivery, or actually confirmed delivered by the carrier.

Only after this complete investigation does it respond to the customer with accurate, specific information: "Your order 1234567 is currently in transit in Calgary, AB as of October 13th. Expected delivery is 3-5 business days from that date."

The Refund Problem: Where Automation Meets Real Money

Processing refunds is where most automation systems fail catastrophically. You can't just click "refund" and hope for the best—there are payment disputes to check, fraud patterns to detect, return policies to enforce, and multiple payment processors to navigate.

Our refund workflow starts with order validation: Is this order real? Is it eligible for refund under the 30-day policy? Are there any pending payment disputes flagged in Stripe or PayPal? The system checks both payment processors because customers use different methods, and a dispute in one system isn't visible in the other.

If the order is over 30 days old, the system doesn't just reject it—it generates a unique store credit code using the pattern CUST-[CustomerFirstName]-[OrderNumber], creates a discount in Shopify with precise constraints (minimum purchase equals discount amount, one use per customer, expires in two years), and sends a personalized message explaining the alternative.

For approved refunds, it determines the payment processor, executes the refund through the appropriate API (Stripe or PayPal), documents the transaction in Shopify's order timeline, updates the support ticket status, and sends a confirmation email—all without human intervention for standard eligible refunds.

Order Cancellations: Simple Request, Complex Execution

"Cancel my order" seems straightforward. It's not. The complexity depends entirely on fulfillment status and which warehouse system has the order.

For unfulfilled orders in Shopify's direct system, it's clean: cancel in Shopify, process the refund through the original payment method, update the ticket, notify the customer. Done.

But when orders have already moved to third-party fulfillment partners, the workflow becomes surgical. The system must verify the order hasn't entered the picking stage by checking the fulfillment status with the warehouse partner. Only if the order is still cancellable does it proceed — executing the cancellation with the fulfillment partner, removing the relevant tags in Shopify, processing the refund, and documenting every step.

Product Information: The Long Tail of Support

Customers ask surprisingly nuanced questions: "What size are your widgets?" "Do they come in different sizes?" "How do I activate the widget I recieved?"

The challenge isn't answering these questions—it's detecting when customers are actually asking them. People don't say "widget 1"—they say "what I bought," or "your product," or "it." Our intent detection system had to learn that "charging issues" definitively means the widget 4 (it's the only product with a charger), while "do not separate label" indicates the widget multi-pack.

The workflow classifies the question, identifies the product being referenced (even from context clues), retrieves the customer's order history to confirm they actually purchased that product, then delivers specific product information—dimensions, usage instructions, warranty details—from a structured knowledge base.

From Governance to Full Autonomy

Here's what makes this production-ready rather than a liability: every write operation is governed before it executes. The AI validates the complete workflow, determines the correct actions, and checks each one against a risk policy before anything runs.

October. We deployed in Shadow Mode — the AI executed every workflow end-to-end with real data, but writes were intercepted and surfaced for review before anything ran. A refund operation would appear in the inbox showing: "Operation 1: stripe:create_refund for $47.99. Reason: Defective product reported within 30-day window. Operation 2: shopify:add_order_note. Operation 3: reamaze:create_message sending confirmation." The team reviewed the complete action plan and approved with one click. Either all three executed atomically or none did — no partial states.

November. We moved to supervised execution. Everything ran for real, but every write still surfaced for approval. Every decision the team made — approve, reject, modify — fed back into the governance policy. Every edge case we caught in review became a rule the system learned. We ran this way for a full month.

December. Full autonomy across all existing workflows. Every action class had been through supervised review. The policy was solid. We removed approval requirements across the board — the system now investigates, decides, executes, and confirms without human review on 95% of tickets. The remaining 5% are genuine edge cases the policy flags as high-risk: large refunds, disputed orders, anything that warrants a human judgment call.

Today the focus is expansion. The existing workflows are running. The platform is stable. The work now is adding new workflow types — more complex edge cases, new integrations, broader coverage. The autonomy question is settled. The question now is scale.

The progression from shadow to supervised to autonomous isn't a milestone you hit once. It's a dial you move deliberately, action class by action class, as the system earns it.

Order Modifications: The Dynamic Pricing Problem

When customers need to add items to existing orders, we can't just create a new order—we need to charge them only for the new items while preserving the original order context.

The workflow creates a duplicate of the original order, applies a 100% discount to all original items (marked with reason: "Already Paid"), adds the new product at full price, calculates accurate shipping, generates a custom invoice, and sends it to the customer with context: "This invoice is for adding [Product] to your existing order 1234567. Original items are discounted since you've already paid."

What Actually Matters

This isn't RPA clicking through interfaces. This is an AI employee that understands "I didn't get my package" means: query Shopify for the order, parse the fulfillment data, check carrier tracking, evaluate shipment status, determine if it's lost versus delayed, and either provide accurate delivery information or initiate a lost package investigation.

It's labor that operates on the same information a human agent would use, makes the same decisions a trained agent would make, and executes through the same systems—just 3.4x faster, at 1/14th the cost, 24/7.

The workflows we built aren't impressive because they're complex. They're impressive because they're reliable. On a schedule like a human, the system wakes up, checks for new tickets, routes them to appropriate workflows, executes multi-step investigations and resolutions, and only escalates what genuinely requires human judgment.

That's not traditional automation. That's AI labor.

‍