ProfileAgent · A/B Testing

Personalized Fashion
Recommendation at Scale

Using synthetic ProfileAgent identities as users, evaluators, and ground truth — so A/B testing needs no real user traffic.

100
Synthetic users sampled
2
Recommender engines
3
Evaluation dimensions
0
Real users needed
The Problem

Why Generic Recommendations Fall Short

  • Most recommenders rely on domain or role keywords — a lawyer gets "formal attire" regardless of who they are as an individual.
  • A/B testing requires real user traffic and weeks of data collection before results are meaningful.
  • User self-reported feedback is noisy, sparse, and expensive to collect at scale.
  • There is no standard way to test personalization quality before deploying to real users — risk is high.

Typical Baseline Output

"Based on your profile as a legal advisor in the Law & Legal domain, we suggest a classic, formal, professional wardrobe. Recommended items: Oxford shirt, Wool blazer, Oxford shoes…"

✗  Same output for 1,000 different lawyers.
✗  Ignores personality, lifestyle, location, values.
✗  No way to know if the user actually likes this.

Our Solution

ProfileAgent as Synthetic User Identity

Instead of waiting for real user feedback, we instantiate a ProfileAgent for each persona — it becomes that user, remembers their style values, and judges recommendations in their own voice.

🧬

Rich Identity

LLM infers first-person lifestyle statements from PersonaHub data and stores them as agent memories.

🎯

Self-Evaluation

The agent evaluates any recommendation it receives using agent.reply() — scoring from its own perspective.

Zero Traffic Needed

Run statistically significant A/B tests before launch — no real users, no waiting, no risk.

How It Works

Pipeline Architecture

1
📋

Sample Users

100 personas drawn from PersonaHub CSV — diverse roles, domains, locations.

2
🧬

Build Agent

ProfileAgent created. LLM infers 4–5 first-person style memories. Agent knows itself.

3
👗

Recommend

Group A: keyword engine reads agent.metadata.
Group B: LLM stylist reads metadata + memories.

4
💬

Agent Judges

Agent calls agent.reply() to self-score the recommendation on 3 dimensions.

5
📊

Report

Welch t-test, Cohen's d, per-agent table, group lift — statistically rigorous output.

Every agent is both the recipient of the recommendation and the evaluator — grounding quality measurement in the user's actual identity.

Recommender Engines

Group A vs Group B

Group A — Baseline

Rule-based keyword engine  ·  baseline_recommend(agent)

  • Reads agent.metadata: role, domain, traits
  • Maps domain → style buckets (e.g. "legal" → professional/formal)
  • Scores catalogue by keyword overlap
  • Zero LLM calls — pure heuristic
  • Equivalent to a production system with no personalization

Group B — Advanced

External LLM stylist  ·  advanced_recommend(agent)

  • Reads agent.metadata and recalled memory lines
  • Passes full context to LLM as stylist prompt via chat()
  • LLM reasons about agent's lifestyle and values per item
  • Does NOT call agent.reply() — external system perspective
  • Represents a semantically-aware recommendation engine

Both groups share the same evaluation step — the agent's own agent.reply() self-score — ensuring a fair comparison.

Key Innovation

ProfileAgent as the Ground-Truth Evaluator

Why the agent judges itself

  • A generic LLM judge has no knowledge of the specific user — it reads surface features only.
  • The ProfileAgent has internalized the persona through enriched memories and profile metadata.
  • First-person evaluation ("does this fit me?") is more discriminating than third-person scoring.
  • The agent's memory recall surfaces style preferences that inform its score.
# Agent evaluates in first person query = ( "A stylist sent me this recommendation:\n\n" f"{recommendation}\n\n" "Based on who I am — my role, my daily " "life, my values, and my taste — rate:\n" " 1. Personalization (1–10)\n" " 2. Relevance (1–10)\n" " 3. Style Coherence (1–10)\n" 'Reply in JSON only.' ) scores = score_with_agent(agent, rec) # agent.reply() draws on profile + memories

The agent's profile (role, domain, specialization) and memories (inferred lifestyle statements) are injected as background context into every reply — making its judgment deeply personal.

Expected Results

A/B Test Outcome (100 Synthetic Agents)

Metric Group A — Baseline Group B — Advanced Lift p-value Cohen's d
Personalization 4.8 7.6 +2.8 0.0001 ** 1.42
Relevance 5.2 7.3 +2.1 0.0003 ** 1.18
Style Coherence 5.5 7.1 +1.6 0.0012 ** 0.97
Composite 5.2 7.3 +2.1 0.0001 ** 1.25
Personalization
4.8
7.6
Relevance
5.2
7.3

 Advanced recommender wins in 78 / 50 head-to-head comparisons
 All lifts are statistically significant (p < 0.01)
 Cohen's d > 1.0 = large effect size
Simulated values shown — run the script for real numbers.

Value Proposition

Why Synthetic ProfileAgent Users?

🚀

Speed

Run a full 100-user A/B test in minutes, not weeks. No traffic ramp-up required.

🌍

Diversity

PersonaHub provides 200k+ diverse personas across domains, cultures, roles, and locations.

🔒

Privacy-Safe

No real user data. No GDPR exposure. Fully synthetic identities derived from public persona descriptions.

🔁

Reproducible

Fixed random_state=42 ensures the same 100 users every run. Results are deterministic and auditable.

📐

Statistically Rigorous

Welch t-test + Cohen's d. No HARKing. Results include p-values and effect sizes out of the box.

🧩

Composable

Drop in any recommender that accepts a ProfileAgent. Evaluation pipeline stays unchanged.

Get Started

Run the Experiment Today

One command. 100 synthetic users. Statistically validated results — before a single real user sees your recommender.

# Run the full A/B experiment python pipeline/fashion_ab_test.py \ --n_users 100 \ --output results/fashion_ab # Output: per-agent table + aggregate stats + results JSON
Recommenders
utils/recommendation.py
A/B Test Harness
pipeline/fashion_ab_test.py
User Identities
data/personahub_sample.csv

Extend the pattern to any vertical: news, e-commerce, music, travel — swap the catalogue, keep the agents.