ProfileAgent · A/B Testing

Personalized Fashion
Recommendation at Scale

Using synthetic ProfileAgent identities as users, evaluators, and ground truth — so A/B testing needs no real user traffic.

100

Synthetic users sampled

2

Recommender engines

3

Evaluation dimensions

0

Real users needed

The Problem

Why Generic Recommendations Fall Short

Most recommenders rely on domain or role keywords — a lawyer gets "formal attire" regardless of who they are as an individual.
A/B testing requires real user traffic and weeks of data collection before results are meaningful.
User self-reported feedback is noisy, sparse, and expensive to collect at scale.
There is no standard way to test personalization quality before deploying to real users — risk is high.

Typical Baseline Output

"Based on your profile as a legal advisor in the Law & Legal domain, we suggest a classic, formal, professional wardrobe. Recommended items: Oxford shirt, Wool blazer, Oxford shoes…"

✗ Same output for 1,000 different lawyers.
✗ Ignores personality, lifestyle, location, values.
✗ No way to know if the user actually likes this.

Our Solution

ProfileAgent as Synthetic User Identity

Instead of waiting for real user feedback, we instantiate a ProfileAgent for each persona — it becomes that user, remembers their style values, and judges recommendations in their own voice.

🧬

Rich Identity

LLM infers first-person lifestyle statements from PersonaHub data and stores them as agent memories.

🎯

Self-Evaluation

The agent evaluates any recommendation it receives using agent.reply() — scoring from its own perspective.

⚡

Zero Traffic Needed

Run statistically significant A/B tests before launch — no real users, no waiting, no risk.

How It Works

Pipeline Architecture

1

📋

Sample Users

100 personas drawn from PersonaHub CSV — diverse roles, domains, locations.

›

2

🧬

Build Agent

ProfileAgent created. LLM infers 4–5 first-person style memories. Agent knows itself.

›

3

👗

Recommend

Group A: keyword engine reads agent.metadata.
Group B: LLM stylist reads metadata + memories.

›

4

💬

Agent Judges

Agent calls agent.reply() to self-score the recommendation on 3 dimensions.

›

5

📊

Report

Welch t-test, Cohen's d, per-agent table, group lift — statistically rigorous output.

Every agent is both the recipient of the recommendation and the evaluator — grounding quality measurement in the user's actual identity.

Recommender Engines

Group A vs Group B

Group A — Baseline

Rule-based keyword engine · baseline_recommend(agent)

Reads agent.metadata: role, domain, traits
Maps domain → style buckets (e.g. "legal" → professional/formal)
Scores catalogue by keyword overlap
Zero LLM calls — pure heuristic
Equivalent to a production system with no personalization

Group B — Advanced

External LLM stylist · advanced_recommend(agent)

Reads agent.metadata and recalled memory lines
Passes full context to LLM as stylist prompt via chat()
LLM reasons about agent's lifestyle and values per item
Does NOT call agent.reply() — external system perspective
Represents a semantically-aware recommendation engine

Both groups share the same evaluation step — the agent's own agent.reply() self-score — ensuring a fair comparison.

Key Innovation

ProfileAgent as the Ground-Truth Evaluator

Why the agent judges itself

A generic LLM judge has no knowledge of the specific user — it reads surface features only.
The ProfileAgent has internalized the persona through enriched memories and profile metadata.
First-person evaluation ("does this fit me?") is more discriminating than third-person scoring.
The agent's memory recall surfaces style preferences that inform its score.

# Agent evaluates in first person
query = (
  "A stylist sent me this recommendation:\n\n"
  f"{recommendation}\n\n"
  "Based on who I am — my role, my daily "
  "life, my values, and my taste — rate:\n"
  "  1. Personalization (1–10)\n"
  "  2. Relevance       (1–10)\n"
  "  3. Style Coherence (1–10)\n"
  'Reply in JSON only.'
)
scores = score_with_agent(agent, rec)
# agent.reply() draws on profile + memories

The agent's profile (role, domain, specialization) and memories (inferred lifestyle statements) are injected as background context into every reply — making its judgment deeply personal.

Expected Results

A/B Test Outcome (100 Synthetic Agents)

Metric	Group A — Baseline	Group B — Advanced	Lift	p-value	Cohen's d
Personalization	4.8	7.6	+2.8	0.0001 **	1.42
Relevance	5.2	7.3	+2.1	0.0003 **	1.18
Style Coherence	5.5	7.1	+1.6	0.0012 **	0.97
Composite	5.2	7.3	+2.1	0.0001 **	1.25

Personalization

4.8

7.6

Relevance

5.2

7.3

✓ Advanced recommender wins in 78 / 50 head-to-head comparisons
✓ All lifts are statistically significant (p < 0.01)
✓ Cohen's d > 1.0 = large effect size
Simulated values shown — run the script for real numbers.

Value Proposition

Why Synthetic ProfileAgent Users?

🚀

Speed

Run a full 100-user A/B test in minutes, not weeks. No traffic ramp-up required.

🌍

Diversity

PersonaHub provides 200k+ diverse personas across domains, cultures, roles, and locations.

🔒

Privacy-Safe

No real user data. No GDPR exposure. Fully synthetic identities derived from public persona descriptions.

🔁

Reproducible

Fixed random_state=42 ensures the same 100 users every run. Results are deterministic and auditable.

📐

Statistically Rigorous

Welch t-test + Cohen's d. No HARKing. Results include p-values and effect sizes out of the box.

🧩

Composable

Drop in any recommender that accepts a ProfileAgent. Evaluation pipeline stays unchanged.

Get Started

Run the Experiment Today

One command. 100 synthetic users. Statistically validated results — before a single real user sees your recommender.

# Run the full A/B experiment
python pipeline/fashion_ab_test.py \
  --n_users 100 \
  --output  results/fashion_ab

# Output: per-agent table + aggregate stats + results JSON

Recommenders

utils/recommendation.py

A/B Test Harness

pipeline/fashion_ab_test.py

User Identities

data/personahub_sample.csv

Extend the pattern to any vertical: news, e-commerce, music, travel — swap the catalogue, keep the agents.

Personalized FashionRecommendation at Scale