Using synthetic ProfileAgent identities as users, evaluators, and ground truth — so A/B testing needs no real user traffic.
✗ Same output for 1,000 different lawyers.
✗ Ignores personality, lifestyle, location, values.
✗ No way to know if the user actually likes this.
Instead of waiting for real user feedback, we instantiate a ProfileAgent for each persona — it becomes that user, remembers their style values, and judges recommendations in their own voice.
LLM infers first-person lifestyle statements from PersonaHub data and stores them as agent memories.
The agent evaluates any recommendation it receives using agent.reply() — scoring from its own perspective.
Run statistically significant A/B tests before launch — no real users, no waiting, no risk.
100 personas drawn from PersonaHub CSV — diverse roles, domains, locations.
›ProfileAgent created. LLM infers 4–5 first-person style memories. Agent knows itself.
›Group A: keyword engine reads agent.metadata.
Group B: LLM stylist reads metadata + memories.
Agent calls agent.reply() to self-score the recommendation on 3 dimensions.
Welch t-test, Cohen's d, per-agent table, group lift — statistically rigorous output.
Every agent is both the recipient of the recommendation and the evaluator — grounding quality measurement in the user's actual identity.
Rule-based keyword engine · baseline_recommend(agent)
agent.metadata: role, domain, traitsExternal LLM stylist · advanced_recommend(agent)
agent.metadata and recalled memory lineschat()agent.reply() — external system perspective
Both groups share the same evaluation step — the agent's own agent.reply() self-score — ensuring a fair comparison.
The agent's profile (role, domain, specialization) and memories (inferred lifestyle statements) are injected as background context into every reply — making its judgment deeply personal.
| Metric | Group A — Baseline | Group B — Advanced | Lift | p-value | Cohen's d |
|---|---|---|---|---|---|
| Personalization | 4.8 | 7.6 | +2.8 | 0.0001 ** | 1.42 |
| Relevance | 5.2 | 7.3 | +2.1 | 0.0003 ** | 1.18 |
| Style Coherence | 5.5 | 7.1 | +1.6 | 0.0012 ** | 0.97 |
| Composite | 5.2 | 7.3 | +2.1 | 0.0001 ** | 1.25 |
✓ Advanced recommender wins in 78 / 50 head-to-head comparisons
✓ All lifts are statistically significant (p < 0.01)
✓ Cohen's d > 1.0 = large effect size
Simulated values shown — run the script for real numbers.
Run a full 100-user A/B test in minutes, not weeks. No traffic ramp-up required.
PersonaHub provides 200k+ diverse personas across domains, cultures, roles, and locations.
No real user data. No GDPR exposure. Fully synthetic identities derived from public persona descriptions.
Fixed random_state=42 ensures the same 100 users every run. Results are deterministic and auditable.
Welch t-test + Cohen's d. No HARKing. Results include p-values and effect sizes out of the box.
Drop in any recommender that accepts a ProfileAgent. Evaluation pipeline stays unchanged.
One command. 100 synthetic users. Statistically validated results — before a single real user sees your recommender.
utils/recommendation.pypipeline/fashion_ab_test.pydata/personahub_sample.csvExtend the pattern to any vertical: news, e-commerce, music, travel — swap the catalogue, keep the agents.