
sanjana
@sanjanayed
Berkeley EECS, Arize Phoenix
ID: 1901698598796214272
17-03-2025 18:14:23
38 Tweet
32 Followers
27 Following


Lot of back and forth on this app today about how "good" GPT-5 is A real indicator will be putting it to test on some traces like Dylan Couzon does here. You can do it on your own data, run experiments side-by-side, and compare how other models match up against gpt-5 After all,



Just made a video tutorial on improving your agent evals: youtube.com/watch?v=zW0vYT… Evals are the guardrails for your LLM apps. They decide what’s “good” and what’s “bad.” But if your evaluator is wrong, it can quietly ship harm. Imagine: you have a recipe bot for people with

arize-phoenix 11.23 gives you the ability to transfer traces to different projects for long-term storage. This now gives you 2 mechanisms (projects, datasets) through which you can preserve data that you want to re-visit later. - Setup a project manually that has a



Great talks at the Y Combinator Context Engineering event tonight. Big takeaway: coding agents can work cleanly, but it’s all about context. Correctness, completeness, and size matter. The art lies in using as little of the context window as possible, but doing it tactically.


