sanjana (@sanjanayed) 's Twitter Profile
sanjana

@sanjanayed

Berkeley EECS, Arize Phoenix

ID: 1901698598796214272

calendar_today17-03-2025 18:14:23

38 Tweet

32 Takipçi

27 Takip Edilen

sanjana (@sanjanayed) 's Twitter Profile Photo

Lot of back and forth on this app today about how "good" GPT-5 is A real indicator will be putting it to test on some traces like Dylan Couzon does here. You can do it on your own data, run experiments side-by-side, and compare how other models match up against gpt-5 After all,

Arize AI (@arizeai) 's Twitter Profile Photo

New 🍳 cookbooks 🍳on building custom evals just dropped! Courtesy of sanjana, learn how to instrument tracing, generate realistic examples, and annotate them to create a benchmark dataset with Arize AI or @arizephoenix bit.ly/41yhusQ

New 🍳 cookbooks 🍳on building custom evals just dropped! Courtesy of  <a href="/sanjanayed/">sanjana</a>, learn how to instrument tracing, generate realistic examples, and annotate them to create a benchmark dataset with <a href="/arizeai/">Arize AI</a> or @arizephoenix bit.ly/41yhusQ
Aparna Dhinakaran (@aparnadhinak) 's Twitter Profile Photo

Last time, we introduced Prompt Learning (PL) — showing how it can boost your agents and models without touching weights or reward functions. We showed how PL leveraged natural language feedback (evals + critiques in plain English) to optimize a prompt for generating structured

Last time, we introduced Prompt Learning (PL) — showing how it can boost your agents and models without touching weights or reward functions. We showed how PL leveraged natural language feedback (evals + critiques in plain English) to optimize a prompt for generating structured
Priyan Jindal (@priyanjindal) 's Twitter Profile Photo

Just made a video tutorial on improving your agent evals: youtube.com/watch?v=zW0vYT… Evals are the guardrails for your LLM apps. They decide what’s “good” and what’s “bad.” But if your evaluator is wrong, it can quietly ship harm. Imagine: you have a recipe bot for people with

Mikyo (@mikeldking) 's Twitter Profile Photo

arize-phoenix 11.23 gives you the ability to transfer traces to different projects for long-term storage. This now gives you 2 mechanisms (projects, datasets) through which you can preserve data that you want to re-visit later. - Setup a project manually that has a

DeepLearning.AI (@deeplearningai) 's Twitter Profile Photo

Building a reliable RAG system doesn’t stop at retrieval and generation, you need observability too. In the Retrieval Augmented Generation course, you'll explore how LLM observability platforms can help you: - Trace prompts through each step of the pipeline - Log and evaluate

Aparna Dhinakaran (@aparnadhinak) 's Twitter Profile Photo

Working with teams running LLM-as-a-judge evals, I’ve noticed a shocking amount of variance on when they use reasoning, CoT, and explanations. Here’s what we’ve seen works best: Explanations make judge models more reliable. They reduce variance across runs, improve agreement

Working with teams running LLM-as-a-judge evals, I’ve noticed a shocking amount of variance on when they use reasoning, CoT, and explanations. Here’s what we’ve seen works best:

Explanations make judge models more reliable.  They reduce variance across runs, improve agreement
sanjana (@sanjanayed) 's Twitter Profile Photo

Great talks at the Y Combinator Context Engineering event tonight. Big takeaway: coding agents can work cleanly, but it’s all about context. Correctness, completeness, and size matter. The art lies in using as little of the context window as possible, but doing it tactically.

Great talks at the <a href="/ycombinator/">Y Combinator</a> Context Engineering event tonight. Big takeaway: coding agents can work cleanly, but it’s all about context. Correctness, completeness, and size matter. The art lies in using as little of the context window as possible, but doing it tactically.
sanjana (@sanjanayed) 's Twitter Profile Photo

This Dev-Agent-Lens is really cool. Keep your CLI as it is & get instant visibility into every Claude Code call. This means full traces, costs, tool calls, and failures. A really clear look into what's happening with Claude Code that I haven't seen anywhere else. Highly