sanjana (@sanjanayed) Twitter Tweets • TwiCopy

sanjana

@sanjanayed

4 months ago

wonder what gpt-5 can do about the em dash apocalypse

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Lot of back and forth on this app today about how "good" GPT-5 is A real indicator will be putting it to test on some traces like Dylan Couzon does here. You can do it on your own data, run experiments side-by-side, and compare how other models match up against gpt-5 After all,

thumb_up_off_alt9

chat_bubble_outline0

repeat2

shareShare

Arize AI

@arizeai

3 months ago

New 🍳 cookbooks 🍳on building custom evals just dropped! Courtesy of sanjana, learn how to instrument tracing, generate realistic examples, and annotate them to create a benchmark dataset with Arize AI or @arizephoenix bit.ly/41yhusQ

New 🍳 cookbooks 🍳on building custom evals just dropped! Courtesy of <a href="/sanjanayed/">sanjana</a>, learn how to instrument tracing, generate realistic examples, and annotate them to create a benchmark dataset with <a href="/arizeai/">Arize AI</a> or @arizephoenix bit.ly/41yhusQ

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Aparna Dhinakaran

@aparnadhinak

3 months ago

Last time, we introduced Prompt Learning (PL) — showing how it can boost your agents and models without touching weights or reward functions. We showed how PL leveraged natural language feedback (evals + critiques in plain English) to optimize a prompt for generating structured

thumb_up_off_alt103

chat_bubble_outline3

repeat18

shareShare

Priyan Jindal

@priyanjindal

3 months ago

Just made a video tutorial on improving your agent evals: youtube.com/watch?v=zW0vYT… Evals are the guardrails for your LLM apps. They decide what’s “good” and what’s “bad.” But if your evaluator is wrong, it can quietly ship harm. Imagine: you have a recipe bot for people with

thumb_up_off_alt10

chat_bubble_outline1

repeat3

shareShare

Mikyo

@mikeldking

3 months ago

arize-phoenix 11.23 gives you the ability to transfer traces to different projects for long-term storage. This now gives you 2 mechanisms (projects, datasets) through which you can preserve data that you want to re-visit later. - Setup a project manually that has a

thumb_up_off_alt3

chat_bubble_outline0

repeat2

shareShare

DeepLearning.AI

@deeplearningai

3 months ago

Building a reliable RAG system doesn’t stop at retrieval and generation, you need observability too. In the Retrieval Augmented Generation course, you'll explore how LLM observability platforms can help you: - Trace prompts through each step of the pipeline - Log and evaluate

thumb_up_off_alt237

chat_bubble_outline7

repeat40

shareShare

Aparna Dhinakaran

@aparnadhinak

3 months ago

Working with teams running LLM-as-a-judge evals, I’ve noticed a shocking amount of variance on when they use reasoning, CoT, and explanations. Here’s what we’ve seen works best: Explanations make judge models more reliable. They reduce variance across runs, improve agreement

thumb_up_off_alt246

chat_bubble_outline5

repeat26

shareShare

sanjana

@sanjanayed

3 months ago

Great talks at the Y Combinator Context Engineering event tonight. Big takeaway: coding agents can work cleanly, but it’s all about context. Correctness, completeness, and size matter. The art lies in using as little of the context window as possible, but doing it tactically.

Great talks at the <a href="/ycombinator/">Y Combinator</a> Context Engineering event tonight. Big takeaway: coding agents can work cleanly, but it’s all about context. Correctness, completeness, and size matter. The art lies in using as little of the context window as possible, but doing it tactically.

thumb_up_off_alt14

chat_bubble_outline1

repeat0

shareShare

sanjana

@sanjanayed

3 months ago

This Dev-Agent-Lens is really cool. Keep your CLI as it is & get instant visibility into every Claude Code call. This means full traces, costs, tool calls, and failures. A really clear look into what's happening with Claude Code that I haven't seen anywhere else. Highly

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

sanjana

@sanjanayed

3 months ago

trace-level evals = seeing every reasoning step, tool call, and output quality check lined up in one place ✅

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare