Amanda Dsouza (@amanda_dsouza) Twitter Tweets • TwiCopy

Amanda Dsouza

@amanda_dsouza

+ Follow

Applied research scientist @SnorkelAI. Previous: @heyjasperai, @fractalAI. MS (ML) @gtcomputing.

ID: 78238156

linkhttps://amy12xx.github.io/ calendar_today29-09-2009 06:44:59

399 Tweet

196 Takipçi

468 Takip Edilen

Fred Sala

@fredsala

6 months ago

The coolest trend for AI is shifting from conversation to action—less talking and more doing. This is also a great opportunity for evals: we need benchmarks that measure utility, including in an economic sense. terminalbench is my favorite effort of this type!

thumb_up_off_alt33

chat_bubble_outline1

repeat18

shareShare

Amanda Dsouza

@amanda_dsouza

6 months ago

Go hear Jason talk about our latest work on dynamic benchmarks at AGI House.

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Kobie Crawford

@kobiewon

6 months ago

Great talk by Zhengyang Qi at today’s AGI House hackathon! Covered the BeTal paper from Snorkel AI Research. Paper: arxiv.org/abs/2510.25039

Great talk by <a href="/qi_zhengyang/">Zhengyang Qi</a> at today’s <a href="/agihouse_org/">AGI House</a> hackathon! Covered the BeTal paper from <a href="/SnorkelAI/">Snorkel AI</a> Research. Paper: arxiv.org/abs/2510.25039

thumb_up_off_alt5

chat_bubble_outline2

repeat3

shareShare

Alex Ratner

@ajratner

6 months ago

Static benchmarks as the gold standard of measurement will increasingly be a thing of the past. The future is dynamic benchmarks - regularly updated in response to evolving failure modes, error analyses, and objectives. Excited to see Snorkel AI Research leading the way here!

thumb_up_off_alt20

chat_bubble_outline1

repeat5

shareShare

Amanda Dsouza

@amanda_dsouza

6 months ago

Very encouraging to see the gains from adaptive environments! Our work on automated benchmark design uses similar ideas of creating environments/benchmarks of controllable difficulty (or more generally, controllable target properties). Interestingly, our results on Tau-bench

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Amanda Dsouza

@amanda_dsouza

6 months ago

Not all LLM queries are created alike — and turns out most can be served effectively by local models. That’s a win on several fronts (energy, privacy, cost..). Interesting results by hazyresearch on a dataset of 1M natural queries! Excited to see more from our collab as well

thumb_up_off_alt9

chat_bubble_outline1

repeat0

shareShare

Snorkel AI

@snorkelai

5 months ago

We had a terrific interview with the creators of Terminal Bench 2.0. They unpack: • why terminals → more reliable and powerful agents • key design tradeoffs in TB 2.0 • Creating Harbor to enable eval, RL, and agent workflows at scale • lessons from building a 100+

thumb_up_off_alt17

chat_bubble_outline1

repeat5

shareShare

Changho Shin @ ICLR 2025

@changho_shin_

5 months ago

Thrilled to present CARE, our confounder-aware aggregation method for LLM-as-a-judge, at NeurIPS Reliable ML from Unreliable Data! 📍 Upper Level Room 2 (TBC) 🕐 Poster Session 2, 1:15–2:15 👇 Thread

thumb_up_off_alt23

chat_bubble_outline1

repeat14

shareShare

Changho Shin @ ICLR 2025

@changho_shin_

4 months ago

- One failure mode of coding agents: they often ignore reusability / extensibility. That’s exactly what SlopCodeBench is trying to capture. - Meanwhile, for slop coders (like me 😂), writing reusable and extensible code is learned through long-term punishment, by suffering

thumb_up_off_alt11

chat_bubble_outline0

repeat1

shareShare

Shizhe He

@shizhehe

4 months ago

Holiday read from hazyresearch 🎄: How should you mix and match LLMs in an agentic system? How many bits of information about the context does an agent carry? We use information theory to understand how to choose and scale these models.

Holiday read from <a href="/HazyResearch/">hazyresearch</a> 🎄:

How should you mix and match LLMs in an agentic system? How many bits of information about the context does an agent carry?

We use information theory to understand how to choose and scale these models.

thumb_up_off_alt359

chat_bubble_outline9

repeat53

shareShare

Amanda Dsouza

@amanda_dsouza

3 months ago

I'll be presenting our workshop paper at the Agentic Benchmarks and Applications Workshop at #AAAI2026 sites.google.com/view/aaba4et/h… Lets chat about benchmarking, and building high quality environments/datasets.

thumb_up_off_alt9

chat_bubble_outline0

repeat2

shareShare

Justin Bauer

@realjustinbauer

3 months ago

Our paper “Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes” was accepted to #MLSys 2026! We introduce three procedurally generated, verifiable datasets—Counting, Graph, and Spatial Reasoning—to study RLVR under low-data / low-compute

thumb_up_off_alt16

chat_bubble_outline2

repeat7

shareShare

vincent sunn chen

@vincentsunnchen

3 months ago

x.com/i/article/2021…

thumb_up_off_alt313

chat_bubble_outline16

repeat78

shareShare

Armin

@arminpcm

a month ago

We are hiring for multiple junior #Research roles within our research team at Snorkel AI, focusing on the following areas: 1. Evaluations and benchmarking, particularly in domains such as legal and healthcare. 2. Post-training, with an emphasis on data valuation and curriculum

thumb_up_off_alt310

chat_bubble_outline19

repeat25

shareShare