Amanda Dsouza (@amanda_dsouza) 's Twitter Profile
Amanda Dsouza

@amanda_dsouza

Applied research scientist @SnorkelAI. Previous: @heyjasperai, @fractalAI. MS (ML) @gtcomputing.

ID: 78238156

linkhttps://amy12xx.github.io/ calendar_today29-09-2009 06:44:59

399 Tweet

196 Takipçi

468 Takip Edilen

Fred Sala (@fredsala) 's Twitter Profile Photo

The coolest trend for AI is shifting from conversation to action—less talking and more doing. This is also a great opportunity for evals: we need benchmarks that measure utility, including in an economic sense. terminalbench is my favorite effort of this type!

Alex Ratner (@ajratner) 's Twitter Profile Photo

Static benchmarks as the gold standard of measurement will increasingly be a thing of the past. The future is dynamic benchmarks - regularly updated in response to evolving failure modes, error analyses, and objectives. Excited to see Snorkel AI Research leading the way here!

Amanda Dsouza (@amanda_dsouza) 's Twitter Profile Photo

Very encouraging to see the gains from adaptive environments! Our work on automated benchmark design uses similar ideas of creating environments/benchmarks of controllable difficulty (or more generally, controllable target properties). Interestingly, our results on Tau-bench

Amanda Dsouza (@amanda_dsouza) 's Twitter Profile Photo

Not all LLM queries are created alike — and turns out most can be served effectively by local models. That’s a win on several fronts (energy, privacy, cost..). Interesting results by hazyresearch on a dataset of 1M natural queries! Excited to see more from our collab as well

Snorkel AI (@snorkelai) 's Twitter Profile Photo

We had a terrific interview with the creators of Terminal Bench 2.0. They unpack: • why terminals → more reliable and powerful agents • key design tradeoffs in TB 2.0 • Creating Harbor to enable eval, RL, and agent workflows at scale • lessons from building a 100+

We had a terrific interview with the creators of Terminal Bench 2.0.

They unpack:
• why terminals → more reliable and powerful agents
• key design tradeoffs in TB 2.0
• Creating Harbor to enable eval, RL, and agent workflows at scale
• lessons from building a 100+
Changho Shin @ ICLR 2025 (@changho_shin_) 's Twitter Profile Photo

Thrilled to present CARE, our confounder-aware aggregation method for LLM-as-a-judge, at NeurIPS Reliable ML from Unreliable Data! 📍 Upper Level Room 2 (TBC) 🕐 Poster Session 2, 1:15–2:15 👇 Thread

Thrilled to present CARE, our confounder-aware aggregation method for LLM-as-a-judge, at NeurIPS Reliable ML from Unreliable Data!

📍 Upper Level Room 2 (TBC)
🕐 Poster Session 2, 1:15–2:15
👇 Thread
Changho Shin @ ICLR 2025 (@changho_shin_) 's Twitter Profile Photo

- One failure mode of coding agents: they often ignore reusability / extensibility. That’s exactly what SlopCodeBench is trying to capture. - Meanwhile, for slop coders (like me 😂), writing reusable and extensible code is learned through long-term punishment, by suffering

Shizhe He (@shizhehe) 's Twitter Profile Photo

Holiday read from hazyresearch 🎄: How should you mix and match LLMs in an agentic system? How many bits of information about the context does an agent carry? We use information theory to understand how to choose and scale these models.

Holiday read from <a href="/HazyResearch/">hazyresearch</a> 🎄:

How should you mix and match LLMs in an agentic system? How many bits of information about the context does an agent carry?

We use information theory to understand how to choose and scale these models.
Amanda Dsouza (@amanda_dsouza) 's Twitter Profile Photo

I'll be presenting our workshop paper at the Agentic Benchmarks and Applications Workshop at #AAAI2026 sites.google.com/view/aaba4et/h… Lets chat about benchmarking, and building high quality environments/datasets.

Justin Bauer (@realjustinbauer) 's Twitter Profile Photo

Our paper “Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes” was accepted to #MLSys 2026! We introduce three procedurally generated, verifiable datasets—Counting, Graph, and Spatial Reasoning—to study RLVR under low-data / low-compute

Armin (@arminpcm) 's Twitter Profile Photo

We are hiring for multiple junior #Research roles within our research team at Snorkel AI, focusing on the following areas: 1. Evaluations and benchmarking, particularly in domains such as legal and healthcare. 2. Post-training, with an emphasis on data valuation and curriculum