Julia Neagu (@juliaaneagu) 's Twitter Profile
Julia Neagu

@juliaaneagu

CEO & Co-Founder @QuotientAI ✨ formerly @GitHub @GitHubCopilot 🤖 reformed physicist 👩‍🔬 ~ opinions are my own ~

ID: 1246476937

linkhttps://www.quotientai.co/post/hello-world-were-quotient calendar_today06-03-2013 16:28:09

936 Tweet

870 Followers

1,1K Following

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

We don't talk about t̶e̶s̶t̶i̶n̶g̶ eval coverage nearly enough. The entire point of LLMs is to have systems that are super robust to random user interactions and produce useful outputs despite that randomness.

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

OK - but what does sufficient eval coverage look like? good eval coverage = converging benchmark metrics -- each net new test case added to the eval benchmark yields diminishing returns.

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

Some of best consumer AI products out there started off as internal enablement tools. It’s the best choice if you decide pre-prod evals aren’t going to cut it and you don’t want to test in *real* prod (user-facing).

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

IMO benchmarking on real user data should be the golden standard for LLM providers releasing new models, especially top 5 ones. It's not hard to build such a benchmark (smaller model trainers are doing it!) so avoiding it for too long in the favor of off the shelf ones is odd.

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

"Do I need a reference dataset for evals?" Yes and no. You need to start w/ a set of test inputs. You need to have a definition of good for your system. Reference-based evals need a source-of-truth answer. Reference-free ones (e.g. w/ judge LLM) don't but must be human-aligned.

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

It's not just "regulated industries" who need to care about AI evals, same as it's not just finance or healthcare developers who implement unit testing. Testing before shipping is a good developer practice, whether it's for AI or not.

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

LLM work so far has been largely solo developers pushing independently. It's all about to get A LOT more collaborative. This small feature feels like it could be a big leap in using LLMs for collaborative work — as every Artifact can be shared, used by others, and “remixed”.

Quotient AI (@quotientai) 's Twitter Profile Photo

🔥 We’re excited to share that Quotient AI has been selected for the Hot AI List by The AI Furnace 🧨🔥 We look forward to what’s next as we continue shaping the future of AI, together. Onwards 🚀

🔥 We’re excited to share that Quotient AI has been selected for the Hot AI List by The AI Furnace 🧨🔥

We look forward to what’s next as we continue shaping the future of AI, together.

Onwards 🚀