ikka (@shahules786) 's Twitter Profile
ikka

@shahules786

The evals guy you should talk to.
Building @ragas_io

ID: 826039247392088064

linkhttps://ragas.io/ calendar_today30-01-2017 12:07:58

1,1K Tweet

3,3K Takipçi

382 Takip Edilen

ikka (@shahules786) 's Twitter Profile Photo

You want to tweak your new RAG chatbot, but you don’t have much test data at hand. You’ve tried synthetic data before: they were shallow and repetitive. This is how you fix it. A practical guide 👉 ► I’ll start with what doesn’t work blindly chunking pages and prompting an LLM

ikka (@shahules786) 's Twitter Profile Photo

We’ve assumed human-labeled data is the gold standard. A new study from Cohere shows even small biases, like phrasing something more confidently, can systematically distort evaluations. Humans, it turns out, are far less rational than we assume. One striking finding: models

We’ve assumed human-labeled data is the gold standard. A new study from Cohere shows even small biases, like phrasing something more confidently, can systematically distort evaluations.

Humans, it turns out, are far less rational than we assume. One striking finding: models
ikka (@shahules786) 's Twitter Profile Photo

We’ve spent the past two years working closely with AI teams, shipping eval loops, and improving LLM systems. Today, I’ve distilled everything we’ve learned into one practical guide. blog.ragas.io/hard-earned-le…

We’ve spent the past two years working closely with AI teams, shipping eval loops, and improving LLM systems. Today, I’ve distilled everything we’ve learned into one practical guide.
blog.ragas.io/hard-earned-le…
ikka (@shahules786) 's Twitter Profile Photo

PII scrubbing in closed-source LLMs has gotten so aggressive that generating realistic synthetic data—with coherent names, places, and addresses—is now way harder than with open-source models.

Sachin (@sachdh) 's Twitter Profile Photo

So much to fun to talk and discuss about training reasoning models at Lossfunk In this talk, we discuss - SFT vs RL and Search - Reward Engineering and Hacking - Relationship between different policy gradient algorithms GRPO , PPO , A2C and Reinforce

Amir!! (@ammirsm) 's Twitter Profile Photo

DSPy’s optimizers are a Trojan Horse of Engineering Discipline in LLM Engineering (and we are still thinking about porting them!) While I didn’t get 300 likes, the focused interest from the right developers matters more. Expanding on this idea with a new blog post, thread 👇

DSPy’s optimizers are a Trojan Horse of Engineering Discipline in LLM Engineering (and we are still thinking about porting them!)

While I didn’t get 300 likes, the focused interest from the right developers matters more.

Expanding on this idea with a new blog post, thread 👇
ikka (@shahules786) 's Twitter Profile Photo

“Wait—you’re saying we can’t monitor AI apps the way we monitor normal software? That means we might miss critical failures completely.” This is a very common concern of AI product teams on Ragas office hours. And yes, that’s exactly right. Traditional software typically fails

ikka (@shahules786) 's Twitter Profile Photo

LLM leaderboards have long been under scrutiny. While Chatbot Arena gained community trust with its dynamic, human-voted evaluation framework, a recent paper — “The Leadership Illusion” — challenges some of its core assumptions. A breakdown: → Flawed assumptions in the BT

LLM leaderboards have long been under scrutiny. While Chatbot Arena gained community trust with its dynamic, human-voted evaluation framework, a recent paper — “The Leadership Illusion” — challenges some of its core assumptions. A breakdown:

→ Flawed assumptions in the BT
ikka (@shahules786) 's Twitter Profile Photo

How do you uncover “unknown unknowns” in your AI system? Hierarchical clustering + summarization is all you need. Detecting unknown unknowns from production data has long been an interesting challenge in AI. Hierarchical clustering has been used by data scientists to organize

How do you uncover “unknown unknowns” in your AI system? Hierarchical clustering + summarization is all you need.

Detecting unknown unknowns from production data has long been an interesting challenge in AI. Hierarchical clustering has been used by data scientists to organize
ikka (@shahules786) 's Twitter Profile Photo

We are evaluating N vendors for our internal AI assistant service. How can we do this? As AI matures, more and more enterprises are buying AI solutions rather than building in-house ones. Many vendors offer similar services, and every buyer wants to know which vendor provides

We are evaluating N vendors for our internal AI assistant service. How can we do this?

As AI matures, more and more enterprises are buying AI solutions rather than building in-house ones. Many vendors offer similar services, and every buyer wants to know which vendor provides
ikka (@shahules786) 's Twitter Profile Photo

Text-to-SQL systems have become increasingly common with LLMs, with companies like Pinterest and Uber building and sharing about their work. However, evaluation remains an open challenge. Here is a recent work that caught my attention Taming SQL Complexity 1.The paper begins

Text-to-SQL systems have become increasingly common with LLMs, with companies like Pinterest and Uber building and sharing about their work. However, evaluation remains an open challenge. Here is a recent work that  caught my attention

Taming SQL Complexity

1.The paper begins
ikka (@shahules786) 's Twitter Profile Photo

An interesting research from OpenAI shows that language models are cheating to converge quickly. LLMs, like any AI models, are trained and aligned using a system of rewards and penalties. Every penalty you assign represents something you value negatively—you’re telling the

An interesting research from OpenAI shows that language models are cheating to converge quickly.

LLMs, like any AI models, are trained and aligned using a system of rewards and penalties. Every penalty you assign represents something you value negatively—you’re telling the
ragas (@ragas_io) 's Twitter Profile Photo

🧠 Paper Club Alert! We're discussing "The Illusion of Thinking" - Apple's controversial paper on why LRMs ace easy puzzles but crash on hard ones. Join us July 3 @ 9:30 AM PT for: - Overview - Chain-of-thought limitations - Real implications for AI in prod Free on Zoom:

🧠 Paper Club Alert! We're discussing "The Illusion of Thinking" - Apple's controversial paper on why LRMs ace easy puzzles but crash on hard ones.

Join us July 3 @ 9:30 AM PT for:
- Overview
- Chain-of-thought limitations
- Real implications for AI in prod

Free on Zoom:
Naman Jain (@thebhulawat) 's Twitter Profile Photo

Introducing Bunny - world's first curiosity device for kids It’s screenfree..it’s portable.. We raised $1M from South Park Commons to reimagine how kids thrive in the age of AI, safely. Comment 'Bunny'. Our nephew will pick 50 families that get it for free this holiday season…

ToolJet (@tooljet) 's Twitter Profile Photo

Vibe coding? Total dumpster fire 🧯 Legacy low-code? Too slow to build 🦥 ToolJet AI yeets both into oblivion. AI agents crank out secure, full-stack apps from your chaos in minutes. Watch the madness end! #vibebuilding #vibecoding #ai #lowcode #ToolJetAI #nocode

Kumar Anirudha (@kranirudha) 's Twitter Profile Photo

🚀 ragas v0.3.7 is just released 🎉 New metrics for tool calling & agent evaluation, OCI Gen AI integration, save/load functionality for metrics, and 45+ improvements across the board. Here's what's new 🧵👇 #LLM #AI #RAG #Evaluation

🚀 <a href="/ragas_io/">ragas</a> v0.3.7 is just released 🎉

New metrics for tool calling &amp; agent evaluation, OCI Gen AI integration, save/load functionality for metrics, and 45+ improvements across the board.

Here's what's new 🧵👇

#LLM #AI #RAG #Evaluation