ikka (@shahules786) Twitter Tweets • TwiCopy

ikka

7 months ago

You want to tweak your new RAG chatbot, but you don’t have much test data at hand. You’ve tried synthetic data before: they were shallow and repetitive. This is how you fix it. A practical guide 👉 ► I’ll start with what doesn’t work blindly chunking pages and prompting an LLM

thumb_up_off_alt10

chat_bubble_outline2

repeat1

shareShare

ikka

@shahules786

6 months ago

We’ve assumed human-labeled data is the gold standard. A new study from Cohere shows even small biases, like phrasing something more confidently, can systematically distort evaluations. Humans, it turns out, are far less rational than we assume. One striking finding: models

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

ikka

@shahules786

6 months ago

We’ve spent the past two years working closely with AI teams, shipping eval loops, and improving LLM systems. Today, I’ve distilled everything we’ve learned into one practical guide. blog.ragas.io/hard-earned-le…

thumb_up_off_alt20

chat_bubble_outline0

repeat4

shareShare

ikka

@shahules786

6 months ago

Only in SF Thanks for the souvenir - RunRL team

thumb_up_off_alt2

chat_bubble_outline1

repeat0

shareShare

ikka

@shahules786

6 months ago

PII scrubbing in closed-source LLMs has gotten so aggressive that generating realistic synthetic data—with coherent names, places, and addresses—is now way harder than with open-source models.

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Sachin

@sachdh

6 months ago

So much to fun to talk and discuss about training reasoning models at Lossfunk In this talk, we discuss - SFT vs RL and Search - Reward Engineering and Hacking - Relationship between different policy gradient algorithms GRPO , PPO , A2C and Reinforce

thumb_up_off_alt80

chat_bubble_outline3

repeat5

shareShare

Amir!!

@ammirsm

6 months ago

DSPy’s optimizers are a Trojan Horse of Engineering Discipline in LLM Engineering (and we are still thinking about porting them!) While I didn’t get 300 likes, the focused interest from the right developers matters more. Expanding on this idea with a new blog post, thread 👇

thumb_up_off_alt35

chat_bubble_outline5

repeat7

shareShare

ikka

@shahules786

6 months ago

We are at PyCon US US 2025. Come by to solve your evals!

We are at <a href="/pycon/">PyCon US</a> US 2025. Come by to solve your evals!

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

ikka

@shahules786

6 months ago

“Wait—you’re saying we can’t monitor AI apps the way we monitor normal software? That means we might miss critical failures completely.” This is a very common concern of AI product teams on Ragas office hours. And yes, that’s exactly right. Traditional software typically fails

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

ikka

@shahules786

6 months ago

LLM leaderboards have long been under scrutiny. While Chatbot Arena gained community trust with its dynamic, human-voted evaluation framework, a recent paper — “The Leadership Illusion” — challenges some of its core assumptions. A breakdown: → Flawed assumptions in the BT

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare

ikka

@shahules786

6 months ago

Finally, met the legend Han - fun conversation on startups, dev tools and LLMOps :)

Finally, met the legend <a href="/HanchungLee/">Han</a> - fun conversation on startups, dev tools and LLMOps :)

thumb_up_off_alt25

chat_bubble_outline6

repeat0

shareShare

ikka

@shahules786

6 months ago

How do you uncover “unknown unknowns” in your AI system? Hierarchical clustering + summarization is all you need. Detecting unknown unknowns from production data has long been an interesting challenge in AI. Hierarchical clustering has been used by data scientists to organize

thumb_up_off_alt6

chat_bubble_outline3

repeat0

shareShare

ikka

@shahules786

5 months ago

We are evaluating N vendors for our internal AI assistant service. How can we do this? As AI matures, more and more enterprises are buying AI solutions rather than building in-house ones. Many vendors offer similar services, and every buyer wants to know which vendor provides

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

ikka

@shahules786

5 months ago

Text-to-SQL systems have become increasingly common with LLMs, with companies like Pinterest and Uber building and sharing about their work. However, evaluation remains an open challenge. Here is a recent work that caught my attention Taming SQL Complexity 1.The paper begins

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare

ikka

@shahules786

5 months ago

An interesting research from OpenAI shows that language models are cheating to converge quickly. LLMs, like any AI models, are trained and aligned using a system of rewards and penalties. Every penalty you assign represents something you value negatively—you’re telling the

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

ragas

@ragas_io

5 months ago

🧠 Paper Club Alert! We're discussing "The Illusion of Thinking" - Apple's controversial paper on why LRMs ace easy puzzles but crash on hard ones. Join us July 3 @ 9:30 AM PT for: - Overview - Chain-of-thought limitations - Real implications for AI in prod Free on Zoom:

thumb_up_off_alt9

chat_bubble_outline0

repeat1

shareShare

Naman Jain

@thebhulawat

2 months ago

Introducing Bunny - world's first curiosity device for kids It’s screenfree..it’s portable.. We raised $1M from South Park Commons to reimagine how kids thrive in the age of AI, safely. Comment 'Bunny'. Our nephew will pick 50 families that get it for free this holiday season…

thumb_up_off_alt1,1K

chat_bubble_outline513

repeat182

shareShare

ToolJet

@tooljet

2 months ago

Vibe coding? Total dumpster fire 🧯 Legacy low-code? Too slow to build 🦥 ToolJet AI yeets both into oblivion. AI agents crank out secure, full-stack apps from your chaos in minutes. Watch the madness end! #vibebuilding #vibecoding #ai #lowcode #ToolJetAI #nocode

thumb_up_off_alt121

chat_bubble_outline31

repeat81

shareShare

Kumar Anirudha

@kranirudha

a month ago

🚀 ragas v0.3.7 is just released 🎉 New metrics for tool calling & agent evaluation, OCI Gen AI integration, save/load functionality for metrics, and 45+ improvements across the board. Here's what's new 🧵👇 #LLM #AI #RAG #Evaluation

🚀 <a href="/ragas_io/">ragas</a> v0.3.7 is just released 🎉

New metrics for tool calling & agent evaluation, OCI Gen AI integration, save/load functionality for metrics, and 45+ improvements across the board.

Here's what's new 🧵👇

#LLM #AI #RAG #Evaluation

thumb_up_off_alt6

chat_bubble_outline1

repeat3

shareShare