Srijan Kumar(@srijankedia) 's Twitter Profileg
Srijan Kumar

@srijankedia

Cofounder @lighthouzai & Prof @GeorgiaTech; Lighthouz: quality assurance platform for AI chatbots | Ex-Google, Stanford, CMU, UMD, IIT | Forbes u30 | NSF CAREER

ID:583637312

linkhttps://lighthouz.ai calendar_today18-05-2012 09:28:06

2,1K Tweets

4,2K Followers

1,3K Following

Srijan Kumar(@srijankedia) 's Twitter Profile Photo

๐Ÿ˜– Identifying the leading LLM using performance on academic benchmarks is flawed. Why?
๐Ÿ‘‰ No guarantee the benchmark data wasn't used to train the LLM, directly or indirectly
๐Ÿ‘‰ Performance on academic benchmark does not equate with utility in the real world. The data is

๐Ÿ˜– Identifying the leading LLM using performance on academic benchmarks is flawed. Why? ๐Ÿ‘‰ No guarantee the benchmark data wasn't used to train the LLM, directly or indirectly ๐Ÿ‘‰ Performance on academic benchmark does not equate with utility in the real world. The data is
account_circle
Srijan Kumar(@srijankedia) 's Twitter Profile Photo

๐Ÿคฎ Using an LLM to evaluate another LLM sucks! LLMs hallucinate and make mistakes, so why would the judge LLM not do the same?

Blindly trusting the eval LLM is worse than not doing evals at all, as it gives a false sense of trust.

Even if the judge LLM is 'more powerful' than

account_circle
Srijan Kumar(@srijankedia) 's Twitter Profile Photo

The โค๏ธ love โค๏ธ from our customers after seeing the majorly-upgraded version of Lighthouz AI has been overwhelming! Over the last few weeks, the team worked very hard to create a unique & user-friendly framework for enterprise dev teams to create accurate AI applications. We can't

The โค๏ธ love โค๏ธ from our customers after seeing the majorly-upgraded version of @lighthouzai has been overwhelming! Over the last few weeks, the team worked very hard to create a unique & user-friendly framework for enterprise dev teams to create accurate AI applications. We can't
account_circle
Srijan Kumar(@srijankedia) 's Twitter Profile Photo

This is sooo cute!! Is there a Twitch stream where I can watch robots playing soccer cutely like 'clumsy toddlers'?

account_circle
Wyatt Walls(@lefthanddraft) 's Twitter Profile Photo

If you are using LLMs for summarizing long docs, you really should read this paper

Over 50% of book summaries (incl by Claude Opus and GPT-4) were identified as containing factual errors and errors of omission

Lesson: don't blindly assume AI summarization tools work. Test them.

If you are using LLMs for summarizing long docs, you really should read this paper Over 50% of book summaries (incl by Claude Opus and GPT-4) were identified as containing factual errors and errors of omission Lesson: don't blindly assume AI summarization tools work. Test them.
account_circle
Wyatt Walls(@lefthanddraft) 's Twitter Profile Photo

I have previously tried using GPT-4 to summarise legal cases, and had them reviewed by lawyers who had just done a manual summary of the same case.
The results were not good: many errors, including conflating the legal reasoning and missing what I thought were key points.

account_circle
Kaushal Shirode(@KaushalShirode) 's Twitter Profile Photo

โš” Chatbot Guardrails Arena โš”

I broke chatbots with guardrails to reveal sensitive information at the Chatbot Guardrails Arena! Can you break them too? arena.lighthouz.ai
Srijan Kumar ๐Ÿ˜

account_circle
Srijan Kumar(@srijankedia) 's Twitter Profile Photo

Specialized/fine tuned models are better. But how do you evaluate fine tuned models reliably? LLM-as-a-judge themselves are inaccurate. Human + AI in the loop is the only reliable solution!

account_circle
Mohit Chandra(@mohit__30) 's Twitter Profile Photo

Our work on cross-lingual evaluation of LLMs was covered in Scientific American Scientific American ๐Ÿ™Œ๐Ÿผ

Yiqiao Jin @GeorgiaTech and I talked about our findings and implications of our work to make healthcare more equitable ๐ŸŒ

Link: shorturl.at/grAJR

Thanks Ananya (เฒ…เฒจเฒจเณเฒฏ), for such a fantastic article!

Our work on cross-lingual evaluation of LLMs was covered in Scientific American @sciam ๐Ÿ™Œ๐Ÿผ @AhrenJin and I talked about our findings and implications of our work to make healthcare more equitable ๐ŸŒ Link: shorturl.at/grAJR Thanks @punarpuli, for such a fantastic article!
account_circle
Srijan Kumar(@srijankedia) 's Twitter Profile Photo

New paper: 'Corrective or Backfire: Characterizing and Predicting User Response to Social Correction'!
arxiv.org/abs/2403.04852

account_circle
Alison Smith(@AlisonTeeSmith) 's Twitter Profile Photo

Lighthouz AI Tools like Lighthouz AI guardrails arena offer an intuitive and tangible way to 'kick the tires', and will get even better over time. Will try harder with some multi-turn querying next ๐Ÿ˜‡

account_circle
Amandeep Singh(@KasperNom) 's Twitter Profile Photo

โš” Chatbot Guardrails Arena โš”

I broke chatbots with guardrails to reveal sensitive information at the Chatbot Guardrails Arena! Can you break them too? arena.lighthouz.ai
huggingface.co/spaces/lighthoโ€ฆ

account_circle
Srijan Kumar(@srijankedia) 's Twitter Profile Photo

๐Ÿ’ฏ
Everyone is realizing that performance on benchmarks โ‰ ย performance in practice.

Do domain-specific task-specific testing of your LLM system.

account_circle
Ali Ghodsi(@alighodsi) 's Twitter Profile Photo

There is so much focus on the standard LLM benchmarks (MMLU, ARC, GSM8k etc), but for enterprises the only thing that matters is how well the AI does on the domain specific tasks. Check out a comparison between DBRX and GPT4 on these domain-specific benchmark datasets.

account_circle