Srijan Kumar (@srijankedia) Twitter Tweets • TwiCopy

Srijan Kumar

@srijankedia

+ Follow

Cofounder @lighthouzai & Prof @GeorgiaTech; Lighthouz: quality assurance platform for AI chatbots | Ex-Google, Stanford, CMU, UMD, IIT | Forbes u30 | NSF CAREER

ID:583637312

linkhttps://lighthouz.ai calendar_today18-05-2012 09:28:06

2,1K Tweets

4,2K Followers

1,3K Following

Srijan Kumar

1 week ago

😖 Identifying the leading LLM using performance on academic benchmarks is flawed. Why?
👉 No guarantee the benchmark data wasn't used to train the LLM, directly or indirectly
👉 Performance on academic benchmark does not equate with utility in the real world. The data is

😖 Identifying the leading LLM using performance on academic benchmarks is flawed. Why? 👉 No guarantee the benchmark data wasn't used to train the LLM, directly or indirectly 👉 Performance on academic benchmark does not equate with utility in the real world. The data is

thumb_up_off_alt5

chat_bubble_outline0

account_circle

Srijan Kumar

1 week ago

🤮 Using an LLM to evaluate another LLM sucks! LLMs hallucinate and make mistakes, so why would the judge LLM not do the same?

Blindly trusting the eval LLM is worse than not doing evals at all, as it gives a false sense of trust.

Even if the judge LLM is 'more powerful' than

thumb_up_off_alt40

chat_bubble_outline0

account_circle

Srijan Kumar

1 week ago

The ❤️ love ❤️ from our customers after seeing the majorly-upgraded version of Lighthouz AI has been overwhelming! Over the last few weeks, the team worked very hard to create a unique & user-friendly framework for enterprise dev teams to create accurate AI applications. We can't

The ❤️ love ❤️ from our customers after seeing the majorly-upgraded version of @lighthouzai has been overwhelming! Over the last few weeks, the team worked very hard to create a unique & user-friendly framework for enterprise dev teams to create accurate AI applications. We can't

thumb_up_off_alt6

chat_bubble_outline0

account_circle

Srijan Kumar

1 week ago

Learn to code, people. Learn to code!

thumb_up_off_alt1

chat_bubble_outline0

account_circle

Srijan Kumar

2 weeks ago

This is sooo cute!! Is there a Twitch stream where I can watch robots playing soccer cutely like 'clumsy toddlers'?

thumb_up_off_alt5

chat_bubble_outline0

account_circle

Wyatt Walls

2 weeks ago

If you are using LLMs for summarizing long docs, you really should read this paper

Over 50% of book summaries (incl by Claude Opus and GPT-4) were identified as containing factual errors and errors of omission

Lesson: don't blindly assume AI summarization tools work. Test them.

If you are using LLMs for summarizing long docs, you really should read this paper Over 50% of book summaries (incl by Claude Opus and GPT-4) were identified as containing factual errors and errors of omission Lesson: don't blindly assume AI summarization tools work. Test them.

thumb_up_off_alt1,2K

chat_bubble_outline0

account_circle

Wyatt Walls

2 weeks ago

I have previously tried using GPT-4 to summarise legal cases, and had them reviewed by lawyers who had just done a manual summary of the same case.
The results were not good: many errors, including conflating the legal reasoning and missing what I thought were key points.

thumb_up_off_alt77

chat_bubble_outline0

account_circle

Kaushal Shirode

@KaushalShirode

2 weeks ago

⚔ Chatbot Guardrails Arena ⚔

I broke chatbots with guardrails to reveal sensitive information at the Chatbot Guardrails Arena! Can you break them too? arena.lighthouz.ai
Srijan Kumar 😁

thumb_up_off_alt2

chat_bubble_outline0

account_circle

Srijan Kumar

3 weeks ago

Specialized/fine tuned models are better. But how do you evaluate fine tuned models reliably? LLM-as-a-judge themselves are inaccurate. Human + AI in the loop is the only reliable solution!

thumb_up_off_alt1

chat_bubble_outline0

account_circle

Mohit Chandra

3 weeks ago

Our work on cross-lingual evaluation of LLMs was covered in Scientific American Scientific American 🙌🏼

Yiqiao Jin @GeorgiaTech and I talked about our findings and implications of our work to make healthcare more equitable 🌏

Link: shorturl.at/grAJR

Thanks Ananya (ಅನನ್ಯ), for such a fantastic article!

Our work on cross-lingual evaluation of LLMs was covered in Scientific American @sciam 🙌🏼 @AhrenJin and I talked about our findings and implications of our work to make healthcare more equitable 🌏 Link: shorturl.at/grAJR Thanks @punarpuli, for such a fantastic article!

thumb_up_off_alt19

chat_bubble_outline0

account_circle

Srijan Kumar

3 weeks ago

New paper: 'Corrective or Backfire: Characterizing and Predicting User Response to Social Correction'!
arxiv.org/abs/2403.04852

thumb_up_off_alt2

chat_bubble_outline0

account_circle

Alison Smith

@AlisonTeeSmith

3 weeks ago

Lighthouz AI Tools like Lighthouz AI guardrails arena offer an intuitive and tangible way to 'kick the tires', and will get even better over time. Will try harder with some multi-turn querying next 😇

thumb_up_off_alt2

chat_bubble_outline0

account_circle

Amandeep Singh

1 month ago

⚔ Chatbot Guardrails Arena ⚔

I broke chatbots with guardrails to reveal sensitive information at the Chatbot Guardrails Arena! Can you break them too? arena.lighthouz.ai
huggingface.co/spaces/lightho…

thumb_up_off_alt3

chat_bubble_outline0

account_circle

Ethan Mollick

4 weeks ago

The state of benchmarks in AI is really bad.

The state of benchmarks in AI is really bad.

thumb_up_off_alt364

chat_bubble_outline0

account_circle

Srijan Kumar

4 weeks ago

💯
Everyone is realizing that performance on benchmarks ≠ performance in practice.

Do domain-specific task-specific testing of your LLM system.

thumb_up_off_alt4

chat_bubble_outline0

account_circle

Ali Ghodsi

4 weeks ago

There is so much focus on the standard LLM benchmarks (MMLU, ARC, GSM8k etc), but for enterprises the only thing that matters is how well the AI does on the domain specific tasks. Check out a comparison between DBRX and GPT4 on these domain-specific benchmark datasets.

thumb_up_off_alt201

chat_bubble_outline0

account_circle