Devansh Jain (@devanshrjain) Twitter Tweets • TwiCopy

Harshita Diddee

10 months ago

Ever wondered which instruction selection strategy to choose for your custom setup? The answer might just be random sampling! In our recent #NAACL Findings paper, we show that popular strategies do not *consistently* beat random selection! Paper: shorturl.at/77ECJ 1/6

thumb_up_off_alt62

chat_bubble_outline2

repeat15

shareShare

Dagster

@dagster

10 months ago

Deploying LLM applications isn't just about the initial setup; it's about continuously managing the technology's rapid evolution. As new models emerge, organizations face the challenge of minimizing technical debt and optimizing model utilization. Check out the full Deep Dive

thumb_up_off_alt10

chat_bubble_outline1

repeat7

shareShare

Tomas Hernando Kofman

@tomas_hk

9 months ago

We're hiring engineers and researchers to build the future of multi-model AI infrastructure. We're a small, technically elite team backed by Jeff Dean, Julien Chaumond, Ion Stoica, + more. And we guarantee a $50K investment in your next startup for every year you work with us.

thumb_up_off_alt58

chat_bubble_outline4

repeat13

shareShare

MatthewBerman

@matthewberman

9 months ago

Apply to NotDiamond, they are hiring the best engineers and researchers. I met Tomas and his vision and intelligence blew me away, so I personally invested! Plus, they will invest $50k into the next company you start for every year you work there.

thumb_up_off_alt24

chat_bubble_outline1

repeat7

shareShare

Akhila Yerukola

@akhila_yerukola

9 months ago

Did you know? Gestures to express universal concepts—like wishing for luck—vary WIDELY across cultures? 🤞means luck in US but deeply offensive in Vietnam 🚨 📣We introduce MC-SIGNS, a test bed to evaluate how LLMs/VLMs/T2I handle such nonverbal cues 📜: arxiv.org/abs/2502.17710

thumb_up_off_alt50

chat_bubble_outline2

repeat15

shareShare

Neel Bhandari

@neelbhandari9

8 months ago

1/🚨 𝗡𝗲𝘄 𝗽𝗮𝗽𝗲𝗿 𝗮𝗹𝗲𝗿𝘁 🚨 RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style? We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline 🧵

thumb_up_off_alt59

chat_bubble_outline1

repeat17

shareShare

MatthewBerman

@matthewberman

6 months ago

Not Diamond is building an incredibly important infrastructure layer for AI: model routing. Today, they make it easier to write prompts once and use them across different LLMs with Prompt Adaptation. I got a preview a few weeks ago (I’m a small investor) and was very

thumb_up_off_alt53

chat_bubble_outline6

repeat8

shareShare

Yong Zheng-Xin (Yong)

@yong_zhengxin

6 months ago

🧵 Multilingual safety training/eval is now standard practice, but a critical question remains: Is multilingual safety actually solved? Our new survey with Cohere Labs answers this and dives deep into: - Language gap in safety research - Future priority areas Thread 👇

thumb_up_off_alt59

chat_bubble_outline4

repeat29

shareShare

Akhila Yerukola

@akhila_yerukola

5 months ago

Thanks Language Technologies Institute | @CarnegieMellon and CMU School of Computer Science for featuring our work!!✨💫 Our paper on culturally offensive nonverbal gestures is accepted to #ACL2025 main! Detailed thread🧵: x.com/akhila_yerukol… Preprint📜: arxiv.org/abs/2502.17710 Work done with Saadia Gabriel Violet Peng Maarten Sap (he/him)

thumb_up_off_alt23

chat_bubble_outline1

repeat4

shareShare

Sanidhya Vijayvargiya

@sanidhya903

5 months ago

1/ AI agents are increasingly being deployed for real-world tasks, but how safe are they in high-stakes settings? 🚨 NEW: OpenAgentSafety - A comprehensive framework for evaluating AI agent safety in realistic scenarios across eight critical risk categories. 🧵

thumb_up_off_alt35

chat_bubble_outline2

repeat16

shareShare

Andy Liu

@uilydna

2 months ago

🚨New Paper: LLM developers aim to align models with values like helpfulness or harmlessness. But when these conflict, which values do models choose to support? We introduce ConflictScope, a fully-automated evaluation pipeline that reveals how models rank values under conflict.

thumb_up_off_alt39

chat_bubble_outline2

repeat11

shareShare

Maarten Sap (he/him)

@maartensap

2 months ago

Day 3 (Thu Oct 9), 11:00am–1:00pm, Poster Session 5 Poster #13: PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages — led by Priyanshu Kumar, Devansh Jain Poster #74: Fluid Language Model Benchmarking — led by Valentin Hofmann

thumb_up_off_alt5

chat_bubble_outline0

repeat2

shareShare

Liwei Jiang

@liweijianglw

2 months ago

(Thu Oct 9, 11:00am–1:00pm) Poster Session 5 𝐏𝐨𝐬𝐭𝐞𝐫 #𝟏𝟑: PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages; w/ amazing Priyanshu Kumar, Devansh Jain PolyGuard is among the SOTA multilingual safety moderation tool + we release comprehensive multilingual

(Thu Oct 9, 11:00am–1:00pm) Poster Session 5

𝐏𝐨𝐬𝐭𝐞𝐫 #𝟏𝟑: PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages; w/ amazing <a href="/kpriyanshu256/">Priyanshu Kumar</a>, <a href="/devanshrjain/">Devansh Jain</a>

PolyGuard is among the SOTA multilingual safety moderation tool + we release comprehensive multilingual

thumb_up_off_alt25

chat_bubble_outline0

repeat7

shareShare

Rootly

@rootlyhq

2 months ago

While Sonnet-4.5 remains a popular choice among developers, our benchmarks show it underperforms GPT-5 on SRE-related tasks when both are run with default parameters. However, using the Not Diamond prompt adaptation platform, Sonnet-4.5 achieved up to a 2x performance

thumb_up_off_alt20

chat_bubble_outline3

repeat9

shareShare

Kshitish Ghate

@ghatekshitish

2 months ago

🚨New paper: Reward Models (RMs) are used to align LLMs, but can they be steered toward user-specific value/style preferences? With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵

thumb_up_off_alt65

chat_bubble_outline1

repeat17

shareShare

Letta

@letta_ai

a month ago

What if we evaluated agents less like isolated code snippets, and more like humans - where behavior depends on the environment and lived experiences? 🧪 Introducing 𝗟𝗲𝘁𝘁𝗮 𝗘𝘃𝗮𝗹𝘀: a fully open source evaluation framework for stateful agents

thumb_up_off_alt53

chat_bubble_outline4

repeat4

shareShare