Eran Hirsch (@hirscheran) 's Twitter Profile
Eran Hirsch

@hirscheran

PhD candidate @biunlp ; Tweets about NLP, ML and research

ID: 2796071287

linkhttps://eranhirs.github.io/ calendar_today07-09-2014 14:32:29

830 Tweet

288 Followers

636 Following

Google Labs (@googlelabs) 's Twitter Profile Photo

We just discovered the ๐Ÿ”ฅ COOLEST ๐Ÿ”ฅ trick in Flow that we have to share: Instead of wordsmithing the perfect prompt, you can just... draw it. Take the image of your scene, doodle what you'd like on it (through any editing app), and then briefly describe what needs to happen

Denny Zhou (@denny_zhou) 's Twitter Profile Photo

Slides for my lecture โ€œLLM Reasoningโ€ at Stanford CS 25: dennyzhou.github.io/LLM-Reasoning-โ€ฆ Key points: 1. Reasoning in LLMs simply means generating a sequence of intermediate tokens before producing the final answer. Whether this resembles human reasoning is irrelevant. The crucial

Yumo Xu (@yumo_xu) 's Twitter Profile Photo

Excited to share our #ACL2025NLP paper, "๐‚๐ข๐ญ๐ž๐„๐ฏ๐š๐ฅ: ๐๐ซ๐ข๐ง๐œ๐ข๐ฉ๐ฅ๐ž-๐ƒ๐ซ๐ข๐ฏ๐ž๐ง ๐‚๐ข๐ญ๐š๐ญ๐ข๐จ๐ง ๐„๐ฏ๐š๐ฅ๐ฎ๐š๐ญ๐ข๐จ๐ง ๐Ÿ๐จ๐ซ ๐’๐จ๐ฎ๐ซ๐œ๐ž ๐€๐ญ๐ญ๐ซ๐ข๐›๐ฎ๐ญ๐ข๐จ๐ง"! ๐Ÿ“œ If youโ€™re working on RAG, Deep Research and Trustworthy AI, this is for you. Why? Citation quality is

Excited to share our #ACL2025NLP paper, "๐‚๐ข๐ญ๐ž๐„๐ฏ๐š๐ฅ: ๐๐ซ๐ข๐ง๐œ๐ข๐ฉ๐ฅ๐ž-๐ƒ๐ซ๐ข๐ฏ๐ž๐ง ๐‚๐ข๐ญ๐š๐ญ๐ข๐จ๐ง ๐„๐ฏ๐š๐ฅ๐ฎ๐š๐ญ๐ข๐จ๐ง ๐Ÿ๐จ๐ซ ๐’๐จ๐ฎ๐ซ๐œ๐ž ๐€๐ญ๐ญ๐ซ๐ข๐›๐ฎ๐ญ๐ข๐จ๐ง"! ๐Ÿ“œ If youโ€™re working on RAG, Deep Research and Trustworthy AI, this is for you. Why? Citation quality is
Vishakh Padmakumar (@vishakh_pk) 's Twitter Profile Photo

Maybe don't use an LLM for _everything_? Last summer, I got to fiddle again with content diversity Adobe Research Adobe and we showed that agentic pipelines that mix LLM-prompt steps with principled techniques can yield better, more personalized summaries

Maybe don't use an LLM for _everything_?

Last summer, I got to fiddle again with content diversity <a href="/AdobeResearch/">Adobe Research</a> <a href="/Adobe/">Adobe</a> and we showed that agentic pipelines that mix LLM-prompt steps with principled techniques can yield better, more personalized summaries
Hunyuan (@tencenthunyuan) 's Twitter Profile Photo

We're thrilled to release & open-source Hunyuan3D World Model 1.0! This model enables you to generate immersive, explorable, and interactive 3D worlds from just a sentence or an image. It's the industry's first open-source 3D world generation model, compatible with CG pipelines

Aviya Maimon (@aviyamaimon) 's Twitter Profile Photo

๐Ÿšจ New paper alert! ๐Ÿšจ We propose an IQ Test for LLMs โ€” a new way to evaluate models that goes beyond benchmarks and uncovers their core skills. Think: ๐Ÿง ๐Ÿค– psychometrics for LLMs. ๐Ÿ‘‡ (1/6)

Salesforce AI Research (@sfresearch) 's Twitter Profile Photo

Can LLM agents really understand us? ๐Ÿค” Paper: arxiv.org/html/2507.2203โ€ฆ Code/Data: github.com/SalesforceAIReโ€ฆ We introduce UserBench: a gym environment testing how well agents align with nuanced human intentโ€”not just follow commands. Users are messy: vague, evolving, indirect.

Can LLM agents really understand us? ๐Ÿค”

Paper: arxiv.org/html/2507.2203โ€ฆ
Code/Data: github.com/SalesforceAIReโ€ฆ

We introduce UserBench: a gym environment testing how well agents align with nuanced human intentโ€”not just follow commands. Users are messy: vague, evolving, indirect.
Banghua Zhu (@banghuaz) 's Twitter Profile Photo

Hearing from a lot of folks that they still fine-tune Qwen2.5 instead of Qwen3 โ€” simply because โ€œitโ€™s easier to tune.โ€ Qwen2.5 models seem more steerable: easier to adapt for new behaviors or boost specific capabilities, which means more downstream work builds on them. People

Mosh Levy (@mosh_levy) 's Twitter Profile Photo

Producing reasoning texts boosts the capabilities of AI models, but do we humans correctly understand these texts? Our latest research suggests that we do not. This highlights a new angle on the "Are they transparent?" debate: they might be, but we misinterpret them. ๐Ÿงต

Producing reasoning texts boosts the capabilities of AI models, but do we humans correctly understand these texts? Our latest research suggests that we do not.
This highlights a new angle on the "Are they transparent?" debate: they might be, but we misinterpret them. ๐Ÿงต
Paul Couvert (@itspaulai) 's Twitter Profile Photo

Wow! Chinese lab Tencent Hunyuan has released an open source alternative to Genie 3 ๐Ÿ”ฅ You can generate realistic videos that you can control in real time. - Long-term consistency - No need for expensive rendering - Trained on 1M+ gameplay recordings Already available โ†“

Tomer Ashuach (@tomerashuach) 's Twitter Profile Photo

๐Ÿšจ New preprint out! CRISP: Persistent Concept Unlearning via SAEs LLMs often encode knowledge we want to remove. CRISP enables persistent, interpretable, precise unlearning while keeping models useful & coherentโ€”tested on bio & cyber safety tasks๐Ÿงต๐Ÿ‘‡ ๐Ÿ“„arxiv.org/abs/2508.13650

๐Ÿšจ New preprint out!

CRISP: Persistent Concept Unlearning via SAEs
LLMs often encode knowledge we want to remove.

CRISP enables persistent, interpretable, precise unlearning while keeping models useful &amp; coherentโ€”tested on bio &amp; cyber safety tasks๐Ÿงต๐Ÿ‘‡
๐Ÿ“„arxiv.org/abs/2508.13650
Eviatar Nachshoni (@enachshoni) 's Twitter Profile Photo

๐Ÿšจ New paper out! ๐Ÿ“„ What happens when LLMs & RLMs face conflicting answers to a question? ๐Ÿค” They often ignore disagreement and confidently pick one โ€œcorrectโ€ answer. ๐Ÿคฏ ๐Ÿ“„ arxiv.org/pdf/2508.12355 #AI #LLM #NLP #MachineLearning

๐Ÿšจ New paper out! ๐Ÿ“„
What happens when LLMs &amp; RLMs face conflicting answers to a question? ๐Ÿค”
They often ignore disagreement and confidently pick one โ€œcorrectโ€ answer. ๐Ÿคฏ
๐Ÿ“„ arxiv.org/pdf/2508.12355
#AI #LLM #NLP #MachineLearning
Fan Nie (@fannie1208) 's Twitter Profile Photo

How it works? [4/n] ๐Ÿ”น UQ-Dataset: 500 challenging, diverse unsolved questions sourced from Stack Exchange. They are super hard and arise naturally when humans seek answers. ๐Ÿ”น UQ-Validator: No ground-truth โ†’ traditional metrics fail. Instead, we designed compound validation

How it works? [4/n]

๐Ÿ”น UQ-Dataset: 500 challenging, diverse unsolved questions sourced from Stack Exchange. They are super hard and arise naturally when humans seek answers.
๐Ÿ”น UQ-Validator: No ground-truth โ†’ traditional metrics fail. Instead, we designed compound validation