Yunxiang Zhang (@yunxiangzhang4) Twitter Tweets • TwiCopy

Yunxiang Zhang

@yunxiangzhang4

+ Follow

CS PhD student @UMichCSE, BS @PKU1898, #NLP

ID: 1399732727880949766

linkhttps://yunx-z.github.io/ calendar_today01-06-2021 14:21:11

34 Tweet

109 Takipçi

236 Takip Edilen

Xin Liu

@xinliu_cs

2 years ago

LLMs often exhibit poorly calibrated confidence, which undermines users' trust in their outputs. Though methods exist for short-form answers, they don't address long-form responses😕 Discover the solution in our #ICLR2024 paper! 📄 arxiv.org/abs/2310.19208 👀

thumb_up_off_alt15

chat_bubble_outline1

repeat8

shareShare

Farima Fatahi (on job market)

@farimafb

a year ago

🌍 How Verifiable Are LM Responses in the Wild? A Three-Way Factuality Benchmark Meet 𝐅𝐚𝐜𝐭𝐁𝐞𝐧𝐜𝐡 – an updatable benchmark for evaluating language models' factuality in real-world scenarios. 🔗 huggingface.co/spaces/launch/… LaunchNLP MichiganAI Computer Science and Engineering at Michigan

thumb_up_off_alt20

chat_bubble_outline1

repeat12

shareShare

Xinliang (Frederick) Zhang

@frederickxzhang

a year ago

Heard of the Alaska-Hawaii merger?🤔Wonder if LLMs know it’s pending government approval before it can happen? They stumble, but we’ve got a fix⚒️! Dive into my #EMNLP2024 work 𝐍𝐚𝐫𝐫𝐚𝐭𝐢𝐯𝐞-𝐨𝐟-𝐓𝐡𝐨𝐮𝐠𝐡𝐭—a special prompting technique to unlock LLMs’ temporal reasoning

thumb_up_off_alt34

chat_bubble_outline1

repeat18

shareShare

Inderjeet Jayakumar Nair

@inderjeetnair

a year ago

Hi everyone 👋, I will be presenting our work at #EMNLP2024 on automatically optimizing feedback generation systems for improved implementation performance, on 12th Nov, 14:00 - 15:30 in the Generation and Summarization oral session. See you there!

thumb_up_off_alt18

chat_bubble_outline1

repeat4

shareShare

Rohan Paul

@rohanpaul_ai

7 months ago

Evaluating LLM research agents on scientific discovery lacks objective measures for assessing proposed methods. This paper introduces MLRC-BENCH, a benchmark using Machine Learning conference competitions to objectively evaluate agent novelty and effectiveness against human

thumb_up_off_alt27

chat_bubble_outline0

repeat4

shareShare

Muhammad Khalifa

@mkhalifaaaa

6 months ago

🚨Announcing SCALR @ COLM 2025 — Call for Papers!🚨 The 1st Workshop on Test-Time Scaling and Reasoning Models (SCALR) is coming to Conference on Language Modeling in Montreal this October! This is the first workshop dedicated to this growing research area. 🌐 scalr-workshop.github.io

🚨Announcing SCALR @ COLM 2025 — Call for Papers!🚨

The 1st Workshop on Test-Time Scaling and Reasoning Models (SCALR) is coming to <a href="/COLM_conf/">Conference on Language Modeling</a> in Montreal this October!

This is the first workshop dedicated to this growing research area.

🌐 scalr-workshop.github.io

thumb_up_off_alt44

chat_bubble_outline1

repeat17

shareShare

Jie Ruan

@jieruan75

6 months ago

🔍LLMs now give medical diagnoses, legal advice, and even tackle scientific problems. ❓Your LLM sounds smart. But what if it’s just good at faking expertise? 🚀We built ExpertLongBench to find out. 📉And the results? They revealed several concerns.👇 🔗 huggingface.co/spaces/launch/…

thumb_up_off_alt19

chat_bubble_outline1

repeat11

shareShare

Muhammad Khalifa

@mkhalifaaaa

6 months ago

🚨 Deadline for SCALR 2025 Workshop: Test‑time Scaling & Reasoning Models at COLM '25 Conference on Language Modeling is approaching!🚨 scalr-workshop.github.io 🧩 Call for short papers (4 pages, non‑archival) now open on OpenReview! Submit by June 23, 2025; notifications out July 24. Topics

🚨 Deadline for SCALR 2025 Workshop: Test‑time Scaling & Reasoning Models at COLM '25 <a href="/COLM_conf/">Conference on Language Modeling</a> is approaching!🚨

scalr-workshop.github.io

🧩 Call for short papers (4 pages, non‑archival) now open on OpenReview! Submit by June 23, 2025; notifications out July 24.

Topics

thumb_up_off_alt16

chat_bubble_outline0

repeat8

shareShare

Kai Zou

@zkjzou

5 months ago

🔥 Excited to introduce ManyICLBench (ACL 2025) 🧐 Do many-shot ICL tasks evaluate LCLMs' ability to retrieve the most similar examples or learn from many examples? We carefully analyzed numerous tasks and categorized them. 📄 Paper: arxiv.org/abs/2411.07130 #ACL2025

thumb_up_off_alt26

chat_bubble_outline1

repeat16

shareShare