Greg Durrett(@gregd_nlp) 's Twitter Profileg
Greg Durrett

@gregd_nlp

CS professor at UT Austin. I do NLP most of the time. he/him

ID:938457074278846468

calendar_today06-12-2017 17:16:17

1,1K Tweets

6,0K Followers

760 Following

Ge Gao(@ggaonlp) 's Twitter Profile Photo

RLHF research requires training and hiring annotators to explicitly choose between different model outputs.

What if we can get human preference based on user edits, which are naturally generated in applications like AI writing assistants? arxiv.org/abs/2404.15269

RLHF research requires training and hiring annotators to explicitly choose between different model outputs. What if we can get human preference based on user edits, which are naturally generated in applications like AI writing assistants? arxiv.org/abs/2404.15269
account_circle
Greg Durrett(@gregd_nlp) 's Twitter Profile Photo

Check out Prasann's work! We think this method has nice potential to deploy in real alignment settings with humans collecting preferences online. It really feels like the right way to squeeze the most out of that kind of data!

account_circle
Yasumasa Onoe(@yasumasa_onoe) 's Twitter Profile Photo

We're excited to announce DOCCI: A new dataset designed to advance vision-language research. DOCCI features 15k images with detailed descriptions crafted to capture complex visual concepts – spatial relations, counting, text and entities more.

arxiv.org/pdf/2404.19753

We're excited to announce DOCCI: A new dataset designed to advance vision-language research. DOCCI features 15k images with detailed descriptions crafted to capture complex visual concepts – spatial relations, counting, text and entities more. arxiv.org/pdf/2404.19753
account_circle
Boyang 'Albert' Li(@AlbertBoyangLi) 's Twitter Profile Photo

🚨New NAACL 2024 Paper 🚨
We trained four vision-language models on 23 source tasks and evaluated on 29 target tasks in order to look for patterns and latent factors in vision-language evaluation benchmarks.

arxiv.org/abs/2404.02415

account_circle
Yoonsang Lee(@yoonsang_) 's Twitter Profile Photo

Can LMs correctly distinguish🔎 confusing entity mentions in multiple documents?

We study how current LMs perform QA task when provided ambiguous questions and a document set📚 that requires challenging entity disambiguation.

Work done at Computer Science at UT Austin✨ w/ Xi Ye, Eunsol Choi

Can LMs correctly distinguish🔎 confusing entity mentions in multiple documents? We study how current LMs perform QA task when provided ambiguous questions and a document set📚 that requires challenging entity disambiguation. Work done at @UTCompSci✨ w/ @xiye_nlp, @eunsolc
account_circle
Yating Wu(@YatingWu96) 's Twitter Profile Photo

LLMs can mimic human curiosity by generating open-ended inquisitive questions given some context, similar to how humans wonder when they read.

But which ones are more important to be answered?🤔

We predict the salience of questions, substantially outperforming GPT-4.🌟 🧵1/5

LLMs can mimic human curiosity by generating open-ended inquisitive questions given some context, similar to how humans wonder when they read. But which ones are more important to be answered?🤔 We predict the salience of questions, substantially outperforming GPT-4.🌟 🧵1/5
account_circle
Greg Durrett(@gregd_nlp) 's Twitter Profile Photo

Check out Liyan's system + benchmark! Strong LLM fact-checking models like MiniCheck will allow response refinement and training for better factuality (work in progress!). LLM-AggreFact collects 10 high-quality labeled datasets of LLM errors in the literature to evaluate them!

account_circle
Ryo Kamoi(@RyoKamoi) 's Twitter Profile Photo

📢 New Preprint! Can LLMs detect mistakes in LLM responses?
We introduce ReaLMistake, error detection benchmark with errors by GPT-4 & Llama 2.
Evaluated 12 LLMs and showed LLM-based error detectors are unreliable!
Rui Zhang Wenpeng_Yin Arman Cohan +
arxiv.org/abs/2404.03602

📢 New Preprint! Can LLMs detect mistakes in LLM responses? We introduce ReaLMistake, error detection benchmark with errors by GPT-4 & Llama 2. Evaluated 12 LLMs and showed LLM-based error detectors are unreliable! @ruizhang_nlp @Wenpeng_Yin @armancohan + arxiv.org/abs/2404.03602
account_circle
Hongli Zhan(@HongliZhan) 's Twitter Profile Photo

🥱Tired of LLM’s generic “hope you feel better” responses?

🧠Can we dive much deeper and instill cognitive capabilities in them?

Under the right instructions, LLMs (zero-shot) score very high per expert psychologist evaluators!

📢arxiv.org/abs/2404.01288

1/🧵

🥱Tired of LLM’s generic “hope you feel better” responses? 🧠Can we dive much deeper and instill cognitive capabilities in them? Under the right instructions, LLMs (zero-shot) score very high per expert psychologist evaluators! 📢arxiv.org/abs/2404.01288 1/🧵
account_circle
Yekyung Kim(@YekyungKim) 's Twitter Profile Photo

Summarizing long documents (>100K tokens) is a popular use case for LLMs, but how faithful are these summaries? We present FABLES, a dataset of human annotations of faithfulness & content selection in LLM-generated summaries of books.

arxiv.org/abs/2404.01261

🧵below:

Summarizing long documents (>100K tokens) is a popular use case for LLMs, but how faithful are these summaries? We present FABLES, a dataset of human annotations of faithfulness & content selection in LLM-generated summaries of books. arxiv.org/abs/2404.01261 🧵below:
account_circle
Akari Asai @ ICLR2024 🇦🇹(@AkariAsai) 's Twitter Profile Photo

Greg Durrett Chaitanya Malaviya Abhika Abhika Mishra also led a project where we annotated 1k LLM responses (llama2 7b&70b chat and ChatGPT) to diverse instruction following prompts with span level hallucinations and types.
The data is publicly available!

account_circle