Dylan X. Hou (@xinminghou) 's Twitter Profile
Dylan X. Hou

@xinminghou

undergrad studying AI at Renmin Univ. of China, NLP researcher, intelligence explorer&trainer, interned@Tencent AI Lab. Carpe Diem🍀

ID: 1549288486435684353

linkhttps://dxhou.github.io/ calendar_today19-07-2022 07:02:19

486 Tweet

496 Followers

2,2K Following

Khanh Nguyen (on job market) (@khanhxuannguyen) 's Twitter Profile Photo

📢 Excited to announce our new paper Language-guided world models: A model-based approach to AI control • We develop LWMs: world models that can read texts to capture new environment dynamics • These models enable humans to efficiently control agents by providing language

📢 Excited to announce our new paper 

Language-guided world models: A model-based approach to AI control

• We develop LWMs: world models that can read texts to capture new environment dynamics
• These models enable humans to efficiently control agents by providing language
Weijie Su (@weijie444) 's Twitter Profile Photo

New Research (w/ amazing Hangfeng He) "A Law of Next-Token Prediction in Large Language Models" LLMs rely on NTP, but their internal mechanisms seem chaotic. It's difficult to discern how each layer processes data for NTP. Surprisingly, we discover a physics-like law on NTP:

New Research (w/ amazing <a href="/hangfeng_he/">Hangfeng He</a>)

"A Law of Next-Token Prediction in Large Language Models"

LLMs rely on NTP, but their internal mechanisms seem chaotic. It's difficult to discern how each layer processes data for NTP. Surprisingly, we discover a physics-like law on NTP:
Kristina Gligorić (@krisgligoric) 's Twitter Profile Photo

LLMs have been proposed for annotation tasks. But, LLMs are biased and make errors. Can we draw * valid * conclusions from LLM annotations? arxiv.org/abs/2408.15204

AI at Meta (@aiatmeta) 's Twitter Profile Photo

New research paper from Meta FAIR – Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. Chunting Zhou, Lili Yu (ICLR2025) and team introduce this recipe for training a multi-modal model over discrete and continuous data. Transfusion combines next token

New research paper from Meta FAIR – Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model.

<a href="/violet_zct/">Chunting Zhou</a>, <a href="/liliyu_lili/">Lili Yu (ICLR2025)</a> and team introduce this recipe for training a multi-modal model over discrete and continuous data. Transfusion combines next token
xuan (ɕɥɛn / sh-yen) (@xuanalogue) 's Twitter Profile Photo

Should AI be aligned with human preferences, rewards, or utility functions? Excited to finally share a preprint that Micah Carroll Matija Hal Ashton & I have worked on for almost 2 years, arguing that AI alignment has to move beyond the preference-reward-utility nexus!

Should AI be aligned with human preferences, rewards, or utility functions?

Excited to finally share a preprint that <a href="/MicahCarroll/">Micah Carroll</a> <a href="/FranklinMatija/">Matija</a> <a href="/hal_ashton/">Hal Ashton</a> &amp; I have worked on for almost 2 years, arguing that AI alignment has to move beyond the preference-reward-utility nexus!
Omar Khattab (@lateinteraction) 's Twitter Profile Photo

🔗 Thoughts on Research Impact in AI. Grad students often ask: how do I do research that makes a difference in the current, crowded AI space? This is a blogpost that summarizes my perspective in six guidelines for making research impact via open-source artifacts. Link below.

🔗 Thoughts on Research Impact in AI.

Grad students often ask: how do I do research that makes a difference in the current, crowded AI space?

This is a blogpost that summarizes my perspective in six guidelines for making research impact via open-source artifacts. Link below.
Jillian Fisher (@jrfisher552) 's Twitter Profile Photo

How do biased AI models effect human decision-making? 🤔 Our latest paper, “Biased AI can Influence Political Decision-Making”, uses two interactive tasks which show that exposure to partisan AI can sway opinions—no matter your political stance! 🗳️ Paper: arxiv.org/abs/2410.06415

How do biased AI models effect human decision-making? 🤔
Our latest paper, “Biased AI can Influence Political Decision-Making”, uses two interactive tasks which show that exposure to partisan AI can sway opinions—no matter your political stance! 🗳️

Paper: arxiv.org/abs/2410.06415
Rohan Pandey (@rohan99pandey) 's Twitter Profile Photo

1/7 With all the buzz around PhD applications, I've felt one of the things missing in the narrative experience of PhDing itself. There's great advice on the application process, there's little talk about how it really is.

Nathan Lambert (@natolambert) 's Twitter Profile Photo

One of the first papers studying inference time personalization. One of the great ways we can make open models better suited to your needs than APIs. PAD: Personalized Alignment at Decoding-Time (similar ideas to our social choice position paper from earlier in the year)

One of the first papers studying inference time personalization. One of the great ways we can make open models better suited to your needs than APIs.

PAD: Personalized Alignment at Decoding-Time
(similar ideas to our social choice position paper from earlier in the year)
Yixin Liu (@yixinliu17) 's Twitter Profile Photo

LLMs are often used to evaluate the instruction-following capabilities of other LLMs – but which LLM should we choose, and how should we use it? 🤔 We're excited to share "ReIFE: Re-evaluating Instruction-Following Evaluation"! Preprint: arxiv.org/abs/2410.07069 📊 Our study is

LLMs are often used to evaluate the instruction-following capabilities of other LLMs – but which LLM should we choose, and how should we use it? 🤔

We're excited to share "ReIFE: Re-evaluating Instruction-Following Evaluation"!  Preprint: arxiv.org/abs/2410.07069

📊 Our study is
Nathan Lambert (@natolambert) 's Twitter Profile Photo

Newest *PO preference tuning paper at least feels substantially different from a lot of the others from earlier in 2024. TPO: Tree Preference Optimization Liao et al It creates a latent space to rank many options between steps. Like DPO meets tree search / PRMs. With some

Newest *PO preference tuning paper at least feels substantially different from a lot of the others from earlier in 2024.

TPO: Tree Preference Optimization 
Liao et al

It creates a latent space to rank many options between steps. Like DPO meets tree search / PRMs. With some
Tao Yu (@taoyds) 's Twitter Profile Photo

🍅Excited to see Anthropic using 🚀our OSWorld🚀(NeurIPS'24) to benchmark computer use! 🍋OSWorld will soon support parallel cloud running, much faster! 🍓More multimodal agent open-source big projects coming soon from XLANG NLP Lab in Nov- stay tuned! 👇os-world.github.io

🍅Excited to see <a href="/AnthropicAI/">Anthropic</a> using 🚀our OSWorld🚀(NeurIPS'24) to benchmark computer use!

🍋OSWorld will soon support parallel cloud running, much faster!

🍓More multimodal agent open-source big projects coming soon from <a href="/XLangNLP/">XLANG NLP Lab</a> in Nov- stay tuned!

👇os-world.github.io
Ilia Sucholutsky (@sucholutsky) 's Twitter Profile Photo

So excited to share that this was published in Nature Human Behaviour! 🥳 It's time to build AI thought partners that learn & think *with* people rather than *instead of* people. 🧠🤝🤖 We lay out what that means, why it matters, and how it can be done! nature.com/articles/s4156…

LLM360 (@llm360) 's Twitter Profile Photo

📣Proud to share Web2Code: a Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs was accepted to NeurIPS Conference 2024! About Web2Code: 📸 novel image + html dataset 📈webpage code gen benchmark 🧠CrystalChat-7B-Web2Code Blog: mbzuai-llm.github.io/webpage2code/

📣Proud to share Web2Code: a Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs was accepted to <a href="/NeurIPSConf/">NeurIPS Conference</a> 2024!

About Web2Code:
📸 novel image + html dataset
📈webpage code gen benchmark
🧠CrystalChat-7B-Web2Code

Blog: mbzuai-llm.github.io/webpage2code/
Jing-Jing Li (@drjingjing2026) 's Twitter Profile Photo

🚨 New preprint from my internship Ai2! Introducing SafetyAnalyst, an LLM content moderation framework that 📌 builds structured “harm-benefit trees” given a prompt 📌 weights harms against benefits 📌 delivers interpretable, transparent, and steerable safety decisions

🚨 New preprint from my internship <a href="/allen_ai/">Ai2</a>!

Introducing SafetyAnalyst, an LLM content moderation framework that
📌 builds structured “harm-benefit trees” given a prompt
📌 weights harms against benefits
📌 delivers interpretable, transparent, and steerable safety decisions
Tao Yu (@taoyds) 's Twitter Profile Photo

🍅Surprising finding: Basic adversarial pop-ups trick state-of-the-art VLMs (e.g., Anthropic computer use agent) into clicking 🚩>90%🚩of the time in OSworld! 🥝Clear signal: We need more robust safety measures before deploying computer use agents at scale.

🍅Surprising finding: Basic adversarial pop-ups trick state-of-the-art VLMs (e.g., <a href="/AnthropicAI/">Anthropic</a> computer use agent) into clicking 🚩&gt;90%🚩of the time in OSworld!

🥝Clear signal: We need more robust safety measures before deploying computer use agents at scale.
Tao Yu (@taoyds) 's Twitter Profile Photo

🤔Static CUA benchmarks enable fast model dev but lack task variety and risk overfitting. Computer Agent Arena tests crowdsourced real-world tasks. OSWorld: 🥇UI-Tars1.5🥈Operator🥉Claude 3.7 CUA Arena: 🥇Claude 3.7🥈Operator🥉UI-Tars1.5 🚀Rankings likely to evolve quickly

🤔Static CUA benchmarks enable fast model dev but lack task variety and risk overfitting. 

Computer Agent Arena tests crowdsourced real-world tasks.

OSWorld: 🥇UI-Tars1.5🥈Operator🥉Claude 3.7
CUA Arena: 🥇Claude 3.7🥈Operator🥉UI-Tars1.5

🚀Rankings likely to evolve quickly
DeepSeek (@deepseek_ai) 's Twitter Profile Photo

🚀 DeepSeek-R1-0528 is here! 🔹 Improved benchmark performance 🔹 Enhanced front-end capabilities 🔹 Reduced hallucinations 🔹 Supports JSON output & function calling ✅ Try it now: chat.deepseek.com 🔌 No change to API usage — docs here: api-docs.deepseek.com/guides/reasoni… 🔗

Yongyi Zang (@yongyi_zang) 's Twitter Profile Photo

🚨New Audio Benchmark 🚨We find standard LLMs can solve Music-QA benchmarks by just guessing from text only, + LALMs can still answer well when given noise instead of music! Presenting RUListening: A fully automated pipeline for making Audio-QA benchmarks *actually* assess