Bill Yuchen Lin 🤖(@billyuchenlin) 's Twitter Profileg
Bill Yuchen Lin 🤖

@billyuchenlin

Research @allen_ai. I evaluate (multi-modal) LLMs, build agents, and study the science of LLMs. Previously: @GoogleAI & @MetaAI FAIR @nlp_usc

ID:726053744731807744

linkhttp://yuchenlin.xyz calendar_today29-04-2016 14:21:16

885 Tweets

6,3K Followers

2,0K Following

Thomas Wolf(@Thom_Wolf) 's Twitter Profile Photo

Llama3 was trained on 15 trillion tokens of public data. But where can you find such datasets and recipes??

Here comes the first release of 🍷Fineweb. A high quality large scale filtered web dataset out-performing all current datasets of its scale. We trained 200+ ablation

account_circle
Xiang Yue(@xiangyue96) 's Twitter Profile Photo

After receiving community feedback, we added Google DeepMind Gemini 1.5 Pro's results. 👇 Gemini 1.5 Pro's vision ability was significantly improved compared to 1.0 Pro and matched GPT-4's performance on our VisualWebBench! 🏆 Its action prediction (e.g., predicting what would

After receiving community feedback, we added @GoogleDeepMind Gemini 1.5 Pro's results. 👇 Gemini 1.5 Pro's vision ability was significantly improved compared to 1.0 Pro and matched GPT-4's performance on our VisualWebBench! 🏆 Its action prediction (e.g., predicting what would
account_circle
Xiang Yue(@xiangyue96) 's Twitter Profile Photo

🚀Introducing VisualWebBench: A Comprehensive Benchmark for Multimodal Web Page Understanding and Grounding. visualwebbench.github.io

🤔What's this all about? Why this benchmark?
> Back in Nov 2023, when we released MMMU (mmmu-benchmark.github.io), a comprehensive multimodal

account_circle
Graham Neubig(@gneubig) 's Twitter Profile Photo

There are several LLM benchmarks for web agents, but agents are not the only web application of LLMs. What about more fine-grained web-page understanding?

Our new benchmark VisualWebBench evaluates LLMs on abilities such as OCR, QA, identifying DOM elements, etc.

account_circle
Bill Yuchen Lin 🤖(@billyuchenlin) 's Twitter Profile Photo

Updates of ⚔️𝕎𝕚𝕝𝕕𝕍𝕚𝕤𝕚𝕠𝕟-𝔸𝕣𝕖𝕟𝕒: We added more models such as Anthropic's Claude3 and Reka! Also, many new features for improving user experience and collecting better evaluation data. E.g., we support selecting models for sampling and inputting reasons

Updates of ⚔️𝕎𝕚𝕝𝕕𝕍𝕚𝕤𝕚𝕠𝕟-𝔸𝕣𝕖𝕟𝕒: We added more models such as @AnthropicAI's Claude3 and @RekaAILabs! Also, many new features for improving user experience and collecting better evaluation data. E.g., we support selecting models for sampling and inputting reasons
account_circle
Bill Yuchen Lin 🤖(@billyuchenlin) 's Twitter Profile Photo

Anthropic Awesome finding and insights on jailbreaking LLMs! I think that a useful baseline defense method for mitigating many-shot jailbreaking could be our SafeDecoding (linked below). Have you tried that? Btw, if one wants to make it easier, replacing safety fine-tuning with

account_circle
Gradio(@Gradio) 's Twitter Profile Photo

NEW : 𝐀𝐠𝐞𝐧𝐭🪄𝐋𝐮𝐦𝐨𝐬 is amazing at complex tasks

Lumos is Language Agents with Unified Data Formats, Modular Design, & OS LLMs

Lumos unifies a suite of complex interactive tasks, achieves competitive performance with GPT-4/3.5, OS agents

Task➡️Modular Approach➡️Results

NEW : 𝐀𝐠𝐞𝐧𝐭🪄𝐋𝐮𝐦𝐨𝐬 is amazing at complex tasks Lumos is Language Agents with Unified Data Formats, Modular Design, & OS LLMs Lumos unifies a suite of complex interactive tasks, achieves competitive performance with GPT-4/3.5, OS agents Task➡️Modular Approach➡️Results
account_circle
Bill Yuchen Lin 🤖(@billyuchenlin) 's Twitter Profile Photo

DBRX-Base from Databricks also achieves the top position in the URIAL Bench, which tests Base LLMs on the MT-bench with URIAL prompts (3-shot instruction-following examples). Check out the full results here on Hugging Face 🤗: huggingface.co/spaces/allenai…

Related Xs:
1️⃣ [URIAL

DBRX-Base from @databricks also achieves the top position in the URIAL Bench, which tests Base LLMs on the MT-bench with URIAL prompts (3-shot instruction-following examples). Check out the full results here on @huggingface 🤗: huggingface.co/spaces/allenai… Related Xs: 1️⃣ [URIAL
account_circle
Bill Yuchen Lin 🤖(@billyuchenlin) 's Twitter Profile Photo

🆕 Check out the recent update of 𝕎𝕚𝕝𝕕𝔹𝕖𝕟𝕔𝕙! We have included a few more models including DBRX-Instruct Databricks and StarlingLM-beta (7B) Nexusflow which are both super powerful! DBRX-Instruct is indeed the best open LLM; Starling-LM 7B outperforms a lot of even

🆕 Check out the recent update of 𝕎𝕚𝕝𝕕𝔹𝕖𝕟𝕔𝕙! We have included a few more models including DBRX-Instruct @databricks and StarlingLM-beta (7B) @NexusflowX which are both super powerful! DBRX-Instruct is indeed the best open LLM; Starling-LM 7B outperforms a lot of even
account_circle
Da Yin(@Wade_Yin9712) 's Twitter Profile Photo

🪄 𝔸𝕘𝕖𝕟𝕥 𝕃𝕦𝕞𝕠𝕤 is one of the first unified and modular frameworks for training open-source LLM-based agents.

New features:
🤖️Multimodal Reasoning with 𝕃𝕦𝕞𝕠𝕤
🐘 13B-scale 𝕃𝕦𝕞𝕠𝕤 models
🤗 𝕃𝕦𝕞𝕠𝕤 data-explorer demo

MOSAIC uclanlp

📝:

🪄 𝔸𝕘𝕖𝕟𝕥 𝕃𝕦𝕞𝕠𝕤 is one of the first unified and modular frameworks for training open-source LLM-based agents. New features: 🤖️Multimodal Reasoning with 𝕃𝕦𝕞𝕠𝕤 🐘 13B-scale 𝕃𝕦𝕞𝕠𝕤 models 🤗 𝕃𝕦𝕞𝕠𝕤 data-explorer demo @ai2_mosaic @uclanlp 📝:
account_circle
Costa Huang(@vwxyzjn) 's Twitter Profile Photo

PPO's training curves look like this. Note that several 1B models' KL exploded. From an optimization point of view, there is nothing wrong with them because the RLHF reward kept going up, the these 1B models corresponds to the 'reward hacking' / over optimized models.

To

PPO's training curves look like this. Note that several 1B models' KL exploded. From an optimization point of view, there is nothing wrong with them because the RLHF reward kept going up, the these 1B models corresponds to the 'reward hacking' / over optimized models. To
account_circle
Matthew Finlayson(@mattf1n) 's Twitter Profile Photo

Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more!
📄 arxiv.org/abs/2403.09539
Here’s how 1/🧵

Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more! 📄 arxiv.org/abs/2403.09539 Here’s how 1/🧵
account_circle