Bill Yuchen Lin 🤖 (@billyuchenlin) Twitter Tweets • TwiCopy

repeat277

account_circle

Bill Yuchen Lin 🤖

1 week ago

Llama3 AI at Meta looks awesome! github.com/meta-llama/lla…
Can't wait to test them on our WildBench and URIAL-Bench! 🤩

thumb_up_off_alt43

repeat3

account_circle

Xiang Yue

@xiangyue96

1 week ago

After receiving community feedback, we added Google DeepMind Gemini 1.5 Pro's results. 👇 Gemini 1.5 Pro's vision ability was significantly improved compared to 1.0 Pro and matched GPT-4's performance on our VisualWebBench! 🏆 Its action prediction (e.g., predicting what would

account_circle

Xiang Yue

@xiangyue96

2 weeks ago

🚀Introducing VisualWebBench: A Comprehensive Benchmark for Multimodal Web Page Understanding and Grounding. visualwebbench.github.io

🤔What's this all about? Why this benchmark?
> Back in Nov 2023, when we released MMMU (mmmu-benchmark.github.io), a comprehensive multimodal

account_circle

Graham Neubig

@gneubig

2 weeks ago

There are several LLM benchmarks for web agents, but agents are not the only web application of LLMs. What about more fine-grained web-page understanding?

Our new benchmark VisualWebBench evaluates LLMs on abilities such as OCR, QA, identifying DOM elements, etc.

account_circle

Bill Yuchen Lin 🤖

3 weeks ago

Updates of ⚔️𝕎𝕚𝕝𝕕𝕍𝕚𝕤𝕚𝕠𝕟-𝔸𝕣𝕖𝕟𝕒: We added more models such as Anthropic's Claude3 and Reka! Also, many new features for improving user experience and collecting better evaluation data. E.g., we support selecting models for sampling and inputting reasons

Updates of ⚔️𝕎𝕚𝕝𝕕𝕍𝕚𝕤𝕚𝕠𝕟-𝔸𝕣𝕖𝕟𝕒: We added more models such as @AnthropicAI's Claude3 and @RekaAILabs! Also, many new features for improving user experience and collecting better evaluation data. E.g., we support selecting models for sampling and inputting reasons

account_circle

Bill Yuchen Lin 🤖

3 weeks ago

Anthropic Awesome finding and insights on jailbreaking LLMs! I think that a useful baseline defense method for mitigating many-shot jailbreaking could be our SafeDecoding (linked below). Have you tried that? Btw, if one wants to make it easier, replacing safety fine-tuning with

thumb_up_off_alt6

repeat2

account_circle

Gradio

@Gradio

3 weeks ago

NEW : 𝐀𝐠𝐞𝐧𝐭🪄𝐋𝐮𝐦𝐨𝐬 is amazing at complex tasks

Lumos is Language Agents with Unified Data Formats, Modular Design, & OS LLMs

Lumos unifies a suite of complex interactive tasks, achieves competitive performance with GPT-4/3.5, OS agents

Task➡️Modular Approach➡️Results

account_circle

Bill Yuchen Lin 🤖

3 weeks ago

DBRX-Base from Databricks also achieves the top position in the URIAL Bench, which tests Base LLMs on the MT-bench with URIAL prompts (3-shot instruction-following examples). Check out the full results here on Hugging Face 🤗: huggingface.co/spaces/allenai…

Related Xs:
1️⃣ [URIAL

account_circle

Bill Yuchen Lin 🤖

3 weeks ago

🆕 Check out the recent update of 𝕎𝕚𝕝𝕕𝔹𝕖𝕟𝕔𝕙! We have included a few more models including DBRX-Instruct Databricks and StarlingLM-beta (7B) Nexusflow which are both super powerful! DBRX-Instruct is indeed the best open LLM; Starling-LM 7B outperforms a lot of even

account_circle

Da Yin

@Wade_Yin9712

4 weeks ago

🪄 𝔸𝕘𝕖𝕟𝕥 𝕃𝕦𝕞𝕠𝕤 is one of the first unified and modular frameworks for training open-source LLM-based agents.

New features:
🤖️Multimodal Reasoning with 𝕃𝕦𝕞𝕠𝕤
🐘 13B-scale 𝕃𝕦𝕞𝕠𝕤 models
🤗 𝕃𝕦𝕞𝕠𝕤 data-explorer demo

MOSAIC uclanlp

📝:

account_circle

Costa Huang

@vwxyzjn

1 month ago

PPO's training curves look like this. Note that several 1B models' KL exploded. From an optimization point of view, there is nothing wrong with them because the RLHF reward kept going up, the these 1B models corresponds to the 'reward hacking' / over optimized models.

To

thumb_up_off_alt7

repeat2