Evan (@evan_a_frick) 's Twitter Profile
Evan

@evan_a_frick

CS at Berkeley.
ML Research @lmarena_ai @berkeley_ai
ML Engineer @NexusflowX

ID: 1782173693378236416

linkhttps://efrick2002.github.io/ calendar_today21-04-2024 22:25:04

29 Tweet

82 Followers

44 Following

Banghua Zhu (@banghuaz) 's Twitter Profile Photo

🔍 Which reward model characteristics best predict RLHF performance? We evaluated RMs & LLM-judges on: - Human preference agreement on Chatbot Arena - Accuracy in selecting correct code/math answers - Correlation with Chatbot Arena rankings Interesting finding: Lower-bound

lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

🚨New Chatbot Arena Category: Creative Writing Arena! Creative writing (~15% votes) involves originality, artistic expression, and often different from technical prompts. Key Findings: - o1-Mini drops below top models - Gemini 1.5 Pro/Flash 002 both UP significantly -

🚨New Chatbot Arena Category: Creative Writing Arena!

Creative writing (~15% votes) involves originality, artistic expression, and often different from technical prompts. 

Key Findings:
- o1-Mini drops below top models
- Gemini 1.5 Pro/Flash 002 both UP significantly
-
lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

Congrats Nexusflow on the latest Athene-V2-72B release, matching top models across hard benchmarks! Now it comes the real test—Athene is live in Arena for human evaluation. Come ask tough prompts at lmarena. ai!

lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

Exciting update from Chatbot Arena! Athene-V2-Chat-72B by Nexusflow debuts as the best open model, matching proprietary models like GPT-4o/Sonnet in technical domains (e.g., math, coding, hard prompts)! Category ranking: - Math: #3 - Coding: #7 - Hard Prompt: #6 - Overall #10

Exciting update from Chatbot Arena!

Athene-V2-Chat-72B by <a href="/NexusflowX/">Nexusflow</a> debuts as the best open model, matching proprietary models like GPT-4o/Sonnet in technical domains (e.g., math, coding, hard prompts)!

Category ranking:
- Math: #3
- Coding: #7
- Hard Prompt: #6
- Overall #10
lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case! P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt. The model is trained on the 2M

Anastasios Nikolas Angelopoulos (@ml_angelopoulos) 's Twitter Profile Photo

Prompt-to-Leaderboard is one of my favorite projects ever. "Which LLM is best for me and my use-case?" We train an LLM to take in prompts and output a vector of BT regression coefficients: one per model. By converting evaluation into learning, we benefit from scaling laws in

Tianle (Tim) Li (@litianleli) 's Twitter Profile Photo

[NEWEST TECH] I ask Chatbot Arena’s Prompt-to-Leaderboard what is the model ranking for a super difficult quarter final question from MIT Integration Bee 🐝. I also added "Do Not Explain, only output the final answer!" to the end of the prompt. See the equation in 🧵

[NEWEST TECH]

I ask Chatbot Arena’s Prompt-to-Leaderboard what is the model ranking for a super difficult quarter final question from MIT Integration Bee 🐝. I also added "Do Not Explain, only output the final answer!" to the end of the prompt.

See the equation in 🧵
Tianle (Tim) Li (@litianleli) 's Twitter Profile Photo

P2L somehow “understands” that these thinking models are able to hide their COT when trained on 1.5M Chatbot Arena pairwise preference data. My challenge to you: Can you find a harder prompt? 👉 lmarena.ai/p2l

Tianle (Tim) Li (@litianleli) 's Twitter Profile Photo

You cannot deny the fact that we built a unique evaluation platform. It shed insights on models like GPT-4.5 that isn’t well captured by other benchmarks. It’s got some human like intelligence which makes it a great model. Multi-turn it certainly is #1 in my experiences.

Tianle (Tim) Li (@litianleli) 's Twitter Profile Photo

🚨 Arena-Hard-v2.0 is here! 🚨 Major Improvement: - Better Automatic Judges (Gemini-2.5 & GPT-4.1) 🦾 - 500 Fresh Prompts from LMArena🗿 - Tougher Baselines 🏋️ - Multilingual (30+ Langs) 🌎 - Plus Eval for Creative Writing ✍️ Test your model on the hardest prompts from LMArena!

🚨 Arena-Hard-v2.0 is here! 🚨

Major Improvement:
- Better Automatic Judges (Gemini-2.5 &amp; GPT-4.1) 🦾
- 500 Fresh Prompts from LMArena🗿
- Tougher Baselines 🏋️
- Multilingual (30+ Langs) 🌎
- Plus Eval for Creative Writing ✍️

Test your model on the hardest prompts from LMArena!
lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

📢We’re excited to share that we’ve raised $100M in seed funding to support LMArena and continue our research on reliable AI. Led by a16z and UC Investments (University of California), we're proud to have the support of those that believe in both the science and the mission. We’re