Evan (@evan_a_frick) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

🔍 Which reward model characteristics best predict RLHF performance? We evaluated RMs & LLM-judges on: - Human preference agreement on Chatbot Arena - Accuracy in selecting correct code/math answers - Correlation with Chatbot Arena rankings Interesting finding: Lower-bound

thumb_up_off_alt35

chat_bubble_outline1

repeat3

shareShare

Evan

@evan_a_frick

9 months ago

The vibes are immaculate.

thumb_up_off_alt1

chat_bubble_outline1

repeat0

shareShare

lmarena.ai (formerly lmsys.org)

@lmarena_ai

9 months ago

🚨New Chatbot Arena Category: Creative Writing Arena! Creative writing (~15% votes) involves originality, artistic expression, and often different from technical prompts. Key Findings: - o1-Mini drops below top models - Gemini 1.5 Pro/Flash 002 both UP significantly -

thumb_up_off_alt359

chat_bubble_outline11

repeat55

shareShare

lmarena.ai (formerly lmsys.org)

@lmarena_ai

9 months ago

Congrats Nexusflow on the latest Athene-V2-72B release, matching top models across hard benchmarks! Now it comes the real test—Athene is live in Arena for human evaluation. Come ask tough prompts at lmarena. ai!

thumb_up_off_alt148

chat_bubble_outline4

repeat16

shareShare

lmarena.ai (formerly lmsys.org)

@lmarena_ai

8 months ago

Exciting update from Chatbot Arena! Athene-V2-Chat-72B by Nexusflow debuts as the best open model, matching proprietary models like GPT-4o/Sonnet in technical domains (e.g., math, coding, hard prompts)! Category ranking: - Math: #3 - Coding: #7 - Hard Prompt: #6 - Overall #10

Exciting update from Chatbot Arena!

Athene-V2-Chat-72B by <a href="/NexusflowX/">Nexusflow</a> debuts as the best open model, matching proprietary models like GPT-4o/Sonnet in technical domains (e.g., math, coding, hard prompts)!

Category ranking:
- Math: #3
- Coding: #7
- Hard Prompt: #6
- Overall #10

thumb_up_off_alt262

chat_bubble_outline4

repeat54

shareShare

Tianle (Tim) Li

@litianleli

8 months ago

lmarena.ai Nexusflow Open Model Community rn 🔥🔥

thumb_up_off_alt11

chat_bubble_outline0

repeat1

shareShare

lmarena.ai (formerly lmsys.org)

@lmarena_ai

5 months ago

Introducing Prompt-to-leaderboard (P2L): a real-time LLM leaderboard tailored exactly to your use case! P2L trains an LLM to generate "prompt-specific" leaderboards, so you can input a prompt and get a leaderboard specifically for that prompt. The model is trained on the 2M

thumb_up_off_alt522

chat_bubble_outline31

repeat82

shareShare

Wei-Lin Chiang

@infwinston

5 months ago

Prompt-to-Leaderboard is now LIVE❤️‍🔥 Input any prompt → leaderboard for you in real-time. Huge shoutout to the incredible team that made this happen! Evan Connor Chen Joseph Tennyson Tianle (Tim) Li Wei-Lin Chiang Anastasios Nikolas Angelopoulos Ion Stoica

thumb_up_off_alt29

chat_bubble_outline1

repeat7

shareShare

Anastasios Nikolas Angelopoulos

@ml_angelopoulos

5 months ago

Prompt-to-Leaderboard is one of my favorite projects ever. "Which LLM is best for me and my use-case?" We train an LLM to take in prompts and output a vector of BT regression coefficients: one per model. By converting evaluation into learning, we benefit from scaling laws in

thumb_up_off_alt41

chat_bubble_outline3

repeat9

shareShare

Lisa Dunlap

@lisabdunlap

5 months ago

Have been waiting for this banger to go public, super fun to play around with. A huge congrats to Evan Tianle (Tim) Li Connor Chen Anastasios Nikolas Angelopoulos Wei-Lin Chiang and Ion Stoica!

thumb_up_off_alt30

chat_bubble_outline1

repeat5

shareShare

Tianle (Tim) Li

@litianleli

5 months ago

[NEWEST TECH] I ask Chatbot Arena’s Prompt-to-Leaderboard what is the model ranking for a super difficult quarter final question from MIT Integration Bee 🐝. I also added "Do Not Explain, only output the final answer!" to the end of the prompt. See the equation in 🧵

thumb_up_off_alt31

chat_bubble_outline1

repeat4

shareShare

Tianle (Tim) Li

@litianleli

5 months ago

P2L somehow “understands” that these thinking models are able to hide their COT when trained on 1.5M Chatbot Arena pairwise preference data. My challenge to you: Can you find a harder prompt? 👉 lmarena.ai/p2l

thumb_up_off_alt7

chat_bubble_outline1

repeat1

shareShare

Tianle (Tim) Li

@litianleli

5 months ago

You cannot deny the fact that we built a unique evaluation platform. It shed insights on models like GPT-4.5 that isn’t well captured by other benchmarks. It’s got some human like intelligence which makes it a great model. Multi-turn it certainly is #1 in my experiences.

thumb_up_off_alt10

chat_bubble_outline1

repeat1

shareShare

Tianle (Tim) Li

@litianleli

3 months ago

🚨 Arena-Hard-v2.0 is here! 🚨 Major Improvement: - Better Automatic Judges (Gemini-2.5 & GPT-4.1) 🦾 - 500 Fresh Prompts from LMArena🗿 - Tougher Baselines 🏋️ - Multilingual (30+ Langs) 🌎 - Plus Eval for Creative Writing ✍️ Test your model on the hardest prompts from LMArena!

thumb_up_off_alt214

chat_bubble_outline4

repeat30

shareShare

Tianle (Tim) Li

@litianleli

3 months ago

#ICML2025

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

lmarena.ai (formerly lmsys.org)

@lmarena_ai

2 months ago

📢We’re excited to share that we’ve raised $100M in seed funding to support LMArena and continue our research on reliable AI. Led by a16z and UC Investments (University of California), we're proud to have the support of those that believe in both the science and the mission. We’re

thumb_up_off_alt796

chat_bubble_outline62

repeat85

shareShare

Evan

Gate.io

Banghua Zhu

Evan

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

lmarena.ai (formerly lmsys.org)

Tianle (Tim) Li

lmarena.ai (formerly lmsys.org)

Wei-Lin Chiang

Anastasios Nikolas Angelopoulos

Lisa Dunlap

Tianle (Tim) Li

Tianle (Tim) Li

Tianle (Tim) Li

Tianle (Tim) Li

Tianle (Tim) Li

lmarena.ai (formerly lmsys.org)