Sebastien Bubeck(@SebastienBubeck) 's Twitter Profileg
Sebastien Bubeck

@SebastienBubeck

VP GenAI Research, Microsoft AI

ID:452384386

linkhttp://sbubeck.com calendar_today01-01-2012 19:44:13

1,4K Tweets

34,8K Followers

1,3K Following

Sebastien Bubeck(@SebastienBubeck) 's Twitter Profile Photo

Amazing work on these new benchmarks, keep them coming!!! And notice our little phi-3-mini (3.8B) ahead of 34B models :-). Quite curious to see where phi-3-medium (14B) lands!

Amazing work on these new benchmarks, keep them coming!!! And notice our little phi-3-mini (3.8B) ahead of 34B models :-). Quite curious to see where phi-3-medium (14B) lands!
account_circle
Sebastien Bubeck(@SebastienBubeck) 's Twitter Profile Photo

Check out this video if you want to learn more about phi-3!

And yes, yesterday's TikZ unicorn is by phi-3 :-) (14B model)

youtube.com/watch?v=rW9bAi…

account_circle
Dimitris Papailiopoulos(@DimitrisPapail) 's Twitter Profile Photo

The most surprising finding of this report is hidden in the appendix. Under the best of two prompts the models don't overfit that much, unlike what the abstract claims.

Here is original GSM8k vs GSM1k scores scatter plot vs the best of two prompts (standard vs cot-like)

The most surprising finding of this report is hidden in the appendix. Under the best of two prompts the models don't overfit that much, unlike what the abstract claims. Here is original GSM8k vs GSM1k scores scatter plot vs the best of two prompts (standard vs cot-like)
account_circle
Sebastien Bubeck(@SebastienBubeck) 's Twitter Profile Photo

I'm super excited by the new eval released by Scale AI! They developed an alternative 1k GSM8k-like examples that no model has ever seen. Here are the numbers with the alt format (appendix C):

GPT-4-turbo: 84.9%
phi-3-mini: 76.3%

Pretty good for a 3.8B model :-).

account_circle
Maksym Andriushchenko 🇺🇦(@maksym_andr) 's Twitter Profile Photo

wow, perhaps the most interesting message of arxiv.org/abs/2405.00332 is not that phi3 overfits (under some prompting template), but that it performs so well for its size even on a held-out dataset!

account_circle
Sebastien Bubeck(@SebastienBubeck) 's Twitter Profile Photo

I'm super excited by the new eval released by Scale AI! They developed an alternative 1k GSM8k-like examples that no model has ever seen. Here are the numbers with the alt format (appendix C):

GPT-4-turbo: 84.9%
phi-3-mini: 76.3%

Pretty good for a 3.8B model :-).

account_circle
Min Choi(@minchoi) 's Twitter Profile Photo

Llama 3 surprised everyone less than a week ago, but Microsoft just dropped Phi-3 and it's incredibly capable small AI model.

We may soon see 7B models that can beat GPT-4. People are already coming up with incredible use cases.

10 wild examples:

account_circle
Ashpreet Bedi(@ashpreetbedi) 's Twitter Profile Photo

🧙RAG with Phi-3 on ollama: I dont trust the benchmarks, so I recorded my very first test run. Completely unedited, each question asked for the first time. First impression is that it is good, very very good for its size.

Try it yourself: git.new/localrag

account_circle
Sebastien Bubeck(@SebastienBubeck) 's Twitter Profile Photo

phi-3 is here, and it's ... good :-).

I made a quick short demo to give you a feel of what phi-3-mini (3.8B) can do. Stay tuned for the open weights release and more announcements tomorrow morning!

(And ofc this wouldn't be complete without the usual table of benchmarks!)

account_circle
Microsoft(@Microsoft) 's Twitter Profile Photo

We're excited to announce the launch of Phi-3, a groundbreaking family of small language models that outperform larger models on a range of benchmarks. Learn how these small language models trained on high-quality data are doing more with less: msft.it/6010YHP32

account_circle