Sam Paech (@sam_paech) Twitter Tweets • TwiCopy

Sam Paech

@sam_paech

+ Follow

AI tinkerer, maintainer of EQ-Bench

ID: 704133968

linkhttps://eqbench.com calendar_today19-07-2012 01:34:16

726 Tweet

1,1K Takipçi

145 Takip Edilen

Sam Paech

@sam_paech

6 months ago

Been seeing some chatter that the new mistral small 3.2 writes a lot like deepseek v3. This analysis of their slop profiles confirms. I think the network representation here makes a bit more sense than the phylo tree, given the complicated nature of model lineages.

thumb_up_off_alt26

chat_bubble_outline1

repeat1

shareShare

Sam Paech

@sam_paech

6 months ago

New models benched: baidu/ERNIE-4.5-300B-A47B-PT and openrouter/cypher-alpha:free This mystery model is an odd one.

thumb_up_off_alt59

chat_bubble_outline3

repeat5

shareShare

Sam Paech

@sam_paech

6 months ago

For a change of pace: some benchmarks Grok-4 didn't get SOTA on.

thumb_up_off_alt108

chat_bubble_outline10

repeat2

shareShare

Sam Paech

@sam_paech

5 months ago

Let me spruik Judgemark real quick cuz I think it's neat: It works by measuring *separability* -- the ability to pin down writing ability in a blind test. The evaluated judge grades a set of models' writing outputs and we calc the error bar overlap. Less overlap: better judge.

thumb_up_off_alt36

chat_bubble_outline3

repeat0

shareShare