Niels Mündler (@nielstron) 's Twitter Profile
Niels Mündler

@nielstron

Computer scientist. PhDing at @eth. Formal verification, Language Models. Compiling Python to FP @OpShinDev. Ex-Founder.

ID: 923755284

linkhttp://blog.nielstron.de calendar_today03-11-2012 18:24:49

1,1K Tweet

551 Takipçi

292 Takip Edilen

Nikola Jovanović @ ICLR 🇸🇬 (@ni_jovanovic) 's Twitter Profile Photo

MathArena goes visual: We evaluated models such as GPT-5 on Math Kangaroo 2025, a recent contest for ages 6-19 where most tasks require visual reasoning. Models struggle the most with tasks for younger kids. For example, they get this task for 1st graders only 3% of the time 🧵

MathArena goes visual: We evaluated models such as GPT-5 on Math Kangaroo 2025, a recent contest for ages 6-19 where most tasks require visual reasoning.

Models struggle the most with tasks for younger kids. For example, they get this task for 1st graders only 3% of the time 🧵
Niels Mündler (@nielstron) 's Twitter Profile Photo

It's a shame, because I really like the approach of this paper, but why did they include a prompt injection in the arXiv version? public shame :(

It's a shame, because I really like the approach of this paper, but why did they include a prompt injection in the arXiv version?

public shame :(
miru (@miru_why) 's Twitter Profile Photo

Niklas Sheth Ron Arel Intology their 'superhuman' ai cleverly assigned all the work to non-default streams, which means the correctness test (which waits on all streams) passes, while the profiling timer (which only waits on the default stream) is tricked into reporting a huge speedup

<a href="/niklassheth/">Niklas Sheth</a> <a href="/ronusedh/">Ron Arel</a> <a href="/IntologyAI/">Intology</a> their 'superhuman' ai cleverly assigned all the work to non-default streams, which means the correctness test (which waits on all streams) passes, while the profiling timer (which only waits on the default stream) is tricked into reporting a huge speedup
Niels Mündler (@nielstron) 's Twitter Profile Photo

two weeks into #iclr rebuttal, so far 1 desk reject, 1 withdrawal and 1 reviewed paper that actually submitted a rebuttal :( to be clear, the remainder are *not* clear accepts.

Niels Mündler (@nielstron) 's Twitter Profile Photo

there are many good reasons to reject a paper anonymously there are no good reasons to change your score after being deanonymized

Niels Mündler (@nielstron) 's Twitter Profile Photo

I'll be in SF for a week from Dec. 8, and would love to learn about any and all problems you are facing when using LLMs (in particular for code). DM me if you'd like to grab an ice cream and chat 🍨