Tom Hope (@hoper_tom) 's Twitter Profile
Tom Hope

@hoper_tom

Assistant professor and research scientist at AI2 | boosting scientific discovery with AI, NLP, IR, KG, HCI

ID: 282251041

linkhttp://www.linkedin.com/in/tom-hope-1433a228 calendar_today14-04-2011 21:11:44

421 Tweet

1,1K Followers

1,1K Following

Sander van der Linden (@sander_vdlinden) 's Twitter Profile Photo

A new breed of elite misinformers have emerged: They wield power without displaying commensurate responsibility. If you want to understand the fantasy-industrial complex, my review of Renee DiResta's excellent new book *Invisible Rules* for nature 👇 nature.com/articles/d4158…

A new breed of elite misinformers have emerged: They wield power without displaying commensurate responsibility. 

If you want to understand the fantasy-industrial complex, my review of <a href="/noUpside/">Renee DiResta</a>'s excellent new book *Invisible Rules* for <a href="/Nature/">nature</a> 👇

nature.com/articles/d4158…
Tom Hope (@hoper_tom) 's Twitter Profile Photo

Awesome. You should plug this in with the AI Scientist that does better research than Stanford experts. The combination is expected to lead to Nobel prize inventions by the end of this month (as predicted by Superhuman Forecastor Scientist).

Tom Hope (@hoper_tom) 's Twitter Profile Photo

Herbert Simon (OG AI Scientist pioneer), from paper some 50 years ago. Quite a bit of work ahead for all the AI Scientists out there to produce useful and novel instances of 1-7 on a regular basis (let’s start with just *one* real-world example though).

Herbert Simon (OG AI Scientist pioneer), from paper some 50 years ago.
Quite a bit of work ahead for all the AI Scientists out there to produce useful and novel instances of 1-7 on a regular basis (let’s start with just *one* real-world example though).
Colin Fraser (@colin_fraser) 's Twitter Profile Photo

OK so what do I think about this? I think, in my opinion, as an amateur paper reader, and with all due respect to the authors, it has a lot of problems. A lot of problems. Especially if you want to interpret the findings as "LLMs can generate more novel ideas than humans".

Colin Fraser (@colin_fraser) 's Twitter Profile Photo

While they claim that in the authors' opinion the contents of the original ideas were preserved...were they? You're asking me to have a lot of faith in Claude's ability to re-write a proposal in a way that perfectly preserves its "novelty" and "excitement"

Colin Fraser (@colin_fraser) 's Twitter Profile Photo

All of the AI problem statements are longer than all of the human problem statements, and almost all of the AI titles are longer than the human titles. The AI text uses flowery language, talking about "constellations" and "digital divides"

Colin Fraser (@colin_fraser) 's Twitter Profile Photo

4. The AI is just having the same idea over and over again Looking at the titles again, notice how similar the AI-generated ones are. [3 vaguely physics-sounding words]: [Doing something] [through/for] [something flowery]

4. The AI is just having the same idea over and over again
Looking at the titles again, notice how similar the AI-generated ones are.

[3 vaguely physics-sounding words]: [Doing something] [through/for] [something flowery]
Colin Fraser (@colin_fraser) 's Twitter Profile Photo

Maybe each of these is a novel idea, in a way, but presented back to back like this the perception of novelty starts to fade a bit. It has like one basic idea. Make a graph. And maybe that's like a good idea that can be applied in lots of places. But it's one idea.

Tom Hope (@hoper_tom) 's Twitter Profile Photo

Thanks Sam Rodriques Andrew White 🐦‍⬛ ! Two quick questions for now: 1. Is perplexity pro also “super”-“human” per the plot? And PaperQA is thus strong AGI / super-super human? 2. Why the focus on *multiple choice* questions (with just 4 choices!) which far from how lit rev done?

Thanks <a href="/SGRodriques/">Sam Rodriques</a> <a href="/andrewwhite01/">Andrew White 🐦‍⬛</a> ! Two quick questions for now: 1. Is perplexity pro also “super”-“human” per the plot? And PaperQA is thus strong AGI / super-super human?
2. Why the focus on *multiple choice* questions (with just 4 choices!) which far from how lit rev done?
(((ل()(ل() 'yoav))))👾 (@yoavgo) 's Twitter Profile Photo

Very good thread by Colin on the "LLMs generate novel research ideas" paper. Do read! I will also add a meta-comment, which Colin omitted: I only skimmed the paper, and focused instead on reading the examples of human generated and LLM-generated "research ideas". They both suck.

Ashwinee Panda (@pandaashwinee) 's Twitter Profile Photo

Very good review of the paper. I was very surprised to see that Claude was rewriting the human ideas, and the examples given are sufficient to convince me that this step is making the rewritten human ideas appear worse.

Christian Wolf (@chriswolfvision) 's Twitter Profile Photo

#CVPR2025 changes: "If a reviewer is flagged by an Area Chair as “highly irresponsible”, their paper submissions will be desk rejected per the discretion of the PCs" 👏👏👏 cvpr.thecvf.com/Conferences/20…

Cyril Zakka, MD (@cyrilzakka) 's Twitter Profile Photo

I mean no disrespect to any of the authors in this thread but the claims being made are wild and/or taken out of context. The AgentClinic dataset is based on an online dataset (read contamination) I helped create, from a resource aimed at helping medical students study for the

Adam Rodman (@adamrodmanmd) 's Twitter Profile Photo

Cyril Zakka, MD The sooner these QA databases die for any sort of meaningful medical benchmarking, the better. (that being said, I'm quite impressed at o1's ability at diagnosis)

Wenhao Yu (@wyu_nd) 's Twitter Profile Photo

💡Introducing DSBench: a challenging benchmark to evaluate LLM systems on real-world data science problems. GPT-4o scores only 28% accuracy, while humans achieve 66%. A clear gap, but an exciting challenge for AI advancement! 🧐 Paper: arxiv.org/abs/2409.07703 Project lead by our

💡Introducing DSBench: a challenging benchmark to evaluate LLM systems on real-world data science problems. GPT-4o scores only 28% accuracy, while humans achieve 66%. A clear gap, but an exciting challenge for AI advancement! 🧐

Paper: arxiv.org/abs/2409.07703
Project lead by our
Ben Bogin (@ben_bogin) 's Twitter Profile Photo

📢 New Benchmark: SUPER for Setting UP and Executing tasks from Research repositories Reproducibility is crucial in science. We introduce SUPER to evaluate LLMs' capabilities in autonomously running experiments from research repositories. ⬇️ arxiv.org/pdf/2409.07440

📢 New Benchmark: SUPER for Setting UP and Executing tasks from Research repositories

Reproducibility is crucial in science. We introduce SUPER to evaluate LLMs' capabilities in autonomously running experiments from research repositories. ⬇️

arxiv.org/pdf/2409.07440