Lili (@lchen915) Twitter Tweets • TwiCopy

Katerina Fragkiadaki

@katerinafragiad

6 months ago

Self-training to maximize self-confidence improves LLM reasoning, without any extrinsic reward.

thumb_up_off_alt11

chat_bubble_outline0

repeat3

shareShare

Dhruv Batra The story is not black and white as the tweet claims: should one have the same prompts for all models (what papers did) vs tune the prompt separately for each model (what blog says; much harder to reproduce). We already highlighted this in the paper and have a baseline to address

thumb_up_off_alt15

chat_bubble_outline0

repeat2

shareShare

Haoyu Xiong

@haoyu_xiong_

5 months ago

Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust

thumb_up_off_alt304

chat_bubble_outline11

repeat74

shareShare

Mihir Prabhudesai

@mihirp98

5 months ago

1/ Maximizing confidence indeed improves reasoning. We worked with Shashwat Goel, Nikhil Chandak Ameya P. for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing

1/ Maximizing confidence indeed improves reasoning. We worked with <a href="/ShashwatGoel7/">Shashwat Goel</a>, <a href="/nikhilchandak29/">Nikhil Chandak</a> <a href="/AmyPrb/">Ameya P.</a> for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing

thumb_up_off_alt49

chat_bubble_outline1

repeat13

shareShare

Mihir Prabhudesai

@mihirp98

5 months ago

Shashwat Goel Nikhil Chandak Ameya P. 2/ We also tried our best to compare against concurrent works in the unsupervised RL space. We find using confidence as a reward performs better than having random rewards (while spurious rewards does indeed improve performance). We also find entropy minimization (reverse KL)

<a href="/ShashwatGoel7/">Shashwat Goel</a> <a href="/nikhilchandak29/">Nikhil Chandak</a> <a href="/AmyPrb/">Ameya P.</a> 2/ We also tried our best to compare against concurrent works in the unsupervised RL space. We find using confidence as a reward performs better than having random rewards (while spurious rewards does indeed improve performance). We also find entropy minimization (reverse KL)

thumb_up_off_alt7

chat_bubble_outline1

repeat1

shareShare

Shashwat Goel

@shashwatgoel7

5 months ago

Glad we could together improve the scientific discourse around reasoning. Was great to see the authors reach out and incorporate all our feedback!

thumb_up_off_alt24

chat_bubble_outline1

repeat5

shareShare

Mihir Prabhudesai

@mihirp98

4 months ago

🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n

thumb_up_off_alt973

chat_bubble_outline122

repeat171

shareShare

Tony Tao @ RSS 🤖

@_tonytao_

3 months ago

this is such a cool idea! wonder if we can do something similar for robotics.

thumb_up_off_alt18

chat_bubble_outline1

repeat1

shareShare

Mihir Prabhudesai

@mihirp98

3 months ago

In RENT, we showed LLMs can improve without access to answers - by maximizing confidence. In this work, we go further: LLMs can improve without even having the questions. Using self-play, one LLM learns to ask challenging questions, while other LLM uses confidence to solve them

thumb_up_off_alt20

chat_bubble_outline0

repeat4

shareShare

Rohan Choudhury

@rchoudhury997

3 months ago

this is a really fun idea for improving language models without having to curate more data from Lili !

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Mengning Wu

@wumengning54261

3 months ago

Cool work! It’s impressive to see how this approach performs, and I can imagine that with stronger verifiers or more dynamic environments, the boundaries could be pushed even further.

thumb_up_off_alt7

chat_bubble_outline0

repeat2

shareShare

Deepak Pathak

@pathak2206

3 months ago

Thrilled to share our latest work on improving LLM reasoning -- without using any real data -- via pure self-play/curiosity. Nostalgic to see the return of circa 2017 ideas from unsupervised RL for robotics/games into LLMs. Check out the thread below! 👇

thumb_up_off_alt93

chat_bubble_outline2

repeat8

shareShare