Daniil Larionov (@rexhaif) Twitter Tweets • TwiCopy

good girl

@goodgirlxsz

5 hours ago

🔥Telegram İfşa

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Xiao Wang

@sandy_wx95

8 months ago

2/n We introduce a novel contrastive evaluation metric for assessing generated text, and test on two tasks: machine translation (MT) and summarization (SUM).

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare

3/n Achieving higher correlation with human judgments: • Single Llama 8B: 0.457 -> ContrastScore(Llama 8B, Llama 3B): 0.498(+9.0%) • Single Qwen 7B: 0.442 -> ContrastScore(Qwen 7B, Qwen 3B): 0.470(+6.3%)

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Xiao Wang

@sandy_wx95

8 months ago

4/n Mitigating biases in likelihood preferences. • MT: Single Llama 8B: 0.352 -> ContrastScore(Llama 8B, Llama 3B): 0.137 (-61.1%) • SUM: Single Llama 8B: 0.381 -> ContrastScore(Llama 8B, Llama 3B): 0.240 (-37.0%)

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Xiao Wang

@sandy_wx95

8 months ago

5/n Mitigating length biases in summarization: • Single Llama 3B: 0.289 -> ContrastScore(Llama 3B, Llama 1B): 0.086(-70.2%) • Single Qwen 3B: 0.349 -> ContrastScore(Qwen 3B, Qwen 0.5B): 0.253(-27.5%)

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Xiao Wang

@sandy_wx95

8 months ago

6/n Improving efficiency with faster process speed: • MT: ContrastScore(Llama3B,1B) 1.5X faster than single Llama 8B • SUM: ContrastScore(Qwen3B,0.5B) 1.7X faster than single Qwen 7B

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Xiao Wang

@sandy_wx95

8 months ago

7/n Ablation Study Insights: Contrastive formulation matters! • our subtraction-based contrastive formulation: Llama(8B,3B) 0.498, Qwen(7B, 3B) 0.470 • original ratio-based for decoding (Li et al, 2023): Llama(8B,3B) 0.429, Qwen(7B, 3B) 0.435

thumb_up_off_alt2

chat_bubble_outline1

repeat1

shareShare

Xiao Wang

@sandy_wx95

8 months ago

8/n Case Study Analysis : ContrastScore leverages probability discrepancy to align with human judgment. Overall: ContrastScore can achieve higher-quality, less biased and more efficient evaluation of generated text.

thumb_up_off_alt4

chat_bubble_outline1

repeat1

shareShare

Xiao Wang

@sandy_wx95

8 months ago

9/n Thanks to our co-authors: Daniil Larionov Siwei Wu（吴思为） Yiqi Liu Steffen Eger Nafise Sadat Moosavi Chenghua Lin

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Daniil Larionov

@rexhaif

7 months ago

I'm at NAACL 2025, presenting my work on prompt optimization for MT evaluation with LLMs. Come say hi at board 50 at 11:00-12:30, Hall 3.

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Daniil Larionov

@rexhaif

7 months ago

Hey #NAACL2025 folks, does anyone plan/want to go to Nuclear Museum?

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Daniil Larionov

@rexhaif

7 months ago

🌵🌄

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Daniil Larionov

@rexhaif

7 months ago

🐿️

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Daniil Larionov

@rexhaif

6 months ago

Either reviewers of this paper have completely ignored the checklist section where authors should have supposedly discussed that the entire paper is written by an AI agent, or the authors have not disclosed that, and it should be retroactively desk-rejected

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

tenderizzation

@tenderizzation

6 months ago

NCCL sending the loss value from the last pipeline parallel stage back to rank 0 so the user can print it

thumb_up_off_alt321

chat_bubble_outline6

repeat20

shareShare

Manos Zaranis

@manoszaranis

5 months ago

🚨Meet MF²: Movie Facts & Fibs: a new benchmark for long-movie understanding! 🤔Do you think your model understands movies? Unlike existing benchmarks, MF² targets memorable events, emotional arcs 💔, and causal chains 🔗 — things humans recall easily, but even top models like