Daniil Larionov (@rexhaif) 's Twitter Profile
Daniil Larionov

@rexhaif

#NLProc |

PhD @ Uni Mannheim |

LLMs + Evaluation + Efficiency

ID: 1093781770967924736

calendar_today08-02-2019 08:01:10

655 Tweet

110 Followers

1,1K Following

Xiao Wang (@sandy_wx95) 's Twitter Profile Photo

2/n We introduce a novel contrastive evaluation metric for assessing generated text, and test on two tasks: machine translation (MT) and summarization (SUM).

2/n We introduce a novel contrastive evaluation metric for assessing generated text, and test on two tasks: machine translation (MT) and summarization (SUM).
Xiao Wang (@sandy_wx95) 's Twitter Profile Photo

3/n Achieving higher correlation with human judgments: • Single Llama 8B: 0.457 -> ContrastScore(Llama 8B, Llama 3B): 0.498(+9.0%) • Single Qwen 7B: 0.442 -> ContrastScore(Qwen 7B, Qwen 3B): 0.470(+6.3%)

3/n Achieving higher correlation with human judgments:
• Single Llama 8B: 0.457 -> ContrastScore(Llama 8B, Llama 3B): 0.498(+9.0%)
• Single Qwen 7B: 0.442 -> ContrastScore(Qwen 7B, Qwen 3B): 0.470(+6.3%)
Xiao Wang (@sandy_wx95) 's Twitter Profile Photo

4/n Mitigating biases in likelihood preferences. • MT: Single Llama 8B: 0.352 -> ContrastScore(Llama 8B, Llama 3B): 0.137 (-61.1%) • SUM: Single Llama 8B: 0.381 -> ContrastScore(Llama 8B, Llama 3B): 0.240 (-37.0%)

4/n Mitigating biases in likelihood preferences.
• MT: Single Llama 8B: 0.352 -> ContrastScore(Llama 8B, Llama 3B): 0.137 (-61.1%)
• SUM: Single Llama 8B: 0.381 -> ContrastScore(Llama 8B, Llama 3B): 0.240 (-37.0%)
Xiao Wang (@sandy_wx95) 's Twitter Profile Photo

5/n Mitigating length biases in summarization: • Single Llama 3B: 0.289 -> ContrastScore(Llama 3B, Llama 1B): 0.086(-70.2%) • Single Qwen 3B: 0.349 -> ContrastScore(Qwen 3B, Qwen 0.5B): 0.253(-27.5%)

5/n Mitigating length biases in summarization:
• Single Llama 3B: 0.289 -> ContrastScore(Llama 3B, Llama 1B): 0.086(-70.2%)
• Single Qwen 3B: 0.349 -> ContrastScore(Qwen 3B, Qwen 0.5B): 0.253(-27.5%)
Xiao Wang (@sandy_wx95) 's Twitter Profile Photo

6/n Improving efficiency with faster process speed: • MT: ContrastScore(Llama3B,1B) 1.5X faster than single Llama 8B • SUM: ContrastScore(Qwen3B,0.5B) 1.7X faster than single Qwen 7B

6/n Improving efficiency with faster process speed:
• MT: ContrastScore(Llama3B,1B) 1.5X faster than single Llama 8B
• SUM: ContrastScore(Qwen3B,0.5B) 1.7X faster than single Qwen 7B
Xiao Wang (@sandy_wx95) 's Twitter Profile Photo

7/n Ablation Study Insights: Contrastive formulation matters! • our subtraction-based contrastive formulation: Llama(8B,3B) 0.498, Qwen(7B, 3B) 0.470 • original ratio-based for decoding (Li et al, 2023): Llama(8B,3B) 0.429, Qwen(7B, 3B) 0.435

7/n Ablation Study Insights:
Contrastive formulation matters!
• our subtraction-based contrastive formulation: Llama(8B,3B) 0.498, Qwen(7B, 3B) 0.470
• original ratio-based for decoding (Li et al, 2023): Llama(8B,3B) 0.429, Qwen(7B, 3B) 0.435
Xiao Wang (@sandy_wx95) 's Twitter Profile Photo

8/n Case Study Analysis : ContrastScore leverages probability discrepancy to align with human judgment. Overall: ContrastScore can achieve higher-quality, less biased and more efficient evaluation of generated text.

8/n Case Study Analysis : ContrastScore leverages probability discrepancy to align with human judgment.

Overall: ContrastScore can achieve higher-quality, less biased and more efficient evaluation of generated text.
Daniil Larionov (@rexhaif) 's Twitter Profile Photo

I'm at NAACL 2025, presenting my work on prompt optimization for MT evaluation with LLMs. Come say hi at board 50 at 11:00-12:30, Hall 3.

I'm at NAACL 2025, presenting my work on prompt optimization for MT evaluation with LLMs. Come say hi at board 50 at 11:00-12:30, Hall 3.
Daniil Larionov (@rexhaif) 's Twitter Profile Photo

Either reviewers of this paper have completely ignored the checklist section where authors should have supposedly discussed that the entire paper is written by an AI agent, or the authors have not disclosed that, and it should be retroactively desk-rejected

Manos Zaranis (@manoszaranis) 's Twitter Profile Photo

🚨Meet MF²: Movie Facts & Fibs: a new benchmark for long-movie understanding! 🤔Do you think your model understands movies? Unlike existing benchmarks, MF² targets memorable events, emotional arcs 💔, and causal chains 🔗 — things humans recall easily, but even top models like

🚨Meet MF²: Movie Facts & Fibs: a new benchmark for long-movie understanding!
🤔Do you think your model understands movies?

Unlike existing benchmarks, MF² targets memorable events, emotional arcs 💔, and causal chains 🔗 — things humans recall easily, but even top models like