Trustworthy ML Initiative (TrustML)(@trustworthy_ml) 's Twitter Profileg
Trustworthy ML Initiative (TrustML)

@trustworthy_ml

Latest research in Trustworthy ML. Organizers: @JaydeepBorkar @sbmisi @hima_lakkaraju @sarahookr Sarah Tan @chhaviyadav_ @_cagarwal @m_lemanczyk @HaohanWang

ID:1262375165490540549

linkhttps://www.trustworthyml.org calendar_today18-05-2020 13:31:24

1,7K Tweet

6,0K Takipçi

64 Takip Edilen

Maksym Andriushchenko 🇺🇦(@maksym_andr) 's Twitter Profile Photo

🚨 Are leading safety-aligned LLMs adversarially robust? 🚨

❗In our new work, we jailbreak basically all of them with ≈100% success rate (according to GPT-4 as a semantic judge):
- Claude 1.2 / 2.0 / 2.1 / 3 Haiku / 3 Sonnet / 3 Opus,
- GPT-3.5 / GPT-4,
- R2D2-7B from…

🚨 Are leading safety-aligned LLMs adversarially robust? 🚨 ❗In our new work, we jailbreak basically all of them with ≈100% success rate (according to GPT-4 as a semantic judge): - Claude 1.2 / 2.0 / 2.1 / 3 Haiku / 3 Sonnet / 3 Opus, - GPT-3.5 / GPT-4, - R2D2-7B from…
account_circle
Canyu Chen(@CanyuChen3) 's Twitter Profile Photo

Thanks LLM Security for sharing our new work 'Can LLM-Generated Misinformation Be Detected?'

🔗Project website (paper, dataset, and code): llm-misinformation.github.io

🚨LLM-generated misinformation is one of the most critical risks on AI safety. Then, one fundamental…

account_circle
Patrick Chao(@patrickrchao) 's Twitter Profile Photo

Are you interested in jailbreaking LLMs? Have you ever wished that jailbreaking research was more standardized, reproducible, or transparent?

Check out JailbreakBench, an open benchmark and leaderboard for Jailbreak attacks and defenses on LLMs!

jailbreakbench.github.io
🧵1/n

Are you interested in jailbreaking LLMs? Have you ever wished that jailbreaking research was more standardized, reproducible, or transparent? Check out JailbreakBench, an open benchmark and leaderboard for Jailbreak attacks and defenses on LLMs! jailbreakbench.github.io 🧵1/n
account_circle
SAIL @ Imperial College London(@SAILImperial) 's Twitter Profile Photo

We're recruiting two Research Assistants to join us and work on the security of ML-based personal assistants at
Imperial College London. The role will focus on verification, robustification and adversarial attacks for AI assistants. rb.gy/mcxvob.

account_circle
Jaemin Cho(@jmin__cho) 's Twitter Profile Photo

Can we adaptively generate training environments with LLMs to help small embodied RL game agents learn useful skills that they are weak at? 🤔

👉 Check out EnvGen, an effective+efficient framework in which an LLM progressively generates and adapts training environments based on…

Can we adaptively generate training environments with LLMs to help small embodied RL game agents learn useful skills that they are weak at? 🤔 👉 Check out EnvGen, an effective+efficient framework in which an LLM progressively generates and adapts training environments based on…
account_circle
Przemyslaw Grabowicz(@przemyslslaw) 's Twitter Profile Photo

The U.S. Supreme Court has ended the use of race in college admissions. Fortunately, there exists a path to fair algorithmic decision-making that differs from the invalidated affirmative action measures, as we discuss in our recent Uncommon Good post:
uncommongood.substack.com/p/fair-machine…

The U.S. Supreme Court has ended the use of race in college admissions. Fortunately, there exists a path to fair algorithmic decision-making that differs from the invalidated affirmative action measures, as we discuss in our recent Uncommon Good post: uncommongood.substack.com/p/fair-machine…
account_circle
Matthew Finlayson(@mattf1n) 's Twitter Profile Photo

Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more!
📄 arxiv.org/abs/2403.09539
Here’s how 1/🧵

Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more! 📄 arxiv.org/abs/2403.09539 Here’s how 1/🧵
account_circle
Canyu Chen(@CanyuChen3) 's Twitter Profile Photo

🤔Can LLM agents really simulate human behaviors?

🌟Our new paper 'Can Large Language Model Agents Simulate Human Trust Behaviors?' (Project website: camel-ai.org/research/agent…) provides some new insights into this fundamental problem.

✨TLDR: We discover the trust behaviors of…

🤔Can LLM agents really simulate human behaviors? 🌟Our new paper 'Can Large Language Model Agents Simulate Human Trust Behaviors?' (Project website: camel-ai.org/research/agent…) provides some new insights into this fundamental problem. ✨TLDR: We discover the trust behaviors of…
account_circle
Sharon Levy(@sharonlevy21) 's Twitter Profile Photo

🧐Are LLM responses to public health questions biased toward specific demographic groups?

In our new interdisciplinary collaboration, we find that disparities exist among model answers for different groups across ages, U.S. locations, and sexes.

Paper: arxiv.org/pdf/2403.04858…

🧐Are LLM responses to public health questions biased toward specific demographic groups? In our new interdisciplinary collaboration, we find that disparities exist among model answers for different groups across ages, U.S. locations, and sexes. Paper: arxiv.org/pdf/2403.04858…
account_circle
Eric Wallace(@Eric_Wallace_) 's Twitter Profile Photo

The final layer of an LLM up-projects from hidden dim —> vocab size. The logprobs are thus low rank, and with some clever API queries, you can recover an LLM’s hidden dimension (or even the exact layer’s weights).

Our new paper is out, a collaboration between lot of friends!

account_circle
Nicolas Papernot(@NicolasPapernot) 's Twitter Profile Photo

Just one month left before SaTML Conference April 9-11 in Toronto! I am excited to hear from Somesh Jha Deb Raji Yves-A. de Montjoye Sheila McIlraith, as well as the authors of accepted papers, and the competition organizing teams!

There's still time to register! satml.org

Just one month left before @satml_conf April 9-11 in Toronto! I am excited to hear from @jhasomesh @rajiinio @yvesalexandre @SheilaMcIlraith, as well as the authors of accepted papers, and the competition organizing teams! There's still time to register! satml.org
account_circle
Przemyslaw Grabowicz(@przemyslslaw) 's Twitter Profile Photo

Our first Uncommon Good post (with Nick Perello) discusses how to train AI systems that do not propagate discrimination, in compliance with legal provision, based on our research published in ACM FAccT, AI, Ethics, and Society Conference (AIES), and ICML Conference. Stay tuned!
open.substack.com/pub/uncommongo…

Our first Uncommon Good post (with Nick Perello) discusses how to train AI systems that do not propagate discrimination, in compliance with legal provision, based on our research published in @FAccTConference, @AIESConf, and @icmlconf. Stay tuned! open.substack.com/pub/uncommongo…
account_circle
David Wan(@meetdavidwan) 's Twitter Profile Photo

Pointing to an image region should help models focus, but standard VLMs fail to understand visual markers/prompts (e.g., boxes/masks).

🚨Contrastive Region Guidance: Training-free method that increases focus on visual prompts by reducing model priors.

arxiv.org/abs/2403.02325
🧵

Pointing to an image region should help models focus, but standard VLMs fail to understand visual markers/prompts (e.g., boxes/masks). 🚨Contrastive Region Guidance: Training-free method that increases focus on visual prompts by reducing model priors. arxiv.org/abs/2403.02325 🧵
account_circle
Javier Rando(@javirandor) 's Twitter Profile Photo

We are announcing the winners of our Trojan Detection Competition on Aligned LLMs!!

🥇 TML Lab (EPFL) (@fra__31, Maksym Andriushchenko 🇺🇦 and Nicolas Flammarion)
🥈 Krystof Mitka
🥉 nev

🧵 With some of the main findings!

account_circle
Zhuang Liu(@liuzhuang1234) 's Twitter Profile Photo

LLMs are great, but their internals are less explored. I'm excited to share very interesting findings in paper

“Massive Activations in Large Language Models”

LLMs have very few internal activations with drastically outsized magnitudes, e.g., 100,000x larger than others. (1/n)

LLMs are great, but their internals are less explored. I'm excited to share very interesting findings in paper “Massive Activations in Large Language Models” LLMs have very few internal activations with drastically outsized magnitudes, e.g., 100,000x larger than others. (1/n)
account_circle
A. Feder Cooper(@afedercooper) 's Twitter Profile Photo

Thrilled to be recognized with best paper honorable mention at AAAI!

Our paper raises serious questions re: reproducibility + reliability in fairness

We define + mitigate arbitrariness, & find that most fairness benchmarks are actually close-to-fair

This is a BIG 🚩🚩

1/

Thrilled to be recognized with best paper honorable mention at @RealAAAI! Our paper raises serious questions re: reproducibility + reliability in fairness We define + mitigate arbitrariness, & find that most fairness benchmarks are actually close-to-fair This is a BIG 🚩🚩 1/
account_circle
Akshay Chaudhari(@Dr_ASChaudhari) 's Twitter Profile Photo

Our clinical work just published in Nature Medicine! We present a framework to adapt & evaluate s for summarization. Physicians 🩺 prefer summaries to those of experts❗

Big step to reduce documentation 📚 and focus more on personalized care 🙌

A 🧵

Our clinical #NLP work just published in @NatureMedicine! We present a framework to adapt & evaluate #LLMs for summarization. Physicians 🩺 prefer #LLM summaries to those of #medical experts❗ Big step to reduce documentation 📚 and focus more on personalized care 🙌 A 🧵
account_circle
LLM Security(@llm_sec) 's Twitter Profile Photo

Fast Adversarial Attacks on Language Models In One GPU Minute 🌶️

'Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute with a success rate of 89%'…

Fast Adversarial Attacks on Language Models In One GPU Minute 🌶️ 'Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute with a success rate of 89%'…
account_circle
Cas (Stephen Casper)(@StephenLCasper) 's Twitter Profile Photo

🚨New paper🚨 8 Methods to Evaluate Robust Unlearning in LLMs

Unlearning is promising for safer LLMs, but it’s tricky to evaluate. Here, we (1) overview eval techniques, (2) red-team a popular method, and (3) show that ad-hoc evals can be misleading.

arxiv.org/abs/2402.16835

🚨New paper🚨 8 Methods to Evaluate Robust Unlearning in LLMs Unlearning is promising for safer LLMs, but it’s tricky to evaluate. Here, we (1) overview eval techniques, (2) red-team a popular method, and (3) show that ad-hoc evals can be misleading. arxiv.org/abs/2402.16835
account_circle