Trustworthy ML Initiative (TrustML) (@trustworthy_ml) Twitter Tweets • TwiCopy

Trustworthy ML Initiative (TrustML)

@trustworthy_ml

+ Follow

Latest research in Trustworthy ML. Organizers: @JaydeepBorkar @sbmisi @hima_lakkaraju @sarahookr Sarah Tan @chhaviyadav_ @_cagarwal @m_lemanczyk @HaohanWang

ID:1262375165490540549

linkhttps://www.trustworthyml.org calendar_today18-05-2020 13:31:24

1,7K Tweet

6,0K Takipçi

64 Takip Edilen

Maksym Andriushchenko 🇺🇦

3 hafta önce

🚨 Are leading safety-aligned LLMs adversarially robust? 🚨

❗In our new work, we jailbreak basically all of them with ≈100% success rate (according to GPT-4 as a semantic judge):
- Claude 1.2 / 2.0 / 2.1 / 3 Haiku / 3 Sonnet / 3 Opus,
- GPT-3.5 / GPT-4,
- R2D2-7B from…

🚨 Are leading safety-aligned LLMs adversarially robust? 🚨 ❗In our new work, we jailbreak basically all of them with ≈100% success rate (according to GPT-4 as a semantic judge): - Claude 1.2 / 2.0 / 2.1 / 3 Haiku / 3 Sonnet / 3 Opus, - GPT-3.5 / GPT-4, - R2D2-7B from…

thumb_up_off_alt328

chat_bubble_outline0

account_circle

Canyu Chen

3 hafta önce

Thanks LLM Security for sharing our new #ICLR2024 work 'Can LLM-Generated Misinformation Be Detected?'

🔗Project website (paper, dataset, and code): llm-misinformation.github.io

🚨LLM-generated misinformation is one of the most critical risks on AI safety. Then, one fundamental…

thumb_up_off_alt118

chat_bubble_outline0

account_circle

Patrick Chao

3 hafta önce

Are you interested in jailbreaking LLMs? Have you ever wished that jailbreaking research was more standardized, reproducible, or transparent?

Check out JailbreakBench, an open benchmark and leaderboard for Jailbreak attacks and defenses on LLMs!

jailbreakbench.github.io
🧵1/n

Are you interested in jailbreaking LLMs? Have you ever wished that jailbreaking research was more standardized, reproducible, or transparent? Check out JailbreakBench, an open benchmark and leaderboard for Jailbreak attacks and defenses on LLMs! jailbreakbench.github.io 🧵1/n

thumb_up_off_alt168

chat_bubble_outline0

account_circle

SAIL @ Imperial College London

1 ay önce

We're recruiting two Research Assistants to join us and work on the security of ML-based personal assistants at
Imperial College London. The role will focus on verification, robustification and adversarial attacks for AI assistants. rb.gy/mcxvob.

thumb_up_off_alt3

chat_bubble_outline0

account_circle

Jaemin Cho

1 ay önce

Can we adaptively generate training environments with LLMs to help small embodied RL game agents learn useful skills that they are weak at? 🤔

👉 Check out EnvGen, an effective+efficient framework in which an LLM progressively generates and adapts training environments based on…

Can we adaptively generate training environments with LLMs to help small embodied RL game agents learn useful skills that they are weak at? 🤔 👉 Check out EnvGen, an effective+efficient framework in which an LLM progressively generates and adapts training environments based on…

thumb_up_off_alt214

chat_bubble_outline0

account_circle

Przemyslaw Grabowicz

1 ay önce

The U.S. Supreme Court has ended the use of race in college admissions. Fortunately, there exists a path to fair algorithmic decision-making that differs from the invalidated affirmative action measures, as we discuss in our recent Uncommon Good post:
uncommongood.substack.com/p/fair-machine…

The U.S. Supreme Court has ended the use of race in college admissions. Fortunately, there exists a path to fair algorithmic decision-making that differs from the invalidated affirmative action measures, as we discuss in our recent Uncommon Good post: uncommongood.substack.com/p/fair-machine…

thumb_up_off_alt1

chat_bubble_outline0

account_circle

Machine Learning Security Laboratory

1 ay önce

We are excited to present a new event of our seminar series on ML Security!
We will host Giovanni Cherubin (@Microsoft) on March 26th, 2024 at 15:00 CET.
Free registration: us02web.zoom.us/j/82941308293?…

ELSA - European Lighthouse on Secure and Safe AI Adversarial Machine Learning Trustworthy ML Initiative (TrustML) AI Village @ DEF CON RedTeamVillage

We are excited to present a new event of our seminar series on ML Security! We will host @gchers (@Microsoft) on March 26th, 2024 at 15:00 CET. Free registration: us02web.zoom.us/j/82941308293?… @elsa_lighthouse @adversarial_ML @trustworthy_ml @aivillage_dc @RedTeamVillage_

thumb_up_off_alt15

chat_bubble_outline0

account_circle

Matthew Finlayson

1 ay önce

Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more!
📄 arxiv.org/abs/2403.09539
Here’s how 1/🧵

Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more! 📄 arxiv.org/abs/2403.09539 Here’s how 1/🧵

thumb_up_off_alt363

chat_bubble_outline0

account_circle

Canyu Chen

1 ay önce

🤔Can LLM agents really simulate human behaviors?

🌟Our new paper 'Can Large Language Model Agents Simulate Human Trust Behaviors?' (Project website: camel-ai.org/research/agent…) provides some new insights into this fundamental problem.

✨TLDR: We discover the trust behaviors of…

🤔Can LLM agents really simulate human behaviors? 🌟Our new paper 'Can Large Language Model Agents Simulate Human Trust Behaviors?' (Project website: camel-ai.org/research/agent…) provides some new insights into this fundamental problem. ✨TLDR: We discover the trust behaviors of…

thumb_up_off_alt263

chat_bubble_outline0

account_circle

Sharon Levy

1 ay önce

🧐Are LLM responses to public health questions biased toward specific demographic groups?

In our new interdisciplinary collaboration, we find that disparities exist among model answers for different groups across ages, U.S. locations, and sexes.

Paper: arxiv.org/pdf/2403.04858…

🧐Are LLM responses to public health questions biased toward specific demographic groups? In our new interdisciplinary collaboration, we find that disparities exist among model answers for different groups across ages, U.S. locations, and sexes. Paper: arxiv.org/pdf/2403.04858…

thumb_up_off_alt103

chat_bubble_outline0

account_circle

Eric Wallace

1 ay önce

The final layer of an LLM up-projects from hidden dim —> vocab size. The logprobs are thus low rank, and with some clever API queries, you can recover an LLM’s hidden dimension (or even the exact layer’s weights).

Our new paper is out, a collaboration between lot of friends!

thumb_up_off_alt207

chat_bubble_outline0

account_circle

Nicolas Papernot

@NicolasPapernot

1 ay önce

Just one month left before SaTML Conference April 9-11 in Toronto! I am excited to hear from Somesh Jha Deb Raji Yves-A. de Montjoye Sheila McIlraith, as well as the authors of accepted papers, and the competition organizing teams!

There's still time to register! satml.org

Just one month left before @satml_conf April 9-11 in Toronto! I am excited to hear from @jhasomesh @rajiinio @yvesalexandre @SheilaMcIlraith, as well as the authors of accepted papers, and the competition organizing teams! There's still time to register! satml.org

thumb_up_off_alt41

chat_bubble_outline0

account_circle

Przemyslaw Grabowicz

1 ay önce

Our first Uncommon Good post (with Nick Perello) discusses how to train AI systems that do not propagate discrimination, in compliance with legal provision, based on our research published in ACM FAccT, AI, Ethics, and Society Conference (AIES), and ICML Conference. Stay tuned!
open.substack.com/pub/uncommongo…

Our first Uncommon Good post (with Nick Perello) discusses how to train AI systems that do not propagate discrimination, in compliance with legal provision, based on our research published in @FAccTConference, @AIESConf, and @icmlconf. Stay tuned! open.substack.com/pub/uncommongo…

thumb_up_off_alt4

chat_bubble_outline0

account_circle

David Wan

1 ay önce

Pointing to an image region should help models focus, but standard VLMs fail to understand visual markers/prompts (e.g., boxes/masks).

🚨Contrastive Region Guidance: Training-free method that increases focus on visual prompts by reducing model priors.

arxiv.org/abs/2403.02325
🧵

Pointing to an image region should help models focus, but standard VLMs fail to understand visual markers/prompts (e.g., boxes/masks). 🚨Contrastive Region Guidance: Training-free method that increases focus on visual prompts by reducing model priors. arxiv.org/abs/2403.02325 🧵

thumb_up_off_alt121

chat_bubble_outline0

account_circle

Javier Rando

1 ay önce

We are announcing the winners of our Trojan Detection Competition on Aligned LLMs!!

🥇 TML Lab (EPFL) (@fra__31, Maksym Andriushchenko 🇺🇦 and Nicolas Flammarion)
🥈 Krystof Mitka
🥉 nev

🧵 With some of the main findings!

thumb_up_off_alt50

chat_bubble_outline0

account_circle

Zhuang Liu

1 ay önce

LLMs are great, but their internals are less explored. I'm excited to share very interesting findings in paper

“Massive Activations in Large Language Models”

LLMs have very few internal activations with drastically outsized magnitudes, e.g., 100,000x larger than others. (1/n)

LLMs are great, but their internals are less explored. I'm excited to share very interesting findings in paper “Massive Activations in Large Language Models” LLMs have very few internal activations with drastically outsized magnitudes, e.g., 100,000x larger than others. (1/n)

thumb_up_off_alt1,0K

chat_bubble_outline0

account_circle

A. Feder Cooper

2 ay önce

Thrilled to be recognized with best paper honorable mention at AAAI!

Our paper raises serious questions re: reproducibility + reliability in fairness

We define + mitigate arbitrariness, & find that most fairness benchmarks are actually close-to-fair

This is a BIG 🚩🚩

1/

Thrilled to be recognized with best paper honorable mention at @RealAAAI! Our paper raises serious questions re: reproducibility + reliability in fairness We define + mitigate arbitrariness, & find that most fairness benchmarks are actually close-to-fair This is a BIG 🚩🚩 1/

thumb_up_off_alt140

chat_bubble_outline0

account_circle

Akshay Chaudhari

@Dr_ASChaudhari

1 ay önce

Our clinical #NLP work just published in Nature Medicine! We present a framework to adapt & evaluate #LLM s for summarization. Physicians 🩺 prefer #LLM summaries to those of #medical experts❗

Big step to reduce documentation 📚 and focus more on personalized care 🙌

A 🧵

Our clinical #NLP work just published in @NatureMedicine! We present a framework to adapt & evaluate #LLMs for summarization. Physicians 🩺 prefer #LLM summaries to those of #medical experts❗ Big step to reduce documentation 📚 and focus more on personalized care 🙌 A 🧵

thumb_up_off_alt274

chat_bubble_outline0

account_circle

LLM Security

1 ay önce

Fast Adversarial Attacks on Language Models In One GPU Minute 🌶️

'Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute with a success rate of 89%'…

Fast Adversarial Attacks on Language Models In One GPU Minute 🌶️ 'Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute with a success rate of 89%'…

thumb_up_off_alt82

chat_bubble_outline0

account_circle

Cas (Stephen Casper)

@StephenLCasper

1 ay önce

🚨New paper🚨 8 Methods to Evaluate Robust Unlearning in LLMs

Unlearning is promising for safer LLMs, but it’s tricky to evaluate. Here, we (1) overview eval techniques, (2) red-team a popular method, and (3) show that ad-hoc evals can be misleading.

arxiv.org/abs/2402.16835

🚨New paper🚨 8 Methods to Evaluate Robust Unlearning in LLMs Unlearning is promising for safer LLMs, but it’s tricky to evaluate. Here, we (1) overview eval techniques, (2) red-team a popular method, and (3) show that ad-hoc evals can be misleading. arxiv.org/abs/2402.16835

thumb_up_off_alt191

chat_bubble_outline0

account_circle

fpc ok :)