Trustworthy ML Initiative (TrustML)(@trustworthy_ml) 's Twitter Profileg
Trustworthy ML Initiative (TrustML)

@trustworthy_ml

Latest research in Trustworthy ML. Organizers: @JaydeepBorkar @sbmisi @hima_lakkaraju @sarahookr Sarah Tan @chhaviyadav_ @_cagarwal @m_lemanczyk @HaohanWang

ID:1262375165490540549

linkhttps://www.trustworthyml.org calendar_today18-05-2020 13:31:24

1,7K Tweets

6,0K Followers

64 Following

Yixin Wan(@yixin_wan_) 's Twitter Profile Photo

How to identify bias in language agency?Eg. in texts describing White men as “leading” & Black women as “helping”?🧐
🔎String matching?❌NO!
🔎Sentiment classifier?❌No!
✅Our agency classifier CAN! It reveals gender, racial, and intersectional bias🤯
🔗:
arxiv.org/abs/2404.10508

account_circle
𝙷𝚒𝚖𝚊 𝙻𝚊𝚔𝚔𝚊𝚛𝚊𝚓𝚞(@hima_lakkaraju) 's Twitter Profile Photo

As we increasingly rely on for product recommendations and searches, can companies game these models to enhance the visibility of their products?

Our latest work provides answers to this question & demonstrates that LLMs can be manipulated to boost product visibility!…

As we increasingly rely on #LLMs for product recommendations and searches, can companies game these models to enhance the visibility of their products? Our latest work provides answers to this question & demonstrates that LLMs can be manipulated to boost product visibility!…
account_circle
Giang Nguyen(@giangnguyen2412) 's Twitter Profile Photo

🚀 Exciting news! Our latest work, CHM-Corr++, has been accepted for presentation at the Workshop, CVPR 2024! 🎉

The work lies in the intersection of: Interactive XAI and human-AI collaboration.

Demo: http://137.184.82.109:7080/
Paper: arxiv.org/abs/2404.05238

account_circle
Maksym Andriushchenko 🇺🇦(@maksym_andr) 's Twitter Profile Photo

🚨 Are leading safety-aligned LLMs adversarially robust? 🚨

❗In our new work, we jailbreak basically all of them with ≈100% success rate (according to GPT-4 as a semantic judge):
- Claude 1.2 / 2.0 / 2.1 / 3 Haiku / 3 Sonnet / 3 Opus,
- GPT-3.5 / GPT-4,
- R2D2-7B from…

🚨 Are leading safety-aligned LLMs adversarially robust? 🚨 ❗In our new work, we jailbreak basically all of them with ≈100% success rate (according to GPT-4 as a semantic judge): - Claude 1.2 / 2.0 / 2.1 / 3 Haiku / 3 Sonnet / 3 Opus, - GPT-3.5 / GPT-4, - R2D2-7B from…
account_circle
Canyu Chen(@CanyuChen3) 's Twitter Profile Photo

Thanks LLM Security for sharing our new work 'Can LLM-Generated Misinformation Be Detected?'

🔗Project website (paper, dataset, and code): llm-misinformation.github.io

🚨LLM-generated misinformation is one of the most critical risks on AI safety. Then, one fundamental…

account_circle
Patrick Chao(@patrickrchao) 's Twitter Profile Photo

Are you interested in jailbreaking LLMs? Have you ever wished that jailbreaking research was more standardized, reproducible, or transparent?

Check out JailbreakBench, an open benchmark and leaderboard for Jailbreak attacks and defenses on LLMs!

jailbreakbench.github.io
🧵1/n

Are you interested in jailbreaking LLMs? Have you ever wished that jailbreaking research was more standardized, reproducible, or transparent? Check out JailbreakBench, an open benchmark and leaderboard for Jailbreak attacks and defenses on LLMs! jailbreakbench.github.io 🧵1/n
account_circle
SAIL @ Imperial College London(@SAILImperial) 's Twitter Profile Photo

We're recruiting two Research Assistants to join us and work on the security of ML-based personal assistants at
Imperial College London. The role will focus on verification, robustification and adversarial attacks for AI assistants. rb.gy/mcxvob.

account_circle
Jaemin Cho(@jmin__cho) 's Twitter Profile Photo

Can we adaptively generate training environments with LLMs to help small embodied RL game agents learn useful skills that they are weak at? 🤔

👉 Check out EnvGen, an effective+efficient framework in which an LLM progressively generates and adapts training environments based on…

Can we adaptively generate training environments with LLMs to help small embodied RL game agents learn useful skills that they are weak at? 🤔 👉 Check out EnvGen, an effective+efficient framework in which an LLM progressively generates and adapts training environments based on…
account_circle
Przemyslaw Grabowicz(@przemyslslaw) 's Twitter Profile Photo

The U.S. Supreme Court has ended the use of race in college admissions. Fortunately, there exists a path to fair algorithmic decision-making that differs from the invalidated affirmative action measures, as we discuss in our recent Uncommon Good post:
uncommongood.substack.com/p/fair-machine…

The U.S. Supreme Court has ended the use of race in college admissions. Fortunately, there exists a path to fair algorithmic decision-making that differs from the invalidated affirmative action measures, as we discuss in our recent Uncommon Good post: uncommongood.substack.com/p/fair-machine…
account_circle
Matthew Finlayson(@mattf1n) 's Twitter Profile Photo

Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more!
📄 arxiv.org/abs/2403.09539
Here’s how 1/🧵

Wanna know gpt-3.5-turbo's embed size? We find a way to extract info from LLM APIs and estimate gpt-3.5-turbo’s embed size to be 4096. With the same trick we also develop 25x faster logprob extraction, audits for LLM APIs, and more! 📄 arxiv.org/abs/2403.09539 Here’s how 1/🧵
account_circle
Canyu Chen(@CanyuChen3) 's Twitter Profile Photo

🤔Can LLM agents really simulate human behaviors?

🌟Our new paper 'Can Large Language Model Agents Simulate Human Trust Behaviors?' (Project website: camel-ai.org/research/agent…) provides some new insights into this fundamental problem.

✨TLDR: We discover the trust behaviors of…

🤔Can LLM agents really simulate human behaviors? 🌟Our new paper 'Can Large Language Model Agents Simulate Human Trust Behaviors?' (Project website: camel-ai.org/research/agent…) provides some new insights into this fundamental problem. ✨TLDR: We discover the trust behaviors of…
account_circle
Sharon Levy(@sharonlevy21) 's Twitter Profile Photo

🧐Are LLM responses to public health questions biased toward specific demographic groups?

In our new interdisciplinary collaboration, we find that disparities exist among model answers for different groups across ages, U.S. locations, and sexes.

Paper: arxiv.org/pdf/2403.04858…

🧐Are LLM responses to public health questions biased toward specific demographic groups? In our new interdisciplinary collaboration, we find that disparities exist among model answers for different groups across ages, U.S. locations, and sexes. Paper: arxiv.org/pdf/2403.04858…
account_circle
Eric Wallace(@Eric_Wallace_) 's Twitter Profile Photo

The final layer of an LLM up-projects from hidden dim —> vocab size. The logprobs are thus low rank, and with some clever API queries, you can recover an LLM’s hidden dimension (or even the exact layer’s weights).

Our new paper is out, a collaboration between lot of friends!

account_circle
Nicolas Papernot(@NicolasPapernot) 's Twitter Profile Photo

Just one month left before SaTML Conference April 9-11 in Toronto! I am excited to hear from Somesh Jha Deb Raji Yves-A. de Montjoye Sheila McIlraith, as well as the authors of accepted papers, and the competition organizing teams!

There's still time to register! satml.org

Just one month left before @satml_conf April 9-11 in Toronto! I am excited to hear from @jhasomesh @rajiinio @yvesalexandre @SheilaMcIlraith, as well as the authors of accepted papers, and the competition organizing teams! There's still time to register! satml.org
account_circle
Przemyslaw Grabowicz(@przemyslslaw) 's Twitter Profile Photo

Our first Uncommon Good post (with Nick Perello) discusses how to train AI systems that do not propagate discrimination, in compliance with legal provision, based on our research published in ACM FAccT, AI, Ethics, and Society Conference (AIES), and ICML Conference. Stay tuned!
open.substack.com/pub/uncommongo…

Our first Uncommon Good post (with Nick Perello) discusses how to train AI systems that do not propagate discrimination, in compliance with legal provision, based on our research published in @FAccTConference, @AIESConf, and @icmlconf. Stay tuned! open.substack.com/pub/uncommongo…
account_circle
David Wan(@meetdavidwan) 's Twitter Profile Photo

Pointing to an image region should help models focus, but standard VLMs fail to understand visual markers/prompts (e.g., boxes/masks).

🚨Contrastive Region Guidance: Training-free method that increases focus on visual prompts by reducing model priors.

arxiv.org/abs/2403.02325
🧵

Pointing to an image region should help models focus, but standard VLMs fail to understand visual markers/prompts (e.g., boxes/masks). 🚨Contrastive Region Guidance: Training-free method that increases focus on visual prompts by reducing model priors. arxiv.org/abs/2403.02325 🧵
account_circle
Javier Rando(@javirandor) 's Twitter Profile Photo

We are announcing the winners of our Trojan Detection Competition on Aligned LLMs!!

🥇 TML Lab (EPFL) (@fra__31, Maksym Andriushchenko 🇺🇦 and Nicolas Flammarion)
🥈 Krystof Mitka
🥉 nev

🧵 With some of the main findings!

account_circle
Zhuang Liu(@liuzhuang1234) 's Twitter Profile Photo

LLMs are great, but their internals are less explored. I'm excited to share very interesting findings in paper

“Massive Activations in Large Language Models”

LLMs have very few internal activations with drastically outsized magnitudes, e.g., 100,000x larger than others. (1/n)

LLMs are great, but their internals are less explored. I'm excited to share very interesting findings in paper “Massive Activations in Large Language Models” LLMs have very few internal activations with drastically outsized magnitudes, e.g., 100,000x larger than others. (1/n)
account_circle
A. Feder Cooper(@afedercooper) 's Twitter Profile Photo

Thrilled to be recognized with best paper honorable mention at AAAI!

Our paper raises serious questions re: reproducibility + reliability in fairness

We define + mitigate arbitrariness, & find that most fairness benchmarks are actually close-to-fair

This is a BIG 🚩🚩

1/

Thrilled to be recognized with best paper honorable mention at @RealAAAI! Our paper raises serious questions re: reproducibility + reliability in fairness We define + mitigate arbitrariness, & find that most fairness benchmarks are actually close-to-fair This is a BIG 🚩🚩 1/
account_circle