Maxime Gasse (@maxime_gasse) Twitter Tweets • TwiCopy

Massimo Caccia

2 years ago

🚀 LLAMA3 is the first open-source LLM to ace tasks in Workarena, making it the top OSS virtual knowledge worker! (and believe us we've tested many models and prompting techniques 😅) Watch it excel in a challenging knowledge base task. Kudos to AI at Meta for the amazing model 🎉

thumb_up_off_alt43

chat_bubble_outline5

repeat13

shareShare

Alexandre Lacoste

@alex_lacoste_

2 years ago

🧵) We unexpectedly reach 🥇 on the leaderboard of #WebArena. While 25% is still far from human performance it is a large jump compared to the next best result. The performance gain is largely attributed to #BrowserGym github.com/ServiceNow/Bro… leaderboard: bit.ly/3QjOL5r

thumb_up_off_alt54

chat_bubble_outline5

repeat19

shareShare

Arjun Ashok

@arjunashok37

2 years ago

Introducing TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series. A highly flexible model for multivariate probabilistic time series prediction. To be presented at ICLR 2024. Find us at Hall B #145 on Thu 9 May 10:45 a.m. - 12:45 p.m. Details🧵👇

thumb_up_off_alt43

chat_bubble_outline2

repeat23

shareShare

Alexandre L.-Piché

@alexpiche_

a year ago

Introducing ReSearch: An iterative self-reflection algorithm that enhances LLM's self-restraint abilities: • Encouraging abstention when uncertain • Producing accurate, informative content when confident Result: Significant accuracy boost for Llama2 7B Chat and Mistral 7B! 🚀

thumb_up_off_alt102

chat_bubble_outline1

repeat45

shareShare

Alexandre Lacoste

@alex_lacoste_

a year ago

Most of our team is at #ICML2024 , reach out if you want to meet. We'll be presenting WorkArena and BrowserGym: Poster Session 2 on Tuesday, Hall C 4-9 #610 arxiv.org/abs/2403.07718

thumb_up_off_alt24

chat_bubble_outline5

repeat16

shareShare

Alexandre Lacoste

@alex_lacoste_

a year ago

🧵🚀 New WebAgent Benchmark Alert! 🚀 There is hope for human workers! We released WorkArena++, a new challenging benchmark for WebAgents. Our best agent achieves 0% accuracy on this benchmark, while human evaluators still obtain 94%! 🔗 arxiv.org/abs/2407.05291

thumb_up_off_alt85

chat_bubble_outline2

repeat29

shareShare

Alexandre Drouin

@alexandredrouin

a year ago

Interested in time series forecasting and LLMs? We are looking for visiting researchers to work on context-aided forecasting (example below): * Benchmarking * Multimodal Foundation Models * Agentic forecasting assistants When: Jan '25 - 8 months Details: bit.ly/sc25q1

thumb_up_off_alt23

chat_bubble_outline0

repeat21

shareShare

Massimo Caccia

@masscaccia

a year ago

🚨Internship Alert! Join ServiceNow Research to develop **generalist web agents** that handle complex tasks via browsers—from automating research to managing IT workflows! 🌐 Fine-tune LLMs into agents, explore datasets, and many more —all in Montreal! forms.gle/wHjb5L6A9rNBEW…

🚨Internship Alert! Join <a href="/ServiceNowRSRCH/">ServiceNow Research</a> to develop **generalist web agents** that handle complex tasks via browsers—from automating research to managing IT workflows!

🌐 Fine-tune LLMs into agents, explore datasets, and many more —all in Montreal!

forms.gle/wHjb5L6A9rNBEW…

thumb_up_off_alt64

chat_bubble_outline1

repeat26

shareShare

🇺🇦 Dzmitry Bahdanau

@dbahdanau

a year ago

🚨 New agent framework! 🚨 My team at ServiceNow Research is releasing TapeAgents: a holistic framework for agent development and optimization. At its core is the tape: a structured agent log. Repo: github.com/ServiceNow/Tap… Paper: servicenow.com/research/TapeA… Why you should care: 🧵

🚨 New agent framework! 🚨

My team at <a href="/ServiceNowRSRCH/">ServiceNow Research</a> is releasing TapeAgents: a holistic framework for agent development and optimization. At its core is the tape: a structured agent log.

Repo: github.com/ServiceNow/Tap…
Paper: servicenow.com/research/TapeA…

Why you should care: 🧵

thumb_up_off_alt154

chat_bubble_outline5

repeat40

shareShare

Alexandre Lacoste

@alex_lacoste_

a year ago

Anthropic Early results with Claude 3.5 sonnet for our new paper. We're probably not even using it right yet and its performance is through the roof, leaving o1-mini in the dust (o1-preview results are coming). See github.com/ServiceNow/Bro… for a growing amount of web-ui benchmarks.

<a href="/AnthropicAI/">Anthropic</a> Early results with Claude 3.5 sonnet for our new paper. We're probably not even using it right yet and its performance is through the roof, leaving o1-mini in the dust (o1-preview results are coming).

See github.com/ServiceNow/Bro…
for a growing amount of web-ui benchmarks.

thumb_up_off_alt19

chat_bubble_outline0

repeat7

shareShare

Maxime Gasse

@maxime_gasse

a year ago

Really cool results for Claude!

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Massimo Caccia

@masscaccia

a year ago

Awesome work! Not necessarily fundamentally new 😅 We've been working on this for a year now, and we have the same demo :p See the GTC talk: nvidia.com/en-us/on-deman…

thumb_up_off_alt20

chat_bubble_outline1

repeat4

shareShare

Arjun Ashok

@arjunashok37

a year ago

(New paper alert!) Forecasting models typically rely on numerical historical data. However, in many cases, numerical data is insufficient and context is key. E.g., In the series below, would you have predicted the drop? Even the best models do not (forecast in blue).

thumb_up_off_alt61

chat_bubble_outline2

repeat31

shareShare

Alexandre Lacoste

@alex_lacoste_

a year ago

🧵-1 We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.

thumb_up_off_alt155

chat_bubble_outline4

repeat59

shareShare

Alexandre Lacoste

@alex_lacoste_

a year ago

We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof. In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet

thumb_up_off_alt104

chat_bubble_outline3

repeat32

shareShare

Maxime Gasse

@maxime_gasse

a year ago

Glad to be part of this great collaborative effort 😊

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Maxime Gasse

@maxime_gasse

a year ago

How do LLMs deal with misinformation? The answer is: not very well, but a natural resilience seems to emerge with larger models. Check out Mo Samsami 's latest work to know more!

thumb_up_off_alt7

chat_bubble_outline1

repeat4

shareShare

Léo Boisvert

@leoboisvert

a year ago

🔥 Fresh off the GPU, new WorkArena-L1 results are in! 🔥 Llama 3.3 70B: 34.5% (↑6.6% from 3.1) Qwen 2.5 32B: 27.9% Even the small models shine: Qwen 2.5 7B (8.2%) doubles Llama 3.1 8B (4%)! ☕️ These models are working harder than me on a Monday morning ☕️

thumb_up_off_alt21

chat_bubble_outline1

repeat10

shareShare