Maxime Gasse (@maxime_gasse) 's Twitter Profile
Maxime Gasse

@maxime_gasse

Senior Research Scientist @ ServiceNow.
Trying to answer the question, can machines think?

ID: 963446045874315264

linkhttp://www.maximegasse.com/ calendar_today13-02-2018 16:13:32

42 Tweet

101 Followers

22 Following

Massimo Caccia (@masscaccia) 's Twitter Profile Photo

🚀 LLAMA3 is the first open-source LLM to ace tasks in Workarena, making it the top OSS virtual knowledge worker! (and believe us we've tested many models and prompting techniques 😅) Watch it excel in a challenging knowledge base task. Kudos to AI at Meta for the amazing model 🎉

Alexandre Lacoste (@alex_lacoste_) 's Twitter Profile Photo

🧵) We unexpectedly reach 🥇 on the leaderboard of #WebArena. While 25% is still far from human performance it is a large jump compared to the next best result. The performance gain is largely attributed to #BrowserGym github.com/ServiceNow/Bro… leaderboard: bit.ly/3QjOL5r

Arjun Ashok (@arjunashok37) 's Twitter Profile Photo

Introducing TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series. A highly flexible model for multivariate probabilistic time series prediction. To be presented at ICLR 2024. Find us at Hall B #145 on Thu 9 May 10:45 a.m. - 12:45 p.m. Details🧵👇

Alexandre L.-Piché (@alexpiche_) 's Twitter Profile Photo

Introducing ReSearch: An iterative self-reflection algorithm that enhances LLM's self-restraint abilities: • Encouraging abstention when uncertain • Producing accurate, informative content when confident Result: Significant accuracy boost for Llama2 7B Chat and Mistral 7B! 🚀

Alexandre Lacoste (@alex_lacoste_) 's Twitter Profile Photo

Most of our team is at #ICML2024 , reach out if you want to meet. We'll be presenting WorkArena and BrowserGym: Poster Session 2 on Tuesday, Hall C 4-9 #610 arxiv.org/abs/2403.07718

Alexandre Lacoste (@alex_lacoste_) 's Twitter Profile Photo

🧵🚀 New WebAgent Benchmark Alert! 🚀 There is hope for human workers! We released WorkArena++, a new challenging benchmark for WebAgents. Our best agent achieves 0% accuracy on this benchmark, while human evaluators still obtain 94%! 🔗 arxiv.org/abs/2407.05291

Alexandre Drouin (@alexandredrouin) 's Twitter Profile Photo

Interested in time series forecasting and LLMs? We are looking for visiting researchers to work on context-aided forecasting (example below): * Benchmarking * Multimodal Foundation Models * Agentic forecasting assistants When: Jan '25 - 8 months Details: bit.ly/sc25q1

Interested in time series forecasting and LLMs?

We are looking for visiting researchers to work on context-aided forecasting (example below):
* Benchmarking
* Multimodal Foundation Models
* Agentic forecasting assistants

When: Jan '25 - 8 months
Details: bit.ly/sc25q1
Massimo Caccia (@masscaccia) 's Twitter Profile Photo

🚨Internship Alert! Join ServiceNow Research to develop **generalist web agents** that handle complex tasks via browsers—from automating research to managing IT workflows! 🌐 Fine-tune LLMs into agents, explore datasets, and many more —all in Montreal! forms.gle/wHjb5L6A9rNBEW…

🚨Internship Alert! Join <a href="/ServiceNowRSRCH/">ServiceNow Research</a> to develop **generalist web agents** that handle complex tasks via browsers—from automating research to managing IT workflows!

🌐 Fine-tune LLMs into agents, explore datasets, and many more —all in Montreal! 

forms.gle/wHjb5L6A9rNBEW…
🇺🇦 Dzmitry Bahdanau (@dbahdanau) 's Twitter Profile Photo

🚨 New agent framework! 🚨 My team at ServiceNow Research is releasing TapeAgents: a holistic framework for agent development and optimization. At its core is the tape: a structured agent log. Repo: github.com/ServiceNow/Tap… Paper: servicenow.com/research/TapeA… Why you should care: 🧵

🚨 New agent framework! 🚨

My team at <a href="/ServiceNowRSRCH/">ServiceNow Research</a>  is releasing TapeAgents: a holistic framework for agent development and optimization. At its core is the tape: a structured agent log.

Repo: github.com/ServiceNow/Tap…
Paper: servicenow.com/research/TapeA…

Why you should care: 🧵
Alexandre Lacoste (@alex_lacoste_) 's Twitter Profile Photo

Anthropic Early results with Claude 3.5 sonnet for our new paper. We're probably not even using it right yet and its performance is through the roof, leaving o1-mini in the dust (o1-preview results are coming). See github.com/ServiceNow/Bro… for a growing amount of web-ui benchmarks.

<a href="/AnthropicAI/">Anthropic</a> Early results with Claude 3.5 sonnet for our new paper. We're probably not even using it right yet and its performance is through the roof, leaving o1-mini in the dust (o1-preview results are coming).

See github.com/ServiceNow/Bro…
for a growing amount of web-ui benchmarks.
Massimo Caccia (@masscaccia) 's Twitter Profile Photo

Awesome work! Not necessarily fundamentally new 😅 We've been working on this for a year now, and we have the same demo :p See the GTC talk: nvidia.com/en-us/on-deman…

Arjun Ashok (@arjunashok37) 's Twitter Profile Photo

(New paper alert!) Forecasting models typically rely on numerical historical data. However, in many cases, numerical data is insufficient and context is key. E.g., In the series below, would you have predicted the drop? Even the best models do not (forecast in blue).

Alexandre Lacoste (@alex_lacoste_) 's Twitter Profile Photo

🧵-1 We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.

🧵-1
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.
Alexandre Lacoste (@alex_lacoste_) 's Twitter Profile Photo

We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof. In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet

We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof.

In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet
Maxime Gasse (@maxime_gasse) 's Twitter Profile Photo

How do LLMs deal with misinformation? The answer is: not very well, but a natural resilience seems to emerge with larger models. Check out Mo Samsami 's latest work to know more!