nickcdryan (@nickcdryan) 's Twitter Profile
nickcdryan

@nickcdryan

nlp, deep learning, NYC

ID: 1656707216781586432

linkhttp://nickcdryan.com calendar_today11-05-2023 17:06:03

382 Tweet

197 Followers

865 Following

nickcdryan (@nickcdryan) 's Twitter Profile Photo

I'm thinking I can settle this very quickly. Basically, does this trained retrieval subnet fetch the right context more than the standard similarity(encoder(question), encoder(contexts))? x.com/nickcdryan/sta…

snimu (@omouamoua) 's Twitter Profile Photo

Results on Mixture of Tokenizers (MoT): I have calculated the pplx on the wikipedia dataset (huggingface.co/datasets/wikim…), for MoT-models of different sizes and a token-only baseline, and the MoTs closest in size to the baseline are better than it. Thread on why that's relevant.

Results on Mixture of Tokenizers (MoT):

I have calculated the pplx on the wikipedia dataset (huggingface.co/datasets/wikim…), for MoT-models of different sizes and a token-only baseline, and the MoTs closest in size to the baseline are better than it. Thread on why that's relevant.
nickcdryan (@nickcdryan) 's Twitter Profile Photo

In multiple research domains, I find Grok will always hallucinate a "hypothetical" Grok-specific version and insert it with the real ones lol. It's nice enough to always call it "hypothetical" but it's basically saying "yea I could do that too if I wanted."

In multiple research domains, I find Grok will always hallucinate a "hypothetical" Grok-specific version and insert it with the real ones lol. 

It's nice enough to always call it "hypothetical" but it's basically saying "yea I could do that too if I wanted."
nickcdryan (@nickcdryan) 's Twitter Profile Photo

A good feedback cycle: new models know how to use themselves because "how to use LLMs, their strengths and weaknesses" now appears in the training data. You don't have to warn them nearly as much to stay away from arithmetic, one-shotting big projects, spelling, etc.

nickcdryan (@nickcdryan) 's Twitter Profile Photo

In a better world... I've included this for most of my recent work. Doing this helps everyone by: - trading notes with future researchers - explaining motivation - forcing you to keep a log of results - indicating how thoroughly explored (and possibly how over-hparamed) the

In a better world...

I've included this for most of my recent work. 

Doing this helps everyone by:
- trading notes with future researchers
- explaining motivation
- forcing you to keep a log of results
- indicating how thoroughly explored (and possibly how over-hparamed) the
nickcdryan (@nickcdryan) 's Twitter Profile Photo

> score is calculated based on "how often words are used together" > just pick obscure words that don't get used at all > rejected

> score is calculated based on "how often words are used together"
> just pick obscure words that don't get used at all
> rejected
nickcdryan (@nickcdryan) 's Twitter Profile Photo

Getting to the heart of the matter here, and fixing it. Heard many times batching is the culprit, but this is the first in-depth explanation I've seen.

Getting to the heart of the matter here, and fixing it.

Heard many times batching is the culprit, but this is the first in-depth explanation I've seen.
nickcdryan (@nickcdryan) 's Twitter Profile Photo

Even simpler: it's just a basic requirement because there isn't enough time or $ to gridsearch your idea. And if gridsearching your idea is the only way to make it work it's probably not worth it anyway.