
Justo Hidalgo
@justohidalgo
Would love to have an artist's heart and a scientist's brain... but in the meantime I do have Mickey Mouse's ears and nose! :D
ID: 143981433
15-05-2010 00:00:08
17,17K Tweet
2,2K Followers
533 Following


A 24-trillion-token web dataset with document-level metadata just dropped on Hugging Face License: apache-2.0 ESSENTIAL-WEB v1.0 collects 24 trillion tokens from Common Crawl. Each document is labeled with a 12-field taxonomy covering topic, page type, complexity, and quality










BREAKING: The EU Commission has released a mandatory template for AI developers to disclose training data. Unlike the Code of Practice, this is not optional. It could have global fallout, as rights holders abroad might use it to sue over copyright. digital-strategy.ec.europa.eu/en/library/exp…








