End goal: have a function keeping only interesting video platform links. Having this would enable getting billions of such links via cc2dataset
This function is a binary classifier. To evaluate it we need links that naturally occur in common crawl. Criteria:
- link not containing video that can be downloaded by yt-dlp should be discarded
- "Bad" links (eg porn) should be discarded in vast majority
To collect this eval set we can: