End goal: have a function keeping only interesting video platform links. Having this would enable getting billions of such links via cc2dataset
This function is a binary classifier. To evaluate it we need links that naturally occur in common crawl. Criteria:
- link not containing video that can be downloaded by yt-dlp should be discarded
- "Bad" links (eg porn) should be discarded in vast majority
To collect this eval set we can:
- take 100k links from 10 random wat
- apply N manually written regex to keep only reasonable platforms
- apply yt dlp to know which links work and which don't
Once that's done, the actual classifier can be evaluated on this.
The classifier can first be implemented as a fine tuned transformer for quality For improvements of speed, regex can be used.
Related issue : rom1504/cc2dataset#27
Existing code:
- https://huggingface.co/datasets/cc-platform-links/training-data/tree/main first dataset, without regex filtering
- https://github.com/marianna13/cc2dataset/blob/check_platform_urls/examples/check_platform_urls.py beginning of script to get the data
- https://huggingface.co/datasets/ChristophSchuhmann/video-url-detector first classifier
- https://colab.research.google.com/drive/1XiGT_mAxttAfOO0ZVTGIc8q2zdPyvpU7 use first characters of url as cheap classifier
Next steps:
- create GitHub repo to have a single place to develop this
- keep iterating until it works well