rom1504/video_platform_filter.md

## video_platform_filter.md

      
    Raw
  

              video_platform_filter.md
            
          
    End goal: have a function keeping only interesting video platform links.
Having this would enable getting billions of such links via cc2dataset
This function is a binary classifier.
To evaluate it we need links that naturally occur in common crawl.
Criteria:

link not containing video that can be downloaded by yt-dlp should be discarded
"Bad" links (eg porn) should be discarded in vast majority

To collect this eval set we can:

take 100k links from 10 random wat
apply N manually written regex to keep only reasonable platforms
apply yt dlp to know which links work and which don't

Once that's done, the actual classifier can be evaluated on this.
The classifier can first be implemented as a fine tuned transformer for quality
For improvements of speed, regex can be used.
Related issue : rom1504/cc2dataset#27
Existing code:

https://huggingface.co/datasets/cc-platform-links/training-data/tree/main first dataset, without regex filtering
https://github.com/marianna13/cc2dataset/blob/check_platform_urls/examples/check_platform_urls.py beginning of script to get the data
https://huggingface.co/datasets/ChristophSchuhmann/video-url-detector first classifier
https://colab.research.google.com/drive/1XiGT_mAxttAfOO0ZVTGIc8q2zdPyvpU7 use first characters of url as cheap classifier

Next steps:

create GitHub repo to have a single place to develop this
keep iterating until it works well