Skip to content

Instantly share code, notes, and snippets.

@rom1504
Last active October 20, 2023 15:51
Show Gist options
  • Save rom1504/f1f8fd253def49ce02a990229d7bf09d to your computer and use it in GitHub Desktop.
Save rom1504/f1f8fd253def49ce02a990229d7bf09d to your computer and use it in GitHub Desktop.
Filtering url to keep only video platforms links

End goal: have a function keeping only interesting video platform links. Having this would enable getting billions of such links via cc2dataset

This function is a binary classifier. To evaluate it we need links that naturally occur in common crawl. Criteria:

  • link not containing video that can be downloaded by yt-dlp should be discarded
  • "Bad" links (eg porn) should be discarded in vast majority

To collect this eval set we can:

  • take 100k links from 10 random wat
  • apply N manually written regex to keep only reasonable platforms
  • apply yt dlp to know which links work and which don't

Once that's done, the actual classifier can be evaluated on this.

The classifier can first be implemented as a fine tuned transformer for quality For improvements of speed, regex can be used.

Related issue : rom1504/cc2dataset#27

Existing code:

Next steps:

  • create GitHub repo to have a single place to develop this
  • keep iterating until it works well
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment