-
-
Save crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c to your computer and use it in GitHub Desktop.
# One liner | |
wget --recursive --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains yoursite.com --no-parent yoursite.com | |
# Explained | |
wget \ | |
--recursive \ # Download the whole site. | |
--page-requisites \ # Get all assets/elements (CSS/JS/images). | |
--adjust-extension \ # Save files with .html on the end. | |
--span-hosts \ # Include necessary assets from offsite as well. | |
--convert-links \ # Update links to still work in the static version. | |
--restrict-file-names=windows \ # Modify filenames to work in Windows as well. | |
--domains yoursite.com \ # Do not follow links outside this domain. | |
--no-parent \ # Don't follow links outside the directory you pass in. | |
yoursite.com/whatever/path # The URL to download |
If you're going to use --recursive, then you need to use --level, and you should probably be polite and use --wait
Aggregating this command with other blog posts on the internet, I ended up using
wget --mirror --no-clobber --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains {{DOMAINS}} --no-parent {{URL}}
@fazlearefin thanks
My file names end with @ver=xx. How do I fix this?
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://web.archive.org/web/20210628062523/https://www.ps-survival.com/PS/Hydro-Power/index.htm
Will download the .pdf
But if I change the domain to https://web.archive.org/web/20220118034512/https://ps-survival.com/PS/index.htm
IT doesn't go down and download the PDF's
Could someone tell me why that is? I'm trying to download all the PDF's.
@iceguru I'd try using an archive downloader. Wget doesn't play nicely with how they have it setup:
https://github.com/hartator/wayback-machine-downloader
I am also eyeing for this repo, since it can be directly hooked up to LLMs for use instead of using wget indirectly. https://pypi.org/project/pywebcopy/
Recently discovered random-wait
as an option from here, should be included to make things less sus https://gist.github.com/stvhwrd/985dedbe1d3329e68d70
Just realized that --no-cobbler
and --mirror
conflicted, and such should use -l inf --recursive
instead? https://stackoverflow.com/questions/13092229/cant-resume-wget-mirror-with-no-clobber-c-f-b-unhelpful
https://superuser.com/questions/1596117/how-do-you-download-an-entire-website-for-offline-viewing-with-wget
https://www.linuxjournal.com/content/downloading-entire-web-site-wget