Skip to content

Instantly share code, notes, and snippets.

@crittermike
Last active December 16, 2024 19:02
Show Gist options
  • Save crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c to your computer and use it in GitHub Desktop.
Save crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c to your computer and use it in GitHub Desktop.
Download an entire website with wget, along with assets.
# One liner
wget --recursive --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains yoursite.com --no-parent yoursite.com
# Explained
wget \
--recursive \ # Download the whole site.
--page-requisites \ # Get all assets/elements (CSS/JS/images).
--adjust-extension \ # Save files with .html on the end.
--span-hosts \ # Include necessary assets from offsite as well.
--convert-links \ # Update links to still work in the static version.
--restrict-file-names=windows \ # Modify filenames to work in Windows as well.
--domains yoursite.com \ # Do not follow links outside this domain.
--no-parent \ # Don't follow links outside the directory you pass in.
yoursite.com/whatever/path # The URL to download
@BradKML
Copy link

BradKML commented Aug 25, 2021

@jeffory-orrok
Copy link

If you're going to use --recursive, then you need to use --level, and you should probably be polite and use --wait

@fazlearefin
Copy link

Aggregating this command with other blog posts on the internet, I ended up using

wget --mirror --no-clobber --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains {{DOMAINS}} --no-parent {{URL}}

@BradKML
Copy link

BradKML commented Jan 3, 2022

@fazlearefin thanks

@dillfrescott
Copy link

My file names end with @ver=xx. How do I fix this?

@iceguru
Copy link

iceguru commented Feb 17, 2022

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://web.archive.org/web/20210628062523/https://www.ps-survival.com/PS/Hydro-Power/index.htm

Will download the .pdf

But if I change the domain to https://web.archive.org/web/20220118034512/https://ps-survival.com/PS/index.htm

IT doesn't go down and download the PDF's

Could someone tell me why that is? I'm trying to download all the PDF's.

@641i130
Copy link

641i130 commented Mar 31, 2023

@iceguru I'd try using an archive downloader. Wget doesn't play nicely with how they have it setup:
https://github.com/hartator/wayback-machine-downloader

@BradKML
Copy link

BradKML commented Apr 2, 2023

I am also eyeing for this repo, since it can be directly hooked up to LLMs for use instead of using wget indirectly. https://pypi.org/project/pywebcopy/

@BradKML
Copy link

BradKML commented Apr 2, 2023

Recently discovered random-wait as an option from here, should be included to make things less sus https://gist.github.com/stvhwrd/985dedbe1d3329e68d70

@BradKML
Copy link

BradKML commented Apr 10, 2023

Just realized that --no-cobbler and --mirror conflicted, and such should use -l inf --recursive instead? https://stackoverflow.com/questions/13092229/cant-resume-wget-mirror-with-no-clobber-c-f-b-unhelpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment