Skip to content

Instantly share code, notes, and snippets.

@crittermike
Last active December 16, 2024 19:02
Show Gist options
  • Save crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c to your computer and use it in GitHub Desktop.
Save crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c to your computer and use it in GitHub Desktop.
Download an entire website with wget, along with assets.
# One liner
wget --recursive --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains yoursite.com --no-parent yoursite.com
# Explained
wget \
--recursive \ # Download the whole site.
--page-requisites \ # Get all assets/elements (CSS/JS/images).
--adjust-extension \ # Save files with .html on the end.
--span-hosts \ # Include necessary assets from offsite as well.
--convert-links \ # Update links to still work in the static version.
--restrict-file-names=windows \ # Modify filenames to work in Windows as well.
--domains yoursite.com \ # Do not follow links outside this domain.
--no-parent \ # Don't follow links outside the directory you pass in.
yoursite.com/whatever/path # The URL to download
@vasili111
Copy link

@YubinXie

Maybe you need --convert-links option?

@polly4you
Copy link

H!, if I am wrong you can virtually shoot me, but the no-parent command is maybe hit by a typo because when I tried with ----no-parent it did not recognize the command but when I did some surgery I endid up with --no-parent and it worked so if I am right cool if I am wrong I am sorry:

YS: polly4you

@Ornolfr
Copy link

Ornolfr commented Oct 22, 2020

What if the website requires authorization of some sort? How do we specify some cookies to wget?

@jan-martinek
Copy link

--no-parent requires trailing slash, otherwise it works from the parent dir

as quoted from docs:

Note that, for HTTP (and HTTPS), the trailing slash is very important to ‘--no-parent’. HTTP has no concept of a “directory”—Wget relies on you to indicate what’s a directory and what isn’t. In ‘http://foo/bar/’, Wget will consider ‘bar’ to be a directory, while in ‘http://foo/bar’ (no trailing slash), ‘bar’ will be considered a filename (so ‘--no-parent’ would be meaningless, as its parent is ‘/’).

@imharvol
Copy link

imharvol commented Mar 5, 2021

What if the website requires authorization of some sort? How do we specify some cookies to wget?

Add

--header='Cookie: KEY=VALUE; KEY=VALUE'

and so on, with the creedentials

@abdallahoraby
Copy link

What if the website requires authorization of some sort? How do we specify some cookies to wget?

Add

--header='Cookie: KEY=VALUE; KEY=VALUE'

and so on, with the creedentials

worked like a charm THANKS

@swport
Copy link

swport commented May 11, 2021

how to also download lazily loaded static chunks ( like css, js files not loaded on initial page load, but are requested after the page load is finished )

@fengshansi
Copy link

What if the page contains an external link that i don't want to clone?

@sineausr931
Copy link

It never occurred to me that wget could do this, thank you for the slap in the face, it saved me from using httrack or something else unnecessarily.

@BradKML
Copy link

BradKML commented Aug 25, 2021

@jeffory-orrok
Copy link

If you're going to use --recursive, then you need to use --level, and you should probably be polite and use --wait

@fazlearefin
Copy link

Aggregating this command with other blog posts on the internet, I ended up using

wget --mirror --no-clobber --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains {{DOMAINS}} --no-parent {{URL}}

@BradKML
Copy link

BradKML commented Jan 3, 2022

@fazlearefin thanks

@dillfrescott
Copy link

My file names end with @ver=xx. How do I fix this?

@iceguru
Copy link

iceguru commented Feb 17, 2022

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://web.archive.org/web/20210628062523/https://www.ps-survival.com/PS/Hydro-Power/index.htm

Will download the .pdf

But if I change the domain to https://web.archive.org/web/20220118034512/https://ps-survival.com/PS/index.htm

IT doesn't go down and download the PDF's

Could someone tell me why that is? I'm trying to download all the PDF's.

@641i130
Copy link

641i130 commented Mar 31, 2023

@iceguru I'd try using an archive downloader. Wget doesn't play nicely with how they have it setup:
https://github.com/hartator/wayback-machine-downloader

@BradKML
Copy link

BradKML commented Apr 2, 2023

I am also eyeing for this repo, since it can be directly hooked up to LLMs for use instead of using wget indirectly. https://pypi.org/project/pywebcopy/

@BradKML
Copy link

BradKML commented Apr 2, 2023

Recently discovered random-wait as an option from here, should be included to make things less sus https://gist.github.com/stvhwrd/985dedbe1d3329e68d70

@BradKML
Copy link

BradKML commented Apr 10, 2023

Just realized that --no-cobbler and --mirror conflicted, and such should use -l inf --recursive instead? https://stackoverflow.com/questions/13092229/cant-resume-wget-mirror-with-no-clobber-c-f-b-unhelpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment