-
-
Save crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c to your computer and use it in GitHub Desktop.
# One liner | |
wget --recursive --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains yoursite.com --no-parent yoursite.com | |
# Explained | |
wget \ | |
--recursive \ # Download the whole site. | |
--page-requisites \ # Get all assets/elements (CSS/JS/images). | |
--adjust-extension \ # Save files with .html on the end. | |
--span-hosts \ # Include necessary assets from offsite as well. | |
--convert-links \ # Update links to still work in the static version. | |
--restrict-file-names=windows \ # Modify filenames to work in Windows as well. | |
--domains yoursite.com \ # Do not follow links outside this domain. | |
--no-parent \ # Don't follow links outside the directory you pass in. | |
yoursite.com/whatever/path # The URL to download |
hello good afternoon...please still don't know how to use it...to download the entire website
hello good afternoon...please still don't know how to use it...to download the entire website
This is just using wget, just look up how to use wget. There are tons of examples online.
Either way you need to make sure you have wget installed already:
debian:
sudo apt-get install wget
Centos/RHEL:
yum install wget
Here are some usage examples to download an entire site:
convert links for local viewing:
wget --mirror --convert-links --page-requisites ----no-parent -P /path/to/download/to https://example-domain.com
without converting:
wget --mirror --page-requisites ----no-parent -P /path/to/download/to https://example-domain.com
One more example to download an entire site with wget:
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.org
Explanation of the various flags:
--mirror – Makes (among other things) the download recursive.
--convert-links – convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing.
--adjust-extension – Adds suitable extensions to filenames (html or css) depending on their content-type.
--page-requisites – Download things like CSS style-sheets and images required to properly display the page offline.
--no-parent – When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.
Alternatively, the command above may be shortened:
wget -mkEpnp http://example.org
If you still insist on running this script, it is a BASH script so first set it as executable:
chmod u+x wget.sh
and then this to run the script:
./wget.sh
if you still can't run the script edit it by adding this as the first line:
#!/bin/sh
Also you need to specify the site in the script that you want to download. At this point you are really better off just using wget outright.
- What about
--span-hosts
? Should I use it ? - Why to use
--mirror
instead of--recursive
?
2: ‘--mirror’
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to ‘-r -N -l inf --no-remove-listing’.
Thanks for the tips. After I download the website, every time I open the file, it links back to its original website. Any idea how to solve this? Thanks!
@mikecrittenden 👋
Maybe you need --convert-links
option?
H!, if I am wrong you can virtually shoot me, but the no-parent command is maybe hit by a typo because when I tried with ----no-parent it did not recognize the command but when I did some surgery I endid up with --no-parent and it worked so if I am right cool if I am wrong I am sorry:
YS: polly4you
What if the website requires authorization of some sort? How do we specify some cookies to wget?
--no-parent
requires trailing slash, otherwise it works from the parent dir
as quoted from docs:
Note that, for HTTP (and HTTPS), the trailing slash is very important to ‘--no-parent’. HTTP has no concept of a “directory”—Wget relies on you to indicate what’s a directory and what isn’t. In ‘http://foo/bar/’, Wget will consider ‘bar’ to be a directory, while in ‘http://foo/bar’ (no trailing slash), ‘bar’ will be considered a filename (so ‘--no-parent’ would be meaningless, as its parent is ‘/’).
What if the website requires authorization of some sort? How do we specify some cookies to wget?
Add
--header='Cookie: KEY=VALUE; KEY=VALUE'
and so on, with the creedentials
What if the website requires authorization of some sort? How do we specify some cookies to wget?
Add
--header='Cookie: KEY=VALUE; KEY=VALUE'
and so on, with the creedentials
worked like a charm THANKS
how to also download lazily loaded static chunks ( like css, js files not loaded on initial page load, but are requested after the page load is finished )
What if the page contains an external link that i don't want to clone?
It never occurred to me that wget could do this, thank you for the slap in the face, it saved me from using httrack or something else unnecessarily.
- Usage of "ignore robot"
- Usage of "infinite layers" (rather than recursive)
https://www.linuxjournal.com/content/downloading-entire-web-site-wget
- Usage of "No Cobbler"
If you're going to use --recursive, then you need to use --level, and you should probably be polite and use --wait
Aggregating this command with other blog posts on the internet, I ended up using
wget --mirror --no-clobber --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains {{DOMAINS}} --no-parent {{URL}}
@fazlearefin thanks
My file names end with @ver=xx. How do I fix this?
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://web.archive.org/web/20210628062523/https://www.ps-survival.com/PS/Hydro-Power/index.htm
Will download the .pdf
But if I change the domain to https://web.archive.org/web/20220118034512/https://ps-survival.com/PS/index.htm
IT doesn't go down and download the PDF's
Could someone tell me why that is? I'm trying to download all the PDF's.
@iceguru I'd try using an archive downloader. Wget doesn't play nicely with how they have it setup:
https://github.com/hartator/wayback-machine-downloader
I am also eyeing for this repo, since it can be directly hooked up to LLMs for use instead of using wget indirectly. https://pypi.org/project/pywebcopy/
Recently discovered random-wait
as an option from here, should be included to make things less sus https://gist.github.com/stvhwrd/985dedbe1d3329e68d70
Just realized that --no-cobbler
and --mirror
conflicted, and such should use -l inf --recursive
instead? https://stackoverflow.com/questions/13092229/cant-resume-wget-mirror-with-no-clobber-c-f-b-unhelpful
sudo apt-get update