This is a script that creatively uses the curl
CLI to download an HTTP resource (colloquially "file"); It saves time & bandwidth whenever possible, but not at the expense of correctness.
- Compares
ETag
s to make sure that an unchanged resource is not transfered again, but a changed resource always is. - Requests a CE-coded (a.k.a. compressed, e.g. gzipped) representation of the resource, falling back to the "regular" one.
- Supports continuation, using conditional requests, but in contrast to the
-C -
curl flag works with CE-coded responses, and falls back to a "full body" request.
People asked me: Why not use wget
for this?
wget
does not store the resource'sETag
, so it cannot compare it when re-requesting.- Combining
-c
(continuation) and-N
(timestamping usingLast-Modified
) don't work together. - I witnessed some subtle but significant bug in the
-c
(continuation) implementation once. I can't remember the details anymore, unfortunately.
For all of the above features to work, you need a server that supports
- serving pre-compressed sidecar files (i.e. a statically compressed file next to the original) as CE-coded;
- range requests, both on the "regular" file as well as the pre-compressed one;
- conditional requests, specifically
If-Range
withETag
s.
For testing purposes, we create a test file:
yes | head -n 50000000 >/var/www/y.txt
gzip -k /var/www/y.txt
ls -lh /var/www
# -rw-r--r-- 1 j staff 95M Aug 8 16:54 y.txt
# -rw-r--r-- 1 j staff 95K Aug 8 16:54 y.txt.gz
A recently fixed bug with a wrong ETag
aside, Caddy v2 does this with the following Caddyfile
:
localhost:8080 {
root * /var/www
file_server browse {
precompressed gzip
}
}
I couldn't find much about this topic, but a response on the mailing reads like nginx does not support range requests on pre-compressed files because there are (non-trivial) problems with dynamically compressed responses. 😔 It seems like nginx does not support range requests on pre-compressed files:
Note also that it's impossible to ungzip a response part if you have not preceding parts from the very start.
This as well applies to many other types of data.
The main problem with
Content-Encoding
and ranges is that one somehow should be able to reproduce exactly the same entity-body (or at least make sure cache validators would change on entity-body change). This is not something trivial when you compress on the fly with possible different compression options.I personally think that moving towards using
Transfer-Encoding
would be a good step for "on the fly" compression. But browser support seems to be not here at all.
TLDR: The following nginx config file enables every aspect but the range requests:
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name _;
root /var/www;
gzip_static on;
gzip_vary on;
location / {
try_files $uri $uri/ =404;
}
}
The following script
- downloads the resource into a temp file (
/tmp/mirror-${sha1(url)}
) and stores theETag
& response headers next to it (/tmp/mirror-${sha1(url)}.etag
&/tmp/mirror-${sha1(url)}-${randomHex()}.headers
), - if applicable, decompresses the file (into
/tmp/mirror-${sha1(url)}-${randomHex()}.decompressed
), - copies the decompressed file to the actual destination path (in order to work atomically).
To demonstrate that it works as intended, we abort it in between:
export LOG_LEVEL=debug
./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
# destPath: 'y.txt',
# rawDestPath: '/tmp/mirror-15c86ece76',
# headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
# etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 does not exist
# /tmp/mirror-15c86ece76 does not exist, downloading "regularly" & saving ETag
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 --etag-save /tmp/mirror-15c86ece76.etag { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# we abort the download half way:
^C
ls -lh /tmp/mirror-15c86ece76
# -rw-r--r-- 1 j staff 40M Aug 9 15:50 mirror-15c86ece76
# and then continue it by re-running the script:
./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
# destPath: 'y.txt',
# rawDestPath: '/tmp/mirror-15c86ece76',
# headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
# etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 exists
# /tmp/mirror-15c86ece76 exists, continuing download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 -C - -H If-Range: "rg7r1m6dvsyb" { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl exited { status: 0, … }
# file is fully downloaded
# downloaded file is CE-coded, decompressing
# gunzip { stdio: [ 22, 23, 'inherit' ] }
# gunzip exited { status: 0, … }
# copying processed download file to destination path
# cp /tmp/mirror-15c86ece76-153154.decompressed y.txt { stdio: [ 'ignore', 'ignore', 'inherit' ] }
# cp exited { status: 0, … }
# done!
# we check if the file has been downloaded corretly:
shasum /var/www/y.txt y.txt
f1f40059b87621eca87321c4436747d75ecaebbf /var/www/y.txt
f1f40059b87621eca87321c4436747d75ecaebbf y.txt
Now that we have downloaded the file, let's emulate the file changing on the server by changing the ETag
stored locally:
echo '"foo"' >/tmp/mirror-15c86ece76.etag
./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
# destPath: 'y.txt',
# rawDestPath: '/tmp/mirror-15c86ece76',
# headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
# etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 exists
# /tmp/mirror-15c86ece76 exists, continuing download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 -C - -H If-Range: "foo" { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl: (33) HTTP server doesn't seem to support byte ranges. Cannot resume.
# curl exited { status: 33, … }
# file download couldn't be continued, server responded with 200 & full body; starting "regular" download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 --etag-save /tmp/mirror-15c86ece76.etag { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl exited { status: 0, … }
# file is fully downloaded
# downloaded file is CE-coded, decompressing
# gunzip { stdio: [ 22, 23, 'inherit' ] }
# gunzip exited { status: 0, … }
# copying exiteded download file to destination path
# cp /tmp/mirror-15c86ece76-7cb151.decompressed y.txt { stdio: [ 'ignore', 'ignore', 'inherit' ] }
# cp process { status: 0, … }
# done!
It has requested a full "regular" (re-)download, because the If-Range
header has not matched, because the local ETag
is different than the server one.
If we re-run it without changing the ETag
again, it will refrain from re-downloading the file:
./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
# destPath: 'y.txt',
# rawDestPath: '/tmp/mirror-15c86ece76',
# headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
# etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 exists
# /tmp/mirror-15c86ece76 exists, continuing download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 -C - -H If-Range: "rg7r1m6dvsyb" { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl: (22) The requested URL returned error: 416
# curl exited { status: 22, … }
# server-reported size 32601485
# downloaded size 32601485
# file is fully downloaded
# downloaded file is CE-coded, decompressing
# gunzip { stdio: [ 22, 23, 'inherit' ] }
# gunzip exited { status: 0, … }
# copying processed download file to destination path
# cp /tmp/mirror-15c86ece76-fd6f2d.decompressed y.txt { stdio: [ 'ignore', 'ignore', 'inherit' ] }
# cp exited { status: 0, … }
# done!