Skip to content

Instantly share code, notes, and snippets.

@derhuerst
Last active September 26, 2024 13:17
Show Gist options
  • Save derhuerst/745cf09fe5f3ea2569948dd215bbfe1a to your computer and use it in GitHub Desktop.
Save derhuerst/745cf09fe5f3ea2569948dd215bbfe1a to your computer and use it in GitHub Desktop.
curl-based HTTP mirroring script

curl-based HTTP mirroring script

This is a script that creatively uses the curl CLI to download an HTTP resource (colloquially "file"); It saves time & bandwidth whenever possible, but not at the expense of correctness.

  • Compares ETags to make sure that an unchanged resource is not transfered again, but a changed resource always is.
  • Requests a CE-coded (a.k.a. compressed, e.g. gzipped) representation of the resource, falling back to the "regular" one.
  • Supports continuation, using conditional requests, but in contrast to the -C - curl flag works with CE-coded responses, and falls back to a "full body" request.

People asked me: Why not use wget for this?

  • wget does not store the resource's ETag, so it cannot compare it when re-requesting.
  • Combining -c (continuation) and -N (timestamping using Last-Modified) don't work together.
  • I witnessed some subtle but significant bug in the -c (continuation) implementation once. I can't remember the details anymore, unfortunately.

server side

For all of the above features to work, you need a server that supports

  • serving pre-compressed sidecar files (i.e. a statically compressed file next to the original) as CE-coded;
  • range requests, both on the "regular" file as well as the pre-compressed one;
  • conditional requests, specifically If-Range with ETags.

For testing purposes, we create a test file:

yes | head -n 50000000 >/var/www/y.txt
gzip -k /var/www/y.txt
ls -lh /var/www
# -rw-r--r--   1 j  staff    95M Aug  8 16:54 y.txt
# -rw-r--r--   1 j  staff    95K Aug  8 16:54 y.txt.gz

Caddy

A recently fixed bug with a wrong ETag aside, Caddy v2 does this with the following Caddyfile:

localhost:8080 {
	root * /var/www
	file_server browse {
		precompressed gzip
	}
}

nginx

I couldn't find much about this topic, but a response on the mailing reads like nginx does not support range requests on pre-compressed files because there are (non-trivial) problems with dynamically compressed responses. 😔 It seems like nginx does not support range requests on pre-compressed files:

Note also that it's impossible to ungzip a response part if you have not preceding parts from the very start.

This as well applies to many other types of data.

The main problem with Content-Encoding and ranges is that one somehow should be able to reproduce exactly the same entity-body (or at least make sure cache validators would change on entity-body change). This is not something trivial when you compress on the fly with possible different compression options.

I personally think that moving towards using Transfer-Encoding would be a good step for "on the fly" compression. But browser support seems to be not here at all.

TLDR: The following nginx config file enables every aspect but the range requests:

server {
	listen 80 default_server;
	listen [::]:80 default_server;
	server_name _;

	root /var/www;
	gzip_static on;
	gzip_vary on;

	location / {
		try_files $uri $uri/ =404;
	}
}

usage

The following script

  1. downloads the resource into a temp file (/tmp/mirror-${sha1(url)}) and stores the ETag & response headers next to it (/tmp/mirror-${sha1(url)}.etag & /tmp/mirror-${sha1(url)}-${randomHex()}.headers),
  2. if applicable, decompresses the file (into /tmp/mirror-${sha1(url)}-${randomHex()}.decompressed),
  3. copies the decompressed file to the actual destination path (in order to work atomically).

demo

To demonstrate that it works as intended, we abort it in between:

export LOG_LEVEL=debug

./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
#   destPath: 'y.txt',
#   rawDestPath: '/tmp/mirror-15c86ece76',
#   headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
#   etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 does not exist
# /tmp/mirror-15c86ece76 does not exist, downloading "regularly" & saving ETag
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 --etag-save /tmp/mirror-15c86ece76.etag { stdio: [ 'ignore', 'inherit', 'inherit' ] }

# we abort the download half way:
^C
ls -lh /tmp/mirror-15c86ece76
# -rw-r--r--   1 j  staff    40M Aug  9 15:50 mirror-15c86ece76

# and then continue it by re-running the script:
./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
#   destPath: 'y.txt',
#   rawDestPath: '/tmp/mirror-15c86ece76',
#   headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
#   etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 exists
# /tmp/mirror-15c86ece76 exists, continuing download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 -C - -H If-Range: "rg7r1m6dvsyb" { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl exited { status: 0, … }
# file is fully downloaded
# downloaded file is CE-coded, decompressing
# gunzip { stdio: [ 22, 23, 'inherit' ] }
# gunzip exited { status: 0, … }
# copying processed download file to destination path
# cp /tmp/mirror-15c86ece76-153154.decompressed y.txt { stdio: [ 'ignore', 'ignore', 'inherit' ] }
# cp exited { status: 0, … }
# done!

# we check if the file has been downloaded corretly:
shasum /var/www/y.txt y.txt
f1f40059b87621eca87321c4436747d75ecaebbf  /var/www/y.txt
f1f40059b87621eca87321c4436747d75ecaebbf  y.txt

Now that we have downloaded the file, let's emulate the file changing on the server by changing the ETag stored locally:

echo '"foo"' >/tmp/mirror-15c86ece76.etag

./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
#   destPath: 'y.txt',
#   rawDestPath: '/tmp/mirror-15c86ece76',
#   headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
#   etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 exists
# /tmp/mirror-15c86ece76 exists, continuing download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 -C - -H If-Range: "foo" { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl: (33) HTTP server doesn't seem to support byte ranges. Cannot resume.
# curl exited { status: 33, … }
# file download couldn't be continued, server responded with 200 & full body; starting "regular" download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 --etag-save /tmp/mirror-15c86ece76.etag { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl exited { status: 0, … }
# file is fully downloaded
# downloaded file is CE-coded, decompressing
# gunzip { stdio: [ 22, 23, 'inherit' ] }
# gunzip exited { status: 0, … }
# copying exiteded download file to destination path
# cp /tmp/mirror-15c86ece76-7cb151.decompressed y.txt { stdio: [ 'ignore', 'ignore', 'inherit' ] }
# cp process { status: 0, … }
# done!

It has requested a full "regular" (re-)download, because the If-Range header has not matched, because the local ETag is different than the server one.

If we re-run it without changing the ETag again, it will refrain from re-downloading the file:

./mirror.mjs 'http://localhost:8080/y.txt' y.txt
# {
#   destPath: 'y.txt',
#   rawDestPath: '/tmp/mirror-15c86ece76',
#   headersPath: '/tmp/mirror-15c86ece76-f57746.headers',
#   etagPath: '/tmp/mirror-15c86ece76.etag'
# }
# /tmp/mirror-15c86ece76 exists
# /tmp/mirror-15c86ece76 exists, continuing download
# curl http://localhost:8080/y.txt -f -L -s -S -H Accept-Encoding: gzip -D /tmp/mirror-15c86ece76-f57746.headers -o /tmp/mirror-15c86ece76 -C - -H If-Range: "rg7r1m6dvsyb" { stdio: [ 'ignore', 'inherit', 'inherit' ] }
# curl: (22) The requested URL returned error: 416
# curl exited { status: 22, … }
# server-reported size 32601485
# downloaded size 32601485
# file is fully downloaded
# downloaded file is CE-coded, decompressing
# gunzip { stdio: [ 22, 23, 'inherit' ] }
# gunzip exited { status: 0, … }
# copying processed download file to destination path
# cp /tmp/mirror-15c86ece76-fd6f2d.decompressed y.txt { stdio: [ 'ignore', 'ignore', 'inherit' ] }
# cp exited { status: 0, … }
# done!
#!/usr/bin/env node
// curl-based HTTP mirroring script
// Jannis R <[email protected]>
// from https://gist.github.com/derhuerst/745cf09fe5f3ea2569948dd215bbfe1a
import {parseArgs} from 'node:util'
import {basename} from 'node:path'
import {createHash, randomBytes} from 'node:crypto'
import {
accessSync, constants,
readFileSync,
statSync,
openSync, closeSync,
utimesSync,
} from 'node:fs'
import {spawnSync} from 'node:child_process'
import {strictEqual} from 'node:assert'
// curl errors
// HTTP page not retrieved. The requested url was not found or returned another error with the HTTP error code being 400 or above.
const HTTP_PAGE_NOT_RETRIEVED = 22
// HTTP range error. The range "command" didn't work.
const RANGE_CMD_DIDNT_WORK = 33
const args = parseArgs({
options: {
help: {
type: 'boolean',
short: 'h',
},
'tmp-prefix': {
type: 'string',
},
'log-level': {
type: 'string',
short: 'l',
},
'debug-curl': {
type: 'boolean',
},
'times': {
type: 'boolean',
},
},
allowPositionals: true,
})
if (args.values.help) {
process.stdout.write(`\
curl-mirror.mjs [--tmp-prefix …] [--log-level …] [--debug-curl] [--times] <url> <dest-path> [-- curl-opts...]
`)
process.exit(0)
}
const url = args.positionals[0]
if (!url) {
process.stderr.write('missing 1st argument: url\n')
process.exit(1)
}
const destPath = args.positionals[1]
if (!destPath) {
process.stderr.write('missing 2nd argument: dest-path\n')
process.exit(1)
}
const additionalCurlArgs = args.positionals.slice(2)
const tmpPrefix = 'tmp-prefix' in args.values
? args.values['tmp-prefix']
: `/tmp/${basename(destPath)}.mirror-`
const ERROR = 0
const WARN = 1
const INFO = 2
const DEBUG = 3
const LOG_LEVEL = new Map([
['warn', WARN],
['info', INFO],
['debug', DEBUG],
]).get(args.values['log-level'] || process.env.LOG_LEVEL) || WARN
const DEBUG_CURL = process.env.DEBUG_CURL === 'true' || Boolean(args.values['debug-curl'])
const fileExists = (path) => {
try {
accessSync(path, constants.R_OK | constants.W_OK) // check read/write access
if (LOG_LEVEL >= DEBUG) console.debug(path + ' exists')
return true
} catch (err) {
if (err.code !== 'ENOENT') throw err
}
if (LOG_LEVEL >= DEBUG) console.debug(path + ' does not exist')
return false
}
const exitWithError = (err) => {
if (LOG_LEVEL >= ERROR) console.error(err)
process.exit(1)
}
const defaultIsOkExitCode = exitCode => exitCode === 0
const run = (cmd, args, opts, isOkExitCode = defaultIsOkExitCode) => {
if (LOG_LEVEL >= DEBUG) console.debug(cmd, ...args, opts)
const proc = spawnSync(cmd, args, opts)
if (LOG_LEVEL >= DEBUG) console.debug(cmd, 'exited', proc)
// for some reason, proc.error is not always populated, e.g. if curl failed with status code 22 (HTTP 416)
if (proc.error) throw proc.error
// so we mimick https://github.com/sindresorhus/execa/blob/c2114519066057414d47a2bed46f17df2c68219d/lib/error.js here
if (!isOkExitCode(proc.status)) {
const _cmd = `${cmd} ${args.join(' ')}`
const err = new Error(`cmd failed with exit code ${proc.status}: ${_cmd}`)
// todo: add stdout & stderr to err msg
err.command = _cmd
err.exitCode = proc.status
err.stdout = proc.stdout
err.stderr = proc.stderr
err.process = proc
throw err
}
return proc
}
const isFullyDownloaded = (destPath, responseHeaders) => {
// https://httpwg.org/specs/rfc7233.html#header.content-range
const unsatisfiedRange = /Content-Range:\s+bytes\s\*\/(.+)/i
const contentRange = responseHeaders.match(unsatisfiedRange)
if (!contentRange) return null // unknown
const completeLength = parseInt(contentRange[1])
const {size: bytesDownloaded} = statSync(destPath)
if (LOG_LEVEL >= DEBUG) {
console.debug('server-reported size', completeLength)
console.debug('downloaded size', bytesDownloaded)
}
return bytesDownloaded === completeLength
}
// modified from parsehttpdate, (c) 2018-2021 Pimm "de Chinchilla" Hogeling, MIT-licensed
// https://github.com/Pimm/parseHttpDate/blob/npm-1.0.11/index.js
// (The number of seconds may start with a 6 because of leap seconds.)
const _httpDatePattern = /^[F-W][a-u]{2}, [0-3]\d (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:[0-5]\d:[0-6]\d GMT$/;
// ⎣day o' week⎦ ⎣date ⎦ ⎣ month ⎦ ⎣yr ⎦ ⎣hour ⎦ ⎣ min ⎦ ⎣ sec ⎦
// J F M A M J J A S O N D
const _httpMonthsNames = 'anebarprayunulugepctovec';
// u r c i e y u t o e e
// a u h l s e b m m
// r a t m e b b
// y r b r e e
// y e r r
// r
// parses e.g. `Tue, 15 Nov 1994 08:12:31 GMT`
const parseHttpDate = (httpDate) => {
if (false == _httpDatePattern.test(httpDate)) {
return NaN
}
return Date.UTC(
parseInt(httpDate.substring(12, 16), 10),
// Skip over the first character of the month abbreviation, as we can safely detect the name by the second and third character only.
_httpMonthsNames.indexOf(httpDate.substring(9, 11)) >> 1,
parseInt(httpDate.substring(5, 7), 10),
parseInt(httpDate.substring(17, 19), 10),
parseInt(httpDate.substring(20, 22), 10),
parseInt(httpDate.substring(23, 25), 10)
)
}
strictEqual(parseHttpDate('Sun, 06 Nov 1994 08:49:37 GMT'), Date.parse('1994-11-06T08:49:37.000Z'))
strictEqual(parseHttpDate('Wed, 21 Oct 2015 07:28:00 GMT'), Date.parse('2015-10-21T07:28:00.000Z'))
// todo: include dest dir in hash?
// todo: include request headers in hash
const urlHash = createHash('sha256').update(url).digest('hex').slice(0, 10)
const tmpFilePath = (random = false, suffix = '') => {
return [
tmpPrefix,
urlHash,
...(random ? ['-' + randomBytes(3).toString('hex')] : []),
...(suffix ? ['.' + suffix] : []),
].join('')
}
const rawDestPath = tmpFilePath()
// const headersPath = tmpFilePath(true, 'headers')
const headersPath = `/tmp/mirror-${urlHash}-f57746.headers`
const readHeaders = () => {
return readFileSync(headersPath, {encoding: 'utf8'})
}
const etagPath = tmpFilePath(false, 'etag')
const readEtag = () => {
try {
return readFileSync(etagPath, {encoding: 'utf8'}).trim()
} catch (err) {
if (err.code === 'ENOENT') return null
throw err
}
}
if (LOG_LEVEL >= DEBUG) {
console.debug({destPath, rawDestPath, headersPath, etagPath})
}
// Because the HTTP RFCs define `Content-Encoding` (CE) as being a property of the entity, range requests *do not* "make sense" on CE-coded files. Therefore continuing an interrupted downloaded is only possible with a *non-CE-coded* representation of the resource. `Transfer-Encoding` would cleanly solve this problem, but unfortunately it is not widely supported in web servers and has no equivalent in HTTP/2 and HTTP/3 (yet?).
// Also, because a CE-coded entity has a different `ETag` than its un-CE-coded equivalent, we *cannot* re-use the CE-coded `ETag` to continue downloading from the un-CE-coded entity, in oder to make sure we're still downloading the same "version" of the resource!
// more details:
// - https://github.com/golang/go/issues/30829#issuecomment-476694405
// - https://github.com/httpwg/http2-spec/issues/445
// Thus, we can only use CE-coding when downloading in one go (and start over after an interruption), and support continuation *for non-CE-coded entities only*.
const baseCurlArgs = [
url,
'-f', // fail on HTTP errors
'-L', // follow redirects
...(DEBUG_CURL
? ['-v', '-#'] // show headers & one-line progress bar
: ['-s', '-S'] // silent mode, but show errors
),
'-H', 'Accept-Encoding: gzip', // request CE-coded entity (but don't decode it)
'-D', headersPath, // dump headers into a file
'-o', rawDestPath,
]
const curlArgs = []
if (fileExists(rawDestPath)) {
if (LOG_LEVEL >= INFO) console.info(`${rawDestPath} exists, continuing download`)
// $rawDestPath exists, continue downloading
curlArgs.push('-C', '-')
// With an *existing* ETag file and an unfinished download, curl --etag-compare *does not* continue the download, because the server reports 304 Not Modified.
// related: https://curl.se/mail/archive-2020-03/0049.html
// curlArgs.push('--etag-compare', etagPath)
const etag = readEtag()
if (etag === null) {
curlArgs.push('--etag-save', etagPath)
} else {
curlArgs.push('-H', `If-Range: ${etag}`)
}
// todo: `-z $rawDestPath`
} else {
if (LOG_LEVEL >= INFO) {
console.info(`${rawDestPath} does not exist, downloading "regularly" & saving ETag`)
}
// With an *existing* ETag file and an unstarted download, curl --etag-compare *does not* download, because the server reports 304 Not Modified.
// related: https://curl.se/mail/archive-2020-03/0049.html
// curlArgs.push('--etag-compare', etagPath)
curlArgs.push('--etag-save', etagPath)
}
try {
const isOkExitCode = (exitCode) => [
0,
HTTP_PAGE_NOT_RETRIEVED,
RANGE_CMD_DIDNT_WORK,
].includes(exitCode)
let curlProc = run('curl', [
...baseCurlArgs,
...curlArgs,
...additionalCurlArgs,
], {
// todo: on HTTP_PAGE_NOT_RETRIEVED, don't let curl log to stderr
stdio: ['ignore', 'inherit', 'inherit'],
}, isOkExitCode)
let headers = readHeaders()
if (
curlProc.status === HTTP_PAGE_NOT_RETRIEVED &&
!isFullyDownloaded(rawDestPath, headers)
) {
throw new Error(`file download couldn't be continued, server responded with 416`)
}
// If the etag doesn't match (because the entity has changed) and the server returns the full body with 200, curl refuses to overwrite the whole file.
if (curlProc.status === RANGE_CMD_DIDNT_WORK) {
if (LOG_LEVEL >= INFO) {
console.info(`file download couldn't be continued, server responded with 200 & full body; (re-)starting "regular" download`)
}
// We re-run curl here, with a full "regular" download. The server has a new file *anyways*, so we don't need to send an ETag.
curlProc = run('curl', [
...baseCurlArgs,
'--etag-save', etagPath,
...additionalCurlArgs,
], {
stdio: ['ignore', 'inherit', 'inherit'],
})
headers = readHeaders()
}
if (LOG_LEVEL >= INFO) console.info('file is fully downloaded')
let processedPath = rawDestPath
// todo:
// > The HTTP/1.1 standard also recommends that the servers supporting this content-encoding should recognize x-gzip as an alias, for compatibility purposes.
const contentEncoding = /Content-Encoding:\s+(.+)/gi.exec(headers)
if (contentEncoding) {
const encoding = contentEncoding[1]
if (encoding !== 'gzip') {
throw new Error(`invalid/unsupported Content-Encoding: ${encoding}`)
}
if (LOG_LEVEL >= INFO) console.info('downloaded file is CE-coded, decompressing')
const decompressedPath = tmpFilePath(true, 'decompressed')
const rawDestFd = openSync(processedPath, 'r')
const decompressedFd = openSync(decompressedPath, 'wx') // fail if exists
run('gunzip', [], {
stdio: [
rawDestFd, // stdin
decompressedFd, // stdout
'inherit',
],
})
closeSync(rawDestFd)
closeSync(decompressedFd)
processedPath = decompressedPath
}
if (LOG_LEVEL >= INFO) console.info('copying processed download file to destination path')
const runCp = (flags, src, dest) => {
return run('cp', [...flags, src, dest], {
stdio: ['ignore', 'ignore', 'inherit'],
})
}
const cpFlags = []
// use copy-on-write if cp & the file system support it
try {
if (process.platform === 'linux') { // GNU/Linux
// note: cp from GNU coreutils 9+ does this automatically
// see also https://unix.stackexchange.com/a/152639
cpFlags.push('--reflink=auto')
} else if (process.platform === 'darwin') { // macOS
cpFlags.push('-c')
}
runCp(cpFlags, processedPath, destPath)
} catch (err) {
if (LOG_LEVEL >= DEBUG) {
console.debug(`using copy-on-write (${cpFlags.join(' ')}) failed:`, err?.message)
console.debug('using plain cp instead')
}
runCp([], processedPath, destPath)
}
const lastModified = /Last-Modified:\s+(.+)/gi.exec(headers)
if (args.values.times) {
if (lastModified === null) {
console.warn('cannot set file mtime: response has no Last-Modified header')
} else {
const timeModified = parseHttpDate(lastModified[1])
if (Number.isNaN(timeModified)) {
console.warn('cannot set file mtime: failed to parse the Last-Modified time:', lastModified)
} else {
const mtime = Math.ceil(timeModified / 1000)
if (LOG_LEVEL >= DEBUG) {
console.debug(`changing atime & mtime to ${mtime} (${new Date(timeModified).toISOString()})`)
}
utimesSync(destPath, mtime, mtime)
}
}
}
if (LOG_LEVEL >= INFO) console.info('mirrored successfully!')
} catch (err) {
exitWithError(err)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment