Last active
August 6, 2024 00:10
-
-
Save mara004/6fe0ac15d0cf303bed0aea2f22d8531f to your computer and use it in GitHub Desktop.
Safer tar extraction
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# SPDX-FileCopyrightText: 2023 geisserml <[email protected]> | |
# SPDX-License-Identifier: Apache-2.0 OR BSD-3-Clause OR MPL-2.0 | |
# Safer tar extraction (hopefully) preventing CVE-2007-4559 etc. | |
# Tries to use the most elegant strategy available in the caller's python version (>= 3.6) | |
__all__ = ["safer_tar_unpack"] | |
import sys | |
if sys.version_info >= (3, 11, 4): # PEP 706 | |
import shutil | |
def safer_tar_unpack(archive_path, dest_dir): | |
shutil.unpack_archive(archive_path, dest_dir, format="tar", filter="data") | |
else: # workaround | |
import tarfile | |
from pathlib import Path | |
def safer_tar_unpack(archive_path, dest_dir): | |
dest_dir = Path(dest_dir).resolve() | |
with tarfile.open(archive_path) as tar: | |
for m in tar.getmembers(): | |
if not ((m.isfile() or m.isdir()) and dest_dir in (dest_dir/m.name).resolve().parents): | |
raise RuntimeError("Path traversal, symlink or non-file member in tar archive (probably malicious).") | |
tar.extractall(dest_dir) |
I just realized that we could simplify the is_within_dir()
check to dir in path.parents
.
It'd be easier than both current strategies (no version checks), and strictly speaking also more correct, because it does not permit the case that path == dir
.
Also note, a known limitation of the above code is that existing files in dir
may be overwritten by the archive. But this is expected.
Re Comment 2: I updated the code accordingly, however the latest revision is now untested.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Just tested with Jakub Wilk's traversal archives.
Results good. Workaround code correctly blocks all invalid samples.
Same for PEP 706 code, except that it automatically converts absolute to relative paths, i.e. they are extracted into
dest_dir
. This is safe (and probably better), only a behavioral difference.Also confirmed with all strategies that we prevent traversal even if a path with shared prefix exists in the destination directory, while invalid archives get through with the
< (3, 9)
code passage when changingcommonpath()
tocommonprefix()
. This proofs the Trellix patch and pip's code are incorrect (they usecommonprefix()
as of this writing).