Skip to content

Instantly share code, notes, and snippets.

@Kasimir123
Last active July 25, 2023 16:34
Show Gist options
  • Save Kasimir123/9bdc38f2eac9a19f83ef6cc52c7ce82e to your computer and use it in GitHub Desktop.
Save Kasimir123/9bdc38f2eac9a19f83ef6cc52c7ce82e to your computer and use it in GitHub Desktop.
Skeleton of the patch that is being applied to projects vulnerable to CVE-2007-4559. Based off of code from pip's unpacking.py.
import os
def is_within_directory(directory, target):
abs_directory = os.path.abspath(directory)
abs_target = os.path.abspath(target)
prefix = os.path.commonprefix([abs_directory, abs_target])
return prefix == abs_directory
def safe_extract(tar, path=".", members=None, *, numeric_owner=False):
for member in tar.getmembers():
member_path = os.path.join(path, member.name)
if not is_within_directory(path, member_path):
raise Exception("Attempted Path Traversal in Tar File")
tar.extractall(path, members, numeric_owner=numeric_owner)
@jowagner
Copy link

A comment in line 15 that path is ignored for absolute member.name would help. Not many readers will remember the detail "If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component." from the os.path.join() documentation.

@jowagner
Copy link

Calling getmembers (line 14) is potentially very expensive (large files) and should not work with tar files that cannot be rewound, e.g. sys.stdin.buffer unless the tarfile module makes a temporary copy.

@Kasimir123
Copy link
Author

Hello @jowagner. Could you possibly give us an example of where an absolute path would break this patch? In our testing, absolute paths are also covered since the common prefix won't be the same. I have added a screenshot below of how the paths would look based on default extractall (since the path is set to "." by default) and if the member name was set to "/etc":

image

Regarding any performance hit, each maintainer would have to make their own determination on security vs performance. That's not a decision we can make for them.

@jowagner
Copy link

jowagner commented Nov 3, 2022

I think there is a misunderstanding: I did not mean to imply with "in line 15 [...] path is ignored for absolute member.name" that something is broken. It's purely a suggestion to add a source code comment here.

Re performance: Yes, until the tarfile module provides an interface to hook in a path_check() function or implements similar checks itself you have no other choice than to scan all members in advance. I am pushing for a combination of path_check() hook and safety by default in python/cpython#73974 (comment). Again, it is more an issue of documentation that this check incurs a performance penality (scanning the archive twice) and breaks extracting from tar files that cannot be read twice, e.g. reading from sys.stdin.buffer.

Edit: clearer phrasing

@nsoranzo
Copy link

nsoranzo commented Nov 4, 2022

@Kasimir123 From my testing, your proposed solution would not prevent the directory traversal if the archive contains a symlink to outside of the target directory, e.g.:

~/tmp$ mkdir test_dir
~/tmp$ cd test_dir/
~/tmp/test_dir$ ln -s /tmp foo
~/tmp/test_dir$ touch foo/bar
~/tmp/test_dir$ ll
total 8
drwxr-xr-x 2 soranzon ga002 4096 Nov  4 11:34 ./
drwxr-xr-x 5 soranzon ga002 4096 Nov  4 11:34 ../
lrwxrwxrwx 1 soranzon ga002    5 Nov  4 11:34 foo -> /tmp/
~/tmp/test_dir$ ll foo/bar 
-rw-r--r-- 1 soranzon ga002 0 Nov  4 11:34 foo/bar
~/tmp/test_dir$ tar cf ../test_dir.tar foo foo/bar 

The extraction of the test_dir.tar archive is forbidden by tar:

~/tmp/test_dir$ rm foo /tmp/bar
rm: remove symbolic link 'foo'? y
rm: remove regular empty file '/tmp/bar'? y
~/tmp/test_dir$ tar xf ../test_dir.tar 
tar: foo/bar: Cannot open: Not a directory
tar: Exiting with failure status due to previous errors
~/tmp/test_dir$ ll
total 8
drwxr-xr-x 2 soranzon ga002 4096 Nov  4 11:36 ./
drwxr-xr-x 5 soranzon ga002 4096 Nov  4 11:35 ../
lrwxrwxrwx 1 soranzon ga002    5 Nov  4 11:34 foo -> /tmp/
~/tmp/test_dir$ ll foo/bar
ls: cannot access 'foo/bar': No such file or directory
~/tmp/test_dir$ rm foo
rm: remove symbolic link 'foo'? y

but is not blocked by your safe_extract():

import tarfile

tarfile("../test_dir.tar")
safe_extract(tar)

@mara004
Copy link

mara004 commented Nov 15, 2022

It looks like @TrellixVulnTeam may have started an automated mass submission of this patch.
But seriously, patching thousands of projects with boilerplate code can't be called a proper solution.
What we need is a simple way to extract tars safely using the standard library, like a prevent_traversal option to shutil.unpack_archive().

@mara004
Copy link

mara004 commented Feb 26, 2023

Apart from that, I find the patch both bloated and written in a fashion that is hard to understand.
Here's an alternative suggestion, using pathlib:

(Update: Warning, this code is not correct, see the following comments)

def safe_extract(tar, dest_dir: Path, **kwargs):
    dest_dir = dest_dir.resolve()
    for member in tar.getmembers():
        # if str(dest_dir) != os.path.commonprefix( [dest_dir, (dest_dir/member.name).resolve()] ):
        # if not (dest_dir/member.name).resolve().is_relative_to(dest_dir):  # python >= 3.9
        if not str( (dest_dir/member.name).resolve() ).startswith( str(dest_dir) ):
            raise RuntimeError("Attempted path traversal in tar archive (probably malicious).")
    tar.extractall(dest_dir, **kwargs)

@jowagner
Copy link

jowagner commented Feb 27, 2023

(1) False negatives, e.g. dest_dir = Path('/dest') and member.name = Path('/dest2').
(2) Why require the user to specify the target path twice, once as path and once as dest_dir = Path(path)? You could simply change dest_dir.resolve() to Path(path).resolve(), assuming that's how one initialises Path objects (you didn't show us from where you import this type).

@mara004
Copy link

mara004 commented Feb 27, 2023

(1) False negatives, e.g. dest_dir = Path('/dest') and member.name = Path('/dest2').

Oh, I see. Thanks! (I wrongly assumed str(path) would add a trailing slash to directories.)
The is_relative_to() strategy should work correctly, though?

You could simply change dest_dir.resolve() to Path(path).resolve(), assuming that's how one initialises Path objects (you didn't show us from where you import this type).

Imported from pathlib, of course. Doing Path(dest_dir).resolve() would be an option but isn't necessary if the caller uses pathlib already.

@mara004
Copy link

mara004 commented Mar 2, 2023

@jowagner Erm, doesn't the os.path.commonprefix() strategy have exactly the same false negative, also in @Kasimir123's (and pip's) code above? Shouldn't it rather be os.path.commonpath()?

@mara004
Copy link

mara004 commented Mar 2, 2023

>>> from pathlib import Path
>>> import os.path
>>> p1 = Path("/a/b/c")
>>> p2 = Path("/a/b/c2")
>>> os.path.commonprefix([p1, p2])
'/a/b/c'
>>> os.path.commonprefix([p1, p2]) == str(p1)
True
>>> os.path.commonpath([p1, p2])
'/a/b'
>>> p2.is_relative_to(p1)
False

@mara004
Copy link

mara004 commented Jul 25, 2023

See also PEP 706 for recent progress on this.

Here's an updated safe extraction implementation I'm now using for my projects:
https://gist.github.com/mara004/6fe0ac15d0cf303bed0aea2f22d8531f


And FWIW, some threads confirming my suspicion that os.path.commonprefix() is invalid in the Trellix patch:
https://stackoverflow.com/a/43613742/15547292
http://grokbase.com/t/python/python-ideas/153nyg6qt1/os-path-commonprefix-yes-that-old-chestnut

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment