-
-
Save Kasimir123/9bdc38f2eac9a19f83ef6cc52c7ce82e to your computer and use it in GitHub Desktop.
import os | |
def is_within_directory(directory, target): | |
abs_directory = os.path.abspath(directory) | |
abs_target = os.path.abspath(target) | |
prefix = os.path.commonprefix([abs_directory, abs_target]) | |
return prefix == abs_directory | |
def safe_extract(tar, path=".", members=None, *, numeric_owner=False): | |
for member in tar.getmembers(): | |
member_path = os.path.join(path, member.name) | |
if not is_within_directory(path, member_path): | |
raise Exception("Attempted Path Traversal in Tar File") | |
tar.extractall(path, members, numeric_owner=numeric_owner) |
Calling getmembers
(line 14) is potentially very expensive (large files) and should not work with tar files that cannot be rewound, e.g. sys.stdin.buffer
unless the tarfile module makes a temporary copy.
Hello @jowagner. Could you possibly give us an example of where an absolute path would break this patch? In our testing, absolute paths are also covered since the common prefix won't be the same. I have added a screenshot below of how the paths would look based on default extractall (since the path is set to "." by default) and if the member name was set to "/etc":
Regarding any performance hit, each maintainer would have to make their own determination on security vs performance. That's not a decision we can make for them.
I think there is a misunderstanding: I did not mean to imply with "in line 15 [...] path is ignored for absolute member.name" that something is broken. It's purely a suggestion to add a source code comment here.
Re performance: Yes, until the tarfile module provides an interface to hook in a path_check()
function or implements similar checks itself you have no other choice than to scan all members in advance. I am pushing for a combination of path_check()
hook and safety by default in python/cpython#73974 (comment). Again, it is more an issue of documentation that this check incurs a performance penality (scanning the archive twice) and breaks extracting from tar files that cannot be read twice, e.g. reading from sys.stdin.buffer
.
Edit: clearer phrasing
@Kasimir123 From my testing, your proposed solution would not prevent the directory traversal if the archive contains a symlink to outside of the target directory, e.g.:
~/tmp$ mkdir test_dir
~/tmp$ cd test_dir/
~/tmp/test_dir$ ln -s /tmp foo
~/tmp/test_dir$ touch foo/bar
~/tmp/test_dir$ ll
total 8
drwxr-xr-x 2 soranzon ga002 4096 Nov 4 11:34 ./
drwxr-xr-x 5 soranzon ga002 4096 Nov 4 11:34 ../
lrwxrwxrwx 1 soranzon ga002 5 Nov 4 11:34 foo -> /tmp/
~/tmp/test_dir$ ll foo/bar
-rw-r--r-- 1 soranzon ga002 0 Nov 4 11:34 foo/bar
~/tmp/test_dir$ tar cf ../test_dir.tar foo foo/bar
The extraction of the test_dir.tar
archive is forbidden by tar:
~/tmp/test_dir$ rm foo /tmp/bar
rm: remove symbolic link 'foo'? y
rm: remove regular empty file '/tmp/bar'? y
~/tmp/test_dir$ tar xf ../test_dir.tar
tar: foo/bar: Cannot open: Not a directory
tar: Exiting with failure status due to previous errors
~/tmp/test_dir$ ll
total 8
drwxr-xr-x 2 soranzon ga002 4096 Nov 4 11:36 ./
drwxr-xr-x 5 soranzon ga002 4096 Nov 4 11:35 ../
lrwxrwxrwx 1 soranzon ga002 5 Nov 4 11:34 foo -> /tmp/
~/tmp/test_dir$ ll foo/bar
ls: cannot access 'foo/bar': No such file or directory
~/tmp/test_dir$ rm foo
rm: remove symbolic link 'foo'? y
but is not blocked by your safe_extract()
:
import tarfile
tarfile("../test_dir.tar")
safe_extract(tar)
It looks like @TrellixVulnTeam may have started an automated mass submission of this patch.
But seriously, patching thousands of projects with boilerplate code can't be called a proper solution.
What we need is a simple way to extract tars safely using the standard library, like a prevent_traversal
option to shutil.unpack_archive()
.
Apart from that, I find the patch both bloated and written in a fashion that is hard to understand.
Here's an alternative suggestion, using pathlib:
(Update: Warning, this code is not correct, see the following comments)
def safe_extract(tar, dest_dir: Path, **kwargs):
dest_dir = dest_dir.resolve()
for member in tar.getmembers():
# if str(dest_dir) != os.path.commonprefix( [dest_dir, (dest_dir/member.name).resolve()] ):
# if not (dest_dir/member.name).resolve().is_relative_to(dest_dir): # python >= 3.9
if not str( (dest_dir/member.name).resolve() ).startswith( str(dest_dir) ):
raise RuntimeError("Attempted path traversal in tar archive (probably malicious).")
tar.extractall(dest_dir, **kwargs)
(1) False negatives, e.g. dest_dir = Path('/dest')
and member.name = Path('/dest2')
.
(2) Why require the user to specify the target path twice, once as path
and once as dest_dir = Path(path)
? You could simply change dest_dir.resolve()
to Path(path).resolve()
, assuming that's how one initialises Path
objects (you didn't show us from where you import this type).
(1) False negatives, e.g. dest_dir = Path('/dest') and member.name = Path('/dest2').
Oh, I see. Thanks! (I wrongly assumed str(path)
would add a trailing slash to directories.)
The is_relative_to()
strategy should work correctly, though?
You could simply change dest_dir.resolve() to Path(path).resolve(), assuming that's how one initialises Path objects (you didn't show us from where you import this type).
Imported from pathlib, of course. Doing Path(dest_dir).resolve()
would be an option but isn't necessary if the caller uses pathlib already.
@jowagner Erm, doesn't the os.path.commonprefix()
strategy have exactly the same false negative, also in @Kasimir123's (and pip's) code above? Shouldn't it rather be os.path.commonpath()
?
>>> from pathlib import Path
>>> import os.path
>>> p1 = Path("/a/b/c")
>>> p2 = Path("/a/b/c2")
>>> os.path.commonprefix([p1, p2])
'/a/b/c'
>>> os.path.commonprefix([p1, p2]) == str(p1)
True
>>> os.path.commonpath([p1, p2])
'/a/b'
>>> p2.is_relative_to(p1)
False
See also PEP 706 for recent progress on this.
Here's an updated safe extraction implementation I'm now using for my projects:
https://gist.github.com/mara004/6fe0ac15d0cf303bed0aea2f22d8531f
And FWIW, some threads confirming my suspicion that os.path.commonprefix()
is invalid in the Trellix patch:
https://stackoverflow.com/a/43613742/15547292
http://grokbase.com/t/python/python-ideas/153nyg6qt1/os-path-commonprefix-yes-that-old-chestnut
A comment in line 15 that
path
is ignored for absolutemember.name
would help. Not many readers will remember the detail "If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component." from theos.path.join()
documentation.