-
-
Save sneakers-the-rat/172e8679b824a3871decd262ed3f59c6 to your computer and use it in GitHub Desktop.
# -------------------------------------------------------------------- | |
# Recursively find pdfs from the directory given as the first argument, | |
# otherwise search the current directory. | |
# Use exiftool and qpdf (both must be installed and locatable on $PATH) | |
# to strip all top-level metadata from PDFs. | |
# | |
# Note - This only removes file-level metadata, not any metadata | |
# in embedded images, etc. | |
# | |
# Code is provided as-is, I take no responsibility for its use, | |
# and I make no guarantee that this code works | |
# or makes your PDFs "safe," whatever that means to you. | |
# | |
# You may need to enable execution of this script before using, | |
# eg. chmod +x clean_pdf.sh | |
# | |
# example: | |
# clean current directory: | |
# >>> ./clean_pdf.sh | |
# | |
# clean specific directory: | |
# >>> ./clean_pdf.sh some/other/directory | |
# -------------------------------------------------------------------- | |
# Color Codes so that warnings/errors stick out | |
GREEN="\e[32m" | |
RED="\e[31m" | |
CLEAR="\e[0m" | |
# loop through all PDFs in first argument ($1), | |
# or use '.' (this directory) if not given | |
DIR="${1:-.}" | |
echo "Cleaning PDFs in directory $DIR" | |
# use find to locate files, pip to while read to get the | |
# whole line instead of space delimited | |
# Note -- this will find pdfs recursively!! | |
find $DIR -type f -name "*.pdf" | while read -r i | |
do | |
# output file as original filename with suffix _clean.pdf | |
TMP=${i%.*}_clean.pdf | |
# remove the temporary file if it already exists | |
if [ -f "$TMP" ]; then | |
rm "$TMP"; | |
fi | |
exiftool -q -q -all:all= "$i" -o "$TMP" | |
qpdf --linearize --deterministic-id --replace-input "$TMP" | |
echo -e $(printf "${GREEN}Processed ${RED}${i} ${CLEAR}as ${GREEN}${TMP}${CLEAR}") | |
done |
@sneakers-the-rat just saw this user on reddit recommending the use of the
--deterministic-id
command from QPDF to achieve cleaner results: https://reddit.com/r/Piracy/comments/12ai3so/how_to_remove_all_metadata_identifiers_when/. From what I understood, this way each cleaned up file generated from a certain source pdf would have the exact same UUIDEnd result in line 52 would be simply
qpdf --linearize --deterministic-id --replace-input "$TMP"
the documentation related to --deterministic-id
on QPDF here and a thread explaining it more clearly. the same article from Elsvr downloaded from multiple institutional accesses will generate byte-for-byte identical outputs from ExifTool+QPDF when using this method.
@bigfakelaugh yes good addition, edited!
@sneakers-the-rat just saw this user on reddit recommending the use of the
--deterministic-id
command from QPDF to achieve cleaner results: https://reddit.com/r/Piracy/comments/12ai3so/how_to_remove_all_metadata_identifiers_when/. From what I understood, this way each cleaned up file generated from a certain source pdf would have the exact same UUIDEnd result in line 52 would be simply
qpdf --linearize --deterministic-id --replace-input "$TMP"