Skip to content

Instantly share code, notes, and snippets.

Last active April 22, 2020 02:44
Show Gist options
  • Save edwinhu/5ac05d0d261e62fa5655b2bf7bff8082 to your computer and use it in GitHub Desktop.
Save edwinhu/5ac05d0d261e62fa5655b2bf7bff8082 to your computer and use it in GitHub Desktop.
get and section adv2 brochure item 11s
# curl to get all of the zip files of PDFs
curl[1-112].zip -o "/data/hue/adv2/"
# list files
unzip -l
# awk extract the part I care about
unzip -p 103705_325511_1_20200131.pdf | pdftotext - - | awk 'BEGIN{IGNORECASE=1};/^item 11/,/^item 12/'
# build an index
for f in *.zip
unzip -l $f | gawk -v f=$f '/pdf/ {print f, $NF}' >> index.txt
# write text to folder
cat index.txt | while read -r z f; do
unzip -p $z $f | pdftotext -q - - | awk 'BEGIN{IGNORECASE=1};/^item 11/,/^item 12/' > item11/${f%%.*}.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment