Skip to content

Instantly share code, notes, and snippets.

@ya-mouse
Last active August 29, 2015 13:58
Show Gist options
  • Save ya-mouse/10233613 to your computer and use it in GitHub Desktop.
Save ya-mouse/10233613 to your computer and use it in GitHub Desktop.
crowdgorod data mining
#!/bin/sh -e
wd=$PWD/$1
prj=http://crowdgorod.mos.ru
[ -n "$1" ] || exit 1
# Если ещё не получили список предложений,
# создаём рабочую директорию и загружаем предложения
if [ ! -f "$wd/.index" ]; then
mkdir -p $wd
curl -s -H "$(cat $PWD/cookies.txt)" $prj/Tasks/Perform?id=$1 | tee $wd/.index.html |
egrep '(solution-title|td-proposal)' -A 1 | sed -n 's,.*<a href="\([^"]\+\)".*>.*,\1,p' > $wd/.index
# Сохраняем список предложений
i=0
while read u; do
id=${u##*nodeId=}
id=${id##*/}
curl -s -H "$(cat $PWD/cookies.txt)" $prj$u | tee $wd/$id.html | awk -vs=0 '
/post-time/ { print $0 }
/rich-text/ {s=1; next}
s == 1 && $1 == "</div>\r" { s=0 }
s == 1 { print $0 }' |
sed 's,.*</div><p>,,; s,</\?p>,,g; s,.*post-time">,TITLE:,; s,</time>,,' > $wd/$id.txt
printf "."
i=$((i+1))
[ "$((i % 50))" -ne 0 ] || printf "%d" $i
done < $wd/.index
printf "\n"
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment