-
-
Save sgraaf/7c061824b1c57c292faa0a123d95a714 to your computer and use it in GitHub Desktop.
#!/bin/sh | |
set -e | |
WIKI_DUMP_FILE_IN=$1 | |
WIKI_DUMP_FILE_OUT=${WIKI_DUMP_FILE_IN%%.*}.txt | |
# clone the WikiExtractor repository | |
git clone https://github.com/attardi/wikiextractor.git | |
# extract and clean the chosen Wikipedia dump | |
echo "Extracting and cleaning $WIKI_DUMP_FILE_IN to $WIKI_DUMP_FILE_OUT..." | |
python3 -m wikiextractor.WikiExtractor $WIKI_DUMP_FILE_IN --processes 8 -q -o - \ | |
| sed "/^\s*\$/d" \ | |
| grep -v "^<doc id=" \ | |
| grep -v "</doc>\$" \ | |
> $WIKI_DUMP_FILE_OUT | |
echo "Succesfully extracted and cleaned $WIKI_DUMP_FILE_IN to $WIKI_DUMP_FILE_OUT" |
Hi @sgraaf. Thanks for the useful tutorial. I just wanted to let you know missing
$
sign beforeWIKI_DUMP_FILE_OUT
variable on line 16.
You're entirely right! Thanks for the catch, fixed the issue :)
Hello @sgraaf. I tried to make it work but the file WikiExtractor.py wasn't found. I modified the path to the file as it was in wikiextractor/wikiextractor but it still doesn't work:
File "/home/troux/voy/wikiextractor/wikiextractor/WikiExtractor.py", line 66, in <module> from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces ImportError: attempted relative import with no known parent package
Do you have any idea how to solve this problem ?
Hello @sgraaf. I tried to make it work but the file WikiExtractor.py wasn't found. I modified the path to the file as it was in wikiextractor/wikiextractor but it still doesn't work:
File "/home/troux/voy/wikiextractor/wikiextractor/WikiExtractor.py", line 66, in <module> from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces ImportError: attempted relative import with no known parent package
Do you have any idea how to solve this problem ?
I have updated the script Could you try it again?
It seems to work. The extracting take a lot of time but that's expected. I will come back if I meet a problem, else consider the problem as fixed !
Thanks.
for this script what is the input?
As commented on my other Gist, it is a .xml.bz2
file. For a complete guide on how to download, extract, clean and pre-process a Wikipedia dump, see this Medium post.
Hi @sgraaf. Thanks for the useful tutorial. I just wanted to let you know missing
$
sign beforeWIKI_DUMP_FILE_OUT
variable on line 16.