Last active
October 24, 2021 09:49
-
-
Save sgraaf/7c061824b1c57c292faa0a123d95a714 to your computer and use it in GitHub Desktop.
Simple bash script to extract and clean a Wikipedia dump. Adapted from: https://github.com/facebookresearch/XLM/blob/master/get-data-wiki.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sh | |
set -e | |
WIKI_DUMP_FILE_IN=$1 | |
WIKI_DUMP_FILE_OUT=${WIKI_DUMP_FILE_IN%%.*}.txt | |
# clone the WikiExtractor repository | |
git clone https://github.com/attardi/wikiextractor.git | |
# extract and clean the chosen Wikipedia dump | |
echo "Extracting and cleaning $WIKI_DUMP_FILE_IN to $WIKI_DUMP_FILE_OUT..." | |
python3 -m wikiextractor.WikiExtractor $WIKI_DUMP_FILE_IN --processes 8 -q -o - \ | |
| sed "/^\s*\$/d" \ | |
| grep -v "^<doc id=" \ | |
| grep -v "</doc>\$" \ | |
> $WIKI_DUMP_FILE_OUT | |
echo "Succesfully extracted and cleaned $WIKI_DUMP_FILE_IN to $WIKI_DUMP_FILE_OUT" |
for this script what is the input?
As commented on my other Gist, it is a .xml.bz2
file. For a complete guide on how to download, extract, clean and pre-process a Wikipedia dump, see this Medium post.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It seems to work. The extracting take a lot of time but that's expected. I will come back if I meet a problem, else consider the problem as fixed !
Thanks.