Last active
October 23, 2021 21:28
-
-
Save sgraaf/926c52fba668f779f5ecac81d21e98a0 to your computer and use it in GitHub Desktop.
Simple python script to pre-process a Wikipedia dump
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
# -*- coding: utf-8 -*- | |
import sys | |
from pathlib import Path | |
from blingfire import text_to_sentences | |
def main(): | |
wiki_dump_file_in = Path(sys.argv[1]) | |
wiki_dump_file_out = wiki_dump_file_in.parent / \ | |
f'{wiki_dump_file_in.stem}_preprocessed{wiki_dump_file_in.suffix}' | |
print(f'Pre-processing {wiki_dump_file_in} to {wiki_dump_file_out}...') | |
with open(wiki_dump_file_out, 'w', encoding='utf-8') as out_f: | |
with open(wiki_dump_file_in, 'r', encoding='utf-8') as in_f: | |
for line in in_f: | |
sentences = text_to_sentences(line) | |
out_f.write(sentences + '\n') | |
print(f'Successfully pre-processed {wiki_dump_file_in} to {wiki_dump_file_out}...') | |
if __name__ == '__main__': | |
main() |
It's for a .txt
file. If you want to extract and clean a .xml.bz2
file, take a look at this script.
thank you!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
the input in this program is a wiki.bz2 or a txt?