Last active
August 29, 2024 03:05
-
-
Save xenova/a452a6474428de0182b17605a98631ee to your computer and use it in GitHub Desktop.
Convert tiktoken tokenizers to the Hugging Face tokenizers format
I actually forgot to update the gist with my new conversion script, which takes into account the new split pretokenization regex (thanks @gautierdag for pointing that out!).
It also sets the default clean_up_tokenization_spaces
to False
(thanks @binxuan for pointing that out).
So, now it's updated 🤗 👍 I've also validated the GPT-4 tokenizer on the entire XNLI dataset (all languages) with 100% compatibility (both encoding and decoding). 🔥 Code to validate:
import tqdm
from datasets import load_dataset
import tiktoken
from transformers import GPT2TokenizerFast
hf_tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
og_tokenizer = tiktoken.encoding_for_model('gpt-4')
dataset = load_dataset('xnli', 'all_languages')
for item in tqdm.tqdm(dataset['train']):
for string in item['premise'].values():
encoded1 = og_tokenizer.encode(string)
encoded2 = hf_tokenizer.encode(string)
assert encoded1 == encoded2, f'encoding "{string}" is incorrect. "{encoded1}" != "{encoded2}"'
decoded1 = og_tokenizer.decode(encoded1)
decoded2 = hf_tokenizer.decode(encoded2, skip_special_tokens=True)
assert decoded1 == decoded2, f'decoding "{string}" is incorrect. "{decoded1}" != "{decoded2}"'
Shouldn't 'tokenizer_class' be 'GPT2Tokenizer' in all cases? This is the huggingface concrete class that's instantiated - i.e. by doing this you can use
hf_tokenizer = AutoTokenizer.from_pretrained('Xenova/gpt-4')
Rather than GPT2TokenizerFast
(which then generates a warning).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Just an update on the issue with the case-insensitive group modifier (
?i:
), which causes issues with certain regex implementations (e.g., JS): I think it's reasonable to just replace the problematic section with a longer (but equivalent) version.Original:
(?i:'s|'t|'re|'ve|'m|'ll|'d)|
JS-friendly version:
(?:'([sS]|[tT]|[rR][eE]|[vV][eE]|[mM]|[lL][lL]|[dD]))
Do what you want with it :) In any case, my code is adapted from this comment, with a few modifications.