Skip to content

Instantly share code, notes, and snippets.

@xenova
Last active August 29, 2024 03:05
Show Gist options
  • Save xenova/a452a6474428de0182b17605a98631ee to your computer and use it in GitHub Desktop.
Save xenova/a452a6474428de0182b17605a98631ee to your computer and use it in GitHub Desktop.
Convert tiktoken tokenizers to the Hugging Face tokenizers format
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@xenova
Copy link
Author

xenova commented Feb 5, 2024

Just an update on the issue with the case-insensitive group modifier (?i:), which causes issues with certain regex implementations (e.g., JS): I think it's reasonable to just replace the problematic section with a longer (but equivalent) version.

Original: (?i:'s|'t|'re|'ve|'m|'ll|'d)|

JS-friendly version: (?:'([sS]|[tT]|[rR][eE]|[vV][eE]|[mM]|[lL][lL]|[dD]))

For the purposes of adapting/incorporating into other projects, what's the license for this code?

Do what you want with it :) In any case, my code is adapted from this comment, with a few modifications.

@xenova
Copy link
Author

xenova commented Mar 27, 2024

I actually forgot to update the gist with my new conversion script, which takes into account the new split pretokenization regex (thanks @gautierdag for pointing that out!).

It also sets the default clean_up_tokenization_spaces to False (thanks @binxuan for pointing that out).

So, now it's updated 🤗 👍 I've also validated the GPT-4 tokenizer on the entire XNLI dataset (all languages) with 100% compatibility (both encoding and decoding). 🔥 Code to validate:

import tqdm
from datasets import load_dataset
import tiktoken
from transformers import GPT2TokenizerFast

hf_tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
og_tokenizer = tiktoken.encoding_for_model('gpt-4')

dataset = load_dataset('xnli', 'all_languages')

for item in tqdm.tqdm(dataset['train']):
    for string in item['premise'].values():
        encoded1 = og_tokenizer.encode(string)
        encoded2 = hf_tokenizer.encode(string)

        assert encoded1 == encoded2, f'encoding "{string}" is incorrect. "{encoded1}" != "{encoded2}"'

        decoded1 = og_tokenizer.decode(encoded1)
        decoded2 = hf_tokenizer.decode(encoded2, skip_special_tokens=True)

        assert decoded1 == decoded2, f'decoding "{string}" is incorrect. "{decoded1}" != "{decoded2}"'

@david-waterworth
Copy link

Shouldn't 'tokenizer_class' be 'GPT2Tokenizer' in all cases? This is the huggingface concrete class that's instantiated - i.e. by doing this you can use

 hf_tokenizer = AutoTokenizer.from_pretrained('Xenova/gpt-4')

Rather than GPT2TokenizerFast (which then generates a warning).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment