Skip to content

Instantly share code, notes, and snippets.

@xenova
Last active August 29, 2024 03:05
Show Gist options
  • Save xenova/a452a6474428de0182b17605a98631ee to your computer and use it in GitHub Desktop.
Save xenova/a452a6474428de0182b17605a98631ee to your computer and use it in GitHub Desktop.
Convert tiktoken tokenizers to the Hugging Face tokenizers format
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@gautierdag
Copy link

gautierdag commented Sep 5, 2023

I think you've got a bug somewhere (probably in the regex):

import tiktoken
from transformers import GPT2TokenizerFast

tst = '\n\n\n\n\ns1232'

# GPT4 test fail
tokenizer_tik = tiktoken.encoding_for_model("gpt-4")
tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')

print([(tokenizer.decode([i]), i) for i in tokenizer.encode(tst)])
# ('\n\n\n\n', 1038), ('\n', 198), ('s', 82), ('12', 717), ('32', 843)
print([(tokenizer_tik.decode([i]), i) for i in tokenizer_tik.encode(tst)])
# ('\n\n\n\n\n', 14963), ('s', 82), ('123', 4513), ('2', 17)

# GPT2 test - success
tokenizer_tik = tiktoken.encoding_for_model("gpt2")
tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt2')

print([(tokenizer.decode([i]), i) for i in tokenizer.encode(tst)])
# [('\n\n', 628), ('\n\n', 628), ('\n', 198), ('s', 82), ('12', 1065), ('32', 2624)]
print([(tokenizer_tik.decode([i]), i) for i in tokenizer_tik.encode(tst)])
# [('\n\n', 628), ('\n\n', 628), ('\n', 198), ('s', 82), ('12', 1065), ('32', 2624)]

@xenova
Copy link
Author

xenova commented Sep 5, 2023

Is this a problem with the encoding or decoding? Can you split up your checks please?

Edit: looks like an issue with encoding. Could you still split the tests and send the token ids? thanks! Also, could you check with gpt-2 to see it is still an issue there?

@gautierdag
Copy link

Edited the test and added digits as well. It's the regex splitting I think since GPT-4 / cl100k changes the regex from gpt-3.

@gautierdag
Copy link

gautierdag commented Sep 5, 2023

To fix it, you should change the pre_tokenizer for cl100k_base to:

"pre_tokenizer": {
      "type": "Sequence",
      "pretokenizers": [
        {
          "type": "Split",
          "pattern": {
            "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
          },
          "behavior": "Removed",
          "invert": True
        },
        {
          "type": "ByteLevel",
          "add_prefix_space": False,
          "trim_offsets": True,
          "use_regex": False
        }
      ]
}

When I do that I pass the simple test above.

Also I don't think "post_processor" is needed - you can keep it null/None.

Edit: I've just ran extensive tests with the above "pre_tokenizer" and I confirm it is now equivalent to gpt4.

@xenova
Copy link
Author

xenova commented Sep 5, 2023

Amazing! Thanks so much @gautierdag! I'll update the script accordingly. If possible, could you share what tests you are running? I'd like to update the regex so that it's compatible with both python and javascript, but the ?i: seems to break in javascript.

(see here for my JS demo)

@gautierdag
Copy link

gautierdag commented Sep 6, 2023

I can't share internal tests unfortunately, I just ran both tokenizers on a large dataset of various different text types and confirmed they were the same.

Yeah the ?i: regex requires special handling, in python you'd need to use the regex library (not re) to be able to reproduce the regex's functionality. Here the python bindings use the underlying rust regex crate which seems to work (though I think technically the fancy-regex crate should be used instead). Not sure about JS, sorry 😞 !

@xenova
Copy link
Author

xenova commented Sep 6, 2023

I can't share internal tests unfortunately, I just ran both tokenizers on a large dataset of various different text types and confirmed they were the same.

No worries!

Yeah the ?i: regex requires special handling, in python you'd need to use the regex library (not re) to be able to reproduce the regex's functionality.

Right, I noticed that while playing around with it a bit more yesterday. I suppose the entire regex can be set to case-insensitive mode, no? Do you notice any difference in your tests if ?i: is removed, but the entire regex is set to case-insensitive? (as opposed to that first group)?

@gautierdag
Copy link

mmmh I changed the regex to:
"(?i)'s|'t|'re|'ve|'m|'ll|'d|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"

Which I think should be setting the whole expression case-insensitive. However this breaks on a weird UTF character (https://unicode-explorer.com/c/0345). I sadly don't have an explanation for it 🤷

character = 'ͅ' # U+0345 fails

t = hfgpt4.encode(character)
print(t) # [137, 227] - what tiktoken also returns

o = hfgpt4_case_insensitive.encode(character)
print(o) # []

It is equivalent otherwise for everything else I tested.

@binxuan
Copy link

binxuan commented Sep 20, 2023

Hey, thanks for sharing the solution and discussion! Do we have any conclusion on which Regex to use to fully replicate the tiktoken in Huggingface? Is this pre_tokenizer setting working? Does removing post_processor yield different results?

```python
"pre_tokenizer": {
      "type": "Sequence",
      "pretokenizers": [
        {
          "type": "Split",
          "pattern": {
            "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
          },
          "behavior": "Removed",
          "invert": True
        },
        {
          "type": "ByteLevel",
          "add_prefix_space": False,
          "trim_offsets": True,
          "use_regex": False
        }
      ]
}

@gautierdag
Copy link

Hey, thanks for sharing the solution and discussion! Do we have any conclusion on which Regex to use to fully replicate the tiktoken in Huggingface? Is this pre_tokenizer setting working? Does removing post_processor yield different results?

```python
"pre_tokenizer": {
      "type": "Sequence",
      "pretokenizers": [
        {
          "type": "Split",
          "pattern": {
            "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
          },
          "behavior": "Removed",
          "invert": True
        },
        {
          "type": "ByteLevel",
          "add_prefix_space": False,
          "trim_offsets": True,
          "use_regex": False
        }
      ]
}

Try it :)

Removing the post_processor shouldn't do anything since the decoder already handles byte level decoding. And as long as you adapted the Regex to the gpt4 version then it should work in python. I can't speak for JS.

@binxuan
Copy link

binxuan commented Sep 22, 2023

Yeah I tried on 10M cases and only got 2 unmatched token sequences, which should be fine for common use cases. By the way, I noticed that the decoding sometimes does not yield same results between converted HF and TikToken. For example, I got different text string from this token sequence [12906, 224, 61196, 5619, 248, 15272, 113, 55884, 245, 5619, 111, 73414, 13, 15272, 250, 31584, 107, 24810, 15272, 113, 31584, 107, 65804, 31584, 97, 43411, 105, 5619, 99, 31584, 99, 92911, 15272, 228, 5619, 250, 84736, 86133, 80338, 31584, 107, 55884, 243, 32511, 248, 31584, 107, 24810, 92317, 61196, 32511, 97, 15272, 246, 12906, 225, 5619, 96, 24810, 11, 15272, 248, 44747, 5619, 94, 1174, 15272, 108, 32511, 245, 11, 15272, 99, 31584, 113, 55884, 115, 11, 15272, 107, 24810, 15272, 255, 32511, 113, 61196, 32511, 224, 5619, 244, 35470, 45279, 44747, 5619, 250, 48909, 32511, 117, 44747, 15272, 101, 32511, 117, 44747, 11, 15272, 107, 24810, 84736, 86133, 32511, 108, 31584, 114, 31584, 113, 5619, 255, 12906, 224, 88344, 44747, 5619, 113, 45279, 15272, 97, 31584, 107, 24810, 15272, 113, 31584, 107, 65804, 31584, 97, 44747, 5619, 248, 44747, 15272, 105, 32511, 107, 65804, 55675, 15272, 228, 5619, 96, 39951, 92317, 73753, 92911, 32511, 101, 35470, 85410, 35470, 84736, 73753, 79468, 31584, 97, 65804, 15272, 110, 43411, 117, 5619, 96, 31584, 107, 32511, 97, 85410, 24810, 84736, 73753, 5619, 95, 32511, 243, 32511, 108, 15272, 246, 31584, 107, 32511, 113, 24810, 11, 15272, 97, 31584, 107, 32511, 248, 73414, 15272, 228, 5619, 107, 73753, 5619, 115, 31584, 107, 15272, 110, 55675, 65804, 32511, 224, 79468, 88344, 55675, 45279, 92317, 32511, 224, 5619, 94, 5619, 96, 31584, 107, 32511, 248, 24810, 84736, 86133, 5619, 107, 80338, 31584, 101, 48909, 45279, 32511, 113, 24810, 11, 85410, 55884, 248, 15272, 113, 43411, 114, 55884, 115, 15272, 228, 95048, 35470, 13, 15272, 228, 5619, 96, 39951, 15272, 241, 79468, 32511, 106, 32511, 248, 24810, 15272, 114, 55884, 113, 5619, 253, 15272, 251, 32511, 110, 31584, 107, 32511, 101, 73414, 80338, 45279, 15272, 227, 92911, 31584, 103, 32511, 113, 5619, 100, 44747, 80338, 5619, 248, 85410, 35470, 84736, 73753, 79468, 31584, 97, 65804]

@binxuan
Copy link

binxuan commented Sep 22, 2023

Nvm, I found it is caused by setting clean_up_tokenization_spaces=True.

@KerfuffleV2
Copy link

@xenova Thanks for posting this! For the purposes of adapting/incorporating into other projects, what's the license for this code? (Maybe add a note license info to the comments at the top?)

@xenova
Copy link
Author

xenova commented Feb 5, 2024

Just an update on the issue with the case-insensitive group modifier (?i:), which causes issues with certain regex implementations (e.g., JS): I think it's reasonable to just replace the problematic section with a longer (but equivalent) version.

Original: (?i:'s|'t|'re|'ve|'m|'ll|'d)|

JS-friendly version: (?:'([sS]|[tT]|[rR][eE]|[vV][eE]|[mM]|[lL][lL]|[dD]))

For the purposes of adapting/incorporating into other projects, what's the license for this code?

Do what you want with it :) In any case, my code is adapted from this comment, with a few modifications.

@xenova
Copy link
Author

xenova commented Mar 27, 2024

I actually forgot to update the gist with my new conversion script, which takes into account the new split pretokenization regex (thanks @gautierdag for pointing that out!).

It also sets the default clean_up_tokenization_spaces to False (thanks @binxuan for pointing that out).

So, now it's updated 🤗 👍 I've also validated the GPT-4 tokenizer on the entire XNLI dataset (all languages) with 100% compatibility (both encoding and decoding). 🔥 Code to validate:

import tqdm
from datasets import load_dataset
import tiktoken
from transformers import GPT2TokenizerFast

hf_tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
og_tokenizer = tiktoken.encoding_for_model('gpt-4')

dataset = load_dataset('xnli', 'all_languages')

for item in tqdm.tqdm(dataset['train']):
    for string in item['premise'].values():
        encoded1 = og_tokenizer.encode(string)
        encoded2 = hf_tokenizer.encode(string)

        assert encoded1 == encoded2, f'encoding "{string}" is incorrect. "{encoded1}" != "{encoded2}"'

        decoded1 = og_tokenizer.decode(encoded1)
        decoded2 = hf_tokenizer.decode(encoded2, skip_special_tokens=True)

        assert decoded1 == decoded2, f'decoding "{string}" is incorrect. "{decoded1}" != "{decoded2}"'

@david-waterworth
Copy link

Shouldn't 'tokenizer_class' be 'GPT2Tokenizer' in all cases? This is the huggingface concrete class that's instantiated - i.e. by doing this you can use

 hf_tokenizer = AutoTokenizer.from_pretrained('Xenova/gpt-4')

Rather than GPT2TokenizerFast (which then generates a warning).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment