-
-
Save xenova/a452a6474428de0182b17605a98631ee to your computer and use it in GitHub Desktop.
I think you've got a bug somewhere (probably in the regex):
import tiktoken
from transformers import GPT2TokenizerFast
tst = '\n\n\n\n\ns1232'
# GPT4 test fail
tokenizer_tik = tiktoken.encoding_for_model("gpt-4")
tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
print([(tokenizer.decode([i]), i) for i in tokenizer.encode(tst)])
# ('\n\n\n\n', 1038), ('\n', 198), ('s', 82), ('12', 717), ('32', 843)
print([(tokenizer_tik.decode([i]), i) for i in tokenizer_tik.encode(tst)])
# ('\n\n\n\n\n', 14963), ('s', 82), ('123', 4513), ('2', 17)
# GPT2 test - success
tokenizer_tik = tiktoken.encoding_for_model("gpt2")
tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt2')
print([(tokenizer.decode([i]), i) for i in tokenizer.encode(tst)])
# [('\n\n', 628), ('\n\n', 628), ('\n', 198), ('s', 82), ('12', 1065), ('32', 2624)]
print([(tokenizer_tik.decode([i]), i) for i in tokenizer_tik.encode(tst)])
# [('\n\n', 628), ('\n\n', 628), ('\n', 198), ('s', 82), ('12', 1065), ('32', 2624)]
Is this a problem with the encoding or decoding? Can you split up your checks please?
Edit: looks like an issue with encoding. Could you still split the tests and send the token ids? thanks! Also, could you check with gpt-2 to see it is still an issue there?
Edited the test and added digits as well. It's the regex splitting I think since GPT-4 / cl100k changes the regex from gpt-3.
To fix it, you should change the pre_tokenizer for cl100k_base to:
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
},
"behavior": "Removed",
"invert": True
},
{
"type": "ByteLevel",
"add_prefix_space": False,
"trim_offsets": True,
"use_regex": False
}
]
}
When I do that I pass the simple test above.
Also I don't think "post_processor" is needed - you can keep it null/None.
Edit: I've just ran extensive tests with the above "pre_tokenizer" and I confirm it is now equivalent to gpt4.
Amazing! Thanks so much @gautierdag! I'll update the script accordingly. If possible, could you share what tests you are running? I'd like to update the regex so that it's compatible with both python and javascript, but the ?i:
seems to break in javascript.
(see here for my JS demo)
I can't share internal tests unfortunately, I just ran both tokenizers on a large dataset of various different text types and confirmed they were the same.
Yeah the ?i:
regex requires special handling, in python you'd need to use the regex
library (not re
) to be able to reproduce the regex's functionality. Here the python bindings use the underlying rust regex crate which seems to work (though I think technically the fancy-regex
crate should be used instead). Not sure about JS, sorry 😞 !
I can't share internal tests unfortunately, I just ran both tokenizers on a large dataset of various different text types and confirmed they were the same.
No worries!
Yeah the ?i: regex requires special handling, in python you'd need to use the regex library (not re) to be able to reproduce the regex's functionality.
Right, I noticed that while playing around with it a bit more yesterday. I suppose the entire regex can be set to case-insensitive mode, no? Do you notice any difference in your tests if ?i:
is removed, but the entire regex is set to case-insensitive? (as opposed to that first group)?
mmmh I changed the regex to:
"(?i)'s|'t|'re|'ve|'m|'ll|'d|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Which I think should be setting the whole expression case-insensitive. However this breaks on a weird UTF character (https://unicode-explorer.com/c/0345). I sadly don't have an explanation for it 🤷
character = 'ͅ' # U+0345 fails
t = hfgpt4.encode(character)
print(t) # [137, 227] - what tiktoken also returns
o = hfgpt4_case_insensitive.encode(character)
print(o) # []
It is equivalent otherwise for everything else I tested.
Hey, thanks for sharing the solution and discussion! Do we have any conclusion on which Regex to use to fully replicate the tiktoken in Huggingface? Is this pre_tokenizer setting working? Does removing post_processor yield different results?
```python "pre_tokenizer": { "type": "Sequence", "pretokenizers": [ { "type": "Split", "pattern": { "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" }, "behavior": "Removed", "invert": True }, { "type": "ByteLevel", "add_prefix_space": False, "trim_offsets": True, "use_regex": False } ] }
Hey, thanks for sharing the solution and discussion! Do we have any conclusion on which Regex to use to fully replicate the tiktoken in Huggingface? Is this pre_tokenizer setting working? Does removing post_processor yield different results?
```python "pre_tokenizer": { "type": "Sequence", "pretokenizers": [ { "type": "Split", "pattern": { "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" }, "behavior": "Removed", "invert": True }, { "type": "ByteLevel", "add_prefix_space": False, "trim_offsets": True, "use_regex": False } ] }
Try it :)
Removing the post_processor shouldn't do anything since the decoder already handles byte level decoding. And as long as you adapted the Regex to the gpt4 version then it should work in python. I can't speak for JS.
Yeah I tried on 10M cases and only got 2 unmatched token sequences, which should be fine for common use cases. By the way, I noticed that the decoding sometimes does not yield same results between converted HF and TikToken. For example, I got different text string from this token sequence [12906, 224, 61196, 5619, 248, 15272, 113, 55884, 245, 5619, 111, 73414, 13, 15272, 250, 31584, 107, 24810, 15272, 113, 31584, 107, 65804, 31584, 97, 43411, 105, 5619, 99, 31584, 99, 92911, 15272, 228, 5619, 250, 84736, 86133, 80338, 31584, 107, 55884, 243, 32511, 248, 31584, 107, 24810, 92317, 61196, 32511, 97, 15272, 246, 12906, 225, 5619, 96, 24810, 11, 15272, 248, 44747, 5619, 94, 1174, 15272, 108, 32511, 245, 11, 15272, 99, 31584, 113, 55884, 115, 11, 15272, 107, 24810, 15272, 255, 32511, 113, 61196, 32511, 224, 5619, 244, 35470, 45279, 44747, 5619, 250, 48909, 32511, 117, 44747, 15272, 101, 32511, 117, 44747, 11, 15272, 107, 24810, 84736, 86133, 32511, 108, 31584, 114, 31584, 113, 5619, 255, 12906, 224, 88344, 44747, 5619, 113, 45279, 15272, 97, 31584, 107, 24810, 15272, 113, 31584, 107, 65804, 31584, 97, 44747, 5619, 248, 44747, 15272, 105, 32511, 107, 65804, 55675, 15272, 228, 5619, 96, 39951, 92317, 73753, 92911, 32511, 101, 35470, 85410, 35470, 84736, 73753, 79468, 31584, 97, 65804, 15272, 110, 43411, 117, 5619, 96, 31584, 107, 32511, 97, 85410, 24810, 84736, 73753, 5619, 95, 32511, 243, 32511, 108, 15272, 246, 31584, 107, 32511, 113, 24810, 11, 15272, 97, 31584, 107, 32511, 248, 73414, 15272, 228, 5619, 107, 73753, 5619, 115, 31584, 107, 15272, 110, 55675, 65804, 32511, 224, 79468, 88344, 55675, 45279, 92317, 32511, 224, 5619, 94, 5619, 96, 31584, 107, 32511, 248, 24810, 84736, 86133, 5619, 107, 80338, 31584, 101, 48909, 45279, 32511, 113, 24810, 11, 85410, 55884, 248, 15272, 113, 43411, 114, 55884, 115, 15272, 228, 95048, 35470, 13, 15272, 228, 5619, 96, 39951, 15272, 241, 79468, 32511, 106, 32511, 248, 24810, 15272, 114, 55884, 113, 5619, 253, 15272, 251, 32511, 110, 31584, 107, 32511, 101, 73414, 80338, 45279, 15272, 227, 92911, 31584, 103, 32511, 113, 5619, 100, 44747, 80338, 5619, 248, 85410, 35470, 84736, 73753, 79468, 31584, 97, 65804]
Nvm, I found it is caused by setting clean_up_tokenization_spaces=True.
@xenova Thanks for posting this! For the purposes of adapting/incorporating into other projects, what's the license for this code? (Maybe add a note license info to the comments at the top?)
Just an update on the issue with the case-insensitive group modifier (?i:
), which causes issues with certain regex implementations (e.g., JS): I think it's reasonable to just replace the problematic section with a longer (but equivalent) version.
Original: (?i:'s|'t|'re|'ve|'m|'ll|'d)|
JS-friendly version: (?:'([sS]|[tT]|[rR][eE]|[vV][eE]|[mM]|[lL][lL]|[dD]))
For the purposes of adapting/incorporating into other projects, what's the license for this code?
Do what you want with it :) In any case, my code is adapted from this comment, with a few modifications.
I actually forgot to update the gist with my new conversion script, which takes into account the new split pretokenization regex (thanks @gautierdag for pointing that out!).
It also sets the default clean_up_tokenization_spaces
to False
(thanks @binxuan for pointing that out).
So, now it's updated 🤗 👍 I've also validated the GPT-4 tokenizer on the entire XNLI dataset (all languages) with 100% compatibility (both encoding and decoding). 🔥 Code to validate:
import tqdm
from datasets import load_dataset
import tiktoken
from transformers import GPT2TokenizerFast
hf_tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
og_tokenizer = tiktoken.encoding_for_model('gpt-4')
dataset = load_dataset('xnli', 'all_languages')
for item in tqdm.tqdm(dataset['train']):
for string in item['premise'].values():
encoded1 = og_tokenizer.encode(string)
encoded2 = hf_tokenizer.encode(string)
assert encoded1 == encoded2, f'encoding "{string}" is incorrect. "{encoded1}" != "{encoded2}"'
decoded1 = og_tokenizer.decode(encoded1)
decoded2 = hf_tokenizer.decode(encoded2, skip_special_tokens=True)
assert decoded1 == decoded2, f'decoding "{string}" is incorrect. "{decoded1}" != "{decoded2}"'
Shouldn't 'tokenizer_class' be 'GPT2Tokenizer' in all cases? This is the huggingface concrete class that's instantiated - i.e. by doing this you can use
hf_tokenizer = AutoTokenizer.from_pretrained('Xenova/gpt-4')
Rather than GPT2TokenizerFast
(which then generates a warning).
A list of pre-converted tokenizers:
https://huggingface.co/Xenova/gpt-4
https://huggingface.co/Xenova/gpt-3.5-turbo
https://huggingface.co/Xenova/gpt-3.5-turbo-16k
https://huggingface.co/Xenova/text-davinci-002
https://huggingface.co/Xenova/text-davinci-003
https://huggingface.co/Xenova/text-embedding-ada-002
https://huggingface.co/Xenova/gpt-3
https://huggingface.co/Xenova/gpt2