Last active
January 26, 2024 22:20
-
-
Save intervitens/d171990ade60afd5dfe51415f6bf8c3b to your computer and use it in GitHub Desktop.
Script for the InternLM2 tokenizer to add ChatML tokens and fix null token 354 for ggml conversion
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Launch with PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python tokenizer_fix.py | |
import sentencepiece.sentencepiece_model_pb2 as model | |
m = model.ModelProto() | |
m.ParseFromString(open('./tokenizer.model', 'rb').read()) | |
m.pieces[92543].piece = '<|im_start|>' | |
m.pieces[92542].piece = '<|im_end|>' | |
m.pieces[92541].piece = '<|action_start|>' | |
m.pieces[92540].piece = '<|action_end|>' | |
m.pieces[92539].piece = '<|interpreter|>' | |
m.pieces[92538].piece = '<|plugin|>' | |
m.pieces[354].piece = "[ERROR_NULL_TOKEN_a76Y96a9eX7b]" | |
with open('tokenizer_fixed.model', 'wb') as f: | |
f.write(m.SerializeToString()) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment