Skip to content

Instantly share code, notes, and snippets.

@Norod
Norod / neo2gpt2.py
Last active November 3, 2024 13:06
A script which 'converts' an existing GPT-Neo model into GPT2 architecture with a modified traget_positions value. The output GPT2 model will then need further training using the same tokenizer as the original model to recover from the local to global attention switching
import torch
from transformers import GPT2LMHeadModel, GPTNeoForCausalLM, GPT2Config
def convert_neo_to_gpt2(neo_model_path, output_path, target_positions=1024):
# Load the trained GPT-Neo model
neo_model = GPTNeoForCausalLM.from_pretrained(neo_model_path)
# Create a GPT-2 config matching GPT-Neo's structure but with reduced position embeddings
gpt2_config = GPT2Config(
vocab_size=neo_model.config.vocab_size,
@Norod
Norod / obj_mesh_to_spritesheet.py
Created October 8, 2024 09:28
Renders a mesh in different view angles and save them as a spritesheet image
import trimesh
from PIL import Image
import numpy as np
import io
# If you are running on Apple Silicon, you may need to comment out the
# following lines as described in this GitHub issue
# to avoid running into an issue with the trimesh library:
# https://github.com/mikedh/trimesh/issues/2084#issuecomment-1840072858
@Norod
Norod / obj_mesh_depth_bucket_disconnect.py
Created October 7, 2024 07:34
Disconnect mesh components in an .obj file based on a a number of "depth range" buckets
import numpy as np
import trimesh
def load_mesh(file_path):
return trimesh.load(file_path)
def compute_depth_ranges(mesh, num_buckets=5):
# Extract vertex depths (assuming z-coordinate represents depth)
depths = mesh.vertices[:, 2]
@Norod
Norod / combine_tokenizers.py
Created July 25, 2024 15:28
Given two BPE tokenizers, combine them and create a new tokenizer
"""
Given two tokenizers, combine them and create a new tokenizer
Usage: python combine_tokenizers.py --tokenizer1 ./SmolLM-135M --tokenizer2 ./hebrew-14k --save_dir ./combined
Source: https://github.com/huggingface/tokenizers/issues/690#issuecomment-830665989
"""
# Libraries for tokenizer
from pathlib import Path
@Norod
Norod / Training_a_new_tokenizer_from_an_old_one.ipynb
Created July 18, 2024 13:02
A set of scripts for: training a small tokenizer in a new language, merging small tokinizer with existing one and saving the combined and resized model
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@Norod
Norod / prepare_jsonl_dataset_file_from_txt_folder.py
Created July 11, 2024 15:33
Create a JSONL dataset by reading and processes lines from text files, concatenating a specified number of text lines into a single JSONL line, encoding new lines as \\n and allowing UTF-8 unicode characters
import os
import json
from glob import glob
from torch.utils.data import IterableDataset, DataLoader
class BatchProcessedDataset(IterableDataset):
"""
A dataset which streams and processes lines from files, concatenating a specified number of lines.
"""
def __init__(self, files, batch_size=4096, lines_per_entry=20):
@Norod
Norod / heb_tokenize_compare.py
Created May 16, 2024 14:37
Prompt length: 374 (Hebrew: Current wikipedia abstract for the term "Cat") across several tokenizers
from transformers import AutoTokenizer
from transformers import LlamaTokenizerFast
tokenizer_grok = LlamaTokenizerFast.from_pretrained('Xenova/grok-1-tokenizer')
tokenizer_gemma = AutoTokenizer.from_pretrained("google/gemma-7b-it")
tokenizer_aya101 = AutoTokenizer.from_pretrained("CohereForAI/aya-101")
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")
# prompt_text='''מודל ראשון בגודל 6-מיליארד פרמטרים מתאמן כרגע על חלק מהדאטסטים שהגבתם, עכשיו כשהמודל על האש אני אתפנה לענות לכולם. מתנצל על העיכוב, קיבלתי המון הודעות ולא ציפיתי לכזו הענות, אתם אדירים!
# שלב הבא: להרכיב דאטהסט אחד ענק מכל הרעיונות והלינקים שצירפתם בשביל האימון המרכזי.'''
@Norod
Norod / apple_openelm-3b_cuda_gradio-demo.ipynb
Last active April 30, 2024 09:40
apple_openelm-3b_cuda_gradio-demo.ipynb
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@Norod
Norod / apple_openelm-270m_cpu_gradio-demo.ipynb
Created April 24, 2024 17:08
Apple_OpenELM-270M_cpu_Gradio-Demo.ipynb
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@Norod
Norod / heb_tokenize_compare.py
Created March 18, 2024 09:21
Compare Hebrew efficiency in various tokenizers (The lower the number, the better)
from transformers import AutoTokenizer
from transformers import LlamaTokenizerFast
#tokenizer_yam = AutoTokenizer.from_pretrained("yam-peleg/Hebrew-Gemma-11B-V2")
tokenizer_grok = LlamaTokenizerFast.from_pretrained('Xenova/grok-1-tokenizer')
tokenizer_gemma = AutoTokenizer.from_pretrained("google/gemma-7b-it")
tokenizer_aya101 = AutoTokenizer.from_pretrained("CohereForAI/aya-101")
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")