Let's checkout the PR:
git fetch origin pull/625/head:dbrx
git switch dbrx
pip install -vvv --no-build-isolation -e .
Download the model:
from huggingface_hub import snapshot_download, get_collection
repo_id="LnL-AI/dbrx-base-converted-v2-4bit-gptq-gptq-v2"
revision = "main"
local_cache_dir = f"/home/maziyar/.cache/huggingface/hub/models--{repo_id.replace('/', '--')}"
snapshot_download(repo_id=repo_id, revision=revision, local_dir_use_symlinks=True, force_download=False, local_dir=local_cache_dir)
Let's put the weights back together via combine_tensors.sh
script:
cd /home/maziyar/.cache/huggingface/hub/models--LnL-AI--dbrx-base-converted-v2-4bit-gptq-gptq-v2/
chmod +x combine_tensors.sh
./combine_tensors.sh
Now let's load the model in Hugging Face for testing:
from transformers import AutoTokenizer, pipeline, TextStreamer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch
model_id = "/home/maziyar/.cache/huggingface/hub/models--LnL-AI--dbrx-base-converted-v2-4bit-gptq-gptq-v2/"
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
damp_percent=0.005,
desc_act=False,
static_groups=False,
sym=True,
true_sequential=True,
model_name_or_path=None,
model_file_base_name=None,
quant_method="gptq",
checkpoint_format="gptq"
)
model = AutoGPTQForCausalLM.from_quantized(
model_id,
trust_remote_code=True,
device="cuda:0",
model_basename="gptq_model-4bit-128g",
quantize_config=quantize_config)
tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-base")
streamer = TextStreamer(tokenizer)
input_text = "What does it take to build a great LLM? Resopnd in 3 bullet points"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=False, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=200, streamer=streamer)
print(tokenizer.decode(outputs[0]))
Now you can have fun!!!
# model
# pipelines
outputs = pipe("What is a large language model?")
print(outputs[0]["generated_text"])