AI & ML interests

Machine learning

Recent Activity

atlaswang  updated a Space about 20 hours ago
vita-group/README
jyhong836  authored a paper 4 months ago
LLMs Can Get "Brain Rot"!
View all activity

VITA-Group@UT Austin (https://vita-group.github.io/))

We revisit classical sparse and low-rank optimization through the lens of modern AI, developing theory-driven algorithms that accelerate training and inference in large-scale models. We also investigate how algebraic and logical structures emerge during learning, uncovering the interplay between neural and symbolic computation across streamlined architectures, reasoning pipelines, and agentic systems. See https://www.vita-group.space/research for our latest research efforts.

Compressed LLM Model Zone

NOTE: All compressed LLMs are moved to a new repo at compressed-llm.

The models are prepared by VITA-group. Credits to Ajay Jaiswal, Zhenyu Zhang, Zhangheng Li, Lu Yin, Shiwei Liu and Junyuan Hong.

License: MIT License

Setup environment

pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
pip install transformers==4.31.0
pip install accelerate
pip install auto-gptq  # for gptq

How to use pruned models

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = 'llama-2-7b'
comp_method = 'magnitude_unstructured'
comp_degree = 0.2
model_path = f'vita-group/{base_model}_{comp_method}'
model = AutoModelForCausalLM.from_pretrained(
        model_path, 
        revision=f's{comp_degree}',
        torch_dtype=torch.float16, 
        low_cpu_mem_usage=True, 
        device_map="auto"
    )
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.cuda()
outputs = model.generate(input_ids, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

How to use wanda+gptq models

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_path = 'vita-group/llama-2-7b_wanda_2_4_gptq_4bit_128g'
tokenizer_path = 'meta-llama/Llama-2-7b-hf'
model = AutoGPTQForCausalLM.from_quantized(
        model_path,
        # inject_fused_attention=False, # or 
        disable_exllama=True,
        device_map='auto',
    )
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
outputs = model.generate(input_ids=input_ids, max_length=128)
tokenizer.decode(outputs[0])

How to use gptq models

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
# model_path = 'vita-group/llama-2-7b_wanda_2_4_gptq_4bit_128g'
# tokenizer_path = 'meta-llama/Llama-2-7b-hf'
model_path = 'vita-group/vicuna-7b-v1.3_gptq'
tokenizer_path = 'lmsys/vicuna-7b-v1.3'
model = AutoGPTQForCausalLM.from_quantized(
        model_path,
        # inject_fused_attention=False, # or 
        disable_exllama=True,
        device_map='auto',
        revision='2bit_128g',
    )
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
input_ids = tokenizer('Hello! I am a VITA-compressed-LLM chatbot!', return_tensors='pt').input_ids.to('cuda')
outputs = model.generate(input_ids=input_ids, max_length=128)
tokenizer.decode(outputs[0])
Base Model Model Size Compression Method Compression Degree
0 Llama-2 13b magnitude_semistruct 0.5_2to4
1 Llama-2 13b sparsegpt_semistruct 0.5_2to4
2 Llama-2 7b magnitude_unstructured s0.1
3 Llama-2 7b magnitude_unstructured s0.2
4 Llama-2 7b magnitude_unstructured s0.3
5 Llama-2 7b magnitude_unstructured s0.5
6 Llama-2 7b magnitude_unstructured s0.6
7 Llama-2 7b sparsegpt_unstructured s0.1
8 Llama-2 7b sparsegpt_unstructured s0.2
9 Llama-2 7b sparsegpt_unstructured s0.3
10 Llama-2 7b sparsegpt_unstructured s0.5
11 Llama-2 7b sparsegpt_unstructured s0.6
12 Llama-2 7b wanda_gptq 4bit_128g
13 Llama-2 7b wanda_unstructured s0.1
14 Llama-2 7b wanda_unstructured s0.2
15 Llama-2 7b wanda_unstructured s0.3
16 Llama-2 7b wanda_unstructured s0.5
17 Llama-2 7b wanda_unstructured s0.6
18 Llama-2-chat 13b magnitude_semistruct 0.5_2to4
19 Llama-2-chat 13b sparsegpt_semistruct 0.5_2to4
20 vicuna 13b magnitude_semistruct 0.5_2to4
21 vicuna 13b sparsegpt_semistruct 0.5_2to4
22 vicuna-v1.3 13b gptq 10bit_128g
23 vicuna-v1.3 13b gptq 12bit_128g
24 vicuna-v1.3 13b gptq 14bit_128g
25 vicuna-v1.3 13b gptq 2bit_128g
26 vicuna-v1.3 13b gptq 3bit_128g
27 vicuna-v1.3 13b gptq 4bit_128g
29 vicuna-v1.3 13b gptq 8bit_128g
30 vicuna-v1.3 7b gptq 10bit_128g
31 vicuna-v1.3 7b gptq 12bit_128g
32 vicuna-v1.3 7b gptq 14bit_128g
33 vicuna-v1.3 7b gptq 2bit_128g
34 vicuna-v1.3 7b gptq 3bit_128g
35 vicuna-v1.3 7b gptq 4bit_128g
37 vicuna-v1.3 7b gptq 8bit_128g

Citations

If you are using models in this hub, please consider citing our papers.

@article{jaiswal2023emergence,
  title={The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter},
  author={Jaiswal, Ajay and Liu, Shiwei and Chen, Tianlong and Wang, Zhangyang},
  journal={arXiv},
  year={2023}
}
@article{jaiswal2023compressing,
      title={Compressing LLMs: The Truth is Rarely Pure and Never Simple}, 
      author={Ajay Jaiswal and Zhe Gan and Xianzhi Du and Bowen Zhang and Zhangyang Wang and Yinfei Yang},
      year={2023},
      journal={arXiv},
}

For any question, please contact Junyuan Hong.

datasets 0

None public yet