GENERator Fine-tuned on OpenGenome2 - HF JSON + DDP
This repository contains checkpoints from fine-tuning GENERator-v2-eukaryote-1.2b-base on the OpenGenome2 dataset.
Training Details
- Base Model: GenerTeam/GENERator-v2-eukaryote-1.2b-base
- Dataset: OpenGenome2 (eukaryotic genic windows, 5kb)
- Training Configuration: HF JSON + DDP
- Number of Checkpoints: 10
- Target Tokens: 20 billion
Available Checkpoints
This model has 10 checkpoint revisions. Each checkpoint is saved at regular intervals during training.
To load a specific checkpoint:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load a specific checkpoint (e.g., checkpoint-1000)
model = AutoModelForCausalLM.from_pretrained(
"hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp",
revision="checkpoint-1000", # Specify the checkpoint
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp",
revision="checkpoint-1000",
trust_remote_code=True
)
Training Configuration
- Learning rate: 1e-4
- Batch size: 1
- Gradient accumulation steps: 16
- Number of GPUs: 8
- Sequence length: 16,384 tokens
- Tokens per step: ~2.1M
Evaluation
Sequence recovery benchmark results show the model's performance across training. See the evaluation plots in the repository.
Citation
If you use this model, please cite:
@misc{generator2-opengenome2,
title={GENERator Fine-tuned on OpenGenome2},
author={Arc Institute},
year={2024},
url={https://huggingface.co/hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp}
}
License
Apache 2.0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp
Base model
GenerTeam/GENERator-v2-eukaryote-1.2b-base