GENERator Fine-tuned on OpenGenome2 - HF JSON + DDP

This repository contains checkpoints from fine-tuning GENERator-v2-eukaryote-1.2b-base on the OpenGenome2 dataset.

Training Details

Base Model: GenerTeam/GENERator-v2-eukaryote-1.2b-base
Dataset: OpenGenome2 (eukaryotic genic windows, 5kb)
Training Configuration: HF JSON + DDP
Number of Checkpoints: 10
Target Tokens: 20 billion

Available Checkpoints

This model has 10 checkpoint revisions. Each checkpoint is saved at regular intervals during training.

To load a specific checkpoint:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a specific checkpoint (e.g., checkpoint-1000)
model = AutoModelForCausalLM.from_pretrained(
    "hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp",
    revision="checkpoint-1000",  # Specify the checkpoint
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp",
    revision="checkpoint-1000",
    trust_remote_code=True
)

Training Configuration

Learning rate: 1e-4
Batch size: 1
Gradient accumulation steps: 16
Number of GPUs: 8
Sequence length: 16,384 tokens
Tokens per step: ~2.1M

Evaluation

Sequence recovery benchmark results show the model's performance across training. See the evaluation plots in the repository.

Citation

If you use this model, please cite:

@misc{generator2-opengenome2,
  title={GENERator Fine-tuned on OpenGenome2},
  author={Arc Institute},
  year={2024},
  url={https://huggingface.co/hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp

Base model

GenerTeam/GENERator-v2-eukaryote-1.2b-base

Finetuned

(2)

this model