GENERator Fine-tuned on OpenGenome2 - HF JSON + DDP

This repository contains checkpoints from fine-tuning GENERator-v2-eukaryote-1.2b-base on the OpenGenome2 dataset.

Training Details

  • Base Model: GenerTeam/GENERator-v2-eukaryote-1.2b-base
  • Dataset: OpenGenome2 (eukaryotic genic windows, 5kb)
  • Training Configuration: HF JSON + DDP
  • Number of Checkpoints: 10
  • Target Tokens: 20 billion

Available Checkpoints

This model has 10 checkpoint revisions. Each checkpoint is saved at regular intervals during training.

To load a specific checkpoint:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a specific checkpoint (e.g., checkpoint-1000)
model = AutoModelForCausalLM.from_pretrained(
    "hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp",
    revision="checkpoint-1000",  # Specify the checkpoint
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp",
    revision="checkpoint-1000",
    trust_remote_code=True
)

Training Configuration

  • Learning rate: 1e-4
  • Batch size: 1
  • Gradient accumulation steps: 16
  • Number of GPUs: 8
  • Sequence length: 16,384 tokens
  • Tokens per step: ~2.1M

Evaluation

Sequence recovery benchmark results show the model's performance across training. See the evaluation plots in the repository.

Citation

If you use this model, please cite:

@misc{generator2-opengenome2,
  title={GENERator Fine-tuned on OpenGenome2},
  author={Arc Institute},
  year={2024},
  url={https://huggingface.co/hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp

Finetuned
(2)
this model