LLaMA 1B - Fine-tuned Language Model
This is a LLaMA model trained on FineWeb-Edu dataset with optimized learning rate and sequence length.
Model Details
- Model Name: llama_1B_lr_4e-4_100bt
- Architecture: LLaMA (Large Language Model Meta AI)
- Parameters: ~1B parameters
- Training Step: 340,000
- Sequence Length: 4096
- Vocabulary Size: 128256
Architecture Details
Model Configuration
- Hidden Dimension: 2048
- Number of Layers: 18
- Number of Heads: 16
- Head Dimension: None
- KV Heads: None
- Max Sequence Length: 4096
- RoPE Theta: 10000.0
- Norm Epsilon: 1e-05
- FFN Dimension Multiplier: None
- Weight Tying: False
Training Details
Data
- Dataset: fineweb_edu_100bt_shuffled
- Batch Size: 9
- Tokenizer: tiktoken
- Tokenizer Path: /fsx-pretraining/home/chunyyyy/blt/bytelatent/tokenizers/original/tokenizer.model
- Add BOS Token: True
- Add EOS Token: True
Optimization
- Learning Rate: 0.0004
- Weight Decay: 0.1
- Scheduler: cosine
- Warmup Steps: 5000
Distributed Training
- Data Parallel Replicas: 8
- Model Dtype: bf16
- FSDP Type: full_shard
Usage
This model uses the LLaMA architecture and contains distributed model weights in PyTorch format. The checkpoint can be loaded using the PyTorch/transformers framework.
# Example loading code
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
# Note: This requires the specific LLaMA framework used for training
# The checkpoint is saved in distributed format and may need conversion
Evaluation Tasks
The model evaluation configuration:
- Validation Steps: 1000
- Validation Source: /fsx-pretraining/home/sllokega/intern_workspace/data/fineweb_edu_10bt_val
- Generator Max Tokens: 4096
- Temperature: 1.0
- Top-p: 0.95
Training Configuration
The complete training configuration is preserved in the uploaded files.
Files Description
*.distcp: Distributed checkpoint files containing model weightsparams.json: Model parameters and configurationtrain_state_*.json: Training state information including optimizer statesconfig.yaml: Complete training configuration
Citation
If you use this model, please cite the LLaMA paper and the FineWeb-Edu dataset.
- Downloads last month
- 2