# RoBERTa-Base Model for Temporal Information Extraction This repository hosts a fine-tuned version of RoBERTa for **temporal information extraction**, where the model identifies and extracts time-related expressions (e.g., dates, durations) from text. The pipeline includes preprocessing, fine-tuning, and inference on labeled temporal datasets. --- ## Model Details - **Model Name:** RoBERTa-Base - **Model Architecture:** RoBERTa Token Classification - **Task:** Temporal Entity Extraction - **Dataset:** Custom JSON format with annotated temporal SPO triples - **Fine-tuning Framework:** Hugging Face Transformers - **Output Labels:** `B-TIMEX`, `I-TIMEX`, `O` --- ## Usage ### Installation ```bash pip install transformers datasets evaluate # Loading the Fine-Tuned Model from transformers import RobertaTokenizerFast, RobertaForTokenClassification import torch # Load model and tokenizer model = RobertaForTokenClassification.from_pretrained("./temporal_model") tokenizer = RobertaTokenizerFast.from_pretrained("./temporal_model", add_prefix_space=True) # Inference function def extract_temporal_entities(text): tokens = text.split() inputs = tokenizer(tokens, return_tensors="pt", is_split_into_words=True) outputs = model(**inputs) predictions = outputs.logits.argmax(dim=-1).squeeze().tolist() word_ids = inputs.word_ids()[0] temporal_spans = [] current = [] for idx, word_idx in enumerate(word_ids): if word_idx is None: continue label = id2label[predictions[idx]] if label == "B-TIMEX": if current: temporal_spans.append(" ".join(current)) current = [tokens[word_idx]] elif label == "I-TIMEX": current.append(tokens[word_idx]) else: if current: temporal_spans.append(" ".join(current)) current = [] if current: temporal_spans.append(" ".join(current)) return temporal_spans # Performance Metrics Evaluation Accuracy: ~0.76 F1 Score: Tracked using seqeval (BIO format) Evaluation Strategy: epoch # Fine-Tuning Details Dataset The dataset consists of manually or script-labeled SPO-style JSON entries with the following fields: text: Raw input string spo_list: A list of subject-predicate-object relations, including: Subject & Object Span Type (e.g., Date, Location) The text is tokenized, and BIO labels are applied for token classification. # Training Configuration Epochs: 3 Batch Size: 16 Learning Rate: 2e-5 Max Sequence Length: 128 tokens Tokenizer: roberta-base with add_prefix_space=True # Repository Structure pgsql Copy Edit . ├── temporal_model/ # Fine-tuned model and tokenizer │ ├── config.json │ ├── pytorch_model.bin │ ├── tokenizer_config.json │ ├── vocab.json │ └── special_tokens_map.json ├── temporal-information-extraction.ipynb ├── README.md # Limitations The model is domain-specific; generalization to other types of temporal expressions (e.g., informal text) may require additional training. BIO tagging may fail in overlapping or nested entity scenarios. # Contributing Contributions are welcome! Feel free to open an issue or submit a pull request to improve model performance or add new datasets.