TextUMC
This model is a fine-tuned version of roberta-large on an unknown dataset. It achieves the following results on the evaluation set:
- Loss: 1.5534
- Accuracy: 0.7512
- Precision: 0.7564
- Recall: 0.7512
- F1: 0.7535
The codes for training and data preprocessing when training this model can be accessed at https://github.com/NoAtmosphere0/TextUMC-training or this link
Model description
The repository uses the TextUMC model, which is designed for unsupervised clustering and evaluation of textual claims and their evidences. The key features and process are as follows:
Core Concepts
Claims and Evidences:
- A Claim consists of a statement, its label, explanation, and a set of supporting evidences.
- Evidences are textual reports or documents related to a claim.
Intended uses & limitations
This is a neural network model (details in the model.py file) that generates embeddings for the provided textual evidences.
The model supports both unified (normal) training, where a single model is trained on all claims, and per-claim (demonic) training, where a separate model is trained for each claim.
Models like TextUMC, which cluster and embed textual claims and evidence, have several limitations:
- Their performance is highly dependent on the quality and completeness of the input data
- Noisy or biased data can lead to poor or misleading results
- The clusters they produce may not always be semantically meaningful or interpretable to humans, and the neural embeddings themselves function as black boxes
- There is no guarantee that discovered clusters will align with human judgment, and results can be sensitive to clustering metrics and hyperparameters
- Standard clustering metrics may not fully reflect the real-world utility or quality of clusters, and human evaluation is often needed
- These models can propagate biases found in the training data and, if used carelessly, can amplify misinformation or label errors
Training and evaluation data
Data Used in This Repository
This repository uses a merged dataset for claim verification and fact-checking tasks, combining two main sources: LIAR-RAW and RAWFC. Below are details about the datasets, preprocessing steps, structure, and statistics.
- LIAR-RAW Dataset
- Contents:
Each sample contains aclaim, itslabel, anexplanation, and a list of supportingreports(evidence). - Preprocessing:
- The
reportsfield is cleaned to retain onlyreport_idandcontentfor each evidence. - Label normalization:
"mostly-true"→"true""barely-true"and"pants-fire"→"false"
- The
- Splits:
Provided astrain.json,val.json, andtest.json.
- RAWFC Dataset
- Contents:
Fact-checking events, each with anevent_id,claim,label,explain, and a list ofreports(evidence). - Preprocessing:
- Loads all JSON files from split directories (
train,val,test). - For each record, reports retain only the
contentfield. - The label
"half"is renamed to"half-true".
- Loads all JSON files from split directories (
- Splits:
Merged into DataFrames for train, validation, and test.
- Merging and Final Dataset
- After preprocessing, the LIAR-RAW and RAWFC splits are concatenated for each of train, validation, and test.
- The validation split is also merged into the training data for a larger training set.
- Final Output:
train.json,test.json(and optionally validation) in a unified format.- The merged dataset is also converted to HuggingFace
DatasetDictformat for use in machine learning pipelines.
- Data Structure
Each data record (after preprocessing and merging) contains:
claim: The main statement to be verified.label: The truthfulness label ("true","false","half-true", etc.).explain: Explanation for the claim's label.reports: List of evidence, each as a dictionary:report_id(if available)content: Evidence text.
Example:
{
"claim": "The sky is green.",
"label": "false",
"explain": "Scientific consensus says the sky appears blue due to Rayleigh scattering.",
"reports": [
{"report_id": "1234567", "content": "A NASA article explains why the sky is blue."},
{"report_id": "2345678", "content": "Physics textbook reference on atmospheric optics."}
]
}
5. Statistics
- The notebook reports over 167,000 unique reports (evidence texts) in the training set, indicating a rich and diverse dataset.
6. Storage and Usage
- The preprocessed datasets are saved as JSON files and as disk-based HuggingFace datasets.
- The dataset is pushed to the HuggingFace Hub for sharing and reproducibility.
7. Summary Table
| Dataset | Files/Splits | Label Normalization | Evidence Structure | Merged Into |
|---|---|---|---|---|
| LIAR-RAW | train/val/test | mostly-true→true, barely-true/pants-fire→false | reports: id, content | train, test |
| RAWFC | train/val/test | half→half-true | reports: content only | train, test |
| Merged | All above | Consistent as above | Unified format | train.json, test.json |
Training procedure
Data Loading:
- Claims and their evidences are loaded from JSON files. Claims with too few evidences are skipped.
Training Modes:
- Normal: One model is trained on all evidences from all claims.
- Demonic: Separate models are trained for each claim, focusing on the evidences attached to that claim.
Clustering:
- Evidence embeddings are clustered using KMeans.
- The optimal number of clusters can be determined using metrics such as the silhouette score.
Loss Functions:
- Unsupervised Contrastive Loss: Encourages similar evidences to have similar embeddings.
- Supervised Contrastive Loss: Uses pseudo-labels from KMeans clustering to refine embeddings.
Evaluation
Clustering Evaluation:
- For each claim, the evidence embeddings are clustered, and metrics such as silhouette score, Calinski-Harabasz, and Davies-Bouldin index are computed.
- The best clustering configuration for each metric is selected, and evidences are grouped accordingly.
- Results are aggregated and saved for further analysis.
Visualization:
- Optionally, the clusters can be visualized using PCA or t-SNE.
Additional Details
- The code is designed to run on GPU if available.
- Detailed logging, metrics, and results are saved during the training and evaluation process.
- Hyperparameters (like batch size, learning rate, number of clusters, etc.) are configurable via command-line arguments.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 4
Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|
| No log | 0 | 0 | 1.1434 | 0.3304 | 0.1092 | 0.3304 | 0.1641 |
| 0.6141 | 1.0 | 1460 | 0.6297 | 0.7483 | 0.7535 | 0.7483 | 0.7451 |
| 0.6133 | 2.0 | 2920 | 0.6568 | 0.7619 | 0.7566 | 0.7619 | 0.7577 |
| 0.2901 | 3.0 | 4380 | 0.9584 | 0.7476 | 0.7646 | 0.7476 | 0.7533 |
| 0.1316 | 4.0 | 5840 | 1.3871 | 0.7639 | 0.7729 | 0.7639 | 0.7675 |
Framework versions
- Transformers 4.46.3
- Pytorch 2.5.1+cu121
- Datasets 3.1.0
- Tokenizers 0.20.3
- Downloads last month
- -
Model tree for NoAtmosphere0/Roberta-large-fc
Base model
FacebookAI/roberta-large