You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

TextUMC

This model is a fine-tuned version of roberta-large on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 1.5534
Accuracy: 0.7512
Precision: 0.7564
Recall: 0.7512
F1: 0.7535

The codes for training and data preprocessing when training this model can be accessed at https://github.com/NoAtmosphere0/TextUMC-training or this link

Model description

The repository uses the TextUMC model, which is designed for unsupervised clustering and evaluation of textual claims and their evidences. The key features and process are as follows:

Core Concepts

Claims and Evidences:
- A Claim consists of a statement, its label, explanation, and a set of supporting evidences.
- Evidences are textual reports or documents related to a claim.

Intended uses & limitations

This is a neural network model (details in the model.py file) that generates embeddings for the provided textual evidences.
The model supports both unified (normal) training, where a single model is trained on all claims, and per-claim (demonic) training, where a separate model is trained for each claim.
Models like TextUMC, which cluster and embed textual claims and evidence, have several limitations:
- Their performance is highly dependent on the quality and completeness of the input data
- Noisy or biased data can lead to poor or misleading results
- The clusters they produce may not always be semantically meaningful or interpretable to humans, and the neural embeddings themselves function as black boxes
- There is no guarantee that discovered clusters will align with human judgment, and results can be sensitive to clustering metrics and hyperparameters
- Standard clustering metrics may not fully reflect the real-world utility or quality of clusters, and human evaluation is often needed
- These models can propagate biases found in the training data and, if used carelessly, can amplify misinformation or label errors

Training and evaluation data

Data Used in This Repository

This repository uses a merged dataset for claim verification and fact-checking tasks, combining two main sources: LIAR-RAW and RAWFC. Below are details about the datasets, preprocessing steps, structure, and statistics.

LIAR-RAW Dataset

Contents:
Each sample contains a claim, its label, an explanation, and a list of supporting reports (evidence).
Preprocessing:
- The reports field is cleaned to retain only report_id and content for each evidence.
- Label normalization:
  - "mostly-true" → "true"
  - "barely-true" and "pants-fire" → "false"
Splits:
Provided as train.json, val.json, and test.json.

RAWFC Dataset

Contents:
Fact-checking events, each with an event_id, claim, label, explain, and a list of reports (evidence).
Preprocessing:
- Loads all JSON files from split directories (train, val, test).
- For each record, reports retain only the content field.
- The label "half" is renamed to "half-true".
Splits:
Merged into DataFrames for train, validation, and test.

Merging and Final Dataset

After preprocessing, the LIAR-RAW and RAWFC splits are concatenated for each of train, validation, and test.
The validation split is also merged into the training data for a larger training set.
Final Output:
- train.json, test.json (and optionally validation) in a unified format.
- The merged dataset is also converted to HuggingFace DatasetDict format for use in machine learning pipelines.

Data Structure

Each data record (after preprocessing and merging) contains:

claim: The main statement to be verified.
label: The truthfulness label ("true", "false", "half-true", etc.).
explain: Explanation for the claim's label.
reports: List of evidence, each as a dictionary:
- report_id (if available)
- content: Evidence text.

Example:

{
  "claim": "The sky is green.",
  "label": "false",
  "explain": "Scientific consensus says the sky appears blue due to Rayleigh scattering.",
  "reports": [
    {"report_id": "1234567", "content": "A NASA article explains why the sky is blue."},
    {"report_id": "2345678", "content": "Physics textbook reference on atmospheric optics."}
  ]
}

5. Statistics

The notebook reports over 167,000 unique reports (evidence texts) in the training set, indicating a rich and diverse dataset.

6. Storage and Usage

The preprocessed datasets are saved as JSON files and as disk-based HuggingFace datasets.
The dataset is pushed to the HuggingFace Hub for sharing and reproducibility.

7. Summary Table

Dataset	Files/Splits	Label Normalization	Evidence Structure	Merged Into
LIAR-RAW	train/val/test	mostly-true→true, barely-true/pants-fire→false	reports: id, content	train, test
RAWFC	train/val/test	half→half-true	reports: content only	train, test
Merged	All above	Consistent as above	Unified format	train.json, test.json

Training procedure

Data Loading:
- Claims and their evidences are loaded from JSON files. Claims with too few evidences are skipped.
Training Modes:
- Normal: One model is trained on all evidences from all claims.
- Demonic: Separate models are trained for each claim, focusing on the evidences attached to that claim.
Clustering:
- Evidence embeddings are clustered using KMeans.
- The optimal number of clusters can be determined using metrics such as the silhouette score.
Loss Functions:
- Unsupervised Contrastive Loss: Encourages similar evidences to have similar embeddings.
- Supervised Contrastive Loss: Uses pseudo-labels from KMeans clustering to refine embeddings.
Evaluation
- Clustering Evaluation:
  - For each claim, the evidence embeddings are clustered, and metrics such as silhouette score, Calinski-Harabasz, and Davies-Bouldin index are computed.
  - The best clustering configuration for each metric is selected, and evidences are grouped accordingly.
  - Results are aggregated and saved for further analysis.
- Visualization:
  - Optionally, the clusters can be visualized using PCA or t-SNE.
Additional Details
- The code is designed to run on GPU if available.
- Detailed logging, metrics, and results are saved during the training and evaluation process.
- Hyperparameters (like batch size, learning rate, number of clusters, etc.) are configurable via command-line arguments.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 4

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	Precision	Recall	F1
No log	0	0	1.1434	0.3304	0.1092	0.3304	0.1641
0.6141	1.0	1460	0.6297	0.7483	0.7535	0.7483	0.7451
0.6133	2.0	2920	0.6568	0.7619	0.7566	0.7619	0.7577
0.2901	3.0	4380	0.9584	0.7476	0.7646	0.7476	0.7533
0.1316	4.0	5840	1.3871	0.7639	0.7729	0.7639	0.7675

Framework versions

Transformers 4.46.3
Pytorch 2.5.1+cu121
Datasets 3.1.0
Tokenizers 0.20.3

Downloads last month: -

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for NoAtmosphere0/Roberta-large-fc

Base model

FacebookAI/roberta-large

Finetuned

(434)

this model

NoAtmosphere0
/

Roberta-large-fc