You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

TextUMC

This model is a fine-tuned version of roberta-large on an unknown dataset. It achieves the following results on the evaluation set:

  • Loss: 1.5534
  • Accuracy: 0.7512
  • Precision: 0.7564
  • Recall: 0.7512
  • F1: 0.7535

The codes for training and data preprocessing when training this model can be accessed at https://github.com/NoAtmosphere0/TextUMC-training or this link

Model description

The repository uses the TextUMC model, which is designed for unsupervised clustering and evaluation of textual claims and their evidences. The key features and process are as follows:

Core Concepts

  • Claims and Evidences:

    • A Claim consists of a statement, its label, explanation, and a set of supporting evidences.
    • Evidences are textual reports or documents related to a claim.

Intended uses & limitations

  • This is a neural network model (details in the model.py file) that generates embeddings for the provided textual evidences.

  • The model supports both unified (normal) training, where a single model is trained on all claims, and per-claim (demonic) training, where a separate model is trained for each claim.

  • Models like TextUMC, which cluster and embed textual claims and evidence, have several limitations:

    • Their performance is highly dependent on the quality and completeness of the input data
    • Noisy or biased data can lead to poor or misleading results
    • The clusters they produce may not always be semantically meaningful or interpretable to humans, and the neural embeddings themselves function as black boxes
    • There is no guarantee that discovered clusters will align with human judgment, and results can be sensitive to clustering metrics and hyperparameters
    • Standard clustering metrics may not fully reflect the real-world utility or quality of clusters, and human evaluation is often needed
    • These models can propagate biases found in the training data and, if used carelessly, can amplify misinformation or label errors

Training and evaluation data

Data Used in This Repository

This repository uses a merged dataset for claim verification and fact-checking tasks, combining two main sources: LIAR-RAW and RAWFC. Below are details about the datasets, preprocessing steps, structure, and statistics.


  1. LIAR-RAW Dataset
  • Contents:
    Each sample contains a claim, its label, an explanation, and a list of supporting reports (evidence).
  • Preprocessing:
    • The reports field is cleaned to retain only report_id and content for each evidence.
    • Label normalization:
      • "mostly-true""true"
      • "barely-true" and "pants-fire""false"
  • Splits:
    Provided as train.json, val.json, and test.json.

  1. RAWFC Dataset
  • Contents:
    Fact-checking events, each with an event_id, claim, label, explain, and a list of reports (evidence).
  • Preprocessing:
    • Loads all JSON files from split directories (train, val, test).
    • For each record, reports retain only the content field.
    • The label "half" is renamed to "half-true".
  • Splits:
    Merged into DataFrames for train, validation, and test.

  1. Merging and Final Dataset
  • After preprocessing, the LIAR-RAW and RAWFC splits are concatenated for each of train, validation, and test.
  • The validation split is also merged into the training data for a larger training set.
  • Final Output:
    • train.json, test.json (and optionally validation) in a unified format.
    • The merged dataset is also converted to HuggingFace DatasetDict format for use in machine learning pipelines.

  1. Data Structure

Each data record (after preprocessing and merging) contains:

  • claim: The main statement to be verified.
  • label: The truthfulness label ("true", "false", "half-true", etc.).
  • explain: Explanation for the claim's label.
  • reports: List of evidence, each as a dictionary:
    • report_id (if available)
    • content: Evidence text.

Example:

{
  "claim": "The sky is green.",
  "label": "false",
  "explain": "Scientific consensus says the sky appears blue due to Rayleigh scattering.",
  "reports": [
    {"report_id": "1234567", "content": "A NASA article explains why the sky is blue."},
    {"report_id": "2345678", "content": "Physics textbook reference on atmospheric optics."}
  ]
}

5. Statistics

  • The notebook reports over 167,000 unique reports (evidence texts) in the training set, indicating a rich and diverse dataset.

6. Storage and Usage

  • The preprocessed datasets are saved as JSON files and as disk-based HuggingFace datasets.
  • The dataset is pushed to the HuggingFace Hub for sharing and reproducibility.

7. Summary Table

Dataset Files/Splits Label Normalization Evidence Structure Merged Into
LIAR-RAW train/val/test mostly-true→true, barely-true/pants-fire→false reports: id, content train, test
RAWFC train/val/test half→half-true reports: content only train, test
Merged All above Consistent as above Unified format train.json, test.json

Training procedure

  • Data Loading:

    • Claims and their evidences are loaded from JSON files. Claims with too few evidences are skipped.
  • Training Modes:

    • Normal: One model is trained on all evidences from all claims.
    • Demonic: Separate models are trained for each claim, focusing on the evidences attached to that claim.
  • Clustering:

    • Evidence embeddings are clustered using KMeans.
    • The optimal number of clusters can be determined using metrics such as the silhouette score.
  • Loss Functions:

    • Unsupervised Contrastive Loss: Encourages similar evidences to have similar embeddings.
    • Supervised Contrastive Loss: Uses pseudo-labels from KMeans clustering to refine embeddings.
  • Evaluation

    • Clustering Evaluation:

      • For each claim, the evidence embeddings are clustered, and metrics such as silhouette score, Calinski-Harabasz, and Davies-Bouldin index are computed.
      • The best clustering configuration for each metric is selected, and evidences are grouped accordingly.
      • Results are aggregated and saved for further analysis.
    • Visualization:

      • Optionally, the clusters can be visualized using PCA or t-SNE.
  • Additional Details

    • The code is designed to run on GPU if available.
    • Detailed logging, metrics, and results are saved during the training and evaluation process.
    • Hyperparameters (like batch size, learning rate, number of clusters, etc.) are configurable via command-line arguments.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 4

Training results

Training Loss Epoch Step Validation Loss Accuracy Precision Recall F1
No log 0 0 1.1434 0.3304 0.1092 0.3304 0.1641
0.6141 1.0 1460 0.6297 0.7483 0.7535 0.7483 0.7451
0.6133 2.0 2920 0.6568 0.7619 0.7566 0.7619 0.7577
0.2901 3.0 4380 0.9584 0.7476 0.7646 0.7476 0.7533
0.1316 4.0 5840 1.3871 0.7639 0.7729 0.7639 0.7675

Framework versions

  • Transformers 4.46.3
  • Pytorch 2.5.1+cu121
  • Datasets 3.1.0
  • Tokenizers 0.20.3
Downloads last month
-
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NoAtmosphere0/Roberta-large-fc

Finetuned
(434)
this model

Evaluation results