| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - information-retrieval |
| | - LLM |
| | - Embedding |
| | - text-retrieval |
| | - disaster-management |
| |
|
| | task_categories: |
| | - text-retrieval |
| | library_name: transformers |
| | dataset_tags: |
| | - DMIR01/DMRetriever_MTT |
| | --- |
| | |
| | This model is trained through the approach described in [DMRetriever: A Family of Models for Improved Text Retrieval in Disaster Management](https://www.arxiv.org/abs/2510.15087). |
| | The associated GitHub repository is available [here](https://github.com/KaiYin97/DMRETRIEVER). |
| |
|
| | ## Model Overview |
| |
|
| | **DMRetriever-335M** has the following features: |
| |
|
| | - Model Type: Text Embedding |
| | - Supported Languages: English |
| | - Number of Paramaters: 335 |
| | - Context Length: 512 |
| | - Embedding Dimension: 1024 |
| |
|
| | For more details, including model training, benchmark evaluation, and inference performance, please refer to our [paper](https://www.arxiv.org/abs/2510.15087), [GitHub](https://github.com/KaiYin97/DMRETRIEVER). |
| |
|
| | ## DMRetriever series model list |
| |
|
| | | **Model** | **Description** | **Backbone** | **Backbone Type** | **Hidden Size** | **#Layers** | |
| | |:--|:--|:--|:--|:--:|:--:| |
| | | [DMRetriever-33M](https://huggingface.co/DMIR01/DMRetriever-33M) | Base 33M variant | MiniLM | Encoder-only | 384 | 12 | |
| | | [DMRetriever-33M-PT](https://huggingface.co/DMIR01/DMRetriever-33M-PT) | Pre-trained version of 33M | MiniLM | Encoder-only | 384 | 12 | |
| | | [DMRetriever-109M](https://huggingface.co/DMIR01/DMRetriever-109M) | Base 109M variant | BERT-base-uncased | Encoder-only | 768 | 12 | |
| | | [DMRetriever-109M-PT](https://huggingface.co/DMIR01/DMRetriever-109M-PT) | Pre-trained version of 109M | BERT-base-uncased | Encoder-only | 768 | 12 | |
| | | [DMRetriever-335M](https://huggingface.co/DMIR01/DMRetriever-335M) | Base 335M variant | BERT-large-uncased-WWM | Encoder-only | 1024 | 24 | |
| | | [DMRetriever-335M-PT](https://huggingface.co/DMIR01/DMRetriever-335M-PT) | Pre-trained version of 335M | BERT-large-uncased-WWM | Encoder-only | 1024 | 24 | |
| | | [DMRetriever-596M](https://huggingface.co/DMIR01/DMRetriever-596M) | Base 596M variant | Qwen3-0.6B | Decoder-only | 1024 | 28 | |
| | | [DMRetriever-596M-PT](https://huggingface.co/DMIR01/DMRetriever-596M-PT) | Pre-trained version of 596M | Qwen3-0.6B | Decoder-only | 1024 | 28 | |
| | | [DMRetriever-4B](https://huggingface.co/DMIR01/DMRetriever-4B) | Base 4B variant | Qwen3-4B | Decoder-only | 2560 | 36 | |
| | | [DMRetriever-4B-PT](https://huggingface.co/DMIR01/DMRetriever-4B-PT) | Pre-trained version of 4B | Qwen3-4B | Decoder-only | 2560 | 36 | |
| | | [DMRetriever-7.6B](https://huggingface.co/DMIR01/DMRetriever-7.6B) | Base 7.6B variant | Qwen3-8B | Decoder-only | 4096 | 36 | |
| | | [DMRetriever-7.6B-PT](https://huggingface.co/DMIR01/DMRetriever-7.6B-PT) | Pre-trained version of 7.6B | Qwen3-8B | Decoder-only | 4096 | 36 | |
| |
|
| |
|
| | ## Usage |
| |
|
| | Using HuggingFace Transformers: |
| | ```python |
| | import numpy as np |
| | import torch |
| | import torch.nn.functional as F |
| | from transformers import AutoTokenizer, AutoModel |
| | |
| | MODEL_NAME = "DMIR01/DMRetriever-335M" |
| | |
| | # Load model/tokenizer |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | dtype = torch.float16 if device == "cuda" else torch.float32 |
| | tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True) |
| | # Some decoder-only models have no pad token; fall back to EOS if needed |
| | if tokenizer.pad_token is None and tokenizer.eos_token is not None: |
| | tokenizer.pad_token = tokenizer.eos_token |
| | model = AutoModel.from_pretrained(MODEL_NAME, torch_dtype=dtype).to(device) |
| | model.eval() |
| | |
| | # Mean pooling over valid tokens (mask==1) |
| | def mean_pool(last_hidden_state: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor: |
| | mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state) # [B, T, 1] |
| | summed = (last_hidden_state * mask).sum(dim=1) # [B, H] |
| | counts = mask.sum(dim=1).clamp(min=1e-9) # [B, 1] |
| | return summed / counts # [B, H] |
| | |
| | # Optional task prefixes (use for queries; keep corpus plain) |
| | TASK2PREFIX = { |
| | "FactCheck": "Given the claim, retrieve most relevant document that supports or refutes the claim", |
| | "NLI": "Given the premise, retrieve most relevant hypothesis that is entailed by the premise", |
| | "QA": "Given the question, retrieve most relevant passage that best answers the question", |
| | "QAdoc": "Given the question, retrieve the most relevant document that answers the question", |
| | "STS": "Given the sentence, retrieve the sentence with the same meaning", |
| | "Twitter": "Given the user query, retrieve the most relevant Twitter text that meets the request", |
| | } |
| | def with_prefix(task: str, text: str) -> str: |
| | p = TASK2PREFIX.get(task, "") |
| | return f"{p}: {text}" if p else text |
| | |
| | # Batch encode with L2 normalization (recommended for cosine/inner-product search) |
| | @torch.inference_mode() |
| | def encode_texts(texts, batch_size: int = 32, max_length: int = 512, normalize: bool = True): |
| | all_embs = [] |
| | for i in range(0, len(texts), batch_size): |
| | batch = texts[i:i + batch_size] |
| | toks = tokenizer( |
| | batch, |
| | padding=True, |
| | truncation=True, |
| | max_length=max_length, |
| | return_tensors="pt", |
| | ) |
| | toks = {k: v.to(device) for k, v in toks.items()} |
| | out = model(**toks, return_dict=True) |
| | emb = mean_pool(out.last_hidden_state, toks["attention_mask"]) |
| | if normalize: |
| | emb = F.normalize(emb, p=2, dim=1) |
| | all_embs.append(emb.cpu().numpy()) |
| | return np.vstack(all_embs) if all_embs else np.empty((0, model.config.hidden_size), dtype=np.float32) |
| | |
| | # ---- Example: plain sentences ---- |
| | sentences = [ |
| | "A cat sits on the mat.", |
| | "The feline is resting on the rug.", |
| | "Quantum mechanics studies matter and light.", |
| | ] |
| | embs = encode_texts(sentences) # shape: [N, hidden_size] |
| | print("Embeddings shape:", embs.shape) |
| | |
| | # Cosine similarity (embeddings are L2-normalized) |
| | sims = embs @ embs.T |
| | print("Cosine similarity matrix:\n", np.round(sims, 3)) |
| | |
| | # ---- Example: query with task prefix (QA) ---- |
| | qa_queries = [ |
| | with_prefix("QA", "Who wrote 'Pride and Prejudice'?"), |
| | with_prefix("QA", "What is the capital of Japan?"), |
| | ] |
| | qa_embs = encode_texts(qa_queries) |
| | print("QA Embeddings shape:", qa_embs.shape) |
| | |
| | ``` |
| | ## Citation |
| | If you find this repository helpful, please kindly consider citing the corresponding paper. Thanks! |
| | ``` |
| | @article{yin2025dmretriever, |
| | title={DMRetriever: A Family of Models for Improved Text Retrieval in Disaster Management}, |
| | author={Yin, Kai and Dong, Xiangjue and Liu, Chengkai and Lin, Allen and Shi, Lingfeng and Mostafavi, Ali and Caverlee, James}, |
| | journal={arXiv preprint arXiv:2510.15087}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| |
|