Spaces:

davanstrien
/

doab-title-extraction-eval

Running

App Files Files Community

davanstrien HF Staff commited on 4 days ago

Commit

2ea6e78

verified ·

1 Parent(s): bf409ec

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

Dockerfile +23 -0
README.md +82 -5
app.py +155 -0
requirements.txt +4 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,23 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install uv for faster package installation
+RUN pip install uv
+# Copy requirements and install dependencies
+COPY requirements.txt .
+RUN uv pip install --system -r requirements.txt
+# Copy the app
+COPY app.py .
+# Create non-root user for security
+RUN useradd -m -u 1000 user
+USER user
+# Expose port 7860 (HuggingFace Spaces default)
+EXPOSE 7860
+# Run the marimo app
+CMD ["marimo", "run", "app.py", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,10 +1,87 @@
 ---
-title: Doab Title Extraction Eval
-emoji: 😻
-colorFrom: purple
-colorTo: yellow
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: DOAB Title Extraction Evaluation
+emoji: 📚
+colorFrom: blue
+colorTo: purple
 sdk: docker
 pinned: false
+license: mit
 ---
+# VLM vs Text: Extracting Titles from Book Covers
+**Can Vision-Language Models extract metadata from book covers better than text extraction?**
+## TL;DR
+**Yes, significantly.** VLMs achieve ~97% accuracy vs ~70% for text extraction on the DOAB academic book cover dataset.
+## The Task
+Extracting titles from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:
+1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model
+2. **Text Extraction**: Extract text from the image first, then send to an LLM
+## Results
+| Approach | Average Accuracy |
+|----------|-----------------|
+| **VLM** | **97%** |
+| Text | 70% |
+VLMs outperform text extraction by ~27 percentage points.
+### Why VLMs Win
+Book covers are **visually structured**:
+- Titles appear in specific locations (usually top/center)
+- Typography indicates importance (larger = more likely title)
+- Layout provides context that pure text loses
+Text extraction flattens this structure, losing valuable spatial information.
+## Models Evaluated
+**VLM Models** (96-98% accuracy):
+- Qwen3-VL-8B-Instruct
+- Qwen3-VL-30B-A3B-Thinking
+- GLM-4.6V-Flash
+**Text Models** (68-70% accuracy):
+- gpt-oss-20b
+- Qwen3-4B-Instruct-2507
+- Olmo-3-7B-Instruct
+**Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.
+## Technical Details
+- **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples)
+- **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
+- **Scoring**: Flexible title matching (handles case, subtitles, punctuation)
+- **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
+## Replicate This
+The evaluation logs are stored on HuggingFace and can be loaded directly:
+```python
+from inspect_ai.analysis import evals_df
+df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals")
+```
+## Why This Matters for GLAM
+Libraries and archives have millions of digitized documents where metadata is incomplete or missing. VLMs offer a promising approach for:
+- **Catalog enhancement**: Fill gaps in existing records
+- **Discovery**: Make collections more searchable
+- **Quality assessment**: Validate existing metadata
+This evaluation demonstrates that domain-specific benchmarks can help identify the best approaches for cultural heritage tasks.
+---
+*Built with [Marimo](https://marimo.io) | Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/)*

app.py ADDED Viewed

	@@ -0,0 +1,155 @@

+import marimo
+__generated_with = "0.10.9"
+app = marimo.App(width="medium")
+@app.cell
+def _():
+    import marimo as mo
+    return (mo,)
+@app.cell
+def _(mo):
+    mo.md(
+        """
+        # DOAB Title Extraction: VLM vs Text
+        **Can Vision-Language Models extract book titles from covers better than text extraction?**
+        This dashboard compares VLM (vision) and text-based approaches for extracting titles from academic book covers in the [DOAB dataset](https://huggingface.co/datasets/biglam/doab-metadata-extraction).
+        📊 **Evaluation logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
+        """
+    )
+    return
+@app.cell
+def _():
+    import pandas as pd
+    from inspect_ai.analysis import evals_df
+    return evals_df, pd
+@app.cell
+def _(evals_df):
+    # Load evaluation results from HuggingFace
+    df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals", quiet=True)
+    # Add approach column
+    df["approach"] = df["task_name"].apply(lambda x: "VLM" if "vlm" in x else "Text")
+    # Shorten model names
+    df["model_short"] = df["model"].apply(lambda x: x.split("/")[-1])
+    # Convert score to percentage
+    df["accuracy"] = df["score_headline_value"] * 100
+    return (df,)
+@app.cell
+def _(df, mo):
+    # Calculate summary stats
+    vlm_avg = df[df["approach"] == "VLM"]["accuracy"].mean()
+    text_avg = df[df["approach"] == "Text"]["accuracy"].mean()
+    diff = vlm_avg - text_avg
+    mo.md(
+        f"""
+        ## Key Results
+        | Approach | Average Accuracy |
+        |----------|-----------------|
+        | **VLM (Vision)** | **{vlm_avg:.0f}%** |
+        | Text Extraction | {text_avg:.0f}% |
+        **VLM advantage: +{diff:.0f} percentage points**
+        VLMs significantly outperform text extraction for book cover metadata.
+        This is because book covers are **visually structured** - titles appear in specific
+        locations with distinctive formatting that VLMs can recognize.
+        """
+    )
+    return diff, text_avg, vlm_avg
+@app.cell
+def _(mo):
+    mo.md("## Model Leaderboard")
+    return
+@app.cell
+def _(df, mo):
+    # Filter selector
+    approach_filter = mo.ui.dropdown(
+        options=["All", "VLM", "Text"],
+        value="All",
+        label="Filter by approach",
+    )
+    return (approach_filter,)
+@app.cell
+def _(approach_filter, df, mo, pd):
+    # Filter data based on selection
+    if approach_filter.value == "All":
+        filtered_df = df
+    else:
+        filtered_df = df[df["approach"] == approach_filter.value]
+    # Create leaderboard
+    leaderboard = (
+        filtered_df[["model_short", "approach", "accuracy"]]
+        .sort_values("accuracy", ascending=False)
+        .reset_index(drop=True)
+    )
+    leaderboard.columns = ["Model", "Approach", "Accuracy (%)"]
+    leaderboard["Accuracy (%)"] = leaderboard["Accuracy (%)"].round(1)
+    mo.vstack([
+        approach_filter,
+        mo.ui.table(leaderboard, selection=None),
+    ])
+    return filtered_df, leaderboard
+@app.cell
+def _(df, mo):
+    mo.md(
+        """
+        ## About This Evaluation
+        **Task**: Extract the title from academic book cover images
+        **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) - 50 samples
+        **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
+        **Scoring**: Flexible title matching (case-insensitive, handles subtitles)
+        ### Models Evaluated
+        **VLM (Vision-Language Models)**:
+        - Qwen3-VL-8B-Instruct
+        - Qwen3-VL-30B-A3B-Thinking
+        - GLM-4.6V-Flash
+        **Text Extraction**:
+        - gpt-oss-20b
+        - Qwen3-4B-Instruct-2507
+        - Olmo-3-7B-Instruct
+        - Qwen3-VL-8B-Instruct (used as text-only LLM)
+        ---
+        *Built with [Marimo](https://marimo.io) | Evaluation logs on [HuggingFace](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)*
+        """
+    )
+    return
+if __name__ == "__main__":
+    app.run()

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+marimo>=0.10.0
+pandas>=2.0.0
+inspect-ai>=0.3.0
+huggingface-hub>=0.20.0