davanstrien HF Staff commited on
Commit
2ea6e78
Β·
verified Β·
1 Parent(s): bf409ec

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. Dockerfile +23 -0
  2. README.md +82 -5
  3. app.py +155 -0
  4. requirements.txt +4 -0
Dockerfile ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # Install uv for faster package installation
6
+ RUN pip install uv
7
+
8
+ # Copy requirements and install dependencies
9
+ COPY requirements.txt .
10
+ RUN uv pip install --system -r requirements.txt
11
+
12
+ # Copy the app
13
+ COPY app.py .
14
+
15
+ # Create non-root user for security
16
+ RUN useradd -m -u 1000 user
17
+ USER user
18
+
19
+ # Expose port 7860 (HuggingFace Spaces default)
20
+ EXPOSE 7860
21
+
22
+ # Run the marimo app
23
+ CMD ["marimo", "run", "app.py", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,10 +1,87 @@
1
  ---
2
- title: Doab Title Extraction Eval
3
- emoji: 😻
4
- colorFrom: purple
5
- colorTo: yellow
6
  sdk: docker
7
  pinned: false
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: DOAB Title Extraction Evaluation
3
+ emoji: πŸ“š
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: docker
7
  pinned: false
8
+ license: mit
9
  ---
10
 
11
+ # VLM vs Text: Extracting Titles from Book Covers
12
+
13
+ **Can Vision-Language Models extract metadata from book covers better than text extraction?**
14
+
15
+ ## TL;DR
16
+
17
+ **Yes, significantly.** VLMs achieve ~97% accuracy vs ~70% for text extraction on the DOAB academic book cover dataset.
18
+
19
+ ## The Task
20
+
21
+ Extracting titles from digitized book covers is a common challenge in libraries, archives, and digital humanities projects (GLAM sector). We compared two approaches:
22
+
23
+ 1. **VLM (Vision)**: Send the cover image directly to a Vision-Language Model
24
+ 2. **Text Extraction**: Extract text from the image first, then send to an LLM
25
+
26
+ ## Results
27
+
28
+ | Approach | Average Accuracy |
29
+ |----------|-----------------|
30
+ | **VLM** | **97%** |
31
+ | Text | 70% |
32
+
33
+ VLMs outperform text extraction by ~27 percentage points.
34
+
35
+ ### Why VLMs Win
36
+
37
+ Book covers are **visually structured**:
38
+ - Titles appear in specific locations (usually top/center)
39
+ - Typography indicates importance (larger = more likely title)
40
+ - Layout provides context that pure text loses
41
+
42
+ Text extraction flattens this structure, losing valuable spatial information.
43
+
44
+ ## Models Evaluated
45
+
46
+ **VLM Models** (96-98% accuracy):
47
+ - Qwen3-VL-8B-Instruct
48
+ - Qwen3-VL-30B-A3B-Thinking
49
+ - GLM-4.6V-Flash
50
+
51
+ **Text Models** (68-70% accuracy):
52
+ - gpt-oss-20b
53
+ - Qwen3-4B-Instruct-2507
54
+ - Olmo-3-7B-Instruct
55
+
56
+ **Interesting finding**: Qwen3-VL-8B achieves 94% even when used as a text-only model, suggesting it's generally better at this task regardless of modality.
57
+
58
+ ## Technical Details
59
+
60
+ - **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) (50 samples)
61
+ - **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
62
+ - **Scoring**: Flexible title matching (handles case, subtitles, punctuation)
63
+ - **Logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
64
+
65
+ ## Replicate This
66
+
67
+ The evaluation logs are stored on HuggingFace and can be loaded directly:
68
+
69
+ ```python
70
+ from inspect_ai.analysis import evals_df
71
+
72
+ df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals")
73
+ ```
74
+
75
+ ## Why This Matters for GLAM
76
+
77
+ Libraries and archives have millions of digitized documents where metadata is incomplete or missing. VLMs offer a promising approach for:
78
+
79
+ - **Catalog enhancement**: Fill gaps in existing records
80
+ - **Discovery**: Make collections more searchable
81
+ - **Quality assessment**: Validate existing metadata
82
+
83
+ This evaluation demonstrates that domain-specific benchmarks can help identify the best approaches for cultural heritage tasks.
84
+
85
+ ---
86
+
87
+ *Built with [Marimo](https://marimo.io) | Evaluation framework: [Inspect AI](https://inspect.aisi.org.uk/)*
app.py ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import marimo
2
+
3
+ __generated_with = "0.10.9"
4
+ app = marimo.App(width="medium")
5
+
6
+
7
+ @app.cell
8
+ def _():
9
+ import marimo as mo
10
+ return (mo,)
11
+
12
+
13
+ @app.cell
14
+ def _(mo):
15
+ mo.md(
16
+ """
17
+ # DOAB Title Extraction: VLM vs Text
18
+
19
+ **Can Vision-Language Models extract book titles from covers better than text extraction?**
20
+
21
+ This dashboard compares VLM (vision) and text-based approaches for extracting titles from academic book covers in the [DOAB dataset](https://huggingface.co/datasets/biglam/doab-metadata-extraction).
22
+
23
+ πŸ“Š **Evaluation logs**: [davanstrien/doab-title-extraction-evals](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)
24
+ """
25
+ )
26
+ return
27
+
28
+
29
+ @app.cell
30
+ def _():
31
+ import pandas as pd
32
+ from inspect_ai.analysis import evals_df
33
+ return evals_df, pd
34
+
35
+
36
+ @app.cell
37
+ def _(evals_df):
38
+ # Load evaluation results from HuggingFace
39
+ df = evals_df("hf://datasets/davanstrien/doab-title-extraction-evals", quiet=True)
40
+
41
+ # Add approach column
42
+ df["approach"] = df["task_name"].apply(lambda x: "VLM" if "vlm" in x else "Text")
43
+
44
+ # Shorten model names
45
+ df["model_short"] = df["model"].apply(lambda x: x.split("/")[-1])
46
+
47
+ # Convert score to percentage
48
+ df["accuracy"] = df["score_headline_value"] * 100
49
+ return (df,)
50
+
51
+
52
+ @app.cell
53
+ def _(df, mo):
54
+ # Calculate summary stats
55
+ vlm_avg = df[df["approach"] == "VLM"]["accuracy"].mean()
56
+ text_avg = df[df["approach"] == "Text"]["accuracy"].mean()
57
+ diff = vlm_avg - text_avg
58
+
59
+ mo.md(
60
+ f"""
61
+ ## Key Results
62
+
63
+ | Approach | Average Accuracy |
64
+ |----------|-----------------|
65
+ | **VLM (Vision)** | **{vlm_avg:.0f}%** |
66
+ | Text Extraction | {text_avg:.0f}% |
67
+
68
+ **VLM advantage: +{diff:.0f} percentage points**
69
+
70
+ VLMs significantly outperform text extraction for book cover metadata.
71
+ This is because book covers are **visually structured** - titles appear in specific
72
+ locations with distinctive formatting that VLMs can recognize.
73
+ """
74
+ )
75
+ return diff, text_avg, vlm_avg
76
+
77
+
78
+ @app.cell
79
+ def _(mo):
80
+ mo.md("## Model Leaderboard")
81
+ return
82
+
83
+
84
+ @app.cell
85
+ def _(df, mo):
86
+ # Filter selector
87
+ approach_filter = mo.ui.dropdown(
88
+ options=["All", "VLM", "Text"],
89
+ value="All",
90
+ label="Filter by approach",
91
+ )
92
+ return (approach_filter,)
93
+
94
+
95
+ @app.cell
96
+ def _(approach_filter, df, mo, pd):
97
+ # Filter data based on selection
98
+ if approach_filter.value == "All":
99
+ filtered_df = df
100
+ else:
101
+ filtered_df = df[df["approach"] == approach_filter.value]
102
+
103
+ # Create leaderboard
104
+ leaderboard = (
105
+ filtered_df[["model_short", "approach", "accuracy"]]
106
+ .sort_values("accuracy", ascending=False)
107
+ .reset_index(drop=True)
108
+ )
109
+ leaderboard.columns = ["Model", "Approach", "Accuracy (%)"]
110
+ leaderboard["Accuracy (%)"] = leaderboard["Accuracy (%)"].round(1)
111
+
112
+ mo.vstack([
113
+ approach_filter,
114
+ mo.ui.table(leaderboard, selection=None),
115
+ ])
116
+ return filtered_df, leaderboard
117
+
118
+
119
+ @app.cell
120
+ def _(df, mo):
121
+ mo.md(
122
+ """
123
+ ## About This Evaluation
124
+
125
+ **Task**: Extract the title from academic book cover images
126
+
127
+ **Dataset**: [DOAB Metadata Extraction](https://huggingface.co/datasets/biglam/doab-metadata-extraction) - 50 samples
128
+
129
+ **Evaluation Framework**: [Inspect AI](https://inspect.aisi.org.uk/)
130
+
131
+ **Scoring**: Flexible title matching (case-insensitive, handles subtitles)
132
+
133
+ ### Models Evaluated
134
+
135
+ **VLM (Vision-Language Models)**:
136
+ - Qwen3-VL-8B-Instruct
137
+ - Qwen3-VL-30B-A3B-Thinking
138
+ - GLM-4.6V-Flash
139
+
140
+ **Text Extraction**:
141
+ - gpt-oss-20b
142
+ - Qwen3-4B-Instruct-2507
143
+ - Olmo-3-7B-Instruct
144
+ - Qwen3-VL-8B-Instruct (used as text-only LLM)
145
+
146
+ ---
147
+
148
+ *Built with [Marimo](https://marimo.io) | Evaluation logs on [HuggingFace](https://huggingface.co/datasets/davanstrien/doab-title-extraction-evals)*
149
+ """
150
+ )
151
+ return
152
+
153
+
154
+ if __name__ == "__main__":
155
+ app.run()
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ marimo>=0.10.0
2
+ pandas>=2.0.0
3
+ inspect-ai>=0.3.0
4
+ huggingface-hub>=0.20.0