Methodology
How the Premodern Concordance extracts, matches, verifies, and enriches named entities across multilingual early modern texts.
Approach
A hybrid pipeline combining large language models for entity extraction and verification with fine-tuned multilingual embeddings for cross-lingual matching. Each step is designed to be incremental, auditable, and resumable.
Scale
Twelve texts spanning 1563–1890 in six languages (English, Portuguese, Spanish, French, Italian, German), producing over 2,100 concordance clusters from thousands of extracted entities with over 50,000 total mentions.
Challenges
Early modern texts present unstable orthography, OCR artifacts, cross-lingual naming conventions (Galeno/Galen/Galien), lost long-s characters, and ambiguous referents that resist standard NER approaches.
Corpus
The concordance currently draws on eight texts from the early modern period through the nineteenth century, concerned with natural knowledge — materia medica, botany, pharmacy, natural history, exploration, evolutionary biology, and psychology. These were selected for their overlapping subject matter and linguistic diversity.
Colóquios dos Simples e Drogas da Índia
García de Orta, 1563 · Portuguese
A pioneering pharmacological dialogue cataloguing the drugs, spices, and medicinal plants of India based on Orta’s decades of firsthand observation in Goa.
Historia Medicinal de las cosas que se traen de nuestras Indias Occidentales
Nicolás Monardes, 1574 · Spanish
A catalogue of New World medicinal substances—tobacco, sassafras, guaiacum, bezoardstones—written by a Seville physician who never crossed the Atlantic.
Ricettario Fiorentino
Collegio Medico di Firenze, 1597 · Italian
The official pharmacopoeia of Florence, standardising compound drug recipes and ingredient terminology for the city’s apothecaries.
The English Physitian
Nicholas Culpeper, 1652 · English
A vernacular herbal linking plants to astrological governance and Galenic humoral medicine, written to make pharmaceutical knowledge accessible beyond Latin.
Polyanthea Medicinal
João Curvo Semedo, 1741 · Portuguese
An encyclopaedic medical compendium blending Galenic, chemical, and empirical approaches, reflecting the eclectic pharmacy of early eighteenth-century Portugal.
Relation historique du voyage, Tome III
Alexander von Humboldt, 1825 · French
The third volume of Humboldt’s narrative of his American expedition, rich with observations on geography, botany, indigenous knowledge, and natural phenomena.
On the Origin of Species
Charles Darwin, 1859 · English
Darwin’s foundational text on evolution by natural selection, dense with references to species, varieties, geological formations, and the naturalists whose observations informed his theory.
The Principles of Psychology, Volume 1
William James, 1890 · English
James’s comprehensive treatise on the science of mind, covering sensation, perception, memory, emotion, and consciousness through both philosophical argument and physiological evidence.
Pipeline Overview
The concordance is built through an eight-stage pipeline. Each stage is a standalone Python script that reads from and writes to JSON, making the process fully auditable and individually re-runnable.
LLM-based NER on chunked source text
GeminiWithin-book entity merging via embeddings
BGE-M3Contrastive learning on historical name pairs
BGE-M3Pairwise cross-book entity matching
BGE-M3Connected-component concordance groups
BGE-M3LLM review of suspicious clusters
GeminiGround truth identification + Wikidata
GeminiSemantic search embeddings for the web
OpenAIEntity Extraction
Each source text is chunked and passed through a large language model for structured named entity recognition. The LLM identifies entities and classifies them into a controlled taxonomy of ten categories, each with specific subcategories.
Chunking strategy
Source texts are split into 2,500-character chunks with 200-character overlap. Breaks are paragraph-aware: the chunker looks for paragraph boundaries near the target split point to avoid cutting mid-sentence. EEBO-specific artifacts (page markers, marginal references) and OCR noise are cleaned before chunking.
LLM extraction
Each chunk is sent to Gemini 2.5 Flash Lite with a structured prompt asking for entities in JSON format. The model operates at temperature 0.1 for near-deterministic output.
| Parameter | Value | Notes |
|---|---|---|
| Model | Gemini 2.5 Flash Lite | Fast, cost-effective structured extraction |
| Temperature | 0.1 | Near-deterministic for consistent NER |
| Chunk size | 2,500 chars | With 200-char overlap |
| Max output tokens | 4,000 | Per chunk response |
| Rate limit | 0.3s between calls | Avoids API throttling |
| Checkpoint interval | Every 25 chunks | Resumable processing |
Entity taxonomy
Authority, Scholar, Practitioner, Patron
Herb, Tree, Root, Seed, Resin
Mammal, Bird, Fish, Insect, Reptile, Product
Organ, Brain Region, Tissue, Body Part, Fluid
Mineral, Chemical, Preparation
Country, City, Region
Acute, Chronic, Psychological, Symptom
Theory, Practice, Quality, School of Thought, Mental Faculty
Institution, Journal, Society
Instrument, Vessel, Tool
Each category also includes a catch-all “Other” subcategory for entities that don’t fit neatly into a specific subcategory. The original six-book corpus primarily produced the first eight categories; Anatomy and Organization emerged as distinct categories after adding nineteenth-century scientific texts (Darwin, James) with their richer anatomical and institutional vocabularies.
Per-entity output
For each entity the model returns: the surface form name, category, subcategory, a short context string (up to 10 words), and variant spellings found in the same chunk. After extraction, a separate script locates full-text excerpts (150 characters of surrounding context per mention) using regex-based string matching across the source text.
Within-Book Deduplication
LLM extraction produces many duplicate or near-duplicate entities within a single text (e.g. “Galeno,” “galeno,” “GALENO”). These are merged using embedding similarity and graph-based clustering.
Process
- Embed all entities within a book using the fine-tuned BGE-M3 model with category context appended to each name
- Compute pairwise cosine similarity between all entity embeddings
- Build a graph where edges connect entities above the merge threshold
- Find connected components via BFS — each component becomes a merged entity
- Validate each component: check that every member has sufficient similarity to the primary entity (highest-count member)
| Parameter | Value | Notes |
|---|---|---|
| Merge threshold | 0.88 | General entities |
| Person threshold | 0.85 | Lower for person name variants |
| String similarity boost | −0.05 | If string similarity > 0.5 |
| Minimum string similarity | 0.3 | Hard floor to prevent false merges |
Model Fine-Tuning
Off-the-shelf multilingual embeddings struggle with early modern naming conventions, archaic spellings, and the non-obvious conceptual linkages central to historical concordance-building. We fine-tune BAAI/bge-m3 on curated training pairs using contrastive learning, iterating through multiple rounds of data curation and evaluation.
Training data curation
The training dataset was built through a multi-stage process. An initial automated extraction from verified concordance clusters produced ~6,800 positive pairs, but quality review revealed a ~14% error rate and heavy skew toward trivial cognates. Strict filtering (Levenshtein distance scoring, per-cluster caps, blocked pair lists) reduced this to ~800 pairs, which were then manually reviewed and supplemented with expert-curated batches targeting specific domains.
The final training file uses an additive batch format — each batch is independently contributed and can be appended without modifying existing entries. The current dataset comprises five batches totalling 889 positive pairs and 154 hard negatives:
| Batch | Focus | Pos. | Neg. |
|---|---|---|---|
| 1 | Cross-lingual entity matches (PT, ES, IT, FR, LA ↔ EN) | 713 | 91 |
| 2 | Temporal-conceptual science shifts (EN ↔ EN across centuries) | 40 | 8 |
| 3 | Psychology and mind science (EN, DE, FR) | 43 | 10 |
| 4 | Helmholtz, Janet, Galton, Agassiz (anticipated texts) | 48 | 25 |
| 5 | Cavendish, E. Darwin, Hartley, Blumenbach, historical disease names | 45 | 20 |
| Total | 889 | 154 |
Positive pairs teach the model that surface forms refer to the same entity, ranging from straightforward cross-lingual matches (canela ↔ cinnamon) to non-obvious conceptual concordances (Palo santo ↔ Guaiacum, Falling sickness ↔ epilepsy, vibratiuncles ↔ memory traces). Hard negatives teach the model to distinguish confusable terms: canfora ≠ canela, phrenology ≠ phenology, vis viva ≠ vis vitalis, hystérie ≠ hystérèse.
Training configuration
| Parameter | Value | Notes |
|---|---|---|
| Base model | BAAI/bge-m3 (568M params) | XLM-RoBERTa backbone; confirmed Latin via CC-100 pretraining |
| Loss function | MultipleNegativesRankingLoss | Contrastive learning with in-batch negatives |
| Epochs | 3 | Convergence confirmed; 1 epoch may suffice per Michail et al. 2025 |
| Batch size | 16 | Each batch element serves as implicit negative for other pairs |
| Learning rate | 2e-5 | Standard for domain adaptation of pretrained transformers |
| Warmup steps | 100 | ~9% of total training steps |
| Training examples | 1,778 | 889 pairs × 2 (both directions) |
| Total steps | ~333 | 111 per epoch |
| Hardware | Google Colab T4 GPU | ~15 minutes total training time |
Evaluation: A/B testing v2 vs. v3
To verify the fine-tuned model generalizes beyond its training data and does not overfit, we ran a comprehensive A/B test comparing the previous fine-tune (v2, trained on ~500 pairs) against the new fine-tune (v3, trained on 889 pairs + 154 hard negatives). The test used six evaluation suites comprising ~430 pair comparisons, including held-out concordance pairs explicitly excluded from training, novel cross-lingual pairs, and randomly sampled same-category negative pairs.
| Test Suite | v2 | v3 | v3 wins | Result |
|---|---|---|---|---|
| Held-out positives (100) | .796 | .758 | 45/55 | Comparable |
| Same-category negatives (80) | .390 | .320 | 67/13 | v3 much better |
| Training data sample (60) | .534 | .759 | 55/5 | v3 learned well |
| Curated hard negatives (60) | .631 | .535 | 49/11 | v3 much better |
| Novel cross-lingual (20) | .844 | .873 | 12/8 | v3 better |
| Novel confusable neg. (10) | .601 | .534 | 8/2 | v3 better |
The critical metric is separation: the gap between average positive similarity and average negative similarity on held-out data. V3 achieved a separation of 0.438 compared to v2’s 0.406, confirming that the new model generalizes well and is not overfitting to training data.
Before and after: base model vs. fine-tuned
Comparing the unmodified BGE-M3 base model against the v3 fine-tune on the full evaluation set shows the scale of improvement:
Base model separation
(negatives rated higher than positives)
Fine-tuned separation
(clear positive/negative gap)
Total improvement
(separation swing)
The base BGE-M3 model had negative separation on our data — it rated confusable non-matches (canfora/canela, phrenology/phenology) higher than genuine cross-lingual equivalences (cão danado/Rabies, effluvium/radiation). After fine-tuning, the model correctly separates these distributions. Notable spot-check improvements include:
| Pair | Type | Base | v3 | \u0394 |
|---|---|---|---|---|
| effluvium ↔ radiation | Temporal concept | .192 | .762 | +.570 |
| doçura ↔ sweetness | PT→EN | .309 | .855 | +.546 |
| Dormideras ↔ Poppy | ES→EN | .246 | .749 | +.503 |
| cão danado ↔ Rabies | PT→EN | .277 | .707 | +.431 |
| Seelenlehre ↔ psychology | DE→EN | .493 | .720 | +.227 |
| mesmerism ↔ hypnosis | Temporal concept | .474 | .746 | +.272 |
| gengibre ≠ genebra | Hard negative | .634 | .273 | −.362 |
| canela ≠ canola | Hard negative | .905 | .658 | −.247 |
Cross-Book Matching
With entities deduplicated and the embedding model fine-tuned, every entity across all books is embedded and compared pairwise. This is the core operation that discovers cross-lingual correspondences.
Matching rules
- Same category only: PERSON matches PERSON, PLANT matches PLANT — no cross-category matches
- Subcategory compatibility: when both entities have valid subcategories, they must match
- One-to-one constraint: each entity can match at most one entity per book, preventing “attractor” entities
- String similarity integration: lexical similarity provides a ±0.03 bonus/penalty to the embedding score
| Parameter | Value | Notes |
|---|---|---|
| Auto-accept threshold | 0.85 | Direct match, no further checks |
| Candidate threshold | 0.65 | Requires string/category validation |
| Same-language orthographic | 0.90 | Higher bar for same-language pairs |
| Cross-language same referent | 0.80 | Lower bar for translations |
| Person minimum string sim | 0.35 | Prevents semantically-similar but different people |
| OTHER_CONCEPT minimum | 0.92 | Abstract concepts require higher confidence |
Link classification
Each match is classified into one of five types:
Spelling differences within or across languages (eſtomago / estomago)
Translation equivalents (canela / cinnamon)
Related but not identical concepts
Substance-to-preparation relationships
Disputed or uncertain matches
Concordance Clustering
Pairwise matches are assembled into concordance clusters using connected component analysis. Each cluster represents a single real-world referent as it appears across multiple books and languages.
Cluster construction
- All cross-book matches form a graph; connected components become candidate clusters
- The primary entity (highest mention count) becomes the canonical name
- Every other member must have a direct edge (similarity ≥ 0.84) to the primary — this prevents chaining artifacts where A→B→C creates a false A→C link
- Substring matching serves as an additional confirmation signal (one name containing the other)
- Members failing validation are removed; clusters shrinking to a single book are dissolved
Near-duplicate merging
A post-processing step merges clusters that were split due to subcategory noise or minor orthographic differences. This catches splits like “cheiro” (categorized as QUALITY) vs “cheyro” (categorized as OTHER_CONCEPT) that should be one cluster.
| Parameter | Value | Notes |
|---|---|---|
| Levenshtein threshold | ≥ 0.83 | Normalized similarity between canonical names |
| PLACE threshold | ≥ 0.85 | Higher bar for places (Africa/Arica problem) |
| Shared books | ≥ 1 required | Unless names are identical |
| Same category | Required | No cross-category merges |
LLM Verification
Embedding-based matching inevitably produces some false positives. A verification pass uses an LLM to review “suspicious” clusters and remove members that don’t belong.
Suspicion heuristics
Rather than reviewing all 1,500+ clusters, the system flags those that exhibit patterns correlated with false matches:
≥ 3 entities from one source text in a single cluster
Mean Levenshtein to canonical name < 0.4
Any member with string similarity < 0.15 to canonical
≥ 6 members may contain unrelated entities
Unique context words > 4× context count
Short names (< 5 chars) with similarity < 0.55
LLM review
Flagged clusters are sent to Gemini 2.5 Flash Lite, which receives the cluster’s canonical name, all members with their book of origin and context excerpts, and must return a verdict: which members to keep, which to remove, and a brief justification. When the LLM identifies sub-groups within a cluster, it can split the cluster rather than simply removing members.
Enrichment & Identification
Verified clusters are enriched with modern identifications. Currently, an LLM identification pass provides modern names, descriptions, and Linnaean binomials. Wikidata entity linking is planned for a future iteration.
LLM identification
Clusters are batched (8 per API call) and sent to Gemini with their canonical name, category, top members, and context excerpts. The model returns a structured identification including: modern name, Linnaean binomial (for biological entities), type classification, temporal data, geographic associations, and a confidence level.
Wikidata linking (planned)
A planned enrichment step will have the LLM suggest a Wikidata search term for each cluster. This term will be used to query the Wikidata API, with results scored by domain relevance:
disease, medical, plant, species, physician, explorer, city, mineral, drug, herb
album, song, film, football, rapper, video game, TV series
Semantic glosses (planned)
A planned enrichment pass will generate 2–3 sentence “semantic glosses” for each cluster — thematic descriptions grounded in how the entity appears in the source texts. Unlike encyclopedic descriptions, these glosses will capture historical context: e.g. “Venomous snakes considered extremely dangerous in early modern medicine. Associated with poison, antidotes, theriac preparations, and fear.”
Enrichment coverage
With ground truth
With modern name
With description
With Linnaean name
Wikidata linking and Wikipedia URL resolution are planned but not yet integrated into the current dataset. Semantic glosses are similarly pending a future enrichment pass.
Search Index
The final stage generates a semantic search index for the web interface. Each cluster’s identity is compressed into a rich text representation and embedded using a fast, general-purpose model.
Embedding text composition
The embedding text is constructed from multiple fields to maximize search recall:
- Canonical name (highest weight)
- Category and subcategory
- Semantic gloss (thematic description)
- Variant names from all members (up to 20)
- Modern name from ground truth
- Linnaean binomial
- Wikidata description
- Botanical family
- Source text context excerpts
| Parameter | Value | Notes |
|---|---|---|
| Model | OpenAI text-embedding-3-small | Fast, affordable semantic search |
| Dimensions | 512 | Matryoshka truncation from 1,536 |
| Batch size | 100 | Per API call |
| Index size | ~16 MB | 1,549 entries |
Hybrid search
The web search combines semantic similarity (cosine distance between query and cluster embeddings) with lexical matching (Levenshtein distance, substring matching, fuzzy matching against canonical names, variant names, modern names, and semantic glosses). This ensures that both conceptual queries (“exotic spice”) and exact-name queries (“Galeno”) return good results.
Edge Cases & Known Challenges
Early modern texts present problems that rarely arise in modern NLP. Below are the most significant challenges and how the pipeline addresses them.
Long-s and OCR artifacts
Many digitized texts preserve the long-s character (‘ſ’), producing surface forms like “eſtomago” for “estomago” and “emplaſtro” for “emplastro.” OCR sometimes renders long-s as ‘f’, creating forms like “fangue” for “sangue” (blood).
Solution: The fine-tuned embedding model learns to map these variants close together. The near-duplicate merge (Levenshtein ≥ 0.83 + shared books) catches residual splits. The pipeline preserves original surface forms while linking them to normalized clusters.
Cross-lingual scholarly names
Pre-modern scholars were known by Latinized names that vary dramatically across languages: Avicenna / Avicena / Auicena / Ibn Sina; Galen / Galeno / Galien / Galenus. Standard NER and string matching fail entirely on these.
Solution: The PERSON category uses a lower matching threshold (0.80 vs 0.84) and the fine-tuned model specifically learns Latin-to-vernacular name transformations. The training set includes approximately 100 such scholarly name pairs.
Ambiguous referents
“Mercury” could mean the planet, the Roman god, or the alchemical element (quicksilver). “Sapa” could be the ancient lead-sweetened grape must or a South American place name. The same word can have genuinely different referents in different texts.
Solution: Category and subcategory constraints prevent cross-domain confusion (Mercury the SUBSTANCE won’t merge with Mercury the PLACE). The LLM extraction prompt is tuned for historical and scientific context (1500–1900), and Wikidata scoring heavily penalizes modern pop-culture matches. When identification is genuinely contested, the enrichment system preserves a “note” field explaining the ambiguity.
Subcategory-driven cluster splitting
The embedding model appends subcategory to each entity name before embedding. This means “cheiro (quality)” and “cheyro (other_concept)” produce different embeddings even though they refer to the same thing — smell/scent in Portuguese.
Solution: The post-processing near-duplicate merge detects these splits by comparing canonical names with normalized Levenshtein distance while requiring shared books as confirmation. This corrected 29 such splits in the current dataset.
Short similar place names
Place names like Africa/Arica or Goa/Gao are short, lexically similar, and may appear in the same texts — but refer to entirely different locations. Standard Levenshtein thresholds would merge them.
Solution: The PLACE category receives a +0.02 threshold bump (0.85 effective vs 0.83 for other categories) in the near-duplicate merge. Combined with the shared-books requirement, this prevents all known false place merges while still catching legitimate variants like Mozambique/Moçambique.
Attractor entities
Without constraints, very common entities (like “water” or “fever”) can attract dozens of only vaguely related entities from other books, creating bloated, incoherent clusters.
Solution: The one-to-one matching constraint ensures each entity can match at most one entity per book. Cluster validation then requires every member to have a direct similarity edge to the primary entity — transitive chains (A→B→C without A→C) are broken.
Models & APIs
BAAI/bge-m3 (fine-tuned v3)
Cross-lingual entity matching, deduplication, concordance building
Fine-tuned on 889 curated pairs + 154 hard negatives across 5 batches. Open-source, runs locally.
Gemini 2.5 Flash Lite
Entity extraction, cluster verification, ground truth identification, semantic glosses
Used for all LLM tasks. Low cost, fast, good at structured output.
OpenAI text-embedding-3-small
Semantic search index for the web interface
512-dimensional Matryoshka embeddings. Used only for search, not matching.
Wikidata API
Entity linking, descriptions, identifiers
Free API with domain-relevance scoring to avoid modern pop-culture matches.
Wikipedia REST API
Thumbnail images, article links
Used at display time for entity detail pages.