Contents

Methodology

How the Premodern Concordance extracts, matches, verifies, and enriches named entities across multilingual early modern texts.

Approach

A hybrid pipeline combining large language models for entity extraction and verification with fine-tuned multilingual embeddings for cross-lingual matching. Each step is designed to be incremental, auditable, and resumable.

Scale

Twelve texts spanning 1563–1890 in six languages (English, Portuguese, Spanish, French, Italian, German), producing over 2,100 concordance clusters from thousands of extracted entities with over 50,000 total mentions.

Challenges

Early modern texts present unstable orthography, OCR artifacts, cross-lingual naming conventions (Galeno/Galen/Galien), lost long-s characters, and ambiguous referents that resist standard NER approaches.

Corpus

The concordance currently draws on eight texts from the early modern period through the nineteenth century, concerned with natural knowledge — materia medica, botany, pharmacy, natural history, exploration, evolutionary biology, and psychology. These were selected for their overlapping subject matter and linguistic diversity.

Colóquios dos Simples e Drogas da Índia

García de Orta, 1563 · Portuguese

A pioneering pharmacological dialogue cataloguing the drugs, spices, and medicinal plants of India based on Orta’s decades of firsthand observation in Goa.

1563

Historia Medicinal de las cosas que se traen de nuestras Indias Occidentales

Nicolás Monardes, 1574 · Spanish

A catalogue of New World medicinal substances—tobacco, sassafras, guaiacum, bezoardstones—written by a Seville physician who never crossed the Atlantic.

1574

Ricettario Fiorentino

Collegio Medico di Firenze, 1597 · Italian

The official pharmacopoeia of Florence, standardising compound drug recipes and ingredient terminology for the city’s apothecaries.

1597

The English Physitian

Nicholas Culpeper, 1652 · English

A vernacular herbal linking plants to astrological governance and Galenic humoral medicine, written to make pharmaceutical knowledge accessible beyond Latin.

1652

Polyanthea Medicinal

João Curvo Semedo, 1741 · Portuguese

An encyclopaedic medical compendium blending Galenic, chemical, and empirical approaches, reflecting the eclectic pharmacy of early eighteenth-century Portugal.

1741

Relation historique du voyage, Tome III

Alexander von Humboldt, 1825 · French

The third volume of Humboldt’s narrative of his American expedition, rich with observations on geography, botany, indigenous knowledge, and natural phenomena.

1825

On the Origin of Species

Charles Darwin, 1859 · English

Darwin’s foundational text on evolution by natural selection, dense with references to species, varieties, geological formations, and the naturalists whose observations informed his theory.

1859

The Principles of Psychology, Volume 1

William James, 1890 · English

James’s comprehensive treatise on the science of mind, covering sensation, perception, memory, emotion, and consciousness through both philosophical argument and physiological evidence.

1890

Pipeline Overview

The concordance is built through an eight-stage pipeline. Each stage is a standalone Python script that reads from and writes to JSON, making the process fully auditable and individually re-runnable.

Incremental design. Every script supports checkpointing and can skip already-processed items on re-run. This means a single failed API call doesn’t require restarting the entire pipeline — just re-run the script and it picks up where it left off.
1

Entity Extraction

Each source text is chunked and passed through a large language model for structured named entity recognition. The LLM identifies entities and classifies them into a controlled taxonomy of ten categories, each with specific subcategories.

Chunking strategy

Source texts are split into 2,500-character chunks with 200-character overlap. Breaks are paragraph-aware: the chunker looks for paragraph boundaries near the target split point to avoid cutting mid-sentence. EEBO-specific artifacts (page markers, marginal references) and OCR noise are cleaned before chunking.

LLM extraction

Each chunk is sent to Gemini 2.5 Flash Lite with a structured prompt asking for entities in JSON format. The model operates at temperature 0.1 for near-deterministic output.

ParameterValueNotes
ModelGemini 2.5 Flash LiteFast, cost-effective structured extraction
Temperature0.1Near-deterministic for consistent NER
Chunk size2,500 charsWith 200-char overlap
Max output tokens4,000Per chunk response
Rate limit0.3s between callsAvoids API throttling
Checkpoint intervalEvery 25 chunksResumable processing

Entity taxonomy

PERSON

Authority, Scholar, Practitioner, Patron

PLANT

Herb, Tree, Root, Seed, Resin

ANIMAL

Mammal, Bird, Fish, Insect, Reptile, Product

ANATOMY

Organ, Brain Region, Tissue, Body Part, Fluid

SUBSTANCE

Mineral, Chemical, Preparation

PLACE

Country, City, Region

DISEASE

Acute, Chronic, Psychological, Symptom

CONCEPT

Theory, Practice, Quality, School of Thought, Mental Faculty

ORGANIZATION

Institution, Journal, Society

OBJECT

Instrument, Vessel, Tool

Each category also includes a catch-all “Other” subcategory for entities that don’t fit neatly into a specific subcategory. The original six-book corpus primarily produced the first eight categories; Anatomy and Organization emerged as distinct categories after adding nineteenth-century scientific texts (Darwin, James) with their richer anatomical and institutional vocabularies.

Per-entity output

For each entity the model returns: the surface form name, category, subcategory, a short context string (up to 10 words), and variant spellings found in the same chunk. After extraction, a separate script locates full-text excerpts (150 characters of surrounding context per mention) using regex-based string matching across the source text.

2

Within-Book Deduplication

LLM extraction produces many duplicate or near-duplicate entities within a single text (e.g. “Galeno,” “galeno,” “GALENO”). These are merged using embedding similarity and graph-based clustering.

Process

  1. Embed all entities within a book using the fine-tuned BGE-M3 model with category context appended to each name
  2. Compute pairwise cosine similarity between all entity embeddings
  3. Build a graph where edges connect entities above the merge threshold
  4. Find connected components via BFS — each component becomes a merged entity
  5. Validate each component: check that every member has sufficient similarity to the primary entity (highest-count member)
ParameterValueNotes
Merge threshold0.88General entities
Person threshold0.85Lower for person name variants
String similarity boost−0.05If string similarity > 0.5
Minimum string similarity0.3Hard floor to prevent false merges
Safety checks. Short words (June/July, Body/Bones) receive extra scrutiny via edit-distance guards. For PERSON entities, a surname compatibility check prevents merging “Duarte Barbosa” with “Duarte Pacheco” just because both share a first name.
3

Model Fine-Tuning

Off-the-shelf multilingual embeddings struggle with early modern naming conventions, archaic spellings, and the non-obvious conceptual linkages central to historical concordance-building. We fine-tune BAAI/bge-m3 on curated training pairs using contrastive learning, iterating through multiple rounds of data curation and evaluation.

Training data curation

The training dataset was built through a multi-stage process. An initial automated extraction from verified concordance clusters produced ~6,800 positive pairs, but quality review revealed a ~14% error rate and heavy skew toward trivial cognates. Strict filtering (Levenshtein distance scoring, per-cluster caps, blocked pair lists) reduced this to ~800 pairs, which were then manually reviewed and supplemented with expert-curated batches targeting specific domains.

The final training file uses an additive batch format — each batch is independently contributed and can be appended without modifying existing entries. The current dataset comprises five batches totalling 889 positive pairs and 154 hard negatives:

BatchFocusPos.Neg.
1Cross-lingual entity matches (PT, ES, IT, FR, LA ↔ EN)71391
2Temporal-conceptual science shifts (EN ↔ EN across centuries)408
3Psychology and mind science (EN, DE, FR)4310
4Helmholtz, Janet, Galton, Agassiz (anticipated texts)4825
5Cavendish, E. Darwin, Hartley, Blumenbach, historical disease names4520
Total889154

Positive pairs teach the model that surface forms refer to the same entity, ranging from straightforward cross-lingual matches (canela ↔ cinnamon) to non-obvious conceptual concordances (Palo santo ↔ Guaiacum, Falling sickness ↔ epilepsy, vibratiuncles ↔ memory traces). Hard negatives teach the model to distinguish confusable terms: canfora ≠ canela, phrenology ≠ phenology, vis vivavis vitalis, hystérie ≠ hystérèse.

Quality over quantity. The curated 889-pair dataset significantly outperforms a 6,800-pair auto-mined dataset that contained ~14% erroneous matches. Wrong pairs actively teach the model incorrect associations. Curation criteria include: minimum Levenshtein distance of 0.25 (filtering trivial cognates), cross-lingual bonus scoring, per-cluster cap of 5 pairs (preventing mega-cluster dominance), and manual review of all pairs. 97.5% of pairs have a normalized string distance > 0.5, meaning the model must learn semantic equivalence rather than surface-form similarity.

Training configuration

ParameterValueNotes
Base modelBAAI/bge-m3 (568M params)XLM-RoBERTa backbone; confirmed Latin via CC-100 pretraining
Loss functionMultipleNegativesRankingLossContrastive learning with in-batch negatives
Epochs3Convergence confirmed; 1 epoch may suffice per Michail et al. 2025
Batch size16Each batch element serves as implicit negative for other pairs
Learning rate2e-5Standard for domain adaptation of pretrained transformers
Warmup steps100~9% of total training steps
Training examples1,778889 pairs × 2 (both directions)
Total steps~333111 per epoch
HardwareGoogle Colab T4 GPU~15 minutes total training time

Evaluation: A/B testing v2 vs. v3

To verify the fine-tuned model generalizes beyond its training data and does not overfit, we ran a comprehensive A/B test comparing the previous fine-tune (v2, trained on ~500 pairs) against the new fine-tune (v3, trained on 889 pairs + 154 hard negatives). The test used six evaluation suites comprising ~430 pair comparisons, including held-out concordance pairs explicitly excluded from training, novel cross-lingual pairs, and randomly sampled same-category negative pairs.

Test Suitev2v3v3 winsResult
Held-out positives (100).796.75845/55Comparable
Same-category negatives (80).390.32067/13v3 much better
Training data sample (60).534.75955/5v3 learned well
Curated hard negatives (60).631.53549/11v3 much better
Novel cross-lingual (20).844.87312/8v3 better
Novel confusable neg. (10).601.5348/2v3 better

The critical metric is separation: the gap between average positive similarity and average negative similarity on held-out data. V3 achieved a separation of 0.438 compared to v2’s 0.406, confirming that the new model generalizes well and is not overfitting to training data.

Before and after: base model vs. fine-tuned

Comparing the unmodified BGE-M3 base model against the v3 fine-tune on the full evaluation set shows the scale of improvement:

−0.064

Base model separation

(negatives rated higher than positives)

+0.292

Fine-tuned separation

(clear positive/negative gap)

+0.356

Total improvement

(separation swing)

The base BGE-M3 model had negative separation on our data — it rated confusable non-matches (canfora/canela, phrenology/phenology) higher than genuine cross-lingual equivalences (cão danado/Rabies, effluvium/radiation). After fine-tuning, the model correctly separates these distributions. Notable spot-check improvements include:

PairTypeBasev3\u0394
effluvium ↔ radiationTemporal concept.192.762+.570
doçura ↔ sweetnessPT→EN.309.855+.546
Dormideras ↔ PoppyES→EN.246.749+.503
cão danado ↔ RabiesPT→EN.277.707+.431
Seelenlehre ↔ psychologyDE→EN.493.720+.227
mesmerism ↔ hypnosisTemporal concept.474.746+.272
gengibre ≠ genebraHard negative.634.273−.362
canela ≠ canolaHard negative.905.658−.247
Key insight. The fine-tuned model learns domain-specific patterns invisible to general-purpose embeddings: non-obvious conceptual concordances across centuries (mesmerism → hypnosis, Falling sickness → epilepsy), cross-lingual botanical and medical terminology in archaic forms (cão danado → Rabies, Dormideras → Poppy), and the critical ability to reject surface-similar confusables (gengibre ≠ genebra, canela ≠ canola, phrenology ≠ phenology). The approach aligns with recent findings by Michail et al. (2025) on adapting multilingual embeddings to historical Luxembourgish, which demonstrated that contrastive fine-tuning with MultipleNegativesRankingLoss can improve historical text retrieval accuracy by up to 43 percentage points — and that training data quality matters more than base model selection.
4

Cross-Book Matching

With entities deduplicated and the embedding model fine-tuned, every entity across all books is embedded and compared pairwise. This is the core operation that discovers cross-lingual correspondences.

Matching rules

  • Same category only: PERSON matches PERSON, PLANT matches PLANT — no cross-category matches
  • Subcategory compatibility: when both entities have valid subcategories, they must match
  • One-to-one constraint: each entity can match at most one entity per book, preventing “attractor” entities
  • String similarity integration: lexical similarity provides a ±0.03 bonus/penalty to the embedding score
ParameterValueNotes
Auto-accept threshold0.85Direct match, no further checks
Candidate threshold0.65Requires string/category validation
Same-language orthographic0.90Higher bar for same-language pairs
Cross-language same referent0.80Lower bar for translations
Person minimum string sim0.35Prevents semantically-similar but different people
OTHER_CONCEPT minimum0.92Abstract concepts require higher confidence

Link classification

Each match is classified into one of five types:

orthographic_variant

Spelling differences within or across languages (eſtomago / estomago)

same_referent

Translation equivalents (canela / cinnamon)

conceptual_overlap

Related but not identical concepts

derivation

Substance-to-preparation relationships

contested_identity

Disputed or uncertain matches

5

Concordance Clustering

Pairwise matches are assembled into concordance clusters using connected component analysis. Each cluster represents a single real-world referent as it appears across multiple books and languages.

Cluster construction

  1. All cross-book matches form a graph; connected components become candidate clusters
  2. The primary entity (highest mention count) becomes the canonical name
  3. Every other member must have a direct edge (similarity ≥ 0.84) to the primary — this prevents chaining artifacts where A→B→C creates a false A→C link
  4. Substring matching serves as an additional confirmation signal (one name containing the other)
  5. Members failing validation are removed; clusters shrinking to a single book are dissolved

Near-duplicate merging

A post-processing step merges clusters that were split due to subcategory noise or minor orthographic differences. This catches splits like “cheiro” (categorized as QUALITY) vs “cheyro” (categorized as OTHER_CONCEPT) that should be one cluster.

ParameterValueNotes
Levenshtein threshold≥ 0.83Normalized similarity between canonical names
PLACE threshold≥ 0.85Higher bar for places (Africa/Arica problem)
Shared books≥ 1 requiredUnless names are identical
Same categoryRequiredNo cross-category merges
Why shared books matter. Requiring at least one shared book is the primary safeguard against false merges. Two entities that appear in the same text and were not matched during cross-book matching are likely genuinely different. This single criterion eliminated false positives like cabras/cobras and Africa/Arica.
6

LLM Verification

Embedding-based matching inevitably produces some false positives. A verification pass uses an LLM to review “suspicious” clusters and remove members that don’t belong.

Suspicion heuristics

Rather than reviewing all 1,500+ clusters, the system flags those that exhibit patterns correlated with false matches:

Multi-entry from same book

≥ 3 entities from one source text in a single cluster

Low average string similarity

Mean Levenshtein to canonical name < 0.4

Outlier member

Any member with string similarity < 0.15 to canonical

Large cluster

≥ 6 members may contain unrelated entities

Divergent contexts

Unique context words > 4× context count

Short low-similarity names

Short names (< 5 chars) with similarity < 0.55

LLM review

Flagged clusters are sent to Gemini 2.5 Flash Lite, which receives the cluster’s canonical name, all members with their book of origin and context excerpts, and must return a verdict: which members to keep, which to remove, and a brief justification. When the LLM identifies sub-groups within a cluster, it can split the cluster rather than simply removing members.

7

Enrichment & Identification

Verified clusters are enriched with modern identifications. Currently, an LLM identification pass provides modern names, descriptions, and Linnaean binomials. Wikidata entity linking is planned for a future iteration.

LLM identification

Clusters are batched (8 per API call) and sent to Gemini with their canonical name, category, top members, and context excerpts. The model returns a structured identification including: modern name, Linnaean binomial (for biological entities), type classification, temporal data, geographic associations, and a confidence level.

Wikidata linking (planned)

A planned enrichment step will have the LLM suggest a Wikidata search term for each cluster. This term will be used to query the Wikidata API, with results scored by domain relevance:

Good signals (+1 each)

disease, medical, plant, species, physician, explorer, city, mineral, drug, herb

Bad signals (−5 each)

album, song, film, football, rapper, video game, TV series

Why domain scoring matters. “Mercury” should resolve to the alchemical element, not the rock band or the planet. “Sapa” should find the lead-based sweetener, not the modern Indonesian city. Heavy negative penalties for pop-culture descriptions keep the results anchored in early modern natural knowledge.

Semantic glosses (planned)

A planned enrichment pass will generate 2–3 sentence “semantic glosses” for each cluster — thematic descriptions grounded in how the entity appears in the source texts. Unlike encyclopedic descriptions, these glosses will capture historical context: e.g. “Venomous snakes considered extremely dangerous in early modern medicine. Associated with poison, antidotes, theriac preparations, and fear.”

Enrichment coverage

73%

With ground truth

61%

With modern name

61%

With description

8%

With Linnaean name

Wikidata linking and Wikipedia URL resolution are planned but not yet integrated into the current dataset. Semantic glosses are similarly pending a future enrichment pass.

8

Search Index

The final stage generates a semantic search index for the web interface. Each cluster’s identity is compressed into a rich text representation and embedded using a fast, general-purpose model.

Embedding text composition

The embedding text is constructed from multiple fields to maximize search recall:

  1. Canonical name (highest weight)
  2. Category and subcategory
  3. Semantic gloss (thematic description)
  4. Variant names from all members (up to 20)
  5. Modern name from ground truth
  6. Linnaean binomial
  7. Wikidata description
  8. Botanical family
  9. Source text context excerpts
ParameterValueNotes
ModelOpenAI text-embedding-3-smallFast, affordable semantic search
Dimensions512Matryoshka truncation from 1,536
Batch size100Per API call
Index size~16 MB1,549 entries

Hybrid search

The web search combines semantic similarity (cosine distance between query and cluster embeddings) with lexical matching (Levenshtein distance, substring matching, fuzzy matching against canonical names, variant names, modern names, and semantic glosses). This ensures that both conceptual queries (“exotic spice”) and exact-name queries (“Galeno”) return good results.

Edge Cases & Known Challenges

Early modern texts present problems that rarely arise in modern NLP. Below are the most significant challenges and how the pipeline addresses them.

Long-s and OCR artifacts

Many digitized texts preserve the long-s character (‘ſ’), producing surface forms like “eſtomago” for “estomago” and “emplaſtro” for “emplastro.” OCR sometimes renders long-s as ‘f’, creating forms like “fangue” for “sangue” (blood).

Solution: The fine-tuned embedding model learns to map these variants close together. The near-duplicate merge (Levenshtein ≥ 0.83 + shared books) catches residual splits. The pipeline preserves original surface forms while linking them to normalized clusters.

Cross-lingual scholarly names

Pre-modern scholars were known by Latinized names that vary dramatically across languages: Avicenna / Avicena / Auicena / Ibn Sina; Galen / Galeno / Galien / Galenus. Standard NER and string matching fail entirely on these.

Solution: The PERSON category uses a lower matching threshold (0.80 vs 0.84) and the fine-tuned model specifically learns Latin-to-vernacular name transformations. The training set includes approximately 100 such scholarly name pairs.

Ambiguous referents

“Mercury” could mean the planet, the Roman god, or the alchemical element (quicksilver). “Sapa” could be the ancient lead-sweetened grape must or a South American place name. The same word can have genuinely different referents in different texts.

Solution: Category and subcategory constraints prevent cross-domain confusion (Mercury the SUBSTANCE won’t merge with Mercury the PLACE). The LLM extraction prompt is tuned for historical and scientific context (1500–1900), and Wikidata scoring heavily penalizes modern pop-culture matches. When identification is genuinely contested, the enrichment system preserves a “note” field explaining the ambiguity.

Subcategory-driven cluster splitting

The embedding model appends subcategory to each entity name before embedding. This means “cheiro (quality)” and “cheyro (other_concept)” produce different embeddings even though they refer to the same thing — smell/scent in Portuguese.

Solution: The post-processing near-duplicate merge detects these splits by comparing canonical names with normalized Levenshtein distance while requiring shared books as confirmation. This corrected 29 such splits in the current dataset.

Short similar place names

Place names like Africa/Arica or Goa/Gao are short, lexically similar, and may appear in the same texts — but refer to entirely different locations. Standard Levenshtein thresholds would merge them.

Solution: The PLACE category receives a +0.02 threshold bump (0.85 effective vs 0.83 for other categories) in the near-duplicate merge. Combined with the shared-books requirement, this prevents all known false place merges while still catching legitimate variants like Mozambique/Moçambique.

Attractor entities

Without constraints, very common entities (like “water” or “fever”) can attract dozens of only vaguely related entities from other books, creating bloated, incoherent clusters.

Solution: The one-to-one matching constraint ensures each entity can match at most one entity per book. Cluster validation then requires every member to have a direct similarity edge to the primary entity — transitive chains (A→B→C without A→C) are broken.

Models & APIs

BAAI/bge-m3 (fine-tuned v3)

Cross-lingual entity matching, deduplication, concordance building

Fine-tuned on 889 curated pairs + 154 hard negatives across 5 batches. Open-source, runs locally.

Stages 2, 4, 5

Gemini 2.5 Flash Lite

Entity extraction, cluster verification, ground truth identification, semantic glosses

Used for all LLM tasks. Low cost, fast, good at structured output.

Stages 1, 6, 7

OpenAI text-embedding-3-small

Semantic search index for the web interface

512-dimensional Matryoshka embeddings. Used only for search, not matching.

Stages 8

Wikidata API

Entity linking, descriptions, identifiers

Free API with domain-relevance scoring to avoid modern pop-culture matches.

Stages 7

Wikipedia REST API

Thumbnail images, article links

Used at display time for entity detail pages.

Stages Web UI