Julia Efremova
PhD candidate at Eindhoven University of Technology

Consider a person named Theodorus Werners born in Tilburg on August 11th, 1861. He got married to Maria van der Hagen in 1888. Maria Eugenia Johanna Werners was their child, born in Tilburg in October 1894. Two years after child’s birth, they bought a house in Breda. Theodorus died in Breda on September 1st, 1926. Each of these pieces of information might have been mentioned in a structured document such as Birth, Marriage or Death certificate, or a free text document such as a Notarial Act. However, due to changes in spelling conventions, misspellings, data conversion and data loss, linking the name-references that are associated with the same entities (i.e. Entity Resolution (ER)) is a long standing open challenge.


Introduction


The starting point of this research project is the large collection of historical documents maintained by the Brabant Historical Information Center. A document can be anything ranging from scans of birth and death certificates, memories of succession, or tax declarations, to social photographs or family pictures. The current status of this collection is that the documents have been tagged by source and subject.Researchers can use keyword-based search to find relevant documents for their research (either a scan or a pointer to a physical location) based on these tags. This database, however, is not at all flawless; many names are duplicate, have several alternative spellings, or even contain mistakes. Furthermore, important semantic links such as the parent-child relation are only implicitly available, making simple tasks such as finding out if two given persons are related, very labor intensive.


Project overview


This project addresses the problem of how to derive identities of persons and social structures from large sets of genealogical data available as text and photographs with incomplete information. In order to do so we want to investigate and deploy a combination of techniques from data mining, machine learning and human computation. The project goals are (a) a semantically enriched and cleaned version of the current database of the BHIC; (b) the development of advanced search tools to support historical research; and (c) providing automatic tools for supporting large scale prosopographical research.