Digging into Data: the research work plan

Gathering and gazetting results & building tools

1. The first stage of the project is to mine the EE correspondence dataset (consisting of letters together with associated documents and extensive annotation) for geographical data, i.e. names of places from the country to personal address level. This information occurs in multiple locations, and may refer to where a letter was written, where it was sent to and the route taken, as well as numerous geographical references within the texts of the letters themselves and their annotations.

2. From the collated data EEP will build diachronic and multilingual thesauri of word-forms. At present, location names exist in EE in up to nine languages (English, Dutch, French, German, Greek, Italian, Latin, Russian and Spanish), and come from historical documents covering three centuries (17th–19th centuries). The thesauri will thus include period-specific abbreviations and colloquialisms for locations over Europe, Asia, the Americas and Oceania in a significant range of European languages. Existing metadata of locations, currently structured to city level, will be used to analyze existing data for the occurrence of elements; this will be used to build a table of geographical token words. Systems applying Soundex and variants (stemming the "key" linguistic term for each location "digital object") will then be used to create additional fields of token variants and stems.

3. The results of this analysis will be used to build a "crawler" to recurse through the EE dataset, identifying instances matching our tokens. The project will need to implement a concordance structure (at sentence and paragraph level) to take a certain number of words before and after the match, which we can potentially use to focus the match for accuracy and disambiguation.

4. The EE finished token list will be mapped against standardized authority lists such as the Getty Thesaurus of Geographic Names (TGN) for geographical names, in order to provide a public gazetteer of locations. (EEP will submit any new or altered data to TGN as part of their contributions scheme, the Getty Vocabularies Program.) This will enable the project to build and test methods, gazetteers and tools to allow users to identify, define and link more data from EE’s and other datasets, as well as being fed into Improvise and used for overlay mapping (static and dynamic systems).

5. The following elements are envisaged:

6. The outcomes listed above will be made publicly available. By their nature they are open to extension chronologically and linguistically, and can be used as templates for the analysis and mapping of a wide range of other data.

