Identifying intertextual relationships in large-scale digital text collections

Project Status Report

During Michaelmas term of 2011, I worked closely with Nicholas Cronk, Director of the Voltaire Foundation, on a small data set of Voltaire’s “signatures” — i.e., the numerous and quite varied ways in which Voltaire closed his letters over his considerable epistolary corpus (some 15,000 letters in all as author). Thanks to structured encoding in the EE database, we were able to analyse the different “Names and Games” that Voltaire used with his various interlocutors, identifying several of the most prevalent sometimes playful signatory patterns (the Bertrand/Raton series of letters to d’Alembert and later Condorcet, for example, inspired by La Fontaine’s fable Le Singe et le Chat). This work was presented at the “Naming, Re-naming, Un-naming in Early Modern and Enlightenment Culture” joint conference held at Oxford with Johns Hopkins University in late November 2011.

Project-wise, after having finally cleared all of the necessary (and some unnecessary) legal hoops and hurdles, I was given all of the EE dataset at the beginning of Hilary term 2012. I thus spent most of the next few months converting all of the records in the tab-delimited table of EE letters into 59,000 or so separate TEI files (1 file per letter/entry) which I then gradually sorted (excluding empty letters or editorial stubs) and eventually loaded into the open source PhiloLogic full-text search and retrieval system developed by the ARTFL Project at the University of Chicago. Once loaded into a coherent PhiloLogic database (slightly modified to accommodate the now 55,575 documents), I could then begin the sequence alignment process and analysis using the PhiloLogic data mining extension PAIR/PhiloLine (see our article on this approach entitled “Something Borrowed: Sequence Alignment and the Identification of Similar Passages in Large Text Collections” Digital Studies / Le champ numérique 2.1, 2010).

Once the data was pre-processed (flattening accents, ignoring case, numbers, and other overly frequent function words, for example), each document was converted into overlapping trigrams of content words (a trigram being an n-gram of 3 word. To take Rousseau’s incipit from the Social Contract as an example, “L’homme est né libre, et partout il est dans les fers” redered as trigrams looks like: homme_libre_partout libre_partout_fers. These overlapping trigrams are then sorted globally by frequency and stored in a large table. Once this pre-processing was complete I could compare, relatively easily, the table of overlapping n-grams with itself (comparing all 55,575 letters to each other in essence) and then with other similarly pre-processed databases I had built during the first months of the project. While the initial comparisons were in some respects “computationally expensive” (55,000 x 55,000 is over 3 billion pairwise comparisons for instance), the resulting aligned databases enabled me to identify four rudimentary types of shared passages:

Duplicate letters or passages: Given the makeup of the EE database — i.e., letters drawn from multiple critical editions — there are of course overlaps in the collection (letters to/from Voltaire and Rousseau, for instance, or between Locke and Boyle that exist in both the Rousseau/Locke correspondence and also that of Voltaire/Boyle). Using the alignment system I was thus able to identify many of these ‘duplicate’ letters, which will hopefully aid EE in expanding their versioning information for individual letters moving forward. A wonderful example of the editorial richness of these sorts of duplicates can be found in the famous letter from Voltaire to Rousseau of 30 August 1755, with several divergent (and wildly interesting) editorial commentaries by Ralph Leigh in his edition of Rousseau’s correspondence: Voltaire to Jean Jacques Rousseau, 30 August 1755 and Theodore Besterman’s version drawn from the Complete Works of Voltaire: Voltaire to Jean Jacques Rousseau, 30 August 1755.
Latin (and other) commonplaces: Another interesting (though not surprising) finding when comparing the mass of EE to itself was the preponderance of Latin and other classical literary commonplaces. While most of these passages are identified by their respective editors, with sequence alignment we can begin to think about a large-scale collection of classical commonplaces and then track their occurrence, use, and reuse throughout the database — tracing, as it were, the ways in which such passages were used and circulated in the Early Modern period. See, for example, instances of Horace’s quip “Nil desperandum Teucro duce et Auspice Teucro” (Never despair, while Teucer is with you, your guide, your augur), the beginning of which (“Nul desperandum”) became a rallying cry of sorts for the 18th-century philosophes.
Literary text reuse: Moving out from the EE database itself, I then compared the letters with a selection of works (1,419 works from 1100 to 1825, a corpus of just over 73 million words) drawn from the ARTFL-Frantext French-language database at the University of Chicago. Here we can begin to trace the use and reuse of literary passages drawn from canonical works of French literature and then circulated for various reasons and by numerous authors around the 18th-century Republic of Letters. See, for instance, the use by Voltaire and his côterie of the Rabelasian proverb “magis magnos clericos non sunt magis magnos sapientes” (the greatest clerks are never the wisest men) drawn from Gargantua:
- Voltaire to Jean Le Rond d’Alembert, 25 February 1758
- Voltaire to Jean Le Rond d’Alembert, 2 September 1758
- Voltaire to Elie Bertrand, 7 January 1760
- Voltaire to Claude Adrien Helvétius, 16 July 1760
- Voltaire to Claude Adrien Helvétius, 16 July 1760
- Voltaire to Etienne Noël Damilaville, 12 August 1763
- Jean Louis Wagnière to Etienne Noël Damilaville, 5 August 1767
- Voltaire to Jean Le Rond d’Alembert, 14 July 1773
- Voltaire to [unknown], 12 November 1773
Genetic textual variations: Finally, using the same comparison with EE and ARTFL-Frantext, I was able to identify instances where literary texts were circulated through the epistolary network in various forms while in the process of being edited — moments of genetic criticism if you will. For example, in the month following the Lisbon earthquake of November 1755 Voltaire set about writing his now-famous Poème sur le désastre de Lisbonne, which first appears in the correspondence as early as January 2 1756 (Beat Ludwig von May to Baron Albrecht von Haller - Friday, 2 January 1756). Here, the sequence alignment system has identified the final strophes of Voltaire’s text, but with significant variations concerning the ending, which was in fact a very early version of the text that Voltaire eventually found too pessimistic and thus changed during the Spring of 1756. The variant ending (with shared sequences underlined) taken from the above letter:
Atome tourmenté sur cest amas de boue,
Que la mort engloutit et dont le sort se joue.
Mais Atome pensant, Atome dont les yeux,
Guidés par la pensée ont mesuré les Cieux:
Au sein de L’infini, Nous élançons Nostre Estre,
Sens pouvoir un moment Nous voir et Nous Conoitre.
Que faut yl O Mortel? Mortel, Il faut souffrir,
Se soumettre en silence, adorer et mourir.

For more on this process, and other text-mining approaches to this and other data sets, see my recent Cultures of Knowledge talk “Text Mining Electronic Enlightenment: Influence and Intertextuality in the 18th-Century Republic of Letters” (May 2012).

This coming year, I plan on generating and publishing some more substantive results concerning the above and various other experimental approaches to the EE database. In the meantime, however, I would like to conclude this brief report by underscoring the importance of digital collections such as Electronic Enlightenment and the need for the scholarly community to participate in and otherwise support their conception, realization, and continuation. What is striking when exploring the depths of such a large collection is not always what one finds, but also what isn’t there — Franklin, Jefferson, Condorcet, Turgot, the list goes on — lacunae that speak to the evolving and necessarily incomplete nature of many digital collections, growing steadily towards that impossible goal of exhaustiveness.

The danger in visualizations and other abstract models in dealing with resources such as EE (using so-called “distant reading” techniques), is that they tend to suggest a closed system or network wherein the data is represented definitively as the “Republic of Letters” (to take one famous example) or Voltaire’s “Epistolary Network,” in all their supposed finitude. The truth of the matter is that the data and metadata in EE is, and should remain for the foreseeable future, in flux, always in a state of improvement and expansion. And, while “macro-analytic” approaches tend to construct artificial (and on the surface, exhaustive) networks that minimize or mask the uncertainty in the data, it should be our goal to valorize this uncertainty, to keep building upon and adding to the varied networks and communication circuits we find in EE; to enable scholars with the means and access to add new letters and metadata to the system; to cultivate a new Republic of Letters that takes its Early Modern predecessor as a point of departure — as a “network” considered not as an abstraction of metadata, but as an informational ideal (imperfect and impossible in many ways) that can nonetheless provide students and scholars of the Early Modern period with the tools and resources to explore and discover the textual and historical richness of such an evolving archive. In short, if pressed to answer the rather loaded question concerning Electronic Enlightenment, “What exactly do you do with 60,000 letters?" I would reply simply, “Keep adding more.”

Identifying intertextual relationships in large-scale digital text collections

Project Status Report

PROJECT META

Staff

Contributors