Prosopographies have in the past often taken decades or even centuries to produce. Even for a period with relatively few sources such as antiquity, hundreds of thousands of texts had to be collected and read, personal names had to be copied on index cards, people had to be identified across sources, their relations then had to be examined and their lives had to reconstructed.
Fortunately in this digital age that enormous work can be at least partially be automated. That is also a process SNAP is experimenting with: we do not only aim to bring together prosopographies just as the prosopographies themselves have brought together individuals in the sources; SNAP also wants to explore how to facilitate the creation of new prosopographies through Named Entity Recognition (NER). And this is where Leuven Ancient History and Trismegistos People come in.
In 2008 Trismegistos started a project to collect all personal names in Trismegistos Texts, at that time basically all published texts from Egypt between 800 BC and AD 800. This new Trismegistos People database could build on the Prosopographia Ptolemaica, a Leuven project which started in the late ’30ies, and which was transformed to a database already in the ’80ies. As its predecessor, Trismegistos People wanted to be multilingual, taking in not only Greek, but also Demotic and other Egyptian evidence.
As we foresaw, however, manual extraction of the personal names in the ‘old style’, now by typing in information in database records rather than writing them on index cards, proved very time-consuming. Still, for a language like Demotic where no Open Access full text was (and is) available, it was the only way forward. As a result, not even half of all 15,000 texts is currently done …
For Greek papyri, however, there was the Duke Database of Documentary Texts [DDbDP], which had just in 2008 been converted to Unicode and had been made available in the Papyrological Navigator [PN]. This was kindly put at our disposal, and this was to be our corpus for NER.
The Wikipedia article about NER states that ‘even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains’. Well, the papyrological texts seemed like a completely new domain all right, with their diacritic marks, their sometimes fragmentary state, the case system of ancient Greek, and the for us aberrant onomastic system with father’s names instead of first names and family names. So in our innocent rashness we decided to develop something completely new ourselves.
This is not the place to go into details. Those who want to know more can read an article of Bart Van Beek and myself (Journal of Juristic Papyrology 39 (2009), p. 31-47, available here), where we describe the procedure we developed, and which allowed us to deal with several hundred thousands of attestations of Greek personal names.
What I want to focus on here, is the new challenge in the form of Latin inscriptions. In the Europeana EAGLE project, Trismegistos is disambiguating the datasets of partners such as EDH, EDR, HispEpOl and EDB, and the full text of the inscriptions is going to be made available in Open Access. This again opens up exciting possibilities for SNAP, if through NER we can automate the collection of all attestations of personal names in this large corpus.
This time I could not call upon Jeroen Clarysse, who had cooperated with us to develop a NER tool in PHP. So I decided to go ahead myself in the system I know best: FilemakerPro. This may seem counterintuitive (some will no doubt use a different word), but if you want to express yourself, you just use the language you know best, and for me this is Filemaker. The challenges remain the same: identifying the named entities, in this case personal names, and extracting them in the best way possible with an eye to scholarly reuse.
The first problem for the Latin inscriptions was identifying personal names. No Open Access set of names was available, so we had to create that ourselves. Of course personal names are written with capitals, but so are place names, names of gods, and even the occasional book title. Not to mention Latin numbers, and in some datasets even unclear passages or the beginning of texts or sentences. Creating a set of Latin personal names on the basis of all capitalized words was thus the first task, and a quite time-consuming one.
For this, paradoxically, we were helped by a second problem, that of the Latin onomastic system. Latin, as all of you know, has an aberrant system in which people standardly have multiple names. According to the time period, they often use two or even three of the following: a praenomen such as Marcus, a nomen (gentilicium) such as Tullius, and a cognomen such as Cicero. On top of that, they even often add the name (mostly praenomen) of their father – or former master in the case of freedman. And citizens can add their tribus, the voting district in which they were registered. This leads to identification clusters such as C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus or C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i. Once you get rid of the diacritics for abbreviations and restaurations, these patterns actually help to identify words with a capital as personal names.
The existence of long clusters of names to identify a single individual, including also non-capitalized words, implies that we had to focus on extracting these clusters for each text. This I did on the basic principle that each consecutive word which either has a capital or belongs to set of core ‘identification-cluster’ words (including of course libertus, filius and the names of the voting districts) should be added to the cluster. This implies that you end up with clusters like C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i.
In a next step each of the constituants words needs to be analyzed: some are ‘linking word’ such as filius or the tribus-names Cornelia and Voturia, others are declined forms of a personal name, and yet others are ‘noise’ in the form of e.g. numbers. The case of the declined forms is essential for the interpretation of the cluster. In this example the cluster takes the form ‘nom nom gen filius tribus nom dat dat gen filius tribus dat’. This is then in a related database interpreted further as the identification of two individuals: one C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus (nom nom gen filius tribus nom) and one C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i (dat dat gen filius tribus dat).
In yet another database each individual identification is then split up and further standardized (e.g. by converting it to the nominative). The first identification is split up as identifying a person Caius Ofillius Proculus with a father Aulus (Ofillius) and belonging to the tribus Cornelia, and a person Caius Sextilius Pollio with a father Publius (Sextilius) and registered in the tribus Voturia.
At that stage, the information is ready to go into our database system, with the core database REF for all attestations of personal names, and separate databases for individuals (PER), and names, their variants, and their declined forms (NAM, NAMVAR, and NAMVARCASE).