Last week, Faith gave a great overview of some of the issues involved in describing the relationships between people. This week, I’m going to come at the problem from the other side, looking at what data we have, and how SNAP plans to represent them.
Our initial datasets include Trismegistos People (TM, described by Mark), the Lexicon of Greek Personal Names (LGPN, described by Sebastian), and a set of names (article headwords) from the Prosopographia Imperii Romani, 2nd edition (PIR²) put together by Tom Elliott. TM has web pages that document the references to names and people found in papyri, many of which are hosted at Papyri.info, as well as resources describing the names and person; references, names, and persons all have unique identifiers. LGPN comes at the problem of modeling people from a different angle. They start with persons and add names and references; persons and names have unique identifiers. From PIR², we have only persons, with a “principal” name and identifying number (the article number) attached to them.
The process of identifying people from ancient evidence proceeds something like this: we can find references to persons in ancient documents, such as inscriptions and papyri. These references often name the persons they refer to but not always (for example, someone might be referred to only by their title). When these sorts of references are collected, it may be possible to detect patterns where the same person is named in multiple documents, and so to build up evidence about that person and/or about families. It may also be interesting and useful to look at patterns of names and ways of naming people across different places, times, languages, and types of document.
Because these names are in ancient languages, they may have different forms depending on their use in the sentence, and we want to record both those variances and the other kinds of variances you find in name spellings (think Jenny vs. Jennie vs. Jenni, for example). This means we want to be able to record both the name as it occurs in the text and also keep track of how that name varies and how it is applied to people. Of course, people can have more than one name, and different people can have the same name. And people are related to other people.
If we can circle back to texts for a moment, we need to note that these are not fixed and final texts, with a single reading and interpretation each. These are often difficult, damaged documents. They may be hard to read, they are likely fragmentary. When these documents are published, the editor will often fill in missing pieces with their own restorations. They may be published multiple times by different editors with different interpretations and different restored text. Sometimes new fragments that go with existing published documents come to light, and the old publications have to be re-evaluated. A new editor might propose a different arrangement of fragments for a document. Other types of new evidence might cause a re-interpretation of existing texts. And, obviously, the names contained by these documents may be affected by any of these processes. So a reference to a name/person must be pegged to a particular edition of a document, and it’s quite possible that a new publication might contain a different reference, or might not contain it at all.
So we have editions of texts, which contain references to names, persons, and the relationships between persons. How are we going to model this? We’ve chosen to use RDF as the basis for merging data from participating SNAP projects, so as Faith explained, we’re dealing with a data structure which consists of linked triples, where each triple looks something like:
Resource => property => Resource or Literal
That is, a resource or entity (which is always identified by a URI) has a property (also always labeled with a URI) which is another resource or a piece of data (called a literal). It’s a very simple, but powerful structure. The subject of one triple can be the object of any number of other triples, and vice-versa. So you can build up chains of related data using these triples. The structure has some limitations though. In general RDF practice, properties are broadly shared, and resources are relatively unique. This means that if you want to attach extra information to any sort of relation, you will have to model that relation as a resource rather than a property. We can’t just say
[Person A] attestedBy [Text α] [Person A] hasName 'Fred'
if we want to add any more detail to the reference. And we do: we need to be able to add information to the reference, like who is responsible for it and what is the actual text of the reference, etc.. This means we have to model references more like this:
[Attestation 1] cites [Text α] [Person A] hasAttestation [Attestation 1]
Similarly, we can’t just say [Person A] hasName ‘Fred’, because we need to be able to treat names as complex entities, with variants, so we need to have
[Person A] hasName [Name א] [Name א] primaryForm 'Frederick' [Name א] variantForm 'Fred' [Name א] variantForm 'Freddy'
As Faith said last week, there is a real tension between keeping things simple and making them able to support the kinds of information we need to preserve. Another tension, which I’ve outlined above, is the one between the fractally complex humanistic intellectual processes which produce the identification of people in texts and the necessarily simple, machine-processable, data structures we have to use to represent them. Reconciling these kinds of tensions is one of the core problems of digital humanities.