Now that the SNAP project has started ingest finalized data from the initial core datasets, it is time to think about how to bring in material from the other partners. For some, this will be easy, as they already know to make available their data in RDF form on the open web and simply need to follow the guidelines in the Cookbook. For others quite a lot of work will be involved getting SNAP ready. This post describes some of the stages you may go through, and some of the problems that you may meet.
I have divided the work into six steps:
- Decide whether you have a set of names, a set of attestations, or a prosopography
- Identify your records
- Establish the identities online
- Data wrangling
- Transform the data
- Make the RDF available
The first step is the most critical – what does your data actually represent? If your starting point is a text, and you have extracted the personal names, you do not have prosopographical records yet, but rather a set of attestations of names in a text. This is in itself a good thing to do, but to get to the prosopography you have to decide how many actual individuals are referred in the text – 10 occurrences of the name Marcus Aurelius may refer to between 1 to 10 persons (or one cat!).
If what you have is a set of attestations, what you probably want to is contribute a slightly different form of data to the SNAP network, namely links from people to sources. This will mean deciding which person in SNAP your name (Aelius Florus natione Pann., for example) refers to, and then generate RDF which links the URL of the source text to the SNAP person.
The second task, data wrangling, can be quite laborious. It involves turning the list of names into a set of person records, and creating a single record for each person which (for example) lists all the different names they are attested under, and what the source of those attestations is. At this point you will also be assembling other information from your sources (place of birth, sex, profession etc). You will under some circumstances be comparing what you have with established authority lists of names or persons.
Now that we have a set of people, the third task, identifying your records. It seems obvious, but if all you are doing is analyzing the information in a spreadsheet, you may never have needed to.
The fourth minimum piece of work is to establish your identifiers online, ie available as URIs. If you have a person called Soranus, you have allocated number 1001 to him, and you are able to use the domain http://my.people.net/, you might decide that his public identifier is http://my.people.net/person/1001. At the least, make it so that this displays an HTML page about the person for anyone visiting that URL. The very simplest way to do this is to make an HTML file for each person, and arrange the web server so that the request for /person/1001 returns that web page (look at facilities for URL rewriting). This assumes, of course, that you have a web server you can use to put up files, with some reasonable prospect of that remaining in place for the coming decades or longer. Larger institutions may have a repository which can do this job for you, and even assign Digital Object Identifiers (DOIs) to your data.
If you can go further, and map your permanent urls to queries against a database (where, for example, http://my.people.net/person/1001 is turned behind the scenes into http://my.people.net/query.php?id=1001, and retrieves the data from a relational database), you will have a more powerful (but harder to sustain) resource. At this point you can consider having your web server do content negotiation, and return different formats of data in response to Accept headers (technical details at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) so that humans read HTML and computers read RDF.
The fifth stage is transforming your data to the RDF format which SNAP wants, as explained in the cookbook. This may involve an XSLT transform of XML data (RDF can be represented in XML), or a Python script reading a database, or an output template in your publication system. Incidentally, if you’re wondering how to get data out of an Excel spreadsheet, the OxGarage can turn the sheet into XML for you.
If you’ve got this far, and created some RDF suitable for submitting to SNAP, that’s great. You may also want to move onto the sixth stage, making your RDF available on the web, that’s even better, so that SNAP can harvest it at intervals and keep up to date. This stage means looking at your own RDF triple store. Setting this sort of software up yourself isn’t so easy, but you may wish to read up on open source projects like OntoWiki, Sesame, and Fuseki. Maybe one of us will blog more about this in future.
In a change from our usual programming we will be bringing you a blog post directly from the Ontologies for Prosopography workshop at Digital Humanities 2014.
The workshop hopes to bring prosopographers and prosopographical datsets together (preferably through LOD).
Datasets in the following areas were represented at the workshop:
- Greek/Byzantine Manuscripts in Sweden
- Apprenticeships in Early Modern Venice
- Late Antique Immigration from Rome
- Roman worlds
- Imperial cult priests in North Africa and Syria
- European University history from the middleages to the present
- C17/18 Tibetan monestries in China and Tibet
- Fictional People
- Cuneiform tablets, from Hellenistic Babylonia
- Contemporary legal texts
- Estonian folklore
- Mayan hiroglyphics
- C20 Canadian litarary publications and magazine
- History of Beer and Brewing
- Avent-guard periodicals
- Archival descriptions of records
- German National Biography
- C19/20 Church records looking at migration
- Lexicon of Greek Personal Names
- Canadian Gay Liberation 1960 – 1980
- Encyclopedia Virginia
- French Literature (including historical, mythic and fictional persons)
- Early (Pre-1900s) Caribbean
- Correspondence networks
- C19 Mexican authors
During the first SNAP:DRGN workshop a breakout group was convened to discuss the potential API for the project. Rather than come up with a specific API during that session, we instead focused on creating a “wish list” of applications and functions that we wanted to support. We were then able to abstract the functions that would be needed to support the list.
The most vital functions support the querying of the dataset to retrieve the the original id of the entity and all the information about that entity. The reference and provider information can be extracted from the general data about an entity so creating the individual functions for specific data is unnecessary and testing will need to be done to see if these are helpful in reducing the load on the server since they are much more directed and thus require less overhead.
|getChildIds(id)||takes SNAP id, returns ids of entity/entities that the current entity has been derived from|
|getSnapIds(id)||takes partner id, returns list of ids of entity/entities that are derived from the current entity|
|getLiveSnapIds(id)||takes partner id, returns list of ids of non-depricated entity/entities that are derived from the current entity|
|getPersonInfo(id, ?content_type)||takes id, returns full details of the entity with that id in the format specified (‘rdf’ or ‘json’). If no content_type is specified then the function will return RDF|
|getReferences(id)||takes id, returns a list of all the references associated with the given entity|
|getProviders(id)||takes id, returns the list of data provider(s)/project(s) associated with the given entity|
These functions will support the main internal functions which will be used to populate the main body of the website:
|getPersonPage(personId)||takes id, returns html page with details of the entity with that id|
|getProjectPage(projectId)||takes dataset/project identifier, returns html page with the information about that dataset/project|
The remaining API functions that were identified have been divided by priority. Higher priority has been given to functions that seem likely to be most useful to users, will be used by other functions or which will support immediate issues such as disambiguation and basic filtering. Priority has also been given to functions that rely on information that is currently supported within the ingested data or which supports the promotion our partner projects.
|getBaseIds(id)||takes a SNAP id, returns earliest id of entity/entities that the current entity has been derived from i.e. it returns the external URIs that the entity was ultimately derived from no matter how long the chain of derivation|
|getDatasetInfo(dataset, content_type)||takes dataset identifier, returns full details of the dataset as per data-info in the format specified (‘rdf’ or ‘json’)|
|getDates(id)||takes id, returns the list of associated date value for the given entity|
|getNames(id)||takes id, returns the list of name values for the given entity|
|getNames(id, lang)||takes id, returns the list of name values that match the specified language code for the given entity|
|getPlaces(id)||takes id, returns the list of associated place value(s) for the given entity|
|getEntitiesByRef(reference)||takes reference, returns a list of all the entities that point to the given reference|
|getEntityNetwork(id)||takes id, returns ids of any entities that are either derived from the given entity or that the given entity is derived from or are asserted to be the same entity|
|getRelatedEntities(id)||takes id, returns list of ids of related entities and how they are related|
|getEntitiesByName(name)||takes name value, returns list of entities with an associated name value that exactly matches the given value.|
|getEntitiesBySource(source)||takes source value, returns list of entities referenced within that source document|
|getRelation(id, relationship)||takes id, returns list of entities that have the given relationship to the given entity|
|getRelationships(id1, id2)||returns the direct relationship (if there is one) from id1 to id2|
|getEntitiesByPublisher(publisher)||takes publisher identifier, returns list of entities associated with the given provider|
|getEntitiesByDate(date)||takes date value, returns list of entities with an associated date value that matches or contains the given value|
|getEntitiesByPlace(place)||takes place value, returns list of entities with an associated place value that matches or contains the given value|
|getCountByPublisher(publisher)||takes publisher identifier, returns count of entities from that publisher|
|getCountByName(name)||takes name, returns count of entities with that exact name|
|getCountByPlace(place)||takes place, returns count of entities associated with that place|
|getCountByRef(reference)||takes reference, returns count of entities associated with the reference|
|getCountBySource(source)||takes source, returns count of entities associated with that source document|
|getTemporalRange(publisher)||takes publisher identifier, returns temporal range of dataset as given in dataset description|
|getGeoRange(publisher)||takes publisher identifier, returns geographical range of dataset as given in dataset description|
Future work, probably beyond the scope of the immediate pilot project will expand the available functions, both in the public API and in the internal systems. In addition to those functions listed below, the existing functions that return lists of entities will be expanded to have optional parameters which will allow the returned result to be filtered by date range and (potentially) place.
|getRelationshipPath(id1, id2)||returns the sequence of relationship links (if there is one) that joins id1 to id2|
|getRelatedNames(name)||takes name, returns list of names that are specified as variants to the given name|
|getEntitiesByDate(tpq, taq)||takes start and end date values, returns list of entities with an associated date value that matches or falls within the given period|
|getEntitiesByApproxName(name)||takes name value, returns list of entities with an associated name value that matches or closely matches the given value|
|getEntitiesByAltName(name)||takes name value, returns list of entities with an associated name value that is an alternate matches the given value|
|getAssertionAuthority(id)||takes id, returns the given authority (may be original publisher) that identified the entity as an entity|
|isDepricated(id)||takes id, returns true if entity has been marked as depricated|
|getCurrent(replaced-id)||takes id of replaced entity and returns the list of ids of entities that have replaced it|
|getAgreementMetric(id)||return a value for the level of agreement for the existence of a given entity as determined by the conflicting assertions|
|getCertainty(id, authority)||return the certainty value for a given authority for a given assertion (id)|
Beyond the basic querying functions, the highest priority will be to add functioned to support the making assertions about existing entities. These assertions will result in the creation of new entities and as such each assertion will need to have an identifier for the person making the claim attached to it. Optional values will allow the assert-er to specify their level of certainty and to give a reference to a resource which back up their statement. The primary assertions (that the entities specified in the first argument do or do not represent co-references) will be implemented initially but as the assertion system is finalised this list will be expanded and refined.
|coRefAssertion(entity-list, authority, ?certainty, ?reason)||make a co-reference assertion|
|notCoRefAssertion(entity-list, authority, ?certainty, ?reason)||make an assertion that two entities are not co-references|
|deprecateEntity(old-id, replacement-id)||annotates the entity identified by the old-id as deprecated and adds a pointer to the replacement entity as defined by replacement-id.|
Internally, restricted functions intended to support the addition of new datasets and allow more automation of the data ingestion system will be added. This will allow us to streamline the ingestion workflow and more easily add data from partner projects:
|addDataset(data-info-uri, data-uri)||ingests rdf from source given in file defined by data-info, and rdf description of the dataset from file indicated by data-info-uri|
|addDatasetDescription(data-info-uri)||adds description of the dataset to the triple store|
|addDatasets(data-info-uri, data-uri-list)||ingests rdf from each source listed in the given file defined by data-info, and rdf description of the datasets from file indicated by data-info-uri|
Finally, purely internal functions to query externally held linked data will allow us to connect with datasets that have not been directly ingested. These datasets may be outside the scope of SNAP, for example place rather than person data, or may represent prosopography datasets which have been published and hosted in a compatible format by other projects with just the minimum information (snap identifier and reference) held locally:
|lookupEntity(uri)||query external datasource for RDF relating to entity as specified by URI|
|lookupSource(uri)||query external datasource for information on/text of the source as given URI|
|lookupPlace(uri)||query external datasource, e.g. Pelagios, for RDF information on place as specified by URI|
The “wish list” identified in the workshop is detailed below. The list in not presented in any particular order and was conceived as purely aspirational.
- Lightweight (popup) widgets for embedding in external websites
- Data to support visualisations
- Find mappings (that is, given one identifier, find all coreferencing identifiers)
- Correlate identifier to provider
- Filter identifiers by a variety of criteria
- date (created / modified)
- subjectOf assertions (depends on precise character of assertions)
- objectOf assertions (depends on precise character of assertions)
- Read information about entity
- Retrieve information from providers
- Retrieve information/text from source (where available online)
- Search on name – get result or disambiguation page
- Language independent as much as possible
- The Fuzzy Person (Entity and closely related entities)
- The Fuzzy Name (name variations)
- match hinting
- name variation (soundex, name variation…)
- Assertion creation
- must allow external systems to publish assertions
- public identification needed to link assertion to who did it
- Dataset ingestion
- single dataset
- bulk datasets
- Trust/Authority analysis on assertions
- certainty assertions
- agreement assertions
- Prosopography/Project Summaries
- Temporal span
- Geographic span
- EAC Suggested Archive Descriptions:
- Reference Code (Required)
- Name and Location of Repository (Required)
- Title (Required)
- Date (Required)
- Extent (Required)
- Name of Creator(s) (Required, If Known)
- Administrative/Biographical History (Optimum)
- Conditions Governing Access (Required)
- Physical Access (Added Value)
- Technical Access (Added Value)
- Conditions Governing Reproduction and Use (Added Value)
- Languages and Scripts of the Material (Required)
- Custodial History (Added Value)
- Immediate Source of Acquisition (Added Value)
- Appraisal, Destruction, and Scheduling Information
- (Added Value)
- Related Materials Elements
- Existence and Location of Originals (Added Value)
- Existence and Location of Copies (Added Value)
- Related Archival Materials (Added Value)
- Publication Note (Added Value)
- Notes Element
- Description Control Element
- List by Source/Reference , Project
- List by Project
- Language translation
- Latin transliteration if not provided
- Greek from Betacode?
- Other languages?
- Partner projects mark entities as deprecated/changed
- Web query – web response
- RDF dump
- SPARQL Endpoint
- Other formats by demand later
Prosopographies have in the past often taken decades or even centuries to produce. Even for a period with relatively few sources such as antiquity, hundreds of thousands of texts had to be collected and read, personal names had to be copied on index cards, people had to be identified across sources, their relations then had to be examined and their lives had to reconstructed.
Fortunately in this digital age that enormous work can be at least partially be automated. That is also a process SNAP is experimenting with: we do not only aim to bring together prosopographies just as the prosopographies themselves have brought together individuals in the sources; SNAP also wants to explore how to facilitate the creation of new prosopographies through Named Entity Recognition (NER). And this is where Leuven Ancient History and Trismegistos People come in.
In 2008 Trismegistos started a project to collect all personal names in Trismegistos Texts, at that time basically all published texts from Egypt between 800 BC and AD 800. This new Trismegistos People database could build on the Prosopographia Ptolemaica, a Leuven project which started in the late ’30ies, and which was transformed to a database already in the ’80ies. As its predecessor, Trismegistos People wanted to be multilingual, taking in not only Greek, but also Demotic and other Egyptian evidence.
As we foresaw, however, manual extraction of the personal names in the ‘old style’, now by typing in information in database records rather than writing them on index cards, proved very time-consuming. Still, for a language like Demotic where no Open Access full text was (and is) available, it was the only way forward. As a result, not even half of all 15,000 texts is currently done …
For Greek papyri, however, there was the Duke Database of Documentary Texts [DDbDP], which had just in 2008 been converted to Unicode and had been made available in the Papyrological Navigator [PN]. This was kindly put at our disposal, and this was to be our corpus for NER.
The Wikipedia article about NER states that ‘even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains’. Well, the papyrological texts seemed like a completely new domain all right, with their diacritic marks, their sometimes fragmentary state, the case system of ancient Greek, and the for us aberrant onomastic system with father’s names instead of first names and family names. So in our innocent rashness we decided to develop something completely new ourselves.
This is not the place to go into details. Those who want to know more can read an article of Bart Van Beek and myself (Journal of Juristic Papyrology 39 (2009), p. 31-47, available here), where we describe the procedure we developed, and which allowed us to deal with several hundred thousands of attestations of Greek personal names.
What I want to focus on here, is the new challenge in the form of Latin inscriptions. In the Europeana EAGLE project, Trismegistos is disambiguating the datasets of partners such as EDH, EDR, HispEpOl and EDB, and the full text of the inscriptions is going to be made available in Open Access. This again opens up exciting possibilities for SNAP, if through NER we can automate the collection of all attestations of personal names in this large corpus.
This time I could not call upon Jeroen Clarysse, who had cooperated with us to develop a NER tool in PHP. So I decided to go ahead myself in the system I know best: FilemakerPro. This may seem counterintuitive (some will no doubt use a different word), but if you want to express yourself, you just use the language you know best, and for me this is Filemaker. The challenges remain the same: identifying the named entities, in this case personal names, and extracting them in the best way possible with an eye to scholarly reuse.
The first problem for the Latin inscriptions was identifying personal names. No Open Access set of names was available, so we had to create that ourselves. Of course personal names are written with capitals, but so are place names, names of gods, and even the occasional book title. Not to mention Latin numbers, and in some datasets even unclear passages or the beginning of texts or sentences. Creating a set of Latin personal names on the basis of all capitalized words was thus the first task, and a quite time-consuming one.
For this, paradoxically, we were helped by a second problem, that of the Latin onomastic system. Latin, as all of you know, has an aberrant system in which people standardly have multiple names. According to the time period, they often use two or even three of the following: a praenomen such as Marcus, a nomen (gentilicium) such as Tullius, and a cognomen such as Cicero. On top of that, they even often add the name (mostly praenomen) of their father – or former master in the case of freedman. And citizens can add their tribus, the voting district in which they were registered. This leads to identification clusters such as C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus or C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i. Once you get rid of the diacritics for abbreviations and restaurations, these patterns actually help to identify words with a capital as personal names.
The existence of long clusters of names to identify a single individual, including also non-capitalized words, implies that we had to focus on extracting these clusters for each text. This I did on the basic principle that each consecutive word which either has a capital or belongs to set of core ‘identification-cluster’ words (including of course libertus, filius and the names of the voting districts) should be added to the cluster. This implies that you end up with clusters like C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i.
In a next step each of the constituants words needs to be analyzed: some are ‘linking word’ such as filius or the tribus-names Cornelia and Voturia, others are declined forms of a personal name, and yet others are ‘noise’ in the form of e.g. numbers. The case of the declined forms is essential for the interpretation of the cluster. In this example the cluster takes the form ‘nom nom gen filius tribus nom dat dat gen filius tribus dat’. This is then in a related database interpreted further as the identification of two individuals: one C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus (nom nom gen filius tribus nom) and one C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i (dat dat gen filius tribus dat).
In yet another database each individual identification is then split up and further standardized (e.g. by converting it to the nominative). The first identification is split up as identifying a person Caius Ofillius Proculus with a father Aulus (Ofillius) and belonging to the tribus Cornelia, and a person Caius Sextilius Pollio with a father Publius (Sextilius) and registered in the tribus Voturia.
At that stage, the information is ready to go into our database system, with the core database REF for all attestations of personal names, and separate databases for individuals (PER), and names, their variants, and their declined forms (NAM, NAMVAR, and NAMVARCASE).
We’ve been discussing lately how to merge person records in SNAP, so that when we encounter partner projects that each have a record for the same person, SNAP can provide a useful service by combining those into single, merged records, and we can start to get an idea of the requirements for performing operations like merges on our data. This discussion has proved something of a rabbit hole.
In any digital project there is always a temptation to plan for and build things that you think you may need later, or that might be nice to have, or that might help address questions that you don’t want to answer now, but might in the future. This temptation is almost always to be fought against. This is hard. We love to think about how things might work, and what people might want to do, but it’s always better in my experience to push towards the ruthlessly practical side of things. It is vastly easier to write software and to build data models when you have real requirements rather than speculative ones. Moreover, those speculative requirements frequently turn out to be different when they turn into real requirements. They may disappear on closer examination, or be vastly more complex, or otherwise metamorphose. If you wrote code to address these pseudo-requirements, it would have been a waste of time. Avoiding this kind of trap is a principle in software engineering, called YAGNI (see, e.g. http://c2.com/cgi/wiki?YouArentGonnaNeedIt). The pressure is even more acute (and more of a risk) in many DH projects, which are both research-oriented and often constrained in terms of resources.
SNAP has at least one of these speculative requirements. We know that in the future, we’ll want to allow people to make a variety of assertions about SNAP datasets. For example, we’ll want to support asserting that two “people” from different databases are in fact the same person, or that what is represented as a single person in a data source is actually two people, or that what a partner database has interpreted as a person actually isn’t (maybe a subsequent edition of the source document has determined that what was thought to be a name isn’t).
So how should we model these cases? We shouldn’t. Not until we’ve had the time to properly sort out all of the requirements for having SNAP users make these kinds of assertions against our data. We have ideas about how this might work, but we don’t have enough information on the parameters yet, and this functionality isn’t in scope for the current SNAP grant. All we want to be able to do right now is try out merging a few person records where our partner datasets have overlaps. Therefore, we aren’t going to model assertions about SNAP entities at all, just one of their outcomes: what a merged person looks like. The requirements for this are pretty straightforward: we need to know where the new person resource comes from, who is responsible for it, and why the merge was performed.
So lets start with two partner records (these are real)
<http://www.trismegistos.org/person/14218#this> a lawd:Person ; dc:publisher <http://www.trismegistos.org> ; lawd:hasName <http://www.trismegistos.org/name/6284#this> ; lawd:hasAttestation <http://www.trismegistos.org/ref/30996#person> .
<http://www.lgpn.ox.ac.uk/id/V2-60610> a lawd:Person ; dc:publisher <http://www.lgpn.ox.ac.uk> ; lawd:hasAttestation <http://www.lgpn.ox.ac.uk/id/V2-60610/personref/1>, <http://www.lgpn.ox.ac.uk/id/V2-60610/personref/2>, <http://www.lgpn.ox.ac.uk/id/V2-60610/personref/3> ; lawd:hasName <http://www.lgpn.ox.ac.uk/nym/nTi1marcos> ; foaf:name "Timarcos"@grc-Latn .
<http://data.snapdrgn.net/people/1234> a lawd:Person ; prov:wasDerivedFrom <http://www.trismegistos.org/person/14218#this> .
<http://data.snapdrgn.net/people/1235> a lawd:Person ; prov:wasDerivedFrom <http://www.lgpn.ox.ac.uk/id/V2-60610> .
To merge them, we’ll just create a new person:
<http://data.snapdrgn.net/people/1236> a lawd:Person, snap:MergedResource ; dc:publisher <http://snapdrgn.net> ; dc:replaces <http://data.snapdrgn.net/people/1234>, <http://data.snapdrgn.net/people/1235> ; snap:reason <http://data.snapdrgn.net/people/1236#reason1> .
<http://data.snapdrgn.net/people/1236#reason1> a cnt:ContentAsText ; cnt:chars "Merged because both replaced persons cite the same texts, IG II(2) 3455 and 3777." .
And with that, we have the who, what, and why, but we haven’t had to make any guesses about how SNAP might work in the future. We can merge person records without having had to plan out a whole new infrastructure.
One of the decisions that has to be made when creating an ontology is which concepts you encode as classes and which you encode as properties of those classes. One of the difficulties is that there is no overarching ‘right answer’ (although there are wrong ones) to how you should model your domain, in has to be decided on a case-by-case basis of what works best for the type of world view that you are trying to encapsulate within your model. This post is a request for feedback to help us decide which model works best for both the project and the wider community.
In the previous post we considered three patterns that we could use to describe relationships. Further discussion has led us to discarding the third, event-driven, option both in a drive towards simplicity and more importantly because it has the furthest conceptual distance from the information we want to represent. The source material is diverse in both type and style but if we consider what is normally captured in prosopographical data, and why, we would expect something like:
Επιγόνη daughter of Επίγονος (from Thasos) http://www.lgpn.ox.ac.uk/id/V1-37074
There are a number of events that we can hypothesise from these type of statements, in this case that Επίγονος fathered a girl, Επιγόνη. This fits in with the logical rules that it is possible to create in structured data: when person A fathered/gave birth to a girl B then B is the daughter of A. While epigraphs, like that in the example, are unlikely to go into further detail other sources may have specific description of some events moving them from the realm of the assumed (we assume that Επίγονος fathered Επιγόνη and did not, for example, adopt or was cuckolded resulting in the above statement) to the evidenced (trustworthiness of that evidence is a issue for a different day/post). For those people familiar with CIDOC CRM, this is basically the model that they employ – and it is a good one allowing a rich and detailed encoding of the biographical history of the person (or object). However much of this information is well beyond the scope of what SNAP sets out to model. If it wasn’t then we could just use CIDOC CRM, a well known and common standard, and all go home early for tea. One of the guiding principles behind SNAP is that we are only encoding the minimum information necessary to name/identify an individual entity. We need to know that Επιγόνη is the daughter of Επίγονος only in so much as that is part of her significant identity. So while we would encourage projects to encode this level of information in there own data, events are beyond the scope of SNAP, which leaves us with two other possibilities.
Defining every possible relationship via properties is arguably the simplest way that we could encode the information we need:
[Επιγόνη] -- daughter-of --> [Επίγονος]
There are two potential downsides to this. Firstly, the number of properties expands pretty fast. Not only do we have the basic property tree with
but each of those needs to have versions for ‘acknowledged’, ‘claimed’, ‘foster’, ‘adopted’, ‘step’. And then there is the extended family and even if we only go as far as the grandparent/grandchild relationship along with the basic aunt-of/uncle-of (interesting there is no collective gender-neutral word for this relationship), cousin-of (and no non-gender neutral term for this), nephew-of/niece-of then we still have to add in maternal and paternal versions (although we can probably be forgiven for dropping the ‘acknowledged’, ‘claimed’ etc). Added to these we need the important non-”blood” relationships: formalised intimate relationships (i.e. recognised marriage), non-formalised intimate relationships (i.e mistresses), slave-of, master-of, freedman-of, parton-of, client-of…
All in all that gets to approximately 90 relationships, plus a few more if we start including things like disciple-of and teacher-of.
This is not necessarily a problem in itself, although it does get a bit messy. It is at least nicely organised into a hierarchy and there are plenty of opportunities for adding disjunct and inverse property restrictions. However what we gain in the simplicity of the direct link we loose in sacrificing the possibility of relating additional information to the connection such as provenance, reference or certainty. If we model the relationship as a concept (i.e. a Class) rather than as a property connecting two entities they we immediately open up more possibilities.
There are three obvious ways to do this:
[Entity1] --<generic-linking-property>--> [Relationship Class] --<relationship-specification>--> [Entity2]
e.g. [Επιγόνη] –has-relationship–> [AcknowledgedRelationship] –daughter-of–> [Επίγονος]
[Entity1] --<generic-linking-property>--> [Relationship] --<generic-linking-property>--> [Entity2] --<generic-type-linking-property>--> [RelationshipSpecification]
e.g. [Επιγόνη] — has-relationship –> [AcknowledgedRelationship]
[Entity1] --<generic-linking-property>--> [Relationship Classes] --<generic-linking-property>--> [Entity2]
e.g. [Επιγόνη] — has-relationship –> [AcknowledgedRelationship, Daughter] — relationship-with –> [Επίγονος]
Although the first two of which could just as easily be modelled the other way around depending on where we preferred to put the emphasis:
[Επιγόνη] --has-relationship--> [Daughter] --acknowledged-with--> [Επίγονος]
[Επιγόνη] --has-relationship--> [Daughter] --relationship-with--> [Επίγονος] --relationship-type--> [AcknowledgedRelationship]
This is important because any additional information such as provenance, reference or certainty would be attached to the intermediary class and comes down to whether we see the hierarchy as being:
We can cut out some of this discussion by dropping the additional property and duel-classing the instance as shown in the third example. Expanding on that our class hierarchy would look like:
- HereditaryFamily (If anyone can think of a better term I am open to suggestions)
- Extended Family
- HereditaryFamily (If anyone can think of a better term I am open to suggestions)
- RelationshipQualifier (all disjoint with everything except HereditaryFamily classes)
- Half (disjount with everything except Sibling classes)
- RelationshipAxis (all disjoint with everything except ExtendedFamily classes)
- Inlaw (disjoint with everything except HereditaryFamily and ExtendedFamily classes)
Disjoints would be defined for the gender specific classes (Son/Daughter, Mother/Father, Aunt/Uncle etc) and for those that are impossible without the use of time travel (Child/Parent, Ancestor/Descendent etc) but given the period we are dealing with (Romans and Egyptians – I’m looking at you) it would be unwise to add any additional disjoints that we might otherwise consider between related people.
Of the options that use classes instead of, or in addition to properties, then this is the simplest. It tends to be bad design when you end up making everything a Class which is what we have ended up doing here. Equally we can go to far in the opposite direction in search of “simplicity” and the desire to have as few classes as possible. The intermediary options offer a combination of properties and classes but also raise some options as to how we want the emphasis of the encoding to lie. These are questions that we feel it would be better to open up to discussion by the wider community rather than just making an executive decision.
Option 1: All properties
<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;daughter-of <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436>
Option 2a: Combination of Classes and Properties (classes defines the relationship, properties the specific relationship)
<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;has-relationship [ a &snap;AcknowledgedRelationship; &snap;daughter-of <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436>] .
Option 2b: Combination of Classes and Properties (classes define the specific relationship, properties the relationship type)
<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;has-relationship [ a &snap;Daughter; &snap;acknowledged-with <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436>]
Option 3a: Combination of Classes and Properties (emphasis on classes but with properties explicitly linking rather than duel classing, main class is the relationship)
<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;has-relationship [ a &snap;AcknowledgedRelationship; &snap;relationship-with <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436> &snap;relationship-type &snap:Daughter] .
Option 3b: Combination of Classes and Properties (emphasis on classes but with properties explicitly linking rather than duel classing, main class is the specific relationship)
<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;has-relationship [ a &snap;Daughter; &snap;relationship-with <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436> &snap;relationship-type &snap:Acknowledged] .
Option 4: All classes
<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;has-relationship [ a &snap;AcknowledgedRelationship; a &snap;Daughter; &snap;relationship-with <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436>] .
I hope this post has clearly laid out the options as we see them and I’d like to invite your opinions and suggestions as to which way we go.
One question that came up during the workshop a couple of weeks ago was: if partner projects already assign their own URIs/ids to their person/name/etc. records, then why should SNAP assign its own identifiers? There are two answers to that, one very practical, and the other a bit more philosophical.
- SNAP IDs will be URIs themselves, and when dereferenced in a browser, or by an application, will return a result. Either a web page listing what SNAP knows about the record in question, or RDF data about it. We can’t do this in a practical way without assigning our own identifiers.
- On a more theoretical level, we think that any updates made to data post-ingest shouldn’t be made directly on our partners’ data. We believe, for example, that while SNAP might assert an identity between two person records coming from two partner datasets, it will be up to the partners whether they accept that identification.
When a person (or person-like entity) record in the SNAP triplestore is queried by URI via a web browser, we would expect this URI to dereference to a HTML page giving more information about the person recorded. The main information, the title of the page, would be the immediate source of the information: i.e. the contributing dataset, or datasets in the case of a merged or co-referenced person. Other information about the person–names, associated dates and places, primary text attestations, etc.–would also be listed, in a simple and standard layout, as would relationships to other persons, and other assertions about the person that SNAP knows about. An example (completely fictional) entry might therefore look something like:
TM 1234 = LGPN V5a-567
SNAP Person id: 10002
c. II cent
Father of TM 1233 (SNAP pid 10001), Diogenes/Διογένης
Attested in: PHI 256884; BGU.12.16024
In addition to this information, which may be different from, in some case supplemental to, the information in the contributing databases, we can imagine other information and services being added to this page. For example, a feed showing external projects that have linked to this person as annotations to names in their texts or archaeological objects; or a Social Network Analysis visualization of persons, places, texts, etc. within two steps of relationship to this person. All of these SNAP-specific services will only be possible if we have SNAP identifiers to dereference to pages containing this information.
When the SNAP system ingests data from a partner (Trismegistos, for example) we’ll get data from them that looks like:
<http://www.trismegistos.org/person/414#this> a lawd:Person ; dc:publisher <http://www.trismegistos.org> ; lawd:hasName <http://www.trismegistos.org/name/5663#this> ; lawd:hasAttestation <http://www.trismegistos.org/ref/1662#this> . <http://www.trismegistos.org/name/5663#this> a lawd:PersonalName ; dc:publisher <http://www.trismegistos.org> ; lawd:primaryForm "Σαραπίων"@grc; lawd:primaryForm "Sarapion"@en ; lawd:hasAttestation <http://www.trismegistos.org/ref/1662#this> .
and SNAP will assign a new person id, like
http://data.snapdrgn.net/person/1234 to the
http://www.trismegistos.org/person/414#this. The theoretical reason for this is that SNAP plans to add functionality for the identification of persons belonging to multiple datasets and the annotation of those persons. As we noted above, we think those sorts of updates shouldn’t be applied directly to the Trismegistos resource by us. If you contribute data to SNAP, we feel strongly that we shouldn’t change that data. You should be free, of course, to accept new facts or assertions that emerge in SNAP that are relevant to your data back into your dataset, but those shouldn’t be forced on you, nor should it be made to look as if your project asserts something it doesn’t. There are a couple of possible ways to achieve this, but one very simple one is to create a derived resource to which new facts and assertions may be added. SNAP ids allow us to preserve the integrity of contributed datasets while allowing us to build upon them.
The SNAP Project is proud to announce the Ontologies for Prosopography: Who’s Who? or, Who was Who? one-day workshop developed in conjunction with the People of the Founding Era project based at the University of Virginia. The workshop will give the opportunity for SNAP to present our data model to a wider audience and engage with the researchers working on similar problems other periods and geographic areas.
The morning session will be devoted to the presentation of the methodologies used by different projects and discussion of the needs of researchers working with historic person data and how they have been, and can be, addressed. Building on this, the afternoon will offer the option of smaller focused discussions or hands-on, practical sessions in which attendees will have the opportunity to discuss their own data with experts and how they can publish it as structured linked data.
The short description of the workshop, as seen on the DH2013 website, is below. A more detailed description can be http://www.stoa.org/archives/1953
Historical data about people, their names, their attributes, and their relationships is one of the most common types of data for projects to expose and yet an area which is falling behind others in the move to the digital data publication and exchange.
The morning session ‘Modelling the Person’, will address the issues of modelling historical persons with presentations and discussions on practices from a range of existing, or emerging, projects and models that attempt to capture information about historical persons using structured models that are compatible with semantic web thinking — models such as the SNAP:DRGN, CIDOC-CRM/FRBRoo, the factoid model, SNAC, etc, plus any others that participants are already using to model their data. Building on these presentations the workshop looks towards finding whether a cross-project consensus on standards and best practice is possible.
The workshop will continue in the afternoon with a session on ‘Linking the Person’. Attendees will have the opportunity to continue the morning’s theoretical discussion or breakout into other areas with a choice of smaller groups focusing on the technical and practical issues of linking the person and name data from different projects together including hands-on sessions on preparing and publishing prosopographical and onomastic datasets as structured data (attendees are encouraged to bring their own datasets if they choose to take part in one of the hands-on breakout groups).
This workshop will particularly appeal to prosopographers, biographers, genealogists, classicists, social historians and those working with resources where persons are mentioned during the Greco-Roman and connected periods or during the foundation of America.
The workshop will also appeal to ontologists and technologists and developers with an interest in structured and open, linked data who are dealing with data related to historical people and names. The breakout groups in the afternoon will cater for all levels of technical ability.
Although some of the projects showcased in this workshop focus on specific periods such as the Greco-Roman world and the foundation of America, the issues raised are applicable to all historical eras and loci and participation by researchers from all periods and areas are encouraged and welcomed.
One of the conversations that it was really useful to hash out in person and with the involvement of so many experts and interested parties present at the workshop a couple weeks ago, was the question of how the SNAP:DRGN Cookbook should recommend contributing person-datasets represent date information.
It has been our working assumption that the minimalist information SNAP is ingesting would optionally include a single, undifferentiated, very crudely recorded date associated with person. (By the same token, any place information associated with a person would be given only in very blunt form, inasmuch as it serves almost as an extra name, epithet or indentifier for a person. Further more granular place association, à la Pelagios, might be included in the original prosopography, and/or in the exposed RDF serialization of said dataset, but SNAP will only expect and take advantage of associated place in the most abstract form.) The argument may be at its clearest with respect to dating, however, partly because there are so many strong arguments for including more granular and semantic date information in a prosopographic dataset.
There are many different classes of data information that one might, in theory, want to include in a prosopographical or biographical record of a person. Some dates are firmly known, from a variety of evidence, and might include a person’s exact birth and death dates, or just a secure floruit, when they were known to associate with other persons, take part in historical events, leave behind artistic or written creations, etc. A person attested to in an inscription might be dated (perhaps to the end of their life) if the inscription is datable by context, content, palaeography, style, or some other feature. Other inscriptions might mention a person (e.g. a historical ruler) whose date bears no relation to that of the text or support. Some people are datable very broadly by century or era; for example a database of Imperial Roman elites with no specific date information for each entry, might nonetheless enable us to ascribe a date range of 0001 – 0300 for all the persons therein, which is worth recording if it is all we have. Dating criteria should also be recorded, as should uncertainty, complexity, and editorial attribution for dating decisions (especially in the case, for example, of conflicting dates offered by different sources or editors). Dates might also be attached to events and c hanges of state within a person’s life, with all of the complexities and varieties described above.
Despite this potential complexity, the value of having robust and reliable dates attached to person-entries in the SNAP graph is compelling. Dates can be used to help disambiguate persons and perform co-reference detection; the date is one of the first things that a human user would want to see in an at-a-glance summary of a person’s information, alongside their name, source, location, and key relationships; especially in combination with place, dating information can contribute to the building of a social network of persons. Given the complexity of dating even within a single project, and massive variety of approaches to dating that historical databases have taken, it is hard to imagine that there will be any consistency in this regard in the SNAP graph of ancient persons, except for that which the SNAP:DRGN project recommends and even demands from contributed data.
Having said this, we have always tried to follow as a sacred mantra the idea that SNAP will only ask for and surface the bare minimum of information needed to identify a person (at a minimum name and URI in contributing project), and in a simple format so that there will be as much consistency across the entire network as possible. So a personal name contributed to the SNAP graph should be in plain text with a language tag attached (“Apollonius”@en, “Ἀπολλώνιος”@grc), and may also include a URI for the name in an associated onomatsticon, but SNAP will not recognise or take account of, for example, TEI-encoded XML recording the condition and certainty of the name, abbreviations, emendations, lemmata etc. All this information can and should be recorded somewhere, in the originating project, but SNAP isn’t the place to find and use this complexity, which will no doubt be unique to individual projects anyway. Likewise, no amount of specific and qualified dates will be of any use if 90% of the graph is in a different format, has much less granular dating, and doesn’t record criteria or semantics of their dates. Even after agonising over the possibilities for a seemingly endless afternoon session, there was no consensus on what a more sophisticated dating mechanism that would be feasible across multiple huge and heterogenous datasets might look like.
The main problem, of course, is that not every date assigned to every person will be selected by a human editor who can individually record their source, certainty, precision, criteria, range and scope of the date. Some datasets will have relatively full dates for all persons, encoded in great detail in TEI or CIDOC-CRM or a bespoke relational database following the Factoid model. Others will have no dates at all attached to persons or names, but a date may be extrapolated from the date of the source document(s) in which each person is attested. These will be a very different quality of date: many will be epitaphs, but might or might not be closely dated; others may be texts clearly written while the person is alive (some might be dated only by the known dates of the person mentioned within them); some may have no relationship between the date of the document and that of them person, mentioning an emperor from several hundred years before the text was written, say. In the case of a dataset like this, should we:
- not indicate any dates in the person-data contributed to SNAP, because in some cases these dates will have no relation to the lives of the persons recorded?
- indicate the extrapolated dates, but with criteria saying that these are a different kind of date (across the whole database, because we have no way of differentiating between the entries)?
- indicate the extrapolated dates, allowing for the very occasional date that might be misleading, because all but a tiny minority of the tens of thousands of entries will still meet SNAP:DRGN standards?
Much to the disapproval of some purists, I think the correct solution for SNAP, which does not record full prosopopgraphical data but only aggregates a summary of it, is number 3. We shall define our date property such that it may contain a very loose superset of all possible dates associated to a person, and tolerate a tiny amount of error in inappropriate dates being recorded with this property.
Even if the source dataset contains detailed dating information in sophisticated formats, please record in the RDF that you submit to the SNAP graph a single date range, expressed as an ISO 8601 time interval (e.g. “0101/0200″ for “second century CE”), which, to the best of your project’s ability to ensure it, probably at least overlaps with the lifetime of the person being recorded.
This definition, and a similarly general definition of our associated place property (“place or placename genrally associated with this person for any reason”), make it especially important that we use a property formally defined in the SNAP ontology as having this general definition, that will not be used for more specific dates (born, died, floruit, married, reign, moved city, attested, etc.) or places (lived, ruled, visited, founded, attacked, ethnicity, etc.).