Minutes of first Advisory Board meeting

SNAP:DRGN Advisory Board

1st meeting Skype (voice only) 2014-05-09

Present: Øyvind Eide (chair), Fabian Koerner, Robert Parker, Laurie Pearce, Charlotte Roueché, Rainer Simon, Gabriel Bodard (principal investigator)

Not present: Sonia Ranade.

The meeting lasted around one hour.

Minutes written by Øyvind Eide based on notes from Laurie Pearce and Rainer Simon. Continue reading

Entering the SNAPDRGN garden

Now that the SNAP project has started ingest finalized data from the initial core datasets, it is time to think about how to bring in material from the other partners. For some, this will be easy, as they already know to make available their data in RDF form on the open web and simply need to follow the guidelines in the Cookbook. For others quite a lot of work will be involved getting SNAP ready. This post describes some of the stages you may go through, and some of the problems that you may meet.

I have divided the work into six steps:

  1. Decide whether you have a set of names, a set of attestations, or a prosopography
  2. Identify your records
  3. Establish the identities online
  4. Data wrangling
  5. Transform the data
  6. Make the RDF available

The first step is the most critical – what does your data actually represent? If your starting point is a text, and you have extracted the personal names, you do not have prosopographical records yet, but rather a set of attestations of names in a text. This is in itself a good thing to do, but to get to the prosopography you have to decide how many actual individuals are referred in the text – 10 occurrences of the name Marcus Aurelius may refer to between 1 to 10 persons (or one cat!).

If what you have is a set of attestations, what you probably want to is contribute a slightly different form of data to the SNAP network, namely links from people to sources. This will mean deciding which person in SNAP your name (Aelius Florus natione Pann., for example) refers to, and then generate RDF which links the URL of the source text to the SNAP person.

The second task, data wrangling, can be quite laborious. It involves turning the list of names into a set of person records, and creating a single record for each person which (for example) lists all the different names they are attested under, and what the source of those attestations is. At this point you will also be assembling other information from your sources (place of birth, sex, profession etc). You will under some circumstances be comparing what you have with established authority lists of names or persons.

Now that we have a set of people, the third task, identifying your records. It seems obvious, but if all you are doing is analyzing the information in a spreadsheet, you may never have needed to.

The fourth minimum piece of work is to establish your identifiers online, ie available as URIs. If you have a person called Soranus, you have allocated number 1001 to him, and you are able to use the domain http://my.people.net/, you might decide that his public identifier is http://my.people.net/person/1001. At the least, make it so that this displays an HTML page about the person for anyone visiting that URL. The very simplest way to do this is to make an HTML file for each person, and arrange the web server so that the request for /person/1001 returns that web page (look at facilities for URL rewriting). This assumes, of course, that you have a web server you can use to put up files, with some reasonable prospect of that remaining in place for the coming decades or longer. Larger institutions may have a repository which can do this job for you, and even assign Digital Object Identifiers (DOIs) to your data.

If you can go further, and map your permanent urls to queries against a database (where, for example, http://my.people.net/person/1001 is turned behind the scenes into http://my.people.net/query.php?id=1001, and retrieves the data from a relational database), you will have a more powerful (but harder to sustain) resource. At this point you can consider having your web server do content negotiation, and return different formats of data in response to Accept headers (technical details at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) so that humans read HTML and computers read RDF.

The fifth stage is transforming your data to the RDF format which SNAP wants, as explained in the cookbook. This may involve an XSLT transform of XML data (RDF can be represented in XML), or a Python script reading a database, or an output template in your publication system. Incidentally, if you’re wondering how to get data out of an Excel spreadsheet, the OxGarage can turn the sheet into XML for you.

If you’ve got this far, and created some RDF suitable for submitting to SNAP, that’s great. You may also want to move onto the sixth stage, making your RDF available on the web, that’s even better, so that SNAP can harvest it at intervals and keep up to date. This stage means looking at your own RDF triple store. Setting this sort of software up yourself isn’t so easy, but you may wish to read up on open source projects like OntoWiki, Sesame, and Fuseki. Maybe one of us will blog more about this in future.

SNAP at DH2014

In a change from our usual programming we will be bringing you a blog post directly from the Ontologies for Prosopography workshop at Digital Humanities 2014.

The workshop hopes to bring prosopographers and prosopographical datsets together (preferably through LOD).

Datasets in the following areas were represented at the workshop:

  • Greek/Byzantine Manuscripts in Sweden
  • Apprenticeships in Early Modern Venice
  • Late Antique Immigration from Rome
  • Roman worlds
  • Imperial cult priests in North Africa and Syria
  • European University history from the middleages to the present
  • C17/18 Tibetan monestries in China and Tibet
  • Fictional People
  • Cuneiform tablets, from Hellenistic Babylonia
  • Contemporary legal texts
  • Estonian folklore
  • Mayan hiroglyphics
  • C20 Canadian litarary publications and magazine
  • History of Beer and Brewing
  • Avent-guard periodicals
  • Archival descriptions of records
  • German National Biography
  • C19/20 Church records looking at migration
  • Lexicon of Greek Personal Names
  • Canadian Gay Liberation 1960 – 1980
  • Encyclopedia Virginia
  • French Literature (including historical, mythic and fictional persons)
  • Early (Pre-1900s) Caribbean
  • Correspondence networks
  • C19 Mexican authors

Looking Towards an API for SNAP:DRGN

During the first SNAP:DRGN workshop a breakout group was convened to discuss the potential API for the project. Rather than come up with a specific API during that session, we instead focused on creating a “wish list” of applications and functions that we wanted to support. We were then able to abstract the functions that would be needed to support the list.

The most vital functions support the querying of the dataset to retrieve the the original id of the entity and all the information about that entity. The reference and provider information can be extracted from the general data about an entity so creating the individual functions for specific data is unnecessary and testing will need to be done to see if these are helpful in reducing the load on the server since they are much more directed and thus require less overhead.

getChildIds(id) takes SNAP id, returns ids of entity/entities that the current entity has been derived from
getSnapIds(id) takes partner id, returns list of ids of entity/entities that are derived from the current entity
getLiveSnapIds(id) takes partner id, returns list of ids of non-depricated entity/entities that are derived from the current entity
getPersonInfo(id, ?content_type) takes id, returns full details of the entity with that id in the format specified (‘rdf’ or ‘json’). If no content_type is specified then the function will return RDF
getReferences(id) takes id, returns a list of all the references associated with the given entity
getProviders(id) takes id, returns the list of data provider(s)/project(s) associated with the given entity

 

These functions will support the main internal functions which will be used to populate the main body of the website:

getPersonPage(personId) takes id, returns html page with details of the entity with that id
getProjectPage(projectId) takes dataset/project identifier, returns html page with the information about that dataset/project

The remaining API functions that were identified have been divided by priority. Higher priority has been given to functions that seem likely to be most useful to users, will be used by other functions or which will support immediate issues such as disambiguation and basic filtering. Priority has also been given to functions that rely on information that is currently supported within the ingested data or which supports the promotion our partner projects.

High Priority:

getBaseIds(id) takes a SNAP id, returns earliest id of entity/entities that the current entity has been derived from i.e. it returns the external URIs that the entity was ultimately derived from no matter how long the chain of derivation
getDatasetInfo(dataset, content_type) takes dataset identifier, returns full details of the dataset as per data-info in the format specified (‘rdf’ or ‘json’)
getDates(id) takes id, returns the list of associated date value for the given entity
getNames(id) takes id, returns the list of name values for the given entity
getNames(id, lang) takes id, returns the list of name values that match the specified language code for the given entity
getPlaces(id) takes id, returns the list of associated place value(s) for the given entity
getEntitiesByRef(reference) takes reference, returns a list of all the entities that point to the given reference

 

Medium Priority:

getEntityNetwork(id) takes id, returns ids of any entities that are either derived from the given entity or that the given entity is derived from or are asserted to be the same entity
getRelatedEntities(id) takes id, returns list of ids of related entities and how they are related
getEntitiesByName(name) takes name value, returns list of entities with an associated name value that exactly matches the given value.
getEntitiesBySource(source) takes source value, returns list of entities referenced within that source document

 

Low Priority:

getRelation(id, relationship) takes id, returns list of entities that have the given relationship to the given entity
getRelationships(id1, id2) returns the direct relationship (if there is one) from id1 to id2
getEntitiesByPublisher(publisher) takes publisher identifier, returns list of entities associated with the given provider
getEntitiesByDate(date) takes date value, returns list of entities with an associated date value that matches or contains the given value
getEntitiesByPlace(place) takes place value, returns list of entities with an associated place value that matches or contains the given value
getCountByPublisher(publisher) takes publisher identifier, returns count of entities from that publisher
getCountByName(name) takes name, returns count of entities with that exact name
getCountByPlace(place) takes place, returns count of entities associated with that place
getCountByRef(reference) takes reference, returns count of entities associated with the reference
getCountBySource(source) takes source, returns count of entities associated with that source document
getTemporalRange(publisher) takes publisher identifier, returns temporal range of dataset as given in dataset description
getGeoRange(publisher) takes publisher identifier, returns geographical range of dataset as given in dataset description

 

Future work, probably beyond the scope of the immediate pilot project will expand the available functions, both in the public API and in the internal systems. In addition to those functions listed below, the existing functions that return lists of entities will be expanded to have optional parameters which will allow the returned result to be filtered by date range and (potentially) place.

Aspirational functions:

getRelationshipPath(id1, id2) returns the sequence of relationship links (if there is one) that joins id1 to id2
getRelatedNames(name) takes name, returns list of names that are specified as variants to the given name
getEntitiesByDate(tpq, taq) takes start and end date values, returns list of entities with an associated date value that matches or falls within the given period
getEntitiesByApproxName(name) takes name value, returns list of entities with an associated name value that matches or closely matches the given value
getEntitiesByAltName(name) takes name value, returns list of entities with an associated name value that is an alternate matches the given value
getAssertionAuthority(id) takes id, returns the given authority (may be original publisher) that identified the entity as an entity
isDepricated(id) takes id, returns true if entity has been marked as depricated
getCurrent(replaced-id) takes id of replaced entity and returns the list of ids of entities that have replaced it
getAgreementMetric(id) return a value for the level of agreement for the existence of a given entity as determined by the conflicting assertions
getCertainty(id, authority) return the certainty value for a given authority for a given assertion (id)

 

Beyond the basic querying functions, the highest priority will be to add functioned to support the making assertions about existing entities. These assertions will result in the creation of new entities and as such each assertion will need to have an identifier for the person making the claim attached to it. Optional values will allow the assert-er to specify their level of certainty and to give a reference to a resource which back up their statement. The primary assertions (that the entities specified in the first argument do or do not represent co-references) will be implemented initially but as the assertion system is finalised this list will be expanded and refined.

coRefAssertion(entity-list, authority, ?certainty, ?reason) make a co-reference assertion
notCoRefAssertion(entity-list, authority, ?certainty, ?reason) make an assertion that two entities are not co-references
deprecateEntity(old-id, replacement-id) annotates the entity identified by the old-id as deprecated and adds a pointer to the replacement entity as defined by replacement-id.

 

Internally, restricted functions intended to support the addition of new datasets and allow more automation of the data ingestion system will be added. This will allow us to streamline the ingestion workflow and more easily add data from partner projects:

addDataset(data-info-uri, data-uri) ingests rdf from source given in file defined by data-info, and rdf description of the dataset from file indicated by data-info-uri
addDatasetDescription(data-info-uri) adds description of the dataset to the triple store
addDatasets(data-info-uri, data-uri-list) ingests rdf from each source listed in the given file defined by data-info, and rdf description of the datasets from file indicated by data-info-uri

 

Finally, purely internal functions to query externally held linked data will allow us to connect with datasets that have not been directly ingested. These datasets may be outside the scope of SNAP, for example place rather than person data, or may represent prosopography datasets which have been published and hosted in a compatible format by other projects with just the minimum information (snap identifier and reference) held locally:

lookupEntity(uri) query external datasource for RDF relating to entity as specified by URI
lookupSource(uri) query external datasource for information on/text of the source as given URI
lookupPlace(uri) query external datasource, e.g. Pelagios, for RDF information on place as specified by URI

 

The “wish list” identified in the workshop is detailed below. The list in not presented in any particular order and was conceived as purely aspirational.

  • Lightweight (popup) widgets for embedding in external websites
  • Data to support visualisations
  • Find mappings (that is, given one identifier, find all coreferencing identifiers)
  • Correlate identifier to provider
  • Filter identifiers by a variety of criteria
    • provider
    • date (created / modified)
    • subjectOf assertions (depends on precise character of assertions)
    • objectOf assertions (depends on precise character of assertions)
  • Read information about entity
    • Retrieve information from providers
    • Retrieve information/text from source (where available online)
  • Search on name – get result or disambiguation page
    • Language independent as much as possible
    • Disambiguation
      • The Fuzzy Person (Entity and closely related entities)
      • The Fuzzy Name (name variations)
  • Autosuggest
    • match hinting
    • name variation (soundex, name variation…)
  • Assertion creation
    • must allow external systems to publish assertions
    • public identification needed to link assertion to who did it
  • Dataset ingestion
    • single dataset
    • bulk datasets
  • Trust/Authority analysis on assertions
    • certainty assertions
    • agreement assertions
  • Prosopography/Project Summaries
    • Size
    • Temporal span
    • Geographic span
    • EAC Suggested Archive Descriptions:
        • Reference Code (Required)
        • Name and Location of Repository (Required)
        • Title (Required)
        • Date (Required)
        • Extent (Required)
        • Name of Creator(s) (Required, If Known)
        • Administrative/Biographical History (Optimum)
        • Conditions Governing Access (Required)
        • Physical Access (Added Value)
        • Technical Access (Added Value)
        • Conditions Governing Reproduction and Use (Added Value)
        • Languages and Scripts of the Material (Required)
        • Custodial History (Added Value)
        • Immediate Source of Acquisition (Added Value)
        • Appraisal, Destruction, and Scheduling Information
        • (Added Value)
        • Accruals
        • Related Materials Elements
        • Existence and Location of Originals (Added Value)
        • Existence and Location of Copies (Added Value)
        • Related Archival Materials (Added Value)
        • Publication Note (Added Value)
        • Notes Element
        • Description Control Element
  • List by Source/Reference , Project
  • List by Project
  • Depreciation
  • Language translation
    • Latin transliteration if not provided
    • Greek from Betacode?
    • Other languages?
  • Re-ingestion
    • Partner projects mark entities as deprecated/changed
  • Output
    • Web query – web response
    • RDF dump
    • SPARQL Endpoint
    • Other formats by demand later

SNAP and NER for Latin inscriptions

Prosopographies have in the past often taken decades or even centuries to produce. Even for a period with relatively few sources such as antiquity, hundreds of thousands of texts had to be collected and read, personal names had to be copied on index cards, people had to be identified across sources, their relations then had to be examined and their lives had to reconstructed.

Fortunately in this digital age that enormous work can be at least partially be automated. That is also a process SNAP is experimenting with: we do not only aim to bring together prosopographies just as the prosopographies themselves have brought together individuals in the sources; SNAP also wants to explore how to facilitate the creation of new prosopographies through Named Entity Recognition (NER). And this is where Leuven Ancient History and Trismegistos People come in.

In 2008 Trismegistos started a project to collect all personal names in Trismegistos Texts, at that time basically all published texts from Egypt between 800 BC and AD 800. This new Trismegistos People database could build on the Prosopographia Ptolemaica, a Leuven project which started in the late ’30ies, and which was transformed to a database already in the ’80ies. As its predecessor, Trismegistos People wanted to be multilingual, taking in not only Greek, but also Demotic and other Egyptian evidence.

As we foresaw, however, manual extraction of the personal names in the ‘old style’, now by typing in information in database records rather than writing them on index cards, proved very time-consuming. Still, for a language like Demotic where no Open Access full text was (and is) available, it was the only way forward. As a result, not even half of all 15,000 texts is currently done …

For Greek papyri, however, there was the Duke Database of Documentary Texts [DDbDP], which had just in 2008 been converted to Unicode and had been made available in the Papyrological Navigator [PN]. This was kindly put at our disposal, and this was to be our corpus for NER.

The Wikipedia article about NER states that ‘even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains’. Well, the papyrological texts seemed like a completely new domain all right, with their diacritic marks, their sometimes fragmentary state, the case system of ancient Greek, and the for us aberrant onomastic system with father’s names instead of first names and family names. So in our innocent rashness we decided to develop something completely new ourselves.

This is not the place to go into details. Those who want to know more can read an article of Bart Van Beek and myself (Journal of Juristic Papyrology 39 (2009), p. 31-47, available here), where we describe the procedure we developed, and which allowed us to deal with several hundred thousands of attestations of Greek personal names.

What I want to focus on here, is the new challenge in the form of Latin inscriptions. In the Europeana EAGLE project, Trismegistos is disambiguating the datasets of partners such as EDH, EDR, HispEpOl and EDB, and the full text of the inscriptions is going to be made available in Open Access. This again opens up exciting possibilities for SNAP, if through NER we can automate the collection of all attestations of personal names in this large corpus.

This time I could not call upon Jeroen Clarysse, who had cooperated with us to develop a NER tool in PHP. So I decided to go ahead myself in the system I know best: FilemakerPro. This may seem counterintuitive (some will no doubt use a different word), but if you want to express yourself, you just use the language you know best, and for me this is Filemaker. The challenges remain the same: identifying the named entities, in this case personal names, and extracting them in the best way possible with an eye to scholarly reuse.

The first problem for the Latin inscriptions was identifying personal names. No Open Access set of names was available, so we had to create that ourselves. Of course personal names are written with capitals, but so are place names, names of gods, and even the occasional book title. Not to mention Latin numbers, and in some datasets even unclear passages or the beginning of texts or sentences. Creating a set of Latin personal names on the basis of all capitalized words was thus the first task, and a quite time-consuming one.

For this, paradoxically, we were helped by a second problem, that of the Latin onomastic system. Latin, as all of you know, has an aberrant system in which people standardly have multiple names. According to the time period, they often use two or even three of the following: a praenomen such as Marcus, a nomen (gentilicium) such as Tullius, and a cognomen such as Cicero. On top of that, they even often add the name (mostly praenomen) of their father – or former master in the case of freedman. And citizens can add their tribus, the voting district in which they were registered. This leads to identification clusters such as C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus or C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i. Once you get rid of the diacritics for abbreviations and restaurations, these patterns actually help to identify words with a capital as personal names.

The existence of long clusters of names to identify a single individual, including also non-capitalized words, implies that we had to focus on extracting these clusters for each text. This I did on the basic principle that each consecutive word which either has a capital or belongs to set of core ‘identification-cluster’ words (including of course libertus, filius and the names of the voting districts) should be added to the cluster. This implies that you end up with clusters like C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i.

In a next step each of the constituants words needs to be analyzed: some are ‘linking word’ such as filius or the tribus-names Cornelia and Voturia, others are declined forms of a personal name, and yet others are ‘noise’ in the form of e.g. numbers. The case of the declined forms is essential for the interpretation of the cluster. In this example the cluster takes the form ‘nom nom gen filius tribus nom dat dat gen filius tribus dat’. This is then in a related database interpreted further as the identification of two individuals: one C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus (nom nom gen filius tribus nom) and one C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i (dat dat gen filius tribus dat).

In yet another database each individual identification is then split up and further standardized (e.g. by converting it to the nominative). The first identification is split up as identifying a person Caius Ofillius Proculus with a father Aulus (Ofillius) and belonging to the tribus Cornelia, and a person Caius Sextilius Pollio with a father Publius (Sextilius) and registered in the tribus Voturia.

At that stage, the information is ready to go into our database system, with the core database REF for all attestations of personal names, and separate databases for individuals (PER), and names, their variants, and their declined forms (NAM, NAMVAR, and NAMVARCASE).

about_ref_diagram
The NER system has currently progressed so far that I think we could get a complete database with all attestations of personal names in Latin inscriptions pretty soon given the funding needed to finetune the system and check its results. Even the onomastic analysis is a relatively easy task, with the cooperation of a few specialists. This way we could focus on the prosopographical aspects through SNAP and perhaps develop digitally assisted identifications of the people referred to in this enlarged Trismegistos People dataset.

You Aren’t Gonna Need It

We’ve been discussing lately how to merge person records in SNAP, so that when we encounter partner projects that each have a record for the same person, SNAP can provide a useful service by combining those into single, merged records, and we can start to get an idea of the requirements for performing operations like merges on our data. This discussion has proved something of a rabbit hole.

In any digital project there is always a temptation to plan for and build things that you think you may need later, or that might be nice to have, or that might help address questions that you don’t want to answer now, but might in the future. This temptation is almost always to be fought against. This is hard. We love to think about how things might work, and what people might want to do, but it’s always better in my experience to push towards the ruthlessly practical side of things. It is vastly easier to write software and to build data models when you have real requirements rather than speculative ones. Moreover, those speculative requirements frequently turn out to be different when they turn into real requirements. They may disappear on closer examination, or be vastly more complex, or otherwise metamorphose. If you wrote code to address these pseudo-requirements, it would have been a waste of time. Avoiding this kind of trap is a principle in software engineering, called YAGNI (see, e.g. http://c2.com/cgi/wiki?YouArentGonnaNeedIt). The pressure is even more acute (and more of a risk) in many DH projects, which are both research-oriented and often constrained in terms of resources.

SNAP has at least one of these speculative requirements. We know that in the future, we’ll want to allow people to make a variety of assertions about SNAP datasets. For example, we’ll want to support asserting that two “people” from different databases are in fact the same person, or that what is represented as a single person in a data source is actually two people, or that what a partner database has interpreted as a person actually isn’t (maybe a subsequent edition of the source document has determined that what was thought to be a name isn’t).

So how should we model these cases? We shouldn’t. Not until we’ve had the time to properly sort out all of the requirements for having SNAP users make these kinds of assertions against our data. We have ideas about how this might work, but we don’t have enough information on the parameters yet, and this functionality isn’t in scope for the current SNAP grant. All we want to be able to do right now is try out merging a few person records where our partner datasets have overlaps. Therefore, we aren’t going to model assertions about SNAP entities at all, just one of their outcomes: what a merged person looks like. The requirements for this are pretty straightforward: we need to know where the new person resource comes from, who is responsible for it, and why the merge was performed.

So lets start with two partner records (these are real)

<http://www.trismegistos.org/person/14218#this>
  a lawd:Person ;
  dc:publisher <http://www.trismegistos.org> ;
  lawd:hasName <http://www.trismegistos.org/name/6284#this> ;
  lawd:hasAttestation 
        <http://www.trismegistos.org/ref/30996#person> .
<http://www.lgpn.ox.ac.uk/id/V2-60610>
  a lawd:Person ;
  dc:publisher <http://www.lgpn.ox.ac.uk> ;
  lawd:hasAttestation 
    <http://www.lgpn.ox.ac.uk/id/V2-60610/personref/1>,
    <http://www.lgpn.ox.ac.uk/id/V2-60610/personref/2>,
    <http://www.lgpn.ox.ac.uk/id/V2-60610/personref/3> ;
  lawd:hasName <http://www.lgpn.ox.ac.uk/nym/nTi1marcos> ;
  foaf:name "Timarcos"@grc-Latn .

We can tell these are the same person, because they are both cite IG II² 3455 and 3777.  When these are ingested into SNAP, they’ll get SNAP ID’s (these are imaginary)

<http://data.snapdrgn.net/people/1234>
  a lawd:Person ;
  prov:wasDerivedFrom 
    <http://www.trismegistos.org/person/14218#this> .

and

<http://data.snapdrgn.net/people/1235>
  a lawd:Person ;
  prov:wasDerivedFrom <http://www.lgpn.ox.ac.uk/id/V2-60610> .

To merge them, we’ll just create a new person:

<http://data.snapdrgn.net/people/1236>
  a lawd:Person, snap:MergedResource ;
  dc:publisher <http://snapdrgn.net> ;
  dc:replaces 
    <http://data.snapdrgn.net/people/1234>,
    <http://data.snapdrgn.net/people/1235> ;
  snap:reason <http://data.snapdrgn.net/people/1236#reason1> .
<http://data.snapdrgn.net/people/1236#reason1>
  a cnt:ContentAsText ;
  cnt:chars "Merged because both replaced persons cite the same texts, IG II(2) 3455 and 3777." .

And with that, we have the who, what, and why, but we haven’t had to make any guesses about how SNAP might work in the future. We can merge person records without having had to plan out a whole new infrastructure.

 

The Old Classes vs Properties Debate (or Relationships Are Hard, Part 2)

One of the decisions that has to be made when creating an ontology is which concepts you encode as classes and which you encode as properties of those classes. One of the difficulties is that there is no overarching ‘right answer’ (although there are wrong ones) to how you should model your domain, in has to be decided on a case-by-case basis of what works best for the type of world view that you are trying to encapsulate within your model. This post is a request for feedback to help us decide which model works best for both the project and the wider community.

In the previous post we considered three patterns that we could use to describe relationships. Further discussion has led us to discarding the third, event-driven, option both in a drive towards simplicity and more importantly because it has the furthest conceptual distance from the information we want to represent. The source material is diverse in both type and style but if we consider what is normally captured in prosopographical data, and why, we would expect something like:

Επιγόνη daughter of Επίγονος
(from Thasos)

http://www.lgpn.ox.ac.uk/id/V1-37074

There are a number of events that we can hypothesise from these type of statements, in this case that Επίγονος fathered a girl, Επιγόνη. This fits in with the logical rules that it is possible to create in structured data: when person A fathered/gave birth to a girl B then B is the daughter of A. While epigraphs, like that in the example, are unlikely to go into further detail other sources may have specific description of some events moving them from the realm of the assumed (we assume that Επίγονος fathered Επιγόνη and did not, for example, adopt or was cuckolded resulting in the above statement) to the evidenced (trustworthiness of that evidence is a issue for a different day/post). For those people familiar with CIDOC CRM, this is basically the model that they employ – and it is a good one allowing a rich and detailed encoding of the biographical history of the person (or object). However much of this information is well beyond the scope of what SNAP sets out to model. If it wasn’t then we could just use CIDOC CRM, a well known and common standard, and all go home early for tea. One of the guiding principles behind SNAP is that we are only encoding the minimum information necessary to name/identify an individual entity. We need to know that Επιγόνη is the daughter of Επίγονος only in so much as that is part of her significant identity. So while we would encourage projects to encode this level of information in there own data, events are beyond the scope of SNAP, which leaves us with two other possibilities.

Defining every possible relationship via properties is arguably the simplest way that we could encode the information we need:

[Επιγόνη] -- daughter-of --> [Επίγονος]

There are two potential downsides to this. Firstly, the number of properties expands pretty fast. Not only do we have the basic property tree with

  • parent-of
    • father-of
    • mother-of
  • sibling-of
    • brother-of
    • sister-of
  • child-of
    • daughter-of
    • son-of

but each of those needs to have versions for ‘acknowledged’, ‘claimed’, ‘foster’, ‘adopted’, ‘step’. And then there is the extended family and even if we only go as far as the grandparent/grandchild relationship along with the basic aunt-of/uncle-of (interesting there is no collective gender-neutral word for this relationship), cousin-of (and no non-gender neutral term for this), nephew-of/niece-of then we still have to add in maternal and paternal versions (although we can probably be forgiven for dropping the ‘acknowledged’, ‘claimed’ etc). Added to these we need the important non-”blood” relationships: formalised intimate relationships (i.e. recognised marriage), non-formalised intimate relationships (i.e mistresses), slave-of, master-of, freedman-of, parton-of, client-of…

All in all that gets to approximately 90 relationships, plus a few more if we start including things like disciple-of and teacher-of.

This is not necessarily a problem in itself, although it does get a bit messy. It is at least nicely organised into a hierarchy and there are plenty of opportunities for adding disjunct and inverse property restrictions. However what we gain in the simplicity of the direct link we loose in sacrificing the possibility of relating additional information to the connection such as provenance, reference or certainty. If we model the relationship as a concept (i.e. a Class) rather than as a property connecting two entities they we immediately open up more possibilities.

There are three obvious ways to do this:
1.

[Entity1] --<generic-linking-property>--> [Relationship Class]  --<relationship-specification>--> [Entity2]

e.g. [Επιγόνη] –has-relationship–> [AcknowledgedRelationship] –daughter-of–> [Επίγονος]

2.

[Entity1] --<generic-linking-property>--> [Relationship]
        --<generic-linking-property>--> [Entity2]
        --<generic-type-linking-property>--> [RelationshipSpecification]

e.g. [Επιγόνη] — has-relationship –> [AcknowledgedRelationship]
–relationship-with–> [Επίγονος]
–relationship-type–> [Daughter]

3.

[Entity1] --<generic-linking-property>--> [Relationship Classes] --<generic-linking-property>--> [Entity2]

e.g. [Επιγόνη] — has-relationship –> [AcknowledgedRelationship, Daughter] — relationship-with –> [Επίγονος]

Although the first two of which could just as easily be modelled the other way around depending on where we preferred to put the emphasis:

[Επιγόνη] --has-relationship--> [Daughter] 
        --acknowledged-with--> [Επίγονος]
[Επιγόνη] --has-relationship--> [Daughter] 
        --relationship-with--> [Επίγονος]
        --relationship-type--> [AcknowledgedRelationship]

This is important because any additional information such as provenance, reference or certainty would be attached to the intermediary class and comes down to whether we see the hierarchy as being:

  • FamilyRelationship
    • AcknowledgedRelationship
    • FosteredRelationship
    • AdoptedRelationship
    • ClaimedRelationship
    • StepRelationship
  • RelationshipType
    • Parent
      • Father
      • Mother
    • Sibling
      • Brother
      • Sister
    • Child
      • Daughter
      • Son

or

  • FamilyRelationship
    • Parent
      • Father
      • Mother
    • Sibling
      • Brother
      • Sister
    • Child
      • Daughter
      • Son
  • RelationshipType
    • AcknowledgedRelationship
    • FosteredRelationship
    • AdoptedRelationship
    • ClaimedRelationship
    • StepRelationship

We can cut out some of this discussion by dropping the additional property and duel-classing the instance as shown in the third example. Expanding on that our class hierarchy would look like:

      • SocialContract
        • ExtendedHousehold
          • Household
            • FamilyRelationship
              • HereditaryFamily (If anyone can think of a better term I am open to suggestions)
                • Parent
                  • Father
                  • Mother
                • Sibling
                  • Brother
                  • Sister
                • Child
                  • Daughter
                  • Son
              • Extended Family
                • Aunt
                • Uncle
                • Nephew
                • Niece
                • Cousin
                • Ancestor
                  • Grandparent
                    • Grandfather
                    • Grandmother
                  • GreatGrandparent
                    • GreatGrandfather
                    • GreatGrandmother
                • Descendent
                  • Grandchild
                    • Grandson
                    • Granddaughter
              • [SeriousIntimateRelationship]
                • [LegallyRecognisedRelationship]
            • [HouseSlave]
        • Slave
          • HouseSlave
        • FreedSlave
          • Freedman
          • Freedwoman
        • IntimateRelationship
          • SeriousIntimateRelationship
            • LegallyRecognisedRelationship
          • CasualIntimateRelationship
      • RelationshipQualifier (all disjoint with everything except HereditaryFamily classes)
        • Acknowledged
        • Adopted
        • Fostered
        • Claimed
        • Step
        • Half (disjount with everything except Sibling classes)
      • RelationshipAxis (all disjoint with everything except ExtendedFamily classes)
        • Maternal
        • Paternal
        • Inlaw (disjoint with everything except HereditaryFamily and ExtendedFamily classes)

Disjoints would be defined for the gender specific classes (Son/Daughter, Mother/Father, Aunt/Uncle etc) and for those that are impossible without the use of time travel (Child/Parent, Ancestor/Descendent etc) but given the period we are dealing with (Romans and Egyptians – I’m looking at you) it would be unwise to add any additional disjoints that we might otherwise consider between related people.

Of the options that use classes instead of, or in addition to properties, then this is the simplest. It tends to be bad design when you end up making everything a Class which is what we have ended up doing here. Equally we can go to far in the opposite direction in search of “simplicity” and the desire to have as few classes as possible. The intermediary options offer a combination of properties and classes but also raise some options as to how we want the emphasis of the encoding to lie. These are questions that we feel it  would be better to open up to discussion by the wider community rather than just making an executive decision.

To review:

Option 1: All properties

<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;daughter-of <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436>

Option 2a: Combination of Classes and Properties (classes defines the relationship, properties the specific relationship)

<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;has-relationship [
        a &snap;AcknowledgedRelationship;
        &snap;daughter-of <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436>] .

Option 2b: Combination of Classes and Properties (classes define the specific relationship, properties the relationship type)

<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;has-relationship [
        a &snap;Daughter;
        &snap;acknowledged-with <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436>]

Option 3a: Combination of Classes and Properties (emphasis on classes but with properties explicitly linking rather than duel classing, main class is the relationship)

<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;has-relationship [
        a &snap;AcknowledgedRelationship;
        &snap;relationship-with <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436>
        &snap;relationship-type &snap:Daughter] .

Option 3b: Combination of Classes and Properties (emphasis on classes but with properties explicitly linking rather than duel classing, main class is the specific relationship)

<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;has-relationship [
        a &snap;Daughter;
        &snap;relationship-with <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436>
        &snap;relationship-type &snap:Acknowledged] .

Option 4: All classes

<http://clas-lgpn2.classics.ox.ac.uk/id/V1-37074> &snap;has-relationship [
        a &snap;AcknowledgedRelationship;
        a &snap;Daughter;
        &snap;relationship-with <http://clas-lgpn2.classics.ox.ac.uk/id/V1-40436>] .

I hope this post has clearly laid out the options as we see them and I’d like to invite your opinions and suggestions as to which way we go.

Why SNAP IDs?

One question that came up during the workshop a couple of weeks ago was: if partner projects already assign their own URIs/ids to their person/name/etc. records, then why should SNAP assign its own identifiers? There are two answers to that, one very practical, and the other a bit more philosophical.

  1. SNAP IDs will be URIs themselves, and when dereferenced in a browser, or by an application, will return a result. Either a web page listing what SNAP knows about the record in question, or RDF data about it. We can’t do this in a practical way without assigning our own identifiers.
  2. On a more theoretical level, we think that any updates made to data post-ingest shouldn’t be made directly on our partners’ data. We believe, for example, that while SNAP might assert an identity between two person records coming from two partner datasets, it will be up to the partners whether they accept that identification.

The practical

When a person (or person-like entity) record in the SNAP triplestore is queried by URI via a web browser, we would expect this URI to dereference to a HTML page giving more information about the person recorded. The main information, the title of the page, would be the immediate source of the information: i.e. the contributing dataset, or datasets in the case of a merged or co-referenced person. Other information about the person–names, associated dates and places, primary text attestations, etc.–would also be listed, in a simple and standard layout, as would relationships to other persons, and other assertions about the person that SNAP knows about. An example (completely fictional) entry might therefore look something like:

TM 1234 = LGPN V5a-567
SNAP Person id: 10002
Apollonius/Apollonios/Ἀπολλώνιος
c. II cent
Aphrodisias
Father of TM 1233 (SNAP pid 10001), Diogenes/Διογένης
Attested in: PHI 256884; BGU.12.16024

In addition to this information, which may be different from, in some case supplemental to, the information in the contributing databases, we can imagine other information and services being added to this page. For example, a feed showing external projects that have linked to this person as annotations to names in their texts or archaeological objects; or a Social Network Analysis visualization of persons, places, texts, etc. within two steps of relationship to this person. All of these SNAP-specific services will only be possible if we have SNAP identifiers to dereference to pages containing this information.

The theoretical

When the SNAP system ingests data from a partner (Trismegistos, for example) we’ll get data from them that looks like:

<http://www.trismegistos.org/person/414#this>
   a lawd:Person ;
   dc:publisher <http://www.trismegistos.org> ;
   lawd:hasName <http://www.trismegistos.org/name/5663#this> ;
   lawd:hasAttestation <http://www.trismegistos.org/ref/1662#this> .
    
<http://www.trismegistos.org/name/5663#this>
   a lawd:PersonalName ;
   dc:publisher <http://www.trismegistos.org> ;
   lawd:primaryForm "Σαραπίων"@grc;
   lawd:primaryForm "Sarapion"@en ;
   lawd:hasAttestation <http://www.trismegistos.org/ref/1662#this> .

and SNAP will assign a new person id, like http://data.snapdrgn.net/person/1234 to the lawd:Person http://www.trismegistos.org/person/414#this. The theoretical reason for this is that SNAP plans to add functionality for the identification of persons belonging to multiple datasets and the annotation of those persons. As we noted above, we think those sorts of updates shouldn’t be applied directly to the Trismegistos resource by us. If you contribute data to SNAP, we feel strongly that we shouldn’t change that data. You should be free, of course, to accept new facts or assertions that emerge in SNAP that are relevant to your data back into your dataset, but those shouldn’t be forced on you, nor should it be made to look as if your project asserts something it doesn’t. There are a couple of possible ways to achieve this, but one very simple one is to create a derived resource to which new facts and assertions may be added. SNAP ids allow us to preserve the integrity of contributed datasets while allowing us to build upon them.

SNAP at Digital Humanities 2014

The SNAP Project is proud to announce the Ontologies for Prosopography: Who’s Who? or, Who was Who? one-day workshop developed in conjunction with the People of the Founding Era project based at the University of Virginia. The workshop will give the opportunity for SNAP to present our data model to a wider audience and engage with the researchers working on similar problems other periods and geographic areas.

The morning session will be devoted to the presentation of the methodologies used by different projects and discussion of the needs of researchers working with historic person data and how they have been, and can be, addressed. Building on this, the afternoon will offer the option of smaller focused discussions or hands-on, practical sessions in which attendees will have the opportunity to discuss their own data with experts and how they can publish it as structured linked data.

The short description of the workshop, as seen on the DH2013 website, is below. A more detailed description can be http://www.stoa.org/archives/1953

Summary:

Historical data about people, their names, their attributes, and their relationships is one of the most common types of data for projects to expose and yet an area which is falling behind others in the move to the digital data publication and exchange.

The morning session ‘Modelling the Person’, will address the issues of modelling historical persons with presentations and discussions on practices from a range of existing, or emerging, projects and models that attempt to capture information about historical persons using structured models that are compatible with semantic web thinking — models such as the SNAP:DRGN, CIDOC-CRM/FRBRoo, the factoid model, SNAC, etc, plus any others that participants are already using to model their data. Building on these presentations the workshop looks towards finding whether a cross-project consensus on standards and best practice is possible.
The workshop will continue in the afternoon with a session on ‘Linking the Person’. Attendees will have the opportunity to continue the morning’s theoretical discussion or breakout into other areas with a choice of smaller groups focusing on the technical and practical issues of linking the person and name data from different projects together including hands-on sessions on preparing and publishing prosopographical and onomastic datasets as structured data (attendees are encouraged to bring their own datasets if they choose to take part in one of the hands-on breakout groups).

Audience:

This workshop will particularly appeal to prosopographers, biographers, genealogists, classicists, social historians and those working with resources where persons are mentioned during the Greco-Roman and connected periods or during the foundation of America.

The workshop will also appeal to ontologists and technologists and developers with an interest in structured and open, linked data who are dealing with data related to historical people and names. The breakout groups in the afternoon will cater for all levels of technical ability.

Although some of the projects showcased in this workshop focus on specific periods such as the Greco-Roman world and the foundation of America, the issues raised are applicable to all historical eras and loci and participation by researchers from all periods and areas are encouraged and welcomed.