(SNA)P

Being a conversation between Gabriel Bodard, Yanne Broux and Silke Vanbeselaere about the SNAP:DRGN project and Social Network Analysis

Cross-posted to Data Ninjas: http://spaghetti-os.blogspot.be/

Gabriel Bodard: So, tell me what is Social Network Analysis, and how is it useful for prosopography projects?

Silke Vanbeselaere: Social Network Analysis (SNA) is basically the study of relationships between people through network theory. First used in sociology, it’s now become popular in many other disciplines, with a budding group of enthusiasts in (ancient) history.
What it does, is focus on relations (of whatever kind) instead of on the actors individually. Through visualisation of the network graph and the network statistics, information can be obtained about the structure of the network and the roles of the individuals in it.

The visualization of these network graphs can be especially interesting for prosopography projects as it can help disambiguate people. Individuals are represented by nodes and their relationships are represented by ties or links between those nodes. Instead of dealing with one source at a time, the network allows you to see the whole of the relationships.

GB: Can you perhaps illustrate that with an example and how it could help us?

Yanne Broux: Off the top of my head: one of the things the extremely nifty disambiguation methods we developed could help you out with is the identification of high-ranked Roman officials across the different datasets. Consuls were often mentioned in dating formulas, and procurators, proconsuls, legati and the like were pretty mobile, so chances are they appear in texts across the empire. They’ll light up like Christmas trees once we shape them into a network.

What is the SNAP:DRGN project about, Gabby, and what sort of prosopographical data (especially relationships) does it contain?

GB: I’m glad you asked me that. SNAP:DRGN stands for “Standards for Networking Ancient Prosopographies: Data and Relations in Greco-Roman Names” (not an artificial backronym at all!). Very briefly, the aim of the project is to bring together person-records from as many online prosopographies of the ancient world as possible, using linked data to record only the most basic information (person identifiers, names, citations, date, place and hopefully relationships with other persons). We only plan to store this very summary data, along with links back to the richer records in the contributing data source, and enable annotation on top of that.

SV: What will people be able to do with this limited data, then?

GB: In particular, scholars will be able to (1) join together records originally from different databases that clearly refer to the same person; (2) point out relationships between persons, e.g. person XYZ from this database is the daughter of person ABC from that one; and (3) annotate their own texts (archaeological or library records, etc.) to disambiguate a personal name using SNAP as an authority list.

At the moment there are relatively few co-references between the prosopographical datasets, i.e people who appear in more than one database, in SNAP (although there will be plenty between the library catalogues), and the only explicitly encoded personal relationships are the ones imported from the Trismegistos database, but we’re working to improve both of these things. How does that sound to you?

SV: Basically, what we need for SNA is a link between the people and the texts in which they appear. Now, I have no idea how sophisticated these other datasets are, but to avoid confusion/ mistakes/ whatever kind of apocalyptic disaster, you need unique numerical identifiers, both for your individuals, and for your texts. Trismegistos Texts is now slowly expanding beyond the Egyptian borders, so perhaps we already have some of the texts incorporated in the other datasets, and then it should be pretty easy to link them. But I suspect that for most of the data, new identifiers will have to be created.

GB: Unique identifiers for all persons we have, of course. SNAP mints URIs for all persons we have data for, whether they had dereferenceable URIs in the source datasets or not. In some cases we have identifiers for texts too (TM uses Papyri.info URIs, as you know); in other cases, we’ve had to hope that parsing text strings will be sufficiently unambiguous to be useful. (We’ve identified a few hundred co-references between LGPN and TM using text strings.) We also have a lot of persons from library catalogues (VIAF, the British Museum, Zenon and Trismegistos Authors) among whom co-references ought to be plentiful.

So this seems to be a little circular at the moment, doesn’t it? One of the things SNA might help with is identifying co-references, which in turn will help us build a graph of relationships. But you’re telling me that SNA isn’t really feasible on our data until we have a much better graph of co-references, relationships, and text co-occurrences. Is there anything useful we could do together in the meantime?

YB: Since we are enriching Trismegistos, by adding new texts from around the Mediterranean, by identifying individuals in the Egyptian texts, and by adding extra information such as titles, ethnics and status designations, and at the same time you are enriching SNAP, we are actually feeding into each other symbiotically, like meat ants and leafhoppers that find each other over sugary sap in the Australian outback.

And hey, Silke, what about that “Structural Equivalence” hoodoo you’ve been learning in London, could that be of any help?

SV: Well, it is a very interesting concept that explores the social environment of a person, but that implies that all your data need to be extremely accurate. That means that you first need to identify all the people mentioned in your network. Because without that information, there’s no way you could rightly use your data to explore the structural equivalence of two or more people. As such, I don’t see how it would yet be suitable to use on the data that we would be presented with in SNAP. In the future however…

GB: Are there any improvements you can suggest for the Trismegistos database?

YB: We’re kind of stuck when it comes to titles. You see, we hardly have any. Asking the computer to retrieve them, like we did for the names, proved to be next to impossible, and it’s a hell of a lot of work if you have to go through some 500,000 attestations manually. I’ve already gone through more than 10,000 of them while working on my double names and municipal officials, so I’ve done my share, methinks. Also, it’s not exactly easy to standardize titles, what with all the different languages in Egypt and all. But I guess that if one of the other datasets has a list or something we could look into, that might help us out a bit…

GB: So, in a hopefully not too distant future, when all these relationships are implemented through SNAP:DRGN, how can the participating projects in turn be of service to you and other researchers who would like to use SNA? When SNAP is ready for SNA to be performed on it, what questions will you ask of it?

SV: Well, Gabby, I’m glad you ask. Prosopographies are the ideal datasets for SNA research as those datasets of people have been formed or selected because of some common features (mentioned in the same source, part of an ethnical/social group, time fellows…). Once the technical infrastructure is in place, it will be relatively straightforward to convert the virtual two-mode networks linking texts and the people appearing in them into the one-mode networks (person – person) needed for actual SNA.

GB: This all sounds very promising! Thank you so much for sharing these ideas. I look forward to being in a position to do a bit more with all this some day.

Are you a prosopography?

At the SNAP:DRGN project meeting in Edinburgh a few weeks ago, we decided on a couple of definitions that will impact on the ways in which partner datasets interact with the project. Our current thinking is that we need to distinguish between two kinds of data:

(1) The first kind, which we’ll loosely call a “prosopography”, is a curated database of person records, with some ambition to be able to be used as an authority list. Prosopographies such as PIR, Broughton, PBW, etc. would be obvious examples of this category, as would the controlled vocabulary of persons in a library catalog like VIAF, Zenon, British Museum persons, Trismegistos Authors, the Perseus Catalog, etc. Even if the task of co-referencing persons is incomplete (as with Trismegistos, say), the intention to disambiguate qualifies the dataset as a “prosopography”.

(2) The second, which we call a “list of attestations” is not comprehensively curated or disambiguated in this way, and has no ambition of being an authority list. Examples of this kind of dataset (as I understand them) would include: the EDH person table; the raw list of name references Mark has extracted from Latin inscriptions; the tagged and indexed “names and titles” in the texts of the Inscriptions of Aphrodisias or Inscriptions of Roman Tripolitania.

In the SNAP:DRGN workflow, we hope that all “prosopographies” of type 1 will be contributed into the SNAP graph. We shall assign SNAP URIs to all persons in the datasets, and in time work to co-reference and merge with persons sourced from other projects as well as possible. These will form the authority file to which other datasets will refer, and we would recommend that lists of “attestations” of type 2 use Pelagios-style OAC annotations (*) to point to the SNAP identifiers as a way of disambiguating their person-references.  The process of disambiguating and/or co-referencing persons in this way might eventually lead some lists of annotations to become disambiguated prosopographies in our schema, at which point we would potentially want to include them in the SNAP graph as first class entities.

(*) We hope to the have the SNAP:DRGN guidelines for these Pelagios-like annotations (“Scenario 5″ in our Cookbook) available very shortly.

Some example RDF fragments

In the process of working with a few of our partner projects, we have produced some sample RDF fragments, which we thought might be useful as an illustration of SNAP RDF format for other projects currently planning to expose a version of their data via our graph. We hope to include at least some examples of this kind in a later version of the SNAP:DRGN Cookbook.

First off, the most simple and minimalist example possible (even more sparse than the PIR data which contains little more than headwords). The Zenon database is the library catalog of the German Archaeological Institute (DAI), which has an authority list of some 360 ancient authors. The RDF of this authority list (encoded in SKOS natively), will contain very little information except for URI and preferred name (sample in GDoc):

<http://zenon.dainst.org/000003901_1333e04fe2d7b09b43b088eb2ff1413f#this>
	rdf:type lawd:Person ;
	dc:publisher <http://www.dainst.org/> ;
	foaf:name "Platon"@de .

(Translated to English: “<this URI> is a person, according to the DAI, called Platon in German.”)

The next example is from the Prosopography of the Byzantine World, a project published by King’s College London. This is a full prosopography, on the “factoid model,” that contains much more richness of information and biographical data than SNAP:DRGN has any aspiration to include. We took one example of a fairly complex person (Leon 103) to show just what a SNAP version of his data might look like. In this case, SNAP will capture the URI; names (both English and Greek); associated date; associated place(s), in as much as these can be extracted from the database; attestation (in PBW) and citation(s); and relationships with other persons (Leon’s cousin Kale is also in PBW).  See the full RDF of Leon 103 in a GDoc here.

Finally, we mocked up the example of a person from the Smith Dictionary of Greco-Roman Biography and Mythology, which is being encoded and NER’ed by Stella Dee in Leipzig. As an example, we took Brutus 18 (the less famous D. Junius Brutus). From this entry, we hope to be able to include in SNAP his name (English only); associated date and associated place (both depending on NER); attestation and citations; relationships (4 relationships are recognised in the text of Smith’s entry, one of which to another person in the Dictionary). See full RDF of Brutus 18 in a GDoc here.

We’ll try to add more examples of this kind as we come up with them. Let us know if you find this sort of thing useful.

SNAP and VIAF

We’ve had a couple of meetings with Karen Smith-Yoshimura and Thomas Hickey, of the Scholars Contributions to VIAF group, to discuss possible collaborations, exchange of information, and mutual benefits of sharing standards between the SNAP:DRGN project and VIAF (the Virtual International Authority File, a federated authority list of persons from library catalogs, mostly from author or subject fields).

We considered two main questions:

  1. What can SNAP:DRGN gain from VIAF data or formats? Most concretely, what subset of VIAF person-records, and what fields in them, should we consider ingesting into the SNAP graph?
  2. How can VIAF benefit from SNAP:DRGN’s work in this area? In what ways can SNAP provide data and information that might be passed back to VIAF for inclusion in the authority file?

Preliminary answers and thoughts below.

1. What can SNAP get from VIAF?

Looking at the VIAF data model, we decided that in most cases the only categories of information we would get from them would be, (a) a URI, (b) a name. There was some discussion as to whether we  could sometimes extract from the data (c) some alternative name forms (e.g. in Greek, by searching for Unicode codepoint range), (d) a date, whcih is present in some name strings; (e) an associated place, which is present in some name strings. VIAF records that come in via Wikipedia or Wikidata would also give us (f) alternative ID, in the form of a Wikipedia/Dbpedia uri. We didn’t think that the LAWD “attestation/citation” categories were appropriate for modelling the information about books with these persons as authors or subjects, although that is the most useful information that one would get by going back into the VIAF data from the SNAP graph.

We discussed what subset of the VIAF dataset would be of interest to model in SNAP, and after a few experiments with filtering by date (which is not always given), language (not always given), and contributing collection, we settled on a preliminary export of persons who matched:

  • birth OR death date present, and before 1000 C.E.
  • AND any one or more of
    • language: Latin
    • OR language: Ancient Greek
    • OR collection: Perseus Catalog.

Which gives us a small corpus of 1,781 ancient authors to experiment with.

2. What can SNAP give to VIAF?

VIAF is an authority list of authors, artists, other creators, and people important enough to have a book (or at least a chapter) written about them, so they won’t be interested in hundreds of thousands of names of Greeks and Romans about whom all we know is their gravestones, contracts they signed, or graffiti they left on a theater wall. In order to flag a subset of persons whom VIAF might be interested in importing from the SNAP graph, we are proposing to add a new property to the SNAP ontology: associatedRole. This would allow us to flag poets, historians, authors, potters, sculptors, actors, performers etc., whom VIAF would include in their authority file, even if no works by these people survive. We’ll consider doing so in a later revision of the Cookbook, since version 1.0 is now locked down.

Other ways in which the SNAP dataset may be of value to VIAf is through the connections that we make between databases by coreferencing and disambiguating unique individuals. If we have a VIAF record for a person, but that person is also in the British Museum person thesaurus, the Trismegistos author table, LGPN and/or PBW, then variant names, dates, citations, alternate identifiers and other information from these databases might enrich the VIAF data on these records, and could be automatically ingested via the linked data we produce.

Many of the issues discussed above will also come up when we speak to other potential data partners about linking up SNAP records with their data, so it was great to have this preliminary conversation.

Minutes of first Advisory Board meeting

SNAP:DRGN Advisory Board

1st meeting Skype (voice only) 2014-05-09

Present: Øyvind Eide (chair), Fabian Koerner, Robert Parker, Laurie Pearce, Charlotte Roueché, Rainer Simon, Gabriel Bodard (principal investigator)

Not present: Sonia Ranade.

The meeting lasted around one hour.

Minutes written by Øyvind Eide based on notes from Laurie Pearce and Rainer Simon. Continue reading

Entering the SNAPDRGN garden

Now that the SNAP project has started ingest finalized data from the initial core datasets, it is time to think about how to bring in material from the other partners. For some, this will be easy, as they already know to make available their data in RDF form on the open web and simply need to follow the guidelines in the Cookbook. For others quite a lot of work will be involved getting SNAP ready. This post describes some of the stages you may go through, and some of the problems that you may meet.

I have divided the work into six steps:

  1. Decide whether you have a set of names, a set of attestations, or a prosopography
  2. Identify your records
  3. Establish the identities online
  4. Data wrangling
  5. Transform the data
  6. Make the RDF available

The first step is the most critical – what does your data actually represent? If your starting point is a text, and you have extracted the personal names, you do not have prosopographical records yet, but rather a set of attestations of names in a text. This is in itself a good thing to do, but to get to the prosopography you have to decide how many actual individuals are referred in the text – 10 occurrences of the name Marcus Aurelius may refer to between 1 to 10 persons (or one cat!).

If what you have is a set of attestations, what you probably want to is contribute a slightly different form of data to the SNAP network, namely links from people to sources. This will mean deciding which person in SNAP your name (Aelius Florus natione Pann., for example) refers to, and then generate RDF which links the URL of the source text to the SNAP person.

The second task, data wrangling, can be quite laborious. It involves turning the list of names into a set of person records, and creating a single record for each person which (for example) lists all the different names they are attested under, and what the source of those attestations is. At this point you will also be assembling other information from your sources (place of birth, sex, profession etc). You will under some circumstances be comparing what you have with established authority lists of names or persons.

Now that we have a set of people, the third task, identifying your records. It seems obvious, but if all you are doing is analyzing the information in a spreadsheet, you may never have needed to.

The fourth minimum piece of work is to establish your identifiers online, ie available as URIs. If you have a person called Soranus, you have allocated number 1001 to him, and you are able to use the domain http://my.people.net/, you might decide that his public identifier is http://my.people.net/person/1001. At the least, make it so that this displays an HTML page about the person for anyone visiting that URL. The very simplest way to do this is to make an HTML file for each person, and arrange the web server so that the request for /person/1001 returns that web page (look at facilities for URL rewriting). This assumes, of course, that you have a web server you can use to put up files, with some reasonable prospect of that remaining in place for the coming decades or longer. Larger institutions may have a repository which can do this job for you, and even assign Digital Object Identifiers (DOIs) to your data.

If you can go further, and map your permanent urls to queries against a database (where, for example, http://my.people.net/person/1001 is turned behind the scenes into http://my.people.net/query.php?id=1001, and retrieves the data from a relational database), you will have a more powerful (but harder to sustain) resource. At this point you can consider having your web server do content negotiation, and return different formats of data in response to Accept headers (technical details at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) so that humans read HTML and computers read RDF.

The fifth stage is transforming your data to the RDF format which SNAP wants, as explained in the cookbook. This may involve an XSLT transform of XML data (RDF can be represented in XML), or a Python script reading a database, or an output template in your publication system. Incidentally, if you’re wondering how to get data out of an Excel spreadsheet, the OxGarage can turn the sheet into XML for you.

If you’ve got this far, and created some RDF suitable for submitting to SNAP, that’s great. You may also want to move onto the sixth stage, making your RDF available on the web, that’s even better, so that SNAP can harvest it at intervals and keep up to date. This stage means looking at your own RDF triple store. Setting this sort of software up yourself isn’t so easy, but you may wish to read up on open source projects like OntoWiki, Sesame, and Fuseki. Maybe one of us will blog more about this in future.

SNAP at DH2014

In a change from our usual programming we will be bringing you a blog post directly from the Ontologies for Prosopography workshop at Digital Humanities 2014.

The workshop hopes to bring prosopographers and prosopographical datsets together (preferably through LOD).

Datasets in the following areas were represented at the workshop:

  • Greek/Byzantine Manuscripts in Sweden
  • Apprenticeships in Early Modern Venice
  • Late Antique Immigration from Rome
  • Roman worlds
  • Imperial cult priests in North Africa and Syria
  • European University history from the middleages to the present
  • C17/18 Tibetan monestries in China and Tibet
  • Fictional People
  • Cuneiform tablets, from Hellenistic Babylonia
  • Contemporary legal texts
  • Estonian folklore
  • Mayan hiroglyphics
  • C20 Canadian litarary publications and magazine
  • History of Beer and Brewing
  • Avent-guard periodicals
  • Archival descriptions of records
  • German National Biography
  • C19/20 Church records looking at migration
  • Lexicon of Greek Personal Names
  • Canadian Gay Liberation 1960 – 1980
  • Encyclopedia Virginia
  • French Literature (including historical, mythic and fictional persons)
  • Early (Pre-1900s) Caribbean
  • Correspondence networks
  • C19 Mexican authors

Looking Towards an API for SNAP:DRGN

During the first SNAP:DRGN workshop a breakout group was convened to discuss the potential API for the project. Rather than come up with a specific API during that session, we instead focused on creating a “wish list” of applications and functions that we wanted to support. We were then able to abstract the functions that would be needed to support the list.

The most vital functions support the querying of the dataset to retrieve the the original id of the entity and all the information about that entity. The reference and provider information can be extracted from the general data about an entity so creating the individual functions for specific data is unnecessary and testing will need to be done to see if these are helpful in reducing the load on the server since they are much more directed and thus require less overhead.

getChildIds(id) takes SNAP id, returns ids of entity/entities that the current entity has been derived from
getSnapIds(id) takes partner id, returns list of ids of entity/entities that are derived from the current entity
getLiveSnapIds(id) takes partner id, returns list of ids of non-depricated entity/entities that are derived from the current entity
getPersonInfo(id, ?content_type) takes id, returns full details of the entity with that id in the format specified (‘rdf’ or ‘json’). If no content_type is specified then the function will return RDF
getReferences(id) takes id, returns a list of all the references associated with the given entity
getProviders(id) takes id, returns the list of data provider(s)/project(s) associated with the given entity

 

These functions will support the main internal functions which will be used to populate the main body of the website:

getPersonPage(personId) takes id, returns html page with details of the entity with that id
getProjectPage(projectId) takes dataset/project identifier, returns html page with the information about that dataset/project

The remaining API functions that were identified have been divided by priority. Higher priority has been given to functions that seem likely to be most useful to users, will be used by other functions or which will support immediate issues such as disambiguation and basic filtering. Priority has also been given to functions that rely on information that is currently supported within the ingested data or which supports the promotion our partner projects.

High Priority:

getBaseIds(id) takes a SNAP id, returns earliest id of entity/entities that the current entity has been derived from i.e. it returns the external URIs that the entity was ultimately derived from no matter how long the chain of derivation
getDatasetInfo(dataset, content_type) takes dataset identifier, returns full details of the dataset as per data-info in the format specified (‘rdf’ or ‘json’)
getDates(id) takes id, returns the list of associated date value for the given entity
getNames(id) takes id, returns the list of name values for the given entity
getNames(id, lang) takes id, returns the list of name values that match the specified language code for the given entity
getPlaces(id) takes id, returns the list of associated place value(s) for the given entity
getEntitiesByRef(reference) takes reference, returns a list of all the entities that point to the given reference

 

Medium Priority:

getEntityNetwork(id) takes id, returns ids of any entities that are either derived from the given entity or that the given entity is derived from or are asserted to be the same entity
getRelatedEntities(id) takes id, returns list of ids of related entities and how they are related
getEntitiesByName(name) takes name value, returns list of entities with an associated name value that exactly matches the given value.
getEntitiesBySource(source) takes source value, returns list of entities referenced within that source document

 

Low Priority:

getRelation(id, relationship) takes id, returns list of entities that have the given relationship to the given entity
getRelationships(id1, id2) returns the direct relationship (if there is one) from id1 to id2
getEntitiesByPublisher(publisher) takes publisher identifier, returns list of entities associated with the given provider
getEntitiesByDate(date) takes date value, returns list of entities with an associated date value that matches or contains the given value
getEntitiesByPlace(place) takes place value, returns list of entities with an associated place value that matches or contains the given value
getCountByPublisher(publisher) takes publisher identifier, returns count of entities from that publisher
getCountByName(name) takes name, returns count of entities with that exact name
getCountByPlace(place) takes place, returns count of entities associated with that place
getCountByRef(reference) takes reference, returns count of entities associated with the reference
getCountBySource(source) takes source, returns count of entities associated with that source document
getTemporalRange(publisher) takes publisher identifier, returns temporal range of dataset as given in dataset description
getGeoRange(publisher) takes publisher identifier, returns geographical range of dataset as given in dataset description

 

Future work, probably beyond the scope of the immediate pilot project will expand the available functions, both in the public API and in the internal systems. In addition to those functions listed below, the existing functions that return lists of entities will be expanded to have optional parameters which will allow the returned result to be filtered by date range and (potentially) place.

Aspirational functions:

getRelationshipPath(id1, id2) returns the sequence of relationship links (if there is one) that joins id1 to id2
getRelatedNames(name) takes name, returns list of names that are specified as variants to the given name
getEntitiesByDate(tpq, taq) takes start and end date values, returns list of entities with an associated date value that matches or falls within the given period
getEntitiesByApproxName(name) takes name value, returns list of entities with an associated name value that matches or closely matches the given value
getEntitiesByAltName(name) takes name value, returns list of entities with an associated name value that is an alternate matches the given value
getAssertionAuthority(id) takes id, returns the given authority (may be original publisher) that identified the entity as an entity
isDepricated(id) takes id, returns true if entity has been marked as depricated
getCurrent(replaced-id) takes id of replaced entity and returns the list of ids of entities that have replaced it
getAgreementMetric(id) return a value for the level of agreement for the existence of a given entity as determined by the conflicting assertions
getCertainty(id, authority) return the certainty value for a given authority for a given assertion (id)

 

Beyond the basic querying functions, the highest priority will be to add functioned to support the making assertions about existing entities. These assertions will result in the creation of new entities and as such each assertion will need to have an identifier for the person making the claim attached to it. Optional values will allow the assert-er to specify their level of certainty and to give a reference to a resource which back up their statement. The primary assertions (that the entities specified in the first argument do or do not represent co-references) will be implemented initially but as the assertion system is finalised this list will be expanded and refined.

coRefAssertion(entity-list, authority, ?certainty, ?reason) make a co-reference assertion
notCoRefAssertion(entity-list, authority, ?certainty, ?reason) make an assertion that two entities are not co-references
deprecateEntity(old-id, replacement-id) annotates the entity identified by the old-id as deprecated and adds a pointer to the replacement entity as defined by replacement-id.

 

Internally, restricted functions intended to support the addition of new datasets and allow more automation of the data ingestion system will be added. This will allow us to streamline the ingestion workflow and more easily add data from partner projects:

addDataset(data-info-uri, data-uri) ingests rdf from source given in file defined by data-info, and rdf description of the dataset from file indicated by data-info-uri
addDatasetDescription(data-info-uri) adds description of the dataset to the triple store
addDatasets(data-info-uri, data-uri-list) ingests rdf from each source listed in the given file defined by data-info, and rdf description of the datasets from file indicated by data-info-uri

 

Finally, purely internal functions to query externally held linked data will allow us to connect with datasets that have not been directly ingested. These datasets may be outside the scope of SNAP, for example place rather than person data, or may represent prosopography datasets which have been published and hosted in a compatible format by other projects with just the minimum information (snap identifier and reference) held locally:

lookupEntity(uri) query external datasource for RDF relating to entity as specified by URI
lookupSource(uri) query external datasource for information on/text of the source as given URI
lookupPlace(uri) query external datasource, e.g. Pelagios, for RDF information on place as specified by URI

 

The “wish list” identified in the workshop is detailed below. The list in not presented in any particular order and was conceived as purely aspirational.

  • Lightweight (popup) widgets for embedding in external websites
  • Data to support visualisations
  • Find mappings (that is, given one identifier, find all coreferencing identifiers)
  • Correlate identifier to provider
  • Filter identifiers by a variety of criteria
    • provider
    • date (created / modified)
    • subjectOf assertions (depends on precise character of assertions)
    • objectOf assertions (depends on precise character of assertions)
  • Read information about entity
    • Retrieve information from providers
    • Retrieve information/text from source (where available online)
  • Search on name – get result or disambiguation page
    • Language independent as much as possible
    • Disambiguation
      • The Fuzzy Person (Entity and closely related entities)
      • The Fuzzy Name (name variations)
  • Autosuggest
    • match hinting
    • name variation (soundex, name variation…)
  • Assertion creation
    • must allow external systems to publish assertions
    • public identification needed to link assertion to who did it
  • Dataset ingestion
    • single dataset
    • bulk datasets
  • Trust/Authority analysis on assertions
    • certainty assertions
    • agreement assertions
  • Prosopography/Project Summaries
    • Size
    • Temporal span
    • Geographic span
    • EAC Suggested Archive Descriptions:
        • Reference Code (Required)
        • Name and Location of Repository (Required)
        • Title (Required)
        • Date (Required)
        • Extent (Required)
        • Name of Creator(s) (Required, If Known)
        • Administrative/Biographical History (Optimum)
        • Conditions Governing Access (Required)
        • Physical Access (Added Value)
        • Technical Access (Added Value)
        • Conditions Governing Reproduction and Use (Added Value)
        • Languages and Scripts of the Material (Required)
        • Custodial History (Added Value)
        • Immediate Source of Acquisition (Added Value)
        • Appraisal, Destruction, and Scheduling Information
        • (Added Value)
        • Accruals
        • Related Materials Elements
        • Existence and Location of Originals (Added Value)
        • Existence and Location of Copies (Added Value)
        • Related Archival Materials (Added Value)
        • Publication Note (Added Value)
        • Notes Element
        • Description Control Element
  • List by Source/Reference , Project
  • List by Project
  • Depreciation
  • Language translation
    • Latin transliteration if not provided
    • Greek from Betacode?
    • Other languages?
  • Re-ingestion
    • Partner projects mark entities as deprecated/changed
  • Output
    • Web query – web response
    • RDF dump
    • SPARQL Endpoint
    • Other formats by demand later

SNAP and NER for Latin inscriptions

Prosopographies have in the past often taken decades or even centuries to produce. Even for a period with relatively few sources such as antiquity, hundreds of thousands of texts had to be collected and read, personal names had to be copied on index cards, people had to be identified across sources, their relations then had to be examined and their lives had to reconstructed.

Fortunately in this digital age that enormous work can be at least partially be automated. That is also a process SNAP is experimenting with: we do not only aim to bring together prosopographies just as the prosopographies themselves have brought together individuals in the sources; SNAP also wants to explore how to facilitate the creation of new prosopographies through Named Entity Recognition (NER). And this is where Leuven Ancient History and Trismegistos People come in.

In 2008 Trismegistos started a project to collect all personal names in Trismegistos Texts, at that time basically all published texts from Egypt between 800 BC and AD 800. This new Trismegistos People database could build on the Prosopographia Ptolemaica, a Leuven project which started in the late ’30ies, and which was transformed to a database already in the ’80ies. As its predecessor, Trismegistos People wanted to be multilingual, taking in not only Greek, but also Demotic and other Egyptian evidence.

As we foresaw, however, manual extraction of the personal names in the ‘old style’, now by typing in information in database records rather than writing them on index cards, proved very time-consuming. Still, for a language like Demotic where no Open Access full text was (and is) available, it was the only way forward. As a result, not even half of all 15,000 texts is currently done …

For Greek papyri, however, there was the Duke Database of Documentary Texts [DDbDP], which had just in 2008 been converted to Unicode and had been made available in the Papyrological Navigator [PN]. This was kindly put at our disposal, and this was to be our corpus for NER.

The Wikipedia article about NER states that ‘even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains’. Well, the papyrological texts seemed like a completely new domain all right, with their diacritic marks, their sometimes fragmentary state, the case system of ancient Greek, and the for us aberrant onomastic system with father’s names instead of first names and family names. So in our innocent rashness we decided to develop something completely new ourselves.

This is not the place to go into details. Those who want to know more can read an article of Bart Van Beek and myself (Journal of Juristic Papyrology 39 (2009), p. 31-47, available here), where we describe the procedure we developed, and which allowed us to deal with several hundred thousands of attestations of Greek personal names.

What I want to focus on here, is the new challenge in the form of Latin inscriptions. In the Europeana EAGLE project, Trismegistos is disambiguating the datasets of partners such as EDH, EDR, HispEpOl and EDB, and the full text of the inscriptions is going to be made available in Open Access. This again opens up exciting possibilities for SNAP, if through NER we can automate the collection of all attestations of personal names in this large corpus.

This time I could not call upon Jeroen Clarysse, who had cooperated with us to develop a NER tool in PHP. So I decided to go ahead myself in the system I know best: FilemakerPro. This may seem counterintuitive (some will no doubt use a different word), but if you want to express yourself, you just use the language you know best, and for me this is Filemaker. The challenges remain the same: identifying the named entities, in this case personal names, and extracting them in the best way possible with an eye to scholarly reuse.

The first problem for the Latin inscriptions was identifying personal names. No Open Access set of names was available, so we had to create that ourselves. Of course personal names are written with capitals, but so are place names, names of gods, and even the occasional book title. Not to mention Latin numbers, and in some datasets even unclear passages or the beginning of texts or sentences. Creating a set of Latin personal names on the basis of all capitalized words was thus the first task, and a quite time-consuming one.

For this, paradoxically, we were helped by a second problem, that of the Latin onomastic system. Latin, as all of you know, has an aberrant system in which people standardly have multiple names. According to the time period, they often use two or even three of the following: a praenomen such as Marcus, a nomen (gentilicium) such as Tullius, and a cognomen such as Cicero. On top of that, they even often add the name (mostly praenomen) of their father – or former master in the case of freedman. And citizens can add their tribus, the voting district in which they were registered. This leads to identification clusters such as C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus or C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i. Once you get rid of the diacritics for abbreviations and restaurations, these patterns actually help to identify words with a capital as personal names.

The existence of long clusters of names to identify a single individual, including also non-capitalized words, implies that we had to focus on extracting these clusters for each text. This I did on the basic principle that each consecutive word which either has a capital or belongs to set of core ‘identification-cluster’ words (including of course libertus, filius and the names of the voting districts) should be added to the cluster. This implies that you end up with clusters like C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i.

In a next step each of the constituants words needs to be analyzed: some are ‘linking word’ such as filius or the tribus-names Cornelia and Voturia, others are declined forms of a personal name, and yet others are ‘noise’ in the form of e.g. numbers. The case of the declined forms is essential for the interpretation of the cluster. In this example the cluster takes the form ‘nom nom gen filius tribus nom dat dat gen filius tribus dat’. This is then in a related database interpreted further as the identification of two individuals: one C(aius) Ofillius A(uli) f(ilius) Cor(nelia) Proculus (nom nom gen filius tribus nom) and one C(aio) Sextilio P(ubli) f(ilio) Vot(uria) Pollion[i (dat dat gen filius tribus dat).

In yet another database each individual identification is then split up and further standardized (e.g. by converting it to the nominative). The first identification is split up as identifying a person Caius Ofillius Proculus with a father Aulus (Ofillius) and belonging to the tribus Cornelia, and a person Caius Sextilius Pollio with a father Publius (Sextilius) and registered in the tribus Voturia.

At that stage, the information is ready to go into our database system, with the core database REF for all attestations of personal names, and separate databases for individuals (PER), and names, their variants, and their declined forms (NAM, NAMVAR, and NAMVARCASE).

about_ref_diagram
The NER system has currently progressed so far that I think we could get a complete database with all attestations of personal names in Latin inscriptions pretty soon given the funding needed to finetune the system and check its results. Even the onomastic analysis is a relatively easy task, with the cooperation of a few specialists. This way we could focus on the prosopographical aspects through SNAP and perhaps develop digitally assisted identifications of the people referred to in this enlarged Trismegistos People dataset.