State of the Snap-Nation

With the end of the pilot project scarily in sight it is time to review where we are and where we hope to be by the end of December.

The big news is that (hopefully) the first set of SNAP identifiers are now frozen!

What this means is that for the first 5 datasets have now been ingested and had SNAP identifiers linked to each of the persons and those identifiers are fixed. There may still be a few tweaks to the RDF descriptive data coming in from the projects but the identifiers will remain the same.

We had been experimenting with starting the idetifiers at 100001 to add a little bit of consistency with number length but after doing that for a while during the testing stage we decided that it wasn’t worth doing and we would just keep things simple and start at 1.

Currently the following datasets have been ingested:

Project Snap Identifier Range
PIR 1 – 10924
TM 10925 – 367917
LGPN 367918 – 671019
British Museum (Selection) 671020 – 671972
VIAF (Selection) 671973 – 673753

 

One big change over the summer was the move from Sesame to 4Store. We started off using Sesame because that was the standard triplestore for projects at DDH in King’s and therefore the easiest to set up and support on our servers. Whether we did something wrong in setup or the tens of millions of triples more than we usual deal with, Sesame didn’t prove to be robust enough to deal with even the founding datasets in a timely enough fashion. After some wrangling (next time we get VM upgraded to the version that has a package ready built for it) we installed 4Store and so far it has stood up to everything that we have thrown at it without any noticeable loss in performance. While not wanting to go as far as saying it bodes well, it definitely doesn’t fill me with a nameless dread at the thought of more people than me accessing the data.

Moving on from the data itself to displaying it in a easy to read form – person website itself is solidifying slowly (even if it isn’t the fanciest cupcake on the shelf). In this we must thank Davide Bellini who is interning with us. Although he was supposed to be working on another project, we lured him away with promises (or possibly threats) of learning python, django and the opportunity of looking into the abyss which is SPARQL. Having successfully made his first mark on the Person Profile pages he is now working on the script to somewhat automate the record merging procedure and continuing to upgrade the profile page displays . Between the two of us we hope to have the Person profile page filled out and the first merged records ingested by Christmas (I will leave the question of which Christmas to the reader’s imagination).

Minutes of second advisory board meeting

SNAP:DRGN Advisory Board (AB)

2nd meeting Skype (voice only) 2014-08-27

Present: Øyvind Eide (ØE, chair), Fabian Koerner (FK), Laurie Pearce (LP), Charlotte Roueché (CR), Rainer Simon (RS), Gabriel Bodard (GB, principal investigator)

Apologies: Sonia Ranade, Robert Parker.

The meeting lasted one hour.

Minutes written by Øyvind Eide based on notes from Laurie Pearce.

1. Call to order (15.00)

ØE welcomes. Call for other business: none.

2. Updates from PI (GB) (15.05)

Main development is the release of Cookbook 1.0. It required testing, getting base data sets into RDF. Most of work was directed on showing/recommending to others how to put data into SNAP format. There was a major worksprint in Edinburgh where priorities for future funding of the project was also discussed, as the project is currently only funded for this calendar year.

Based on the important distinction of two kinds of data sets, the project has to make decisions about the next stages and priorities in the development. Two types of sets of data are distinguished:

  1. Prosopography: this is information about persons, intended to disambiguate them (even it disambiguation is not always successful). This is the kind of data SNAP would import.
  2. List of attestations: SNAP will not import and assign URIs to such data. The data owners are invited to annotate such datasets with SNAP ids. However, might test second integration to incorporate the data and annotate at a later stage.

SNAP is in discussions with VIAF about useful association between the two. There is a subset of about 2000 person references from VIAF, with dates before 1000 AND wrote in either Latin or Greek. Those w/o dates or languages are omitted from this set. This small subset has been imported to SNAP.

How SNAP can help VIAF: VIAF is not interested in all references, just those who are authors according to the library catalogues. If SNAP had a field for role/occupation, contributors who has data about persons being creators/actors/painters/poets/theologians etc. can be asked to provide it in order to flag relevant persons for VIAF. VIAF would then assign identifiers to these persons, even if no real information beyond the fact that the tombstone says “painter” is available.

Additional datasets: SNAP has received data from the British Museum and VIAF, and are in advanced conversation with others, including the Hellenistic Babylonia, PBW, Smith Dictionary, RIB and the Zenon catalogue.

Working on the triples to show functionality: this is the slowest part to get ready. RDF requires much work. Triples store that had been recommended was not capable of handling the data, and had to start with a more robust triple store. As a result, many elements of the API that were specified haven’t been built yet. SNAP has been in touch with contributors, have made mock up RDFs which are being tested, but no further production imports yet.

3. Discussion of update (15.15)

CR asked about the VIAF relationship. For example, for Julius Caesar, the VIAF record might put in one role only, and that role might not be author(ship). Have to consider the specific relationship as creator of work. There is RDF relationship between individuals and things they create.

GB replied that there is nothing to preclude assigning more roles, but are building subset that is minimalistic to work with other projects; not building a prosopography.

ØE: On more general level, more databases will have specific things each is interested in. Based on the simplicity of the SNAP project, these things will not apply to top layer. So callbacks to local databases may be necessary, but this is not simple. In order to get to a situation where one can access more detailed information from the local databases, one would have to map into something more complex than SNAP. One needs more a more advanced ontology to be able to connect into more complex prosopographies.

GB replied that there are not so many fields left that are not accounted for in SNAP that are reducible, and none of the providers does that level of reduction anyway. So this is currently not a relevant problem.

ØE asked about the discussion in Edinburgh that might have focused on future funding. He asked GB to share his notes/impressions with SNAP, even if the notes are brief. GB agreed to do so.

  • person-search as a research tool
  • graph-search as a crosswalk channel
  • speccing full annotation, certainty & disagreement
    • *and getting uptake*
  • Pelagios-style harvester for oac annotations pointing in to snap
  • infrastructure and optimization

GB: Items that remain for future processing: to integrate into a SNAP graph: a new scholarly statement identifying a name instance as a specific person, and to indicate the authority of who is making that identification and who disagrees. Will not yet have many references pointing in to SNAP by the end of the year. However, it would be useful to have the Pelagios harvester with “here are all the persons” and “these are the datasets that have been annotated with references to these persons.”

Getting infrastructure working and triples store working: Sesame is not powerful, but still needs much more memory, say the equivalent of 10 usual projects in a campus institution. (Migrating to FourStore led to much improved performance, but this is still a relatively small dataset.) Have to consider how to optimize and get more computing power.

ØE: Might look to supercomputing as a possible source of funding – most work in the humanities does not need this, but there is funding available if one can document a need.

CRMinf is an extension to CRM covering argumentation and inference making Link to documentation that is currently under development: CRMinf: the Argumentation Model. An Extension of CIDOC-CRM to support argumentation. http://www.cidoc-crm.org/technical_papers.html

4. Discussion of Cookbook (15.30)

The PDF version does not have much supporting prose. GB asks for pointers that could be added to the paragraphs to clarify what is intended and/or necessary to help users who are not able to read and understand RDF easily. He notes that the cookbook does not include a soup-to-nuts example, one that takes a user from start to finish. Should the cookbook include full markup of persons as examples? The meeting agreed this was a good idea, either in the cookbook or linked to from it.

LP asked about the preferred means for providing comments/feedback.

GB: Email is good for straightforward corrections, but please use the ancient-people email list for discussion. ØE noted that the raising of specific points on the ancient-people list is good way to flag issues.

CR found the cookbook to be very clear and helpful. However, as she went through the list of items she lost the overall picture. It would be good to have examples of minimum structure needed. Some potential contributors may not be certain of whether they have prosopographical data in the SNAP sense.

GB: Can show minimal sets with only date and name. He wants to include a description of what constitutes a prosopography in the SNAP sense.

ØE suggested that illustrations might be useful in order to understand the contents better.

RS found the cookbook clear. He brought up the topic of name properties. What makes a name important enough to get richer encoding? GB would like minimal encoding, such as birth name. Whether one would use additional properties depends on whether the contributing database has controlled vocabularies of names, as Trismegistos and LGPN have. One could and should contribute variant names to SNAP, but SNAP prefers the primary name.

RS brought up the annotating of documents with SNAP URIs as in the Pelagios use case. What is the boundary between a name and attestations to it? When does it become attestation and use RDF in cookbook, contrary to annotating images on inscriptions

GB: This comes down to whether data is truly prosopographical in the SNAP sense, as discussed above. Only some contributors have URNs for names. If you have, contribute them; otherwise, SNAP still wants the names. Attestations are links from SNAP to other data sets. Annotations are links for other data sets to SNAP.

ØE: Two-way links could allow for ingesting lists also from non-prosopographies.

GB: It is a question if your data is prosopographical or not, but this is not thought fully through. The intent of your annotations/attestation is central.

ØE: The issue of date (discussed in the first meeting) is now well-defined, and is not complex. Date is understood as a time period/point that overlaps with the life of person. It would be good to have an equally simple and clear definition for place.

GB: This should be linked to importance. He will ponder on a formulation.

FK: This should be left to the provider. We must keep in mind the choice of place will be difficult.

GB: One can include more than one place, if that’s the case.

FK: Would it be good?

GB: It will probably not hurt. More than one place means that all of them are significant.

ØE: We should stop now and continue the discussion at the ancient-people list.

5. Any other business (15:50)

None.

6. Summing up (15.55)

ØE asked about the SNAP ontology: should this be discussed at Skype or another format?

GB: It would be good to discuss it in more detail, but it is dependent the participation of Faith Lawrence and Hugh Cayless.

ØE: It could either be the topic of the third AB meeting or an additional ad hoc meeting on the topic. The AB will agree on how to proceed via email.

ØE thanked the participants and closed the meeting.

SNAP Persons

SNAP:DRGN uses the LAWD ontology to define persons and other person-like entities in our contributing datasets. A LAWD Person is a CIDOC-CRM E21_Person.

LAWD defines a top-level class “Agent” (lawd:Agent), and four sub-classes:

  • lawd:Person
  • lawd:Deity
  • lawd:MythologicalCreature
  • lawd:Group

When contributing your prosopographical dataset in the SNAP RDF format, you should use whichever of these classes best define the people in your database. Most of these will presumably be people; you may also have groups, families and corporate bodies distinguished, or you may have deities so defined. If you are unable reliably to distinguish between these types in your data, or if you have types that do not fit under any of these four headings, you should use the super-class lawd:Agent encode such persons.

(Or get in touch with me and suggest new terms for the LAWD ontology, if you think they are universal enough to be of general use to the community.)

(SNA)P

Being a conversation between Gabriel Bodard, Yanne Broux and Silke Vanbeselaere about the SNAP:DRGN project and Social Network Analysis

Cross-posted to Data Ninjas: http://spaghetti-os.blogspot.be/

Gabriel Bodard: So, tell me what is Social Network Analysis, and how is it useful for prosopography projects?

Silke Vanbeselaere: Social Network Analysis (SNA) is basically the study of relationships between people through network theory. First used in sociology, it’s now become popular in many other disciplines, with a budding group of enthusiasts in (ancient) history.
What it does, is focus on relations (of whatever kind) instead of on the actors individually. Through visualisation of the network graph and the network statistics, information can be obtained about the structure of the network and the roles of the individuals in it.

The visualization of these network graphs can be especially interesting for prosopography projects as it can help disambiguate people. Individuals are represented by nodes and their relationships are represented by ties or links between those nodes. Instead of dealing with one source at a time, the network allows you to see the whole of the relationships.

GB: Can you perhaps illustrate that with an example and how it could help us?

Yanne Broux: Off the top of my head: one of the things the extremely nifty disambiguation methods we developed could help you out with is the identification of high-ranked Roman officials across the different datasets. Consuls were often mentioned in dating formulas, and procurators, proconsuls, legati and the like were pretty mobile, so chances are they appear in texts across the empire. They’ll light up like Christmas trees once we shape them into a network.

What is the SNAP:DRGN project about, Gabby, and what sort of prosopographical data (especially relationships) does it contain?

GB: I’m glad you asked me that. SNAP:DRGN stands for “Standards for Networking Ancient Prosopographies: Data and Relations in Greco-Roman Names” (not an artificial backronym at all!). Very briefly, the aim of the project is to bring together person-records from as many online prosopographies of the ancient world as possible, using linked data to record only the most basic information (person identifiers, names, citations, date, place and hopefully relationships with other persons). We only plan to store this very summary data, along with links back to the richer records in the contributing data source, and enable annotation on top of that.

SV: What will people be able to do with this limited data, then?

GB: In particular, scholars will be able to (1) join together records originally from different databases that clearly refer to the same person; (2) point out relationships between persons, e.g. person XYZ from this database is the daughter of person ABC from that one; and (3) annotate their own texts (archaeological or library records, etc.) to disambiguate a personal name using SNAP as an authority list.

At the moment there are relatively few co-references between the prosopographical datasets, i.e people who appear in more than one database, in SNAP (although there will be plenty between the library catalogues), and the only explicitly encoded personal relationships are the ones imported from the Trismegistos database, but we’re working to improve both of these things. How does that sound to you?

SV: Basically, what we need for SNA is a link between the people and the texts in which they appear. Now, I have no idea how sophisticated these other datasets are, but to avoid confusion/ mistakes/ whatever kind of apocalyptic disaster, you need unique numerical identifiers, both for your individuals, and for your texts. Trismegistos Texts is now slowly expanding beyond the Egyptian borders, so perhaps we already have some of the texts incorporated in the other datasets, and then it should be pretty easy to link them. But I suspect that for most of the data, new identifiers will have to be created.

GB: Unique identifiers for all persons we have, of course. SNAP mints URIs for all persons we have data for, whether they had dereferenceable URIs in the source datasets or not. In some cases we have identifiers for texts too (TM uses Papyri.info URIs, as you know); in other cases, we’ve had to hope that parsing text strings will be sufficiently unambiguous to be useful. (We’ve identified a few hundred co-references between LGPN and TM using text strings.) We also have a lot of persons from library catalogues (VIAF, the British Museum, Zenon and Trismegistos Authors) among whom co-references ought to be plentiful.

So this seems to be a little circular at the moment, doesn’t it? One of the things SNA might help with is identifying co-references, which in turn will help us build a graph of relationships. But you’re telling me that SNA isn’t really feasible on our data until we have a much better graph of co-references, relationships, and text co-occurrences. Is there anything useful we could do together in the meantime?

YB: Since we are enriching Trismegistos, by adding new texts from around the Mediterranean, by identifying individuals in the Egyptian texts, and by adding extra information such as titles, ethnics and status designations, and at the same time you are enriching SNAP, we are actually feeding into each other symbiotically, like meat ants and leafhoppers that find each other over sugary sap in the Australian outback.

And hey, Silke, what about that “Structural Equivalence” hoodoo you’ve been learning in London, could that be of any help?

SV: Well, it is a very interesting concept that explores the social environment of a person, but that implies that all your data need to be extremely accurate. That means that you first need to identify all the people mentioned in your network. Because without that information, there’s no way you could rightly use your data to explore the structural equivalence of two or more people. As such, I don’t see how it would yet be suitable to use on the data that we would be presented with in SNAP. In the future however…

GB: Are there any improvements you can suggest for the Trismegistos database?

YB: We’re kind of stuck when it comes to titles. You see, we hardly have any. Asking the computer to retrieve them, like we did for the names, proved to be next to impossible, and it’s a hell of a lot of work if you have to go through some 500,000 attestations manually. I’ve already gone through more than 10,000 of them while working on my double names and municipal officials, so I’ve done my share, methinks. Also, it’s not exactly easy to standardize titles, what with all the different languages in Egypt and all. But I guess that if one of the other datasets has a list or something we could look into, that might help us out a bit…

GB: So, in a hopefully not too distant future, when all these relationships are implemented through SNAP:DRGN, how can the participating projects in turn be of service to you and other researchers who would like to use SNA? When SNAP is ready for SNA to be performed on it, what questions will you ask of it?

SV: Well, Gabby, I’m glad you ask. Prosopographies are the ideal datasets for SNA research as those datasets of people have been formed or selected because of some common features (mentioned in the same source, part of an ethnical/social group, time fellows…). Once the technical infrastructure is in place, it will be relatively straightforward to convert the virtual two-mode networks linking texts and the people appearing in them into the one-mode networks (person – person) needed for actual SNA.

GB: This all sounds very promising! Thank you so much for sharing these ideas. I look forward to being in a position to do a bit more with all this some day.

Are you a prosopography?

At the SNAP:DRGN project meeting in Edinburgh a few weeks ago, we decided on a couple of definitions that will impact on the ways in which partner datasets interact with the project. Our current thinking is that we need to distinguish between two kinds of data:

(1) The first kind, which we’ll loosely call a “prosopography”, is a curated database of person records, with some ambition to be able to be used as an authority list. Prosopographies such as PIR, Broughton, PBW, etc. would be obvious examples of this category, as would the controlled vocabulary of persons in a library catalog like VIAF, Zenon, British Museum persons, Trismegistos Authors, the Perseus Catalog, etc. Even if the task of co-referencing persons is incomplete (as with Trismegistos, say), the intention to disambiguate qualifies the dataset as a “prosopography”.

(2) The second, which we call a “list of attestations” is not comprehensively curated or disambiguated in this way, and has no ambition of being an authority list. Examples of this kind of dataset (as I understand them) would include: the EDH person table; the raw list of name references Mark has extracted from Latin inscriptions; the tagged and indexed “names and titles” in the texts of the Inscriptions of Aphrodisias or Inscriptions of Roman Tripolitania.

In the SNAP:DRGN workflow, we hope that all “prosopographies” of type 1 will be contributed into the SNAP graph. We shall assign SNAP URIs to all persons in the datasets, and in time work to co-reference and merge with persons sourced from other projects as well as possible. These will form the authority file to which other datasets will refer, and we would recommend that lists of “attestations” of type 2 use Pelagios-style OAC annotations (*) to point to the SNAP identifiers as a way of disambiguating their person-references.  The process of disambiguating and/or co-referencing persons in this way might eventually lead some lists of annotations to become disambiguated prosopographies in our schema, at which point we would potentially want to include them in the SNAP graph as first class entities.

(*) We hope to the have the SNAP:DRGN guidelines for these Pelagios-like annotations (“Scenario 5″ in our Cookbook) available very shortly.

Some example RDF fragments

In the process of working with a few of our partner projects, we have produced some sample RDF fragments, which we thought might be useful as an illustration of SNAP RDF format for other projects currently planning to expose a version of their data via our graph. We hope to include at least some examples of this kind in a later version of the SNAP:DRGN Cookbook.

First off, the most simple and minimalist example possible (even more sparse than the PIR data which contains little more than headwords). The Zenon database is the library catalog of the German Archaeological Institute (DAI), which has an authority list of some 360 ancient authors. The RDF of this authority list (encoded in SKOS natively), will contain very little information except for URI and preferred name (sample in GDoc):

<http://zenon.dainst.org/000003901_1333e04fe2d7b09b43b088eb2ff1413f#this>
	rdf:type lawd:Person ;
	dc:publisher <http://www.dainst.org/> ;
	foaf:name "Platon"@de .

(Translated to English: “<this URI> is a person, according to the DAI, called Platon in German.”)

The next example is from the Prosopography of the Byzantine World, a project published by King’s College London. This is a full prosopography, on the “factoid model,” that contains much more richness of information and biographical data than SNAP:DRGN has any aspiration to include. We took one example of a fairly complex person (Leon 103) to show just what a SNAP version of his data might look like. In this case, SNAP will capture the URI; names (both English and Greek); associated date; associated place(s), in as much as these can be extracted from the database; attestation (in PBW) and citation(s); and relationships with other persons (Leon’s cousin Kale is also in PBW).  See the full RDF of Leon 103 in a GDoc here.

Finally, we mocked up the example of a person from the Smith Dictionary of Greco-Roman Biography and Mythology, which is being encoded and NER’ed by Stella Dee in Leipzig. As an example, we took Brutus 18 (the less famous D. Junius Brutus). From this entry, we hope to be able to include in SNAP his name (English only); associated date and associated place (both depending on NER); attestation and citations; relationships (4 relationships are recognised in the text of Smith’s entry, one of which to another person in the Dictionary). See full RDF of Brutus 18 in a GDoc here.

We’ll try to add more examples of this kind as we come up with them. Let us know if you find this sort of thing useful.

SNAP and VIAF

We’ve had a couple of meetings with Karen Smith-Yoshimura and Thomas Hickey, of the Scholars Contributions to VIAF group, to discuss possible collaborations, exchange of information, and mutual benefits of sharing standards between the SNAP:DRGN project and VIAF (the Virtual International Authority File, a federated authority list of persons from library catalogs, mostly from author or subject fields).

We considered two main questions:

  1. What can SNAP:DRGN gain from VIAF data or formats? Most concretely, what subset of VIAF person-records, and what fields in them, should we consider ingesting into the SNAP graph?
  2. How can VIAF benefit from SNAP:DRGN’s work in this area? In what ways can SNAP provide data and information that might be passed back to VIAF for inclusion in the authority file?

Preliminary answers and thoughts below.

1. What can SNAP get from VIAF?

Looking at the VIAF data model, we decided that in most cases the only categories of information we would get from them would be, (a) a URI, (b) a name. There was some discussion as to whether we  could sometimes extract from the data (c) some alternative name forms (e.g. in Greek, by searching for Unicode codepoint range), (d) a date, whcih is present in some name strings; (e) an associated place, which is present in some name strings. VIAF records that come in via Wikipedia or Wikidata would also give us (f) alternative ID, in the form of a Wikipedia/Dbpedia uri. We didn’t think that the LAWD “attestation/citation” categories were appropriate for modelling the information about books with these persons as authors or subjects, although that is the most useful information that one would get by going back into the VIAF data from the SNAP graph.

We discussed what subset of the VIAF dataset would be of interest to model in SNAP, and after a few experiments with filtering by date (which is not always given), language (not always given), and contributing collection, we settled on a preliminary export of persons who matched:

  • birth OR death date present, and before 1000 C.E.
  • AND any one or more of
    • language: Latin
    • OR language: Ancient Greek
    • OR collection: Perseus Catalog.

Which gives us a small corpus of 1,781 ancient authors to experiment with.

2. What can SNAP give to VIAF?

VIAF is an authority list of authors, artists, other creators, and people important enough to have a book (or at least a chapter) written about them, so they won’t be interested in hundreds of thousands of names of Greeks and Romans about whom all we know is their gravestones, contracts they signed, or graffiti they left on a theater wall. In order to flag a subset of persons whom VIAF might be interested in importing from the SNAP graph, we are proposing to add a new property to the SNAP ontology: associatedRole. This would allow us to flag poets, historians, authors, potters, sculptors, actors, performers etc., whom VIAF would include in their authority file, even if no works by these people survive. We’ll consider doing so in a later revision of the Cookbook, since version 1.0 is now locked down.

Other ways in which the SNAP dataset may be of value to VIAf is through the connections that we make between databases by coreferencing and disambiguating unique individuals. If we have a VIAF record for a person, but that person is also in the British Museum person thesaurus, the Trismegistos author table, LGPN and/or PBW, then variant names, dates, citations, alternate identifiers and other information from these databases might enrich the VIAF data on these records, and could be automatically ingested via the linked data we produce.

Many of the issues discussed above will also come up when we speak to other potential data partners about linking up SNAP records with their data, so it was great to have this preliminary conversation.

Minutes of first Advisory Board meeting

SNAP:DRGN Advisory Board

1st meeting Skype (voice only) 2014-05-09

Present: Øyvind Eide (chair), Fabian Koerner, Robert Parker, Laurie Pearce, Charlotte Roueché, Rainer Simon, Gabriel Bodard (principal investigator)

Not present: Sonia Ranade.

The meeting lasted around one hour.

Minutes written by Øyvind Eide based on notes from Laurie Pearce and Rainer Simon. Continue reading

Entering the SNAPDRGN garden

Now that the SNAP project has started ingest finalized data from the initial core datasets, it is time to think about how to bring in material from the other partners. For some, this will be easy, as they already know to make available their data in RDF form on the open web and simply need to follow the guidelines in the Cookbook. For others quite a lot of work will be involved getting SNAP ready. This post describes some of the stages you may go through, and some of the problems that you may meet.

I have divided the work into six steps:

  1. Decide whether you have a set of names, a set of attestations, or a prosopography
  2. Identify your records
  3. Establish the identities online
  4. Data wrangling
  5. Transform the data
  6. Make the RDF available

The first step is the most critical – what does your data actually represent? If your starting point is a text, and you have extracted the personal names, you do not have prosopographical records yet, but rather a set of attestations of names in a text. This is in itself a good thing to do, but to get to the prosopography you have to decide how many actual individuals are referred in the text – 10 occurrences of the name Marcus Aurelius may refer to between 1 to 10 persons (or one cat!).

If what you have is a set of attestations, what you probably want to is contribute a slightly different form of data to the SNAP network, namely links from people to sources. This will mean deciding which person in SNAP your name (Aelius Florus natione Pann., for example) refers to, and then generate RDF which links the URL of the source text to the SNAP person.

The second task, data wrangling, can be quite laborious. It involves turning the list of names into a set of person records, and creating a single record for each person which (for example) lists all the different names they are attested under, and what the source of those attestations is. At this point you will also be assembling other information from your sources (place of birth, sex, profession etc). You will under some circumstances be comparing what you have with established authority lists of names or persons.

Now that we have a set of people, the third task, identifying your records. It seems obvious, but if all you are doing is analyzing the information in a spreadsheet, you may never have needed to.

The fourth minimum piece of work is to establish your identifiers online, ie available as URIs. If you have a person called Soranus, you have allocated number 1001 to him, and you are able to use the domain http://my.people.net/, you might decide that his public identifier is http://my.people.net/person/1001. At the least, make it so that this displays an HTML page about the person for anyone visiting that URL. The very simplest way to do this is to make an HTML file for each person, and arrange the web server so that the request for /person/1001 returns that web page (look at facilities for URL rewriting). This assumes, of course, that you have a web server you can use to put up files, with some reasonable prospect of that remaining in place for the coming decades or longer. Larger institutions may have a repository which can do this job for you, and even assign Digital Object Identifiers (DOIs) to your data.

If you can go further, and map your permanent urls to queries against a database (where, for example, http://my.people.net/person/1001 is turned behind the scenes into http://my.people.net/query.php?id=1001, and retrieves the data from a relational database), you will have a more powerful (but harder to sustain) resource. At this point you can consider having your web server do content negotiation, and return different formats of data in response to Accept headers (technical details at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) so that humans read HTML and computers read RDF.

The fifth stage is transforming your data to the RDF format which SNAP wants, as explained in the cookbook. This may involve an XSLT transform of XML data (RDF can be represented in XML), or a Python script reading a database, or an output template in your publication system. Incidentally, if you’re wondering how to get data out of an Excel spreadsheet, the OxGarage can turn the sheet into XML for you.

If you’ve got this far, and created some RDF suitable for submitting to SNAP, that’s great. You may also want to move onto the sixth stage, making your RDF available on the web, that’s even better, so that SNAP can harvest it at intervals and keep up to date. This stage means looking at your own RDF triple store. Setting this sort of software up yourself isn’t so easy, but you may wish to read up on open source projects like OntoWiki, Sesame, and Fuseki. Maybe one of us will blog more about this in future.