Procedures to identify co-references in contributing datasets

During the SNAP workshop of March 30 and April 1 we had a breakout session on the problem of trying to identify overlaps between the contributing datasets, one of SNAP challenges. Given how diverse the information provided by each will be, this is not a given. What follows is my idiosyncratic survey of what was discussed at the session, with thanks to the participants of course.

Perhaps it’s good to remark first that there are different types of co-reference between entries in the datasets. Some of these are not controversial at all, and are just a consequence of important people appearing in several prosopographies, in many regions of the ancient world, often even spread across long periods. Alexander the Great will be an example. In other cases identical individuals may be found in more than one database, even if they are attested only very few times, perhaps even only in a single source.

For the former, important people their name or names in the datasets may suffice, together with the number of attestations for example. For the latter, it may be profitable to look at their attestations. If all of their references in one database are identical be with those of another, the identification will be completely uncontroversial and can almost be automated. But how to proceed if all but one are the same, or half, or only a single one?

Of course the identity of attestations themselves is not a simple matter. Conventions for identifying sources still vary widely, with different abbreviations being used in different disciplines and even within a single discipline. Perhaps Trismegistos’ texid’s that are now present in many papyrological and epigraphic databases can supplement the use of CTS identifiers for authors.

As to the procedure involved, perhaps the databases in which potential overlap is detected on the basis of names or attestations can be warned by receiving an email or can be otherwise alerted. If both of them agree, the two records can be merged in SNAP. If one agrees but the other not, a note should be made. If the verdict is negative at either side, this should also be recorded.

Of course this should not be the only way in which identifications are made. Third party users, for instance, should also be able to contribute and suggest identifications, based on whatever basis. This can include similar genealogical connections of the people involved, combined with similar dates and location. These criteria are currently also used in the offline version of Trismegistos to suggest possible doubles in the people database.

Many of these identifications are bound to be controversial. One of the most important task of SNAP therefore will be to cater for different degrees of uncertainty. Leif Isaksen suggested a system of green (accept & identify) – orange (mark as ‘possibly the same as’ but without merging) – red (keep separate with a rejection of the merge). Øyvind Eide remarked that even this three leveled system was not self-evident, as research about modelling uncertainty has shown that so much depends on the personality of the person involved in judging the merits of the suggested identification.

John raised the point that it remained very uncertain in all this who was going to be the person or persons to evaluated suggested merges. Since the SNAP-id referred to the original database, ideally someone of that database should be involved in the process. This would all imply work, however, and just how this work would be distributed and where the responsibilities lie remains the subject of discussion. Also, it remains unclear whether and when SNAP will reach a turning point at which new material is not automatically integrated, but matching with existing partners becomes the responsibilities of the new contributor. This will prevent unnecessary proliferation of SNAP-id’s for people. For all of these problems, test-cases with the material available in SNAP will no doubt clarify all this further.

Finally, we might also want to explore whether visualisation of the entries subject to a potential merge would help, either by working with a representation of the networks of the people, or with a ‘visual alias’ based on properties available in the datasets.

Leave a Reply

Your email address will not be published. Required fields are marked *