Named Entity Recognition and SNAP

One of the other breakout sessions of the SNAP workshop dealt with Named Entity Recognition. One can wonder whether setting up a Named Entity Recognition procedure from scratch is worth the effort for an after all limited and finite set of full text documents. The experience of Trismegistos People has shown the answer is definitively YES.

Trismegistos started its work on the 40-50.000 full-text files in XML Unicode, kindly provided to us by the Duke Database and the Papyrological Navigator, in late 2008. Although we did have the onomastic database of the Prosopographia Ptolemaica, this clearly needed to be expanded to include all Roman period names, and also to cater for the declined forms found in the full text. Based on the roughly 90.000 unique capitalized entries in the full text, we created the necessary authority lists of declined name variants, which were linked to a database of name variants (in the nominative) and later to a database of names. This was then used to match the words starting with a capital (fortunately only proper names and not the words at the beginning of a sentence!) in the full text. Ambiguous forms, such as words that could be accusative or genitive depending on the context, were disambiguated manually, although in retrospect we should perhaps have developed an automated procedure for that. Afterwards we created a parsing tool that matched a cluster of capitalized name forms and linking words such as ‘son’ and ‘mother’ against a set of 164 rules. Depending on the type of name (those annoying Latin style names and all the complications they caused!) and the cases, the computer then suggested an interpretation of the cluster which was manually reviewed. After human approval, the information was exported to a MySQL database, which was then manipulated to fit Trismegistos People’s data structure. For those who would like more detail, there is our article in the Journal of Juristic Papyrology of 2009 (downloadable at

For the new SNAP project, Trismegistos will do adapt its NER work to filter out names as they appear in Latin inscriptions. This will involve a new set of rules, as Latin proposes new challenges as opposed to Greek: many of the cases are less obviously recognizable on the basis of their endings, certainly in the absence (again!) of an authoritative onomastic dataset we can use. Again we will have to combine work on creating this tool with the creation of new rule to analyze the identification clusters.

I have just started to play around with some data in Filemaker. The extraction of capitalized clusters is already done, but much work remains to be done for the recognition of words ending in some specific groups (e.g. –i and –o, which can be genitive, dative or even nominative plural, and nominative, dative or ablative respectively). The automated analysis of identification clusters can only be preliminary at this stage, but nevertheless substantial progress has been made.

So far nothing has been done yet with the genealogical and other relational information in the inscriptions. I had been planning to distill the infor from the naming clusters, but not from the rest of the inscription. Gernot Höflechner suggested me that his Latin grammatical parser might be worth exploring for this. It does seem a very attractive suggestion! Who knows one day we can distill almost all relevant information on age, office etc. from the often stereotype inscriptions.

I am quite looking forward to going to Edinburgh and talking to the people of the Geoparser, to see how my exploratory efforts can be sophisticated or thrown overboard by their expertise!

Leave a Reply

Your email address will not be published. Required fields are marked *