ChartEx: data technologies and charters

York chartex imageYork, corner of Stonegate and Petergate – image taken from one of the ChartEx presentations

I will gradually be talking about the sessions I went to at this year’s International Medieval Congress, but I’ve had a special request to report on the session organised by the ChartEx project, because of its possible relevance to many of the other current charter database projects. Most of the presentations that the ChartEx team gave are now up on their project site, so that’s the first place to look. This post is more giving my personal views of what the wider significance of the project might be, judging on the basis of what were inevitably fairly brief presentations.

I’ll start by making three points that the team themselves made: this is a proof-of-concept project (i.e. the emphasis is on a relatively short intense project to see if the technology can work effectively), they’re working with existing digitised resources, and their aim is to provide tools for expert historians rather than end-results accessible to non-specialists. So any assessment of what they’ve achieved has to acknowledge the limits of what’s possible in the time, the sources they had to start from and who they’re designing things for.

There are three main areas on which they were focusing: Natural Language Processing (NLP), data mining and a virtual workbench. First of all, the NLP is attempting to create a system which will automatically mark-up charter texts or transcriptions, e.g. tagging people, places, occupations, relationships, charter features etc. So the obvious questions I was interested in were 1) can such automatic marking-up be done and 2) is it useful if you do succeed in doing it? To which the answers seemed to me to be 1) “yes, but” and 2) more useful when combined with data-mining than I’d previously appreciated.

From what we heard of the methods and successes of the NLP part of the project, there are certain limits on what it can effectively do:

a) You need a large training set to start with: they were talking about 200 charters that had to be marked-up by hand, which means it’s probably only a process worth doing if you have at least a thousand charters you want marked-up.

b) It works better on marking-up names (of people or places) than on relationships, beyond the most immediately adjacent in the text, e.g. it can cope with finding the father-son relationship in “Thomas son of Josce”, but not necessarily both of the relationships in “Thomas son of Josce goldsmith and citizen of York to his younger son Jeremy”.

c) One of the reasons it works more effectively on names is because it’s using existing editorial conventions, e.g. capitalisation of proper nouns. That means that if you get an editor who’s decided they’re not going to use this convention (e.g. as with the Farfa charters), you have problems.

d) It also sounded as if it would work reasonably well where you had a list of likely terms you could give it to look for, e.g. occupation names/titles.

e) Overall, it’s likely to work best on texts that are relatively standardised: the demonstrations we had were using modern English translations or summaries of charters from late medieval York. One of the team suggested that if you used the original Latin texts instead, you might get some extra relationships clearer because of grammatical case (e.g. you could distinguish the recipient from the donor in a sentence). However, that relies crucially on the writer of the Latin texts observing some consistent rules of grammar, which early medieval scribes frankly don’t.

f) There’s also what I now think of as the “Judas Iscariot problem”, after an example in my IMC 2013 paper. In other words, the names of people and places that you don’t want (e.g. Biblical figures in sanctions clauses, or those mentioned in pro anima clauses in this example), also get marked-up.

I think all these factors combined together means that NLP is only likely to be of substantial use where you’ve got big and fairly homogeneous corpora of charters: the only early medieval dataset ChartEx were considering working with was the online edition of the Cluny charters.

The part of the project that I found most interesting (and potentially more relevant to early medievalists) was the discussion of data mining. This was using statistical methods on marked-up text to suggest possible identifications, both person-to-person, and also (more complicated), site-to-site. More specifically the aim was to match people/families in charters from late medieval York with one another and combine this with boundary information to try and identify a series of charters all dealing with the same urban plot.

This is the kind of matching that Sarah Rees Jones and scholars like her have tried to do for urban landscapes by manual methods for many years. What is so useful about computer techniques is that they can combine multiple factors and compare different charters very rapidly. If you look at the demonstration of this (slides 10-12) , you can see how a phrase such as “Thomas, son of Josce, goldsmith”, can be broken down into a set of statements with probabilities and the likelihood of a match between two people with similar descriptors in two different charters can then be quantified. (For the mathematically inclined among us, the speaker admitted that the probabilities for names and profession weren’t necessarily entirely independent, but he didn’t think that distorted the results too much).

The speakers also demonstrated how it was possible to do transaction/transaction clustering, i.e. to spot the charters which were most like each other in terms of the boundaries of the property transferred and the people involved. That kind of large-scale matching (they were carrying out complete cross-matching of sets of 100 items or more) is extremely difficult for human brains, which find it hard to take multiple factors into account simultaneously.

It’s that combination of mark-up (automated or not) and data-mining that struck me as the most useful general application of the project. The mapping of plots is only likely to be relevant for collections where we have lots of data on the same small areas, which means urban areas with large amounts of charters. The person-to-person identification techniques work well if you’ve got people identified in some detail in relatively formalised ways. My immediate thought is that it would have been a useful tool to have had for the project team on Profile of a Doomed Elite. But the matching process can only be as effective as the quality of data you’ve got, and I don’t think most early medieval charter collections do give you enough identifying details. I’d be very interested to hear the team’s result on matching Cluny data, or what you’d get from e.g. twelfth century Scottish charters.

But in theory, you could apply the same matching techniques to any data in the charter that had been marked up, either by hand or via NLP. I’ve previously been sceptical about what you can do with a list of curses from Anglo-Saxon charters, but this kind of data mining probably could do some very interesting clustering of them, especially using some of the methods for matching texts that DEEDS has expertise in. And in particular, that means that it might be possible at last to do something systematic with early medieval formulae (for those of us who aren’t Wendy Davies).

Particular types of formulae, such as appurtenance clauses, are at once so standardised that they must be being derived from one another (or from shared earlier models) and at the same time so subtly different from one another that tracing their connections is extremely complicated. If you have the text of enough early medieval charters online it wouldn’t be that time-consuming to mark-up just the relevant few sections in each charter (either manually or possibly via NLP) and then turn such data-mining techniques on them. I suspect you would get some genuinely interesting suggested clusters as a result. And the whole point of this project is that it’s not intended to replace scholars, but to give them short-cuts to looking at data in a way that’s otherwise excessively time-consuming.

And it’s at this point that I want to go onto the final aim of the ChartEx project, which is to produce a virtual workbench for historians working with charters. The main novelty here seemed to be the involvement of specialists in human-computer interaction, but at this stage in the project we were told more about the methodology they were using for designing the interface than what was actually in it. So it’s a bit hard to know how different it will be from the kind of interface that KCL’s Department of Digital Humanities is now designing, e.g. the mapping and statistics possible with Domesday Book data. It’ll be interesting to see how this develops, but the project as a whole already seems to have some methods that those of us interested in charters from other periods might well find worth investigating and adapting.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s