Fifty years of historical database angst

The Making of Charlemagne’s Europe project website has now gone live, and includes a post by me on interconnecting charter databases. I mention in that a recent argument when we were trying to decide which of several different categories of transaction a particular document fell into. Just to show that such problems of coding documents are not new, here are some quotes from a recent article on Charles Tilly, a historical sociologist and a pioneer of using databases for historical research.

The Codebook for Intensive Sample of Disturbances guides more than 60 researchers in the minutiae of a herculean coding project of violent civil conflicts in French historical documents and periodicals between 1830–1860 and 1930–1960…The Codebook contains information about violent civic conflict events and charts the action and interaction sequences of various actors (called there formations) over time….we find fine-grained detail and frequent provision made for textual commentary on the thousands of computer punch cards involved.

(John Krinsky and Ann Mische, “Formations and Formalisms: Charles Tilly and the Paradox of the Actor”, Annual Review of Sociology, 39 (2013), p. 3)

The article then goes on to quote the Codebook on the issue of subformations (when political groups split up):

In the FORMATION SEQUENCE codes,treat the subformation as a formation for the period of its collective activity—but place 01 (“formation does not exist as such at this time”) in the intervals before and after. If two or more subformations comprise the entire membership of the formation from which they emerge, place 01 in that formation’s code for the intervals during which they are acting. But if a small fragment breaks off from a larger formation,continue to record the activities of the main formation as well as the new subformation.

If a formation breaks up, reforms and then breaks up in a different way, assign new subformation numbers the second time.

If fragments of different formations merge into new formations, hop around the room on one foot, shouting ILLEGITIMIS NON CARBORUNDUM.

(Krinsky and Mische, p 4, citing Charles Tilly, Codebook for intensive sample of disturbances. Res.DataCollect. ICPSR 0051, Inter-Univ. Consort. Polit. Soc. Res., Ann Arbor, Mich. (1966), p. 95)

In nearly fifty years, we’ve gone from punch-cards to open source web application frameworks, but we still haven’t solved the problem of historical data (and the people behind it) not fitting neatly into the framework we create, however flexible we try and be.

ChartEx: data technologies and charters

York chartex imageYork, corner of Stonegate and Petergate – image taken from one of the ChartEx presentations

I will gradually be talking about the sessions I went to at this year’s International Medieval Congress, but I’ve had a special request to report on the session organised by the ChartEx project, because of its possible relevance to many of the other current charter database projects. Most of the presentations that the ChartEx team gave are now up on their project site, so that’s the first place to look. This post is more giving my personal views of what the wider significance of the project might be, judging on the basis of what were inevitably fairly brief presentations.

I’ll start by making three points that the team themselves made: this is a proof-of-concept project (i.e. the emphasis is on a relatively short intense project to see if the technology can work effectively), they’re working with existing digitised resources, and their aim is to provide tools for expert historians rather than end-results accessible to non-specialists. So any assessment of what they’ve achieved has to acknowledge the limits of what’s possible in the time, the sources they had to start from and who they’re designing things for.

There are three main areas on which they were focusing: Natural Language Processing (NLP), data mining and a virtual workbench. First of all, the NLP is attempting to create a system which will automatically mark-up charter texts or transcriptions, e.g. tagging people, places, occupations, relationships, charter features etc. So the obvious questions I was interested in were 1) can such automatic marking-up be done and 2) is it useful if you do succeed in doing it? To which the answers seemed to me to be 1) “yes, but” and 2) more useful when combined with data-mining than I’d previously appreciated.

From what we heard of the methods and successes of the NLP part of the project, there are certain limits on what it can effectively do:

a) You need a large training set to start with: they were talking about 200 charters that had to be marked-up by hand, which means it’s probably only a process worth doing if you have at least a thousand charters you want marked-up.

b) It works better on marking-up names (of people or places) than on relationships, beyond the most immediately adjacent in the text, e.g. it can cope with finding the father-son relationship in “Thomas son of Josce”, but not necessarily both of the relationships in “Thomas son of Josce goldsmith and citizen of York to his younger son Jeremy”.

c) One of the reasons it works more effectively on names is because it’s using existing editorial conventions, e.g. capitalisation of proper nouns. That means that if you get an editor who’s decided they’re not going to use this convention (e.g. as with the Farfa charters), you have problems.

d) It also sounded as if it would work reasonably well where you had a list of likely terms you could give it to look for, e.g. occupation names/titles.

e) Overall, it’s likely to work best on texts that are relatively standardised: the demonstrations we had were using modern English translations or summaries of charters from late medieval York. One of the team suggested that if you used the original Latin texts instead, you might get some extra relationships clearer because of grammatical case (e.g. you could distinguish the recipient from the donor in a sentence). However, that relies crucially on the writer of the Latin texts observing some consistent rules of grammar, which early medieval scribes frankly don’t.

f) There’s also what I now think of as the “Judas Iscariot problem”, after an example in my IMC 2013 paper. In other words, the names of people and places that you don’t want (e.g. Biblical figures in sanctions clauses, or those mentioned in pro anima clauses in this example), also get marked-up.

I think all these factors combined together means that NLP is only likely to be of substantial use where you’ve got big and fairly homogeneous corpora of charters: the only early medieval dataset ChartEx were considering working with was the online edition of the Cluny charters.

The part of the project that I found most interesting (and potentially more relevant to early medievalists) was the discussion of data mining. This was using statistical methods on marked-up text to suggest possible identifications, both person-to-person, and also (more complicated), site-to-site. More specifically the aim was to match people/families in charters from late medieval York with one another and combine this with boundary information to try and identify a series of charters all dealing with the same urban plot.

This is the kind of matching that Sarah Rees Jones and scholars like her have tried to do for urban landscapes by manual methods for many years. What is so useful about computer techniques is that they can combine multiple factors and compare different charters very rapidly. If you look at the demonstration of this (slides 10-12) , you can see how a phrase such as “Thomas, son of Josce, goldsmith”, can be broken down into a set of statements with probabilities and the likelihood of a match between two people with similar descriptors in two different charters can then be quantified. (For the mathematically inclined among us, the speaker admitted that the probabilities for names and profession weren’t necessarily entirely independent, but he didn’t think that distorted the results too much).

The speakers also demonstrated how it was possible to do transaction/transaction clustering, i.e. to spot the charters which were most like each other in terms of the boundaries of the property transferred and the people involved. That kind of large-scale matching (they were carrying out complete cross-matching of sets of 100 items or more) is extremely difficult for human brains, which find it hard to take multiple factors into account simultaneously.

It’s that combination of mark-up (automated or not) and data-mining that struck me as the most useful general application of the project. The mapping of plots is only likely to be relevant for collections where we have lots of data on the same small areas, which means urban areas with large amounts of charters. The person-to-person identification techniques work well if you’ve got people identified in some detail in relatively formalised ways. My immediate thought is that it would have been a useful tool to have had for the project team on Profile of a Doomed Elite. But the matching process can only be as effective as the quality of data you’ve got, and I don’t think most early medieval charter collections do give you enough identifying details. I’d be very interested to hear the team’s result on matching Cluny data, or what you’d get from e.g. twelfth century Scottish charters.

But in theory, you could apply the same matching techniques to any data in the charter that had been marked up, either by hand or via NLP. I’ve previously been sceptical about what you can do with a list of curses from Anglo-Saxon charters, but this kind of data mining probably could do some very interesting clustering of them, especially using some of the methods for matching texts that DEEDS has expertise in. And in particular, that means that it might be possible at last to do something systematic with early medieval formulae (for those of us who aren’t Wendy Davies).

Particular types of formulae, such as appurtenance clauses, are at once so standardised that they must be being derived from one another (or from shared earlier models) and at the same time so subtly different from one another that tracing their connections is extremely complicated. If you have the text of enough early medieval charters online it wouldn’t be that time-consuming to mark-up just the relevant few sections in each charter (either manually or possibly via NLP) and then turn such data-mining techniques on them. I suspect you would get some genuinely interesting suggested clusters as a result. And the whole point of this project is that it’s not intended to replace scholars, but to give them short-cuts to looking at data in a way that’s otherwise excessively time-consuming.

And it’s at this point that I want to go onto the final aim of the ChartEx project, which is to produce a virtual workbench for historians working with charters. The main novelty here seemed to be the involvement of specialists in human-computer interaction, but at this stage in the project we were told more about the methodology they were using for designing the interface than what was actually in it. So it’s a bit hard to know how different it will be from the kind of interface that KCL’s Department of Digital Humanities is now designing, e.g. the mapping and statistics possible with Domesday Book data. It’ll be interesting to see how this develops, but the project as a whole already seems to have some methods that those of us interested in charters from other periods might well find worth investigating and adapting.

Medieval social networks 2: charters and connections

As a follow-up to my first post on social network analysis, I’m now gradually reading some of the many books and articles on historians’ use of network analysis that readers of my blog suggested. And having read a couple of chapters of Giovanni Ruffini, Social Networks in Byzantine Egypt, I’m coming to realise that one of the most difficult issues for those of us working with documentary sources is deciding what counts as a connection between two people and what links should therefore be included in the network.

The majority of the late antique/medieval network analysis studies that I’ve looked at work by hand-crafting links. Someone sits down, works their way through their sources and picks out by eye every link between two people (or two places). Often, they also categorise the link. For example, Elizabeth Clark, when studying conflicts between Jerome and Rufinus, divided links into seven different types: “marriage/kinship; religious mentorship; hospitality; travelling companionship; financial patronage, money, and gifts; literature written to, for, or against members of the network; and carriers of literature and information correspondence.”

(Elizabeth A. Clark, “Elite networks and heresy accusations: towards a social description of the Origenist controversy”, Semeia (56) 1991, 79-117 at p. 95).

Similarly, Judith Bennett did the same thing when looking at connections of families recorded in the Brigstock manorial court records:

The content of these transactions has been divided into six qualitative categories that collectively encompass all possible transactions. These categories are based upon whether the network subject interacted with an-other person by whether the network subject interacted with an-other person by (i) receiving assistance, (2) giving assistance, (3) acting jointly, (4) receiving land, (5) giving land, or (6) engaging in a dispute.

(Judith M. Bennett, “The tie that binds: peasant marriages and families in late medieval England”, Journal of Interdisciplinary History 15 (1984), 111-129, at p. 115).

And for networks of places, Johannes Preiser-Kapeller, “Networks of border zones: multiplex relations of power, religion and economy in South-Eastern Europe, 1250-1453 AD”, in Revive the past: proceeding of the 39th conference on computer applications and quantitative methods in archaeology, Beijing, 12-16 April 2011 edited by Mingquan Zhou, Iza Romanowska, Zhongke Wu, Pengfei Xu and Philip Verhagen,. (Amsterdam, Pallas Publications, 2012), 381-393, combined existing geographical datasets on late antique land and sea routes with details of church and state administrative networks he’s compiled from documentary sources.

Such approaches create very reliable networks, but they’re hard to scale up. Clark looks at 26 people; Judith Bennett has 31 people and 1,965 appearances in extant records from 1287-1348. Preiser-Kapeller has around 270 nodes and 680 links in total. Rosé’s study of Odo of Cluny, which I discussed in the previous post, had 860 links. For charters, such hand-crafted networks would probably only allow the exploration of small archives or individual villages.

What is more, researchers often want to carry out social network analysis as an offshoot of more general prosopographical work, such as creating a charter database. But it’s hard to analyse links until you’ve first created a prosopography, because it’s only when you’ve been through all the charters that you have a decent idea of whether two people of the same name are actually the same person. (There’s a further issue here about whether you may end up with circular reasoning between prosopography and network analysis, but I’ll leave that for now). So in theory, you’d need to go through all the charters first to identify people and then have to go back to assess whether or not they are linked in a meaningful way, doubling your work.

As a result, some researchers have started trying to see if there are ways of automatically creating networks from existing databases or files, developing methods for analysing charters that (in theory) can be scaled up relatively easily. In the rest of the post I want to look at the relatively few projects I’m aware of attempting to do this and outline how we might approach the problem with the Making of Charlemagne’s Europe dataset.

The three projects I’m looking at are by Giovanni Ruffini, working on the village of Aphrodito in Egypt (see reference above), Joan Vilaseca, who’s been experimenting on creating graphs from the early medieval sources he’s collected at Cathalaunia.org and a controversial article by Romain Boulet, Bertrand Jouve, Fabrice Rossi, and Nathalie Villa, “Batch kernel SOM and related Laplacian methods for social network analysis”, Neurocomputing 71 (2008), 1257-1273.

Ruffini is explicit about how he’s creating his networks and the problems that may result from this (pp. 29-31). He’s taking documents and creating “affiliation networks”: all those who appear in the same document are regarded as connected to one another. As he points out, the immediate problem is that this method can introduce distortions if you have one or two documents with very large numbers of names. For example, one of the texts in his corpus is part of the Aphrodito fiscal register and has 455 names in it, while the average text names only eleven (p. 203). If such a disproportionately large text is included, analysis of connectivity is badly distorted, with all the people appearing in the fiscal register appearing at the top of connectivity lists.

The same effect can be seen in Joan Vilaseca’s graphs. If you look at his first attempts at graphing documents from Catalonia between 898-914, they’re dominated by the famous judgement of Valfogona in 913.

But Joan’s graphs also show an additional problem. His first graphs also give great prominence to Charles the Simple and Louis the Stammerer, because they appear so often in dating clauses. When he starts looking for measures of centrality in his next post he initially finds the most connected people to be St Peter, the Virgin Mary and Judas Iscariot (who appear frequently in sanction clauses).

This brings us to the key question: what does it mean to be in the same charter as another person? The problem is that people are named in charters for so many different reasons: they may be saints, donors, witnesses, relatives to be commemorated, scribes or even the count whose pagus you are in. People may also appear as the objects of transactions: some of our early decisions on the Charlemagne project were deciding how we would treat the unfree (and possibly the free) who were being transferred between one party and another. Such unfree have an obvious connection to the donor and the recipient. But do they have any meaningful relationship to the witnesses or the scribe? At least with witnesses, there’s a reasonable chance in most cases that they all physically met at some point, but I don’t know of any evidence that the unfree would necessarily have been present when their ownership was transferred by a charter.

So simple affiliation networks, even when you eliminate disproportionately large documents and people mentioned only in dating or sanction clauses, can still be inaccurate representations of actual relationships. One possible response to this problem is to include as links only types of relationships that are themselves spelled out in the charters. Joan has some graphs showing only family and neighbourhood relationships, for example. Ruffini (p. 21) suggests the possibility of using data-sets where a link is defined as existing only when there is a clear connection between two parties in a document e.g. between a lessor and a lessee. But as he points out, we would then have much smaller data-sets. And for early medieval charters, in particular, focusing on the main parties to a transaction only would simply demonstrate that most transaction were about people donating or selling land to churches and monasteries, which is not exactly new information.

Are there any other ways to cut out “irrelevant” connections while keeping those we think are likely to show meaning? Another approach that Joan tries uses affiliation networks, but then removes links where two people occur together in only one document. For his interest in identifying key members of Catalan society, focusing on the most important links may well make sense. But they potentially distort the evidence on one question of wider interest: how significant are weak ties in charter-derived networks? Weak ties, where two people interact only occasionally, may paradoxically be more important for spreading information or practices. Given we have only a small subset of interactions preserved via charter data, significant weak ties may be lost if we start removing data from affiliation networks in this way.

Implicitly, at least, an alternative method for selecting links within what’s broadly an affiliation network is given by Boulet, Jouvet, Rossi and Villa. As they explain in their study of thirteenth and fourteenth century notarial acts, they constructed a graph in the following manner (pp. 1264-1265):

First, nobles and notaries are removed from the analyzed graph because they are named in almost every contracts: they are obvious central individuals in the social relationships and could mask other important tendencies in the organization of the peasant society. Then, two persons are linked together if:

_ they appear in a same contract,
_ they appear in two different contracts which differ from less than 15 years and on which they are related to the same lord or to the same notary.

The three main lords of the area (Calstelnau Ratier II, III and Aymeric de Gourdon) are not taken into account for this last rule because almost all the peasants are related to one of these lords. The links are weighted by the number of contracts satisfying one of the specified conditions.

Though it’s not clear why people are regarded as linked if they use the same notary, the other criteria seem to be ways of trying to filter out distortions that potentially arise from notorial practices. If men are routinely described in terms of their affiliation to a lord e.g. “A the man of B”, then an affiliation network will derive from a sale between “A the man of B” and “C the man of D” not only the justified links A to B, C to D and A to C, but also links that in practice are unlikely to exist or at least are not proven to do so, i.e. A to D, C to B and B to D.

So how might we balance distortions from applying the affiliation network model to charter data against loss of data or an unfeasibly high workload if we don’t use this method? The model for the Making of Charlemagne’s Europe database allows inputting of relationship factoids, which will catch explicit references to people as the relatives or neighbours of others. Graphs using such data will be relatively easy to construct.

We are also, however, recording “agent roles”, used to identify what role a person or an institution plays within an individual charter or transaction (e.g. witness, scribe, object of transaction, granter). At the minimum, any social network analysis application added to the system should probably allow a user to choose which of these roles they want included within the graphs to be created. There should also be some threshold (either chosen by us or user-defined) for excluding documents that contain “too many” different agents. We’re still not going to get the precision graphs that hand-crafting links will give, but we can hopefully still get something that will tell us something useful about how people interact.

Medieval social networks 1: concepts, intellectual networks and tools

480px-Kencf0618FacebookNetwork

Data visualization of Facebook relationships by Kencf0618

Network analysis is one of those areas which keeps on cropping up as a possibility for medieval researchers. (There have been some interesting discussions and examples previously at A Corner of Tenth Century Europe and Cathalaunia, which I’ll discuss more in a later post).

Since one of the hopes of the Making of Charlemagne’s Europe project I’m working for is that the data collected can be used for exploring social networks, I thought it would be useful to find out a bit more about what has been done already. So is this my first attempt to get a feeling for what’s been done with medieval data and what it might be possible to do.

I should note at this point that I’m drawing very heavily on the work of Johannes Preiser-Kapeller, especially his paper: “Visualising Communities: Möglichkeiten der Netzwerkanalyse und der relationalen Soziologie für die Erfassung und Analyse mittelalterlicher Gemeinschaften”. I found out about many of the projects I discuss from this paper, so I am grateful for to him for providing such a primer. My focus is slightly different to his, however, as what I’m particularly interested is the type of research questions that social network analysis might be used to answer, more than the details of particular projects.

Defining networks
One immediate problem in knowing where to look comes because the key mathematical tools and visualization techniques can be applied to very different kinds of data. The underlying concepts come mainly from graph theory. Wikipedia defines that as: “the study of graphs, which are mathematical structures used to model pairwise relations between objects from a certain collection. A “graph” in this context is a collection of “vertices” or “nodes” and a collection of edges that connect pairs of vertices. A graph may be undirected, meaning that there is no distinction between the two vertices associated with each edge, or its edges may be directed from one vertex to another.”

What that means is that you can use the same basic techniques to study anything from a road network via the structure of novels, to how infections spread through a population. But it also means that the type of network and how you can analyse it depends crucially on several factors. These include how you define a node and edge, whether all edges are the same (or whether you’re counting the connections between some pairs as somehow different/more important than others) and whether it’s a directed or undirected graph.
The size of the network is also crucial, and that differs vastly between disciplines: it’s when you see a physicist commenting that “At best power law forms for small networks (and small to me means under a million nodes in this context) give a reasonable description or summary of fat tailed distributions” that you know that not all networks are the same kind of thing. One of the things that interests me when looking at projects is the extent to which data visualization is important in itself or whether the emphasis is on mathematical analysis of the underlying data.

Data quality

There are, inevitably, particular issues with data quality for medieval networks. The obvious one is whether the information you have is typical or whether the reasons for its survival bias our evidence excessively from the start. (The answer is almost certainly yes, but medievalists wouldn’t know how to cope if they had properly representative sources, so let’s move on rapidly).

Another big issue is identifying individual nodes. You can in theory have anything as nodes: an individual, a “family”, a manuscript, a place, a type of archaeological artefact, a gene, a unit of language. (I’m not going to look at either linguistic or genetic network analysis in what follows, but there are projects doing both of those). The problem with medieval data is that there’s almost always some uncertainty about identification: are two people the same or not? What do you do about unidentifiable places? How do you decide whether two people belong to the same family?

Then there’s question of how you define a connection between two nodes. What makes two people connected to one another? The data you extract from the sources obviously depends on decisions made about this, but for a lot of medieval networks there’s the added complication that not all connections are made at the same time. If you have a modern social network where A connects with B and (simultaneously) B connects with C you can make certain deductions about the network from data about whether or not A and C are connected. If you have limited medieval data where A connects with B and 20 years later B connects with C, can you model that as one network, or do you have to take time-slices across the network (which may often reduce your available data set from small to pathetic)?

Varieties of projects
Of the medieval history projects I’ve come across so far (I suspect there’s a whole slew of others in fields such as archaeology), most seem to fall into three categories. There are studies on networks of traders, such as by Mike Burkhardt on the Hanse. There are probably other similar examples: I’ve not yet had a chance to investigate whether the important work by Avner Greif on traders in the Maghreb also uses network analysis or not. But these kinds of studies are unlikely to be relevant to any early medieval project, because they will almost certainly rely on relatively large-scale sets of data from a short chronological range (account-books, registers of traders etc). Such data sets simply don’t exist for the periods I’m interested in.

The other two types of medieval network studies I’ve noticed are ones which are looking at intellectual networks or the spread of ideas (with some possible overlap with spread of objects more generally) and ones using network analysis to study how a society operates (social network analysis in its most specific sense). For both of these, I’m aware of some early medieval studies and others that are potentially applicable to early medieval style-data. I’ll cover intellectual networks in this post (including a discussion of a recent IHR seminar) and then move onto social history uses of network analysis in the next post.

Intellectual networks/spread of ideas: example projects

1) Ego-networks
There are several forms that network analysis of intellectual networks can take. One obvious one is as a more quantitative version of what’s been done for many years (if not centuries): the study of “ego-networks”, the intellectual contacts that a particular individual has.

This is the basis for the study by Isabelle Rosé of Odo of Cluny (Rosé, Isabelle. “Reconstitution, représentation graphique et analyse des réseaux de pouvoir au haut Moyen Âge: Approche des pratiques sociales de l’aristocratie à partir de l’exemple d’Odon de Cluny († 942)”, Redes. Revista hispana para el análisis de redes sociales 21, no. 1 (2011)

Rosé’s study isn’t strictly of just an ego-network, since she also tries to analyse the connections that Odo’s contacts had with each other in which Odo wasn’t involved, but the centre is clearly Odo. Rosé uses a mix of different sources (narrative and charters) to construct snapshots of Odo’s connections over time: she ends up with a PowerPoint slideshow showing the network for every year (available from here). She wanted to include a spatial dimension to the networks (showing where connections were formed), but couldn’t find a way of doing that.

Rosé’s account includes some useful detail about her methodology. The data she collected in Excel consisted of 2 people’s names, a type of connection and a direction for it, a source and start dates and end dates for the connection. She also codes individual nodes based on the person’s social function (monk, layman, king etc) and the aristocratic group they belong to (Bosonids etc); this is reflected in their colour and shape on her network diagrams.

There are a lot of questions raised immediately about how such decisions are made (period of time allocated to a particular connection, how she decides on who counts as on one of the groups); all the kind of nitty-gritty that has to be sorted out for any particular project.

What does Rosé’s use of network analysis allow that a conventional analysis of how Odo’s social networks helped him couldn’t do? One is that the data collection method encourages a systematic searching for all connections that an unstructured reading of the sources might miss. Secondly, the visualization of networks (especially as they change over time) gives an easy way of spotting patterns, allowing periodization of Odo’s career, for example. Thirdly, it’s possible to compare different sorts of tie, e.g. she shows that the kinship networks (whether actual or the fictive kinship of godparenthood) consists of a number of unconnected segments. But when you include ties of kinship and ties of fidelity, you do get a single network. Finally, Rosé uses a few formal network metrics to rank people by their centrality to the network (their importance to it) and their role as cut-points (people whose removal from the network would mean that there were disconnected segments of it).

Apart from this restricted use of metrics, Rosé is mostly doing visualization and I suspect that many of her conclusions are confirmations of things that a conventional analysis of Odo’s social network without such complex data collection would have come up with anyhow: who Odo’s key connections were, the importance of the fact that right from the start Odo had connections to the Robertines and also the Guilhemides. But one of her most interesting comments was that analysis showed a move away from kings as central to social networks, which she connected to a move to “feudalism”. If we could find comparable data sets (and there are obvious problems in doing so), it’d be interesting to see whether kings outside France become non-central to reforming abbots in the same way.

2) Scale-free networks
There are a couple of articles I want to highlight which talk about scale-free medieval networks and which I want to discuss more for some of the difficulties they raise than the answers they’re coming up with. One is work that hasn’t yet been published, but has been publicised: analysis of the spread of heresy by Andrew Roach of Glasgow and Paul Ormerod. The other is Sindbæk, S.M. 2007. ‘The Small World of the Vikings. Networks in Early Medieval Communication and Exchange’, Norwegian Archaeological Review 40, 59-74, online.

But first, a very rough explanation of scale-free networks, which means introducing one or two basic mathematical/statistical ideas. The first is the degree of a node, the number of connections it has. The second is the distribution of these degrees, i.e. what percentage of nodes have 1 degree, 2 degrees, etc. Scale-free networks are ones where the degree distribution follows a power law: roughly speaking, you have a few very well-connected nodes and then a long tail of a lot of poorly-connected nodes.

The crunch here is “roughly-speaking”: there are all kinds of issues about whether any particular example really does represent the power law distributions that supposedly lie behind it. It’s a reminder that if we as historians we do start doing more of this kind of work, we’re probably going to need some good mathematicians/statisticians behind us pointing out possible issues.

Without seeing the data, it’s impossible to tell whether Roach and Ormerod are accurate about medieval heresy spreading through such types of networks. But Søren Sindbæk’s paper on Viking trade suggests that the interest here isn’t strictly whether we’re talking about scale-free distributions or not. It’s a more general question about how the very localized societies within which the vast majority of medieval people lived could nevertheless allow the relatively rapid long-range spread of everything from unusual theological ideas to silver dirhams.

Søren’s main point is that there are two possible ways that such small-world networks can evolve: either you can have a few random links between two otherwise largely separate networks (weak-ties model) or you can have a few very well-connected nodes amid the otherwise very localised societies (“scale-free”). Which of these two ideal type of networks you have affects considerably the robustness of the network: i.e. if you have one or two crucial hubs that get destroyed by attackers, the whole network falls apart, but random attacks aren’t likely to have much effect, while the weak-ties model is more vulnerable to a random attack (if a random link that ties two networks together happens to get severed). Søren tries to see which type of network best fits two very limited sets of data (one based on the Vita Anskari) and one on archaeological data. The answer, not surprisingly, is “scale-free” networks.

I say the answer isn’t surprising because the medieval world is full of hierarchies of people and places, and some of the defining characteristics of those at the top of such hierarchies are that they move around more or they have connections to a lot more places. I found Søren’s paper mainly revealing in giving a feel of the numerical bounds for where simple visualization is a useful tool: a plot of 116 edges (see Fig 3) is already getting complex to visualise; one with 491 edges (see fig 4) almost impossible to take in by eye.

As for Roach and Ormerod, the fact that heresy was mainly spread through a small number of widespread travellers isn’t exactly news. We’ll have to wait and see whether they can provide something that gives a new dimension of analysis.

3) Six degrees of not-Alcuin
Finally for this post, I want to discuss an IHR seminar I heard back in May: Clare Woods from Duke University talking about “Ninth century networks: books, gifts, scholarly exchange”. Clare’s coming to intellectual history from a slightly different angle from Isabelle Rosé: she has been editing a collection of sermons by Hrabanus Maurus for Archbishop Haistulf of Mainz, and thinking about how to represent the relationship between manuscript witnesses visually (rather than just rely on verbal descriptions or stemma diagrams.

The point here is that manuscript stemma can be thought of as directional networks between manuscripts, whose place of production can be located (more or less accurately). (There are also projects endeavouring to generate manuscript stemma automatically, but I’m not discussing those at the moment). Clare is also using data from book dedications, known manuscript movements, and the evidence of medieval library catalogues.

Also in contrast to Rosé, Clare was interested in the possibility of getting beyond the spider’s web idea of intellectual history. i.e. that Hrabanus (or Odo) sits at the centre and everyone else revolves around him. This is a particular issue for Carolingian intellectual history because of Alcuin. We have by far more letters of Alcuin preserved than of any other Carolingian author (Hincmar probably comes second, but his letters still haven’t been edited properly), so if you use Rosé’s techniques you’re liable to end up overrating Alcuin’s significance vastly.

Clare’s main focus was on simple tools for visualizing this information, ideally in both its spatial and temporal dimensions. As I said above, Rosé was using Excel, Powerpoint and NetDrawand was finding problems in showing locations. Clare was using Google Maps for the spatial element, but thought she’d need Javascript (which she doesn’t know) to show changes over time. I have seen projects which use GoogleMaps and a timeline, such as the MGH Constitutiones timemap (click on Karte to follow how Charles IV, the fourteenth century Holy Roman Emperor moved around his kingdom). I don’t know how that is made to work.

I’d be interested to know from more informed readers of the blog if there are such tools available that non-experts can use to produce geo-coded networks of this kind. Gephi seems to be popular free software for network analysis, and I’ve seen a reference to a plug-in for this which allows entering geo-coded data. The Guardian datablog recommends Google Fusion Tables.

But whatever software you have, there are the normal issues of data quality. There’s a particular problem with data coming from a very long timescale: in questions David Ganz wondered whether the evidence was getting contaminated by C12 copies (I wasn’t quite sure whether that’s just because there are so many manuscripts of all sorts from later). How do we know whether manuscript movements do reflect actual intellectual contacts, rather than just random accidents of them getting moved/displaced etc? Clare also discussed the problems of how you mapped a manuscript which came from “northern Italy”. Her response was to choose an arbitrary point in the region and use that – at the level of approximation and small number of data points she’s using, it’s not a major distortion.

The data sets for early medieval texts are always going to be tiny: having more than 100 manuscripts of one text from the whole of the Middle Ages is exceptional. (The largest transmission I know of is for Alcuin’s De virtutibus et vitiis of which we have around 140 copies). But Clare’s project does potentially offer the possibility of combining her data with other geo-referenced social networks to get an alternative and wider picture of intellectual connections in the Carolingian world. Combining data-sets is likely to lead to even more quality issues, but it does offer the possibility of building up new concepts of the Carolingian world module by module.

Who cares about history (unique identifiers edition)?

Over at A Corner of Tenth-Century Europe, a discussion that started off being about the trajectory of women’s history has mutated into one about why someone isn’t creating a system of unique identifiers for medieval texts. And while I’ve spent the last decade or so thinking about gender history, I’ve spent half my life thinking about databases and identifying references uniquely, because that is one of the things librarians do all day. So I wanted to start from Joan Vilaseca’s plea for “A public and standarized corpus of classical/ancient texts with external references to editions, versions, comments, articles, etc,etc.etc”, sketch out what I’m aware of as existing and explore why history seemingly can’t get its act together in the way that chemistry or taxonomy has.

There are actually some databases that do a fair amount of what Joan would want. As examples, there are:

1) Perseus Digital Library. This is a big and sophisticated free collection of classical texts, including some very neat tools, such as Greek and Latin word study tools (which I freely admit to using when I’m stumped on working out the root verb from a conjugated form). This doesn’t have identifying references, however.

2) Library of Latin texts. This commercial database includes the full text of the whole corpus of Latin literature up to the second century AD (essentially taken from the Teubner editions), plus a lot of patristic and medieval Latin (largely, but not entirely taken from the Corpus Christianorum series). Associated with this is the Clavis Patrum Latinorum which provides a numbered list of all Christian Latin texts from Tertullian to Bede. (There are similar indexes which cover Greek patristic texts, apocrypha, and early medieval French authors.

3) Thesaurus Linguae Graecae. This database includes most literary texts in Greek from Homer to the fall of the Byzantine Empire. It’s a subscription service, but it includes a free online canon database that provides unique identifying numbers for works and parts of works.

4) Bibliotheca Hagiographica Latina (BHL). This is a catalogue of ancient and medieval Latin hagiographical materials, produced by the Bollandists, which provides unique identifying numbers for different texts. There’s also a free online version.

5) Leuven Database of Ancient Books. This free database includes basic information on all literary texts preserved in manuscript from the fourth century BC to AD 800; the texts are assigned a unique number. (It’s a subset of the Trimegistos project which focus on documents from Graeco-Roman Egypt, both literary and non-literary and also provides identifying numbers).

What this very brief overview reflects is one basic fact: to produce a database and/or identifying ring systems of any size takes time and money. As a result, there have to be enough people wanting the result to make it worthwhile making that investment. There are several different models for financing such projects: you can sell the resultant database (either for profit or at a break-even price), or you can persuade funding bodies to support you, or rely on charitable donations but you need someone willing to pay.

It’s worth looking here to see why identifier projects in other fields have succeeded. A lot of large-scale identifier projects, for example, have come out of library science and publishing, both because these are huge and connected networks and because there’s the commercial driver of being able to identify something in your inventory quickly and accurately). So the Standard Book Number, developed for WH Smith in the 1960s became the ISBN of today, followed in the 1970s by the ISSN for serials, etc. It’s noticeable that it took more than twenty years after unique identifiers for serials to develop for unique identifiers for individual articles within those serials to develop (the CrossRef project using DOIs). This wasn’t because no user ever wanted an individual article to read before then; it was because it was only with electronic journals that it became feasible to try and sell individual articles to people.

Most of the other really large-scale nomenclature/identifier projects have been in the sciences, for the simple reason that the same phenomena are being studied all over the world. We’re (mostly) looking at the same sky, hence the International Astronomical Union was formed in 1919. The International Union of Pure and Applied Chemistry, responsible for chemical nomenclature also dates from a similar period. (One of the other main systems of chemical nomenclature, the CAS Registry number is an offshoot of the subscription index/database Chemical Abstracts). Again, people are trying to do the same chemical reactions from Bombay to Los Angeles, so there’s a big demand for such systems. Biological classification has a very long history, dating back to Linnaeus (although unique identifiers are only just being developed), reacting to thousands of years of attempts to show how all species are related.

The classical/medieval database projects that I’ve mentioned above have essentially been possible because they have a sufficiently tightly-defined group of potential users who are all interested in the same sort of thing: classical literature or papyrology or hagiography. It’s therefore worth creating something for them to use. The problem with extending such a system to broader historical areas is that no-one cares about history.

That sounds ridiculous, but it’s a problem I’ve mentioned before: it’s not really clear that we’re doing the same thing as historians when we study vastly different periods and use completely different sorts of sources. Or to put it a different way, the Old Bailey database is a remarkable resource, but not of any professional use to me. I don’t care about all history, everywhere; I care specifically about early medieval European history. Historical sources, even just medieval sources, aren’t one thing, but a patchwork of different islands and most researchers spend most of their time perched securely on a few of these, rarely venturing off them. I’ve had years of being an early medievalist and never needed to cite Sawyer numbers, for example, because I don’t research or teach Anglo-Saxon history; I’d be almost equally baffled if I came across Corpus Iuris Canonici footnotes without the help of Edward Peters. The patchwork systems of identifying medieval documents remain because of the lack of overlap between the groups of researchers using them, and I can’t see any driving force that is going to change that. Crowd-sourcing has produced some remarkable things, but creating unique identifiers is a peculiarly ill-suited task for crowd-sourcing. Unless more people start caring about the history of everywhere at all times, Joan isn’t going to get the wide-ranging system he’d like.

Digital diplomatics 1: projects and possibilities

I am currently trying to get up to speed on some of the many projects involving charters online, drawing heavily on accounts from the Digital Diplomatics conferences (and also Jon Jarrett’s useful reports on the 2011 conference). I don’t claim to be an expert on charters, but I have been using (and sometimes developing) databases for 25 years, so some of the issues seem quite familiar from my experience as a librarian. What I want to do in this first post is give a sample of the types of project out there and also note what I consider to be some particularly interesting features.

It’s useful to start with a sketch of the origins of diplomatics (the study of charters) because that explains a lot about how digital developments have been shaped. The starting point was the attempts by early modernists to work out which charters of a particular religious institution were false and which were genuine. For this, the key ability was being able to compare charters with good evidence for being authentic (e.g. held as originals) to other more dubious versions. As a result, charter studies have often been organised either around particular collections/archives (e.g. editions of cartularies, charters of St Gall) or around rulers (e.g. the diplomas of Charles the Bald), because it’s easier to spot the dodgy stuff in a reasonably homogenous corpus.

Charters have also long been a key source for regional history, so eighteenth and nineteenth century scholars produced a lot of editions of regional collections of documents including charters, such as the Histoire générale de Languedoc. Where the corpus is small enough, these have then been extended to national collections or overviews, some of which I mention below.

From the purely print age, we have now, however, begun moving into digital diplomatics and there have been a variety of approaches.

1) Simple retro-digitisation
Because there’s been scholarly interest in diplomatics for several centuries, a lot of early editions are now out of copyright. Simple retro-digitisation of old editions doesn’t often get mentioned in discussions of digital diplomatics (though Georg Vogeler, “Digitale Urkundenbücher. Eine Bestandsaufnahme”, Archiv für Diplomatik, 56 (2010), p. 363?392 has a useful discussion of them), but there are a lot of old charter editions being put online by projects such as Google, Internet Archive, Gallica etc. This data, however, is pretty hard for charter scholars to make use of unless they’re looking for a specific charter (or at most a specific edition). Is there any way in which this material could be deal with more effectively?

Doing something with such data doesn’t strike me as a project that’s likely to possible to fund (it’s not new and exciting enough). The most plausible way of organising it seems to me to be crowd-sourcing of OCR work on charter scans (or checking already OCR’d documents) along with adding some basic XML markup and then sticking them in a repository. Monasterium seems the obvious one to use. Whether there would be enough researchers interested in charters from more than one foundation to make the effort of doing this worthwhile, however, I’m not sure.

2) Databases based on the printed edition model
Printed editions of charters are normally either arranged chronologically or include a chronological index. (There are a few cartulary editions which don’t have this, and I have winced at having to look through hundreds of pages to spot if there are any Carolingian charters). The vast majority of printed editions also have indexes to personal names and place names. In contrast, content analysis of the charter is often fairly limited, in the form of headnotes plus a narrative introduction.

The indexes to printed charters, if they’re done properly, work pretty well for the needs of many people working with these sources. Or, to see it from a different angle, historians studying charters arrange their research into these kind of categories. As a result, where such indexes don’t exist in the original edition, you’ll often find that someone creates them later (like Julius Schmincke doing an index to Dronke’s edition of the charters from Fulda).

A lot of charter databases are still essentially arranged around these traditional print access methods, with digitisation essentially adding (often fairly basic) full text search and remote access. Many of the online charter projects that have got furthest have been digitisations of relatively small and coherent existing charter collections, which have already been published in a single print series. There are several based on national collections, such as Sean Miller’s database of Anglo-Saxon charters, Diplomaticum Norvegicum and Diplomatarium Fennicum. There are also some regional charter databases of the same type (such as the Württembergische Urkundenbuch, and the early twentieth-century edition of the Cluny charters have also been put in a database. And then, of course, there’s the charters section of the digital Monumenta Germaniae Historica.

3) Aggregator databases
There are also a few charter database projects which are based on aggregating multiple printed editions: the two most important are Monasterium and Chartae Burgundiae Medii Aevi.

4) Born digital/hybrid editions
In contrast to the substantial projects of digitising existing editions, most of the born digital (or moved to digital) charter databases seem to be fairly small scale. The one exception I’ve found so far is Codice diplomatico della Lombardia Medievale which has now put over 5,000 Lombard charters from the eighth to twelfth century online.

5) Databases of originals
There is also a slightly separate strand of digital diplomatics research, which has focused on charters which are preserved in the originals (rather than as cartulary copies, etc). Some of these databases just include the text, others focus on images of charters. Projects include ARTEM and the (basic) database now attached to the Chartae Latinae Antiquiores publishing project. I’m also aware of several more image-focused projects, such as the Marburg Lichtbildarchiv, and Pergamo Online, which contains images of parchments preserved in Pergamo.

I’m not going to discuss the image databases in any detail, because they’re a very different kettle of fish to the textual databases I’m used to working with, but it is worth noting how decisions made on how much detail is recorded for original documents can be fairly arbitrary. As George Vogeler points out, there’s an odd division for the St Gall charters between the early stuff that gets put in horrendously expensive printed ChLA editions and the material from the eleventh century onwards that is available free via Monasterium.

6) Linguistic projects
I also won’t say much about charter database projects that focus on linguistic analysis of texts, such as Corpus der altdeutschen Originalurkunden bis zum Jahr 1300, Langscape and the work being done by people like Rosanna Sornicola and Timo Korkiangas. While this is interesting work, it seems to me of less immediate relevance to most historians.

7) Factoid model
As Patrick Sahle put it in a recent paper (“Vorüberlegungen zur Portalbildung in der Urkundenforschung”, Digitale Diplomatik: Neue Technologien in der historischen Arbeit mit Urkunden. Archiv fur Diplomatik Schriftgeschichte, Siegel-und Wappenkunde, Beiheft 12, edited by Georg Vogeler (Cologne, Böhlau Verlag, 2009), 325-341 at p. 338), the object of diplomatic research is the individual charter. Most database projects are structured in a way that reflects this focus on the charter as a unit.

A contrast is given by the factoid model adopted by a number of KCL projects, such as the Prosopography of Anglo-Saxon England and what will shortly become the People of Medieval Scotland project. Here, the key unit is the factoid, a statement of the form: “Source S claims Agents X1, X2, X3 etc carried out Action A1 connected with Possessions/Places P1, P2 at date D1.” A charter (or another source) can thus be broken down into a number of factoids, allowing finer grained-access to the content of charters. Although this may not seem an obvious approach to considering charters (and there are a number of practical problems), it does match surprisingly well to the “Who, What, Where, When, How do we know” model that I’ve mentioned before as one approach to working with charters.

What works
As my overview suggests, there are already too many charter databases out there to make it easy to discuss them all in any more depth than “here’s another one that does X, Y and Z”. But there are some projects that seem to me to illuminate particularly important aspects of digital diplomatics:

1) DEEDS: full text done right
I’ve discussed before the problems of searching full-text databases of charters, but most projects don’t seem to respond to such problems. Instead they have very basic full-text facilities, and certainly nothing like the ability to use regular expressions that Jon Jarrett longs for.

The problem with regular expressions, of course, is that they still require an expert user. And as several generations of designers of library catalogues and other kinds of databases know, most users aren’t experts, and they don’t want to have to become so to be able to use your database. Even if you learn the right syntax, how do you know what spelling variations to try searching for before you’ve seen what might be lurking in the database? For example, if know that the MGH edition of one of Charlemagne’s charters (DK 169) refers to a particular county as Drungaoe or Trungaoe, how on earth would it occur to you that the same charter in Monasterium would name the place as “Traungaev”?

DEEDS is the only project I’ve seen so far that has really sophisticated analytical tools for full-text. Its methods of shingles for example, is currently being applied to dating documents, but it strikes me as something that might also very usefully be applied to identifying particular formularies used by someone drawing up a charter. By breaking a document down in this way, you can analyse multiple factors suggesting that a document is “nearer” to one model than another in a way that’s simply not practical with manual methods.

Even more useful, potentially is DEEDS’ use of normalisation. Their alternative spelling option makes their search engine cope with a lot of the more common issues in searching Latin. But the really interesting part to me was their discussion of using normalisation to produce phonetic proxies. This takes a phrase such as “Sciant presentes et futuri quod ego Iohannes de Halliwelle” and reduces it to “scnt prsnt cj futr cj eg iohns pr hall”, the bare sounds of the key terms. A full-text search facility with phonetic proxy as option strikes me as one of the few ways that you might be able to produce something that could find you the multiple possible Latin spellings of the Traungau, without you needing to sit down for a week to work them out…

2) ARTEM: bringing in the users
ARTEM, the database of French original charters before 1121 is far from being the biggest or the more sophisticated charter database around. Where the project has succeeded, however, is in getting researchers actually to use the database. There have been several conference publications based on its work, e.g. Marie-José Gasse-Grandjean and Benoît-Michel Tock, eds. Les actes comme expression du pouvoir au haut Moyen âge: actes de la table ronde de Nancy, 26-27 novembre 1999. Atelier de recherches sur les textes médiévaux, 5. (Turnhout, Brepols, 2003).

What I’m not yet sure of is why ARTEM have been more successful than comparable projects in getting other scholars involved. Is it because they’ve been going longer, that they’re more pro-active in arranging roundtables, or is it because France has a weird early medieval charter distribution, with a large number of relatively small collections of charters, and thus researchers desperately need a multi-archive database?

3) Monasterium: Charters 2.0
Monasterium.net describes itself as a “collaborative archive” and it’s the only project I’m so far aware of that takes the idea of user participation seriously. As well as providing tools for working with and annotating individual charters (which I haven’t yet had the chance to try out), it’s also intended to provide a distributed infrastructure into which individual archives from across Europe can add their material. As a means for getting later medieval charters available online, especially for smaller archives, it looks ideal. In terms of data quantity and quality, however, it’s liable to the patchiness inherent to large-scale collaborative projects: some areas get very well-covered, some don’t get referred to at all.

4) CBMA: blending old and new
Chartae Burgundiae Medii Aevi isn’t unusual in its scope ? it’s aiming to put online the 15,000 charters from the region of Burgundy. What’s more unusual is its methods ? it’s putting online both old editions and previously unedited cartularies. There are obvious issues here about whether they can get data consistency, but potentially it seems more practical to start with existing editions (however imperfect) and “grow” a database using them, than to wait for funding to re-edit everything from scratch.

5) Cathalaunia.org: DIY databases
All the databases I’ve discussed so far have been major research projects. However Cathalaunia.org, created by Joan Vilaseca shows the possibility for a dedicated individual to produce their own web-based charter database, using easily available tools.
Joan uses a wiki format, which for the relatively small number of documents he has provides a neat way of showing links between people and places. The unstructured nature of the data may make it harder to search, but it also means that different genres of documents (not just charters, but hagiography etc) can be incorporated easily. It’s a useful reminder that charter information doesn’t have to be stored in relational databases. (For another example of this minimalist approach, see Project FAST, which is putting a Florentine archive online).

Cathalonia.org also raises an interesting point about audiences and the accessibility of charter databases. The site is in Catalan, which makes it far more suitable for what I presume is Joan’s main audience, people interested in the history of their own region. But for those of us who aren’t Catalans (and don’t specialise in its history) the use of a relatively uncommon language is a disadvantage.

Preliminary conclusions
The databases I’ve so far read about or seen prove that there are lots of interesting projects going on, but I do slightly wonder if there’s too much variety. Different audiences and different aims can explain some of the variants, but I think maybe we start needing to adapt more systematically from previous projects. I can see the components of really effective databases in some projects, but so far they’re not being pulled together into something that properly builds on the pioneering work. So, I finish with a question for the more experienced users: what do you like from particular charter database sites? What should the Charlemagne project be stealing from other projects?

By my own free uill I have zold and zell this to gou: on the full text of charters

This post is inspired by three things: a recent IHR paper given by Rosanna Sornicola, a paper given at the International Medieval Congress in 2011 by Peter Stokes of KCL and some of the comments on a previous post of mine about charters. It aims to ask a deceptively simple question: what do we mean by the full text of a charter?

To start with Rosanna’s paper, it was entitled “What the legal documents of the early middle ages can tell us about language: the case of 9th- and 10th-century charters from Southern Italy”, and was pretty much as it said in the title. She’s a professor of linguistics interested in the development of the Romance vernaculars out of Latin. It’s a question that’s been debated for more than a century, but the answers that are being suggested now are far more complicated than a simple change between two languages. Most of the models now are of the coexistence of Latin and the vernaculars, diglossia, with the locus of change not the language per se but the social groups who used a particular register of language. There was no unitary route between Latin and the vernaculars, but many different routes.

Rosanna was exploring one particular context for such change: southern Italy in the ninth and tenth century. Less attention has been paid to linguistic change there than in France or Spain, but it presents an interesting contrast. Unlike other areas, there isn’t the same cultural break as with the Merovingians in France or the Lombards in northern Italy, with the arrival of an essentially illiterate ruling class. Naples and Amalfi, in particular, had a rich and relatively autonomous cultural life. They were also much less influenced by Carolingian cultural reforms, which have sometimes been claimed to be key to developments elsewhere.

Instead, Rosanna was arguing for the persistence of late antique forms of Latin in the south, but this is a late antique Latin that is already substantially changed from ideas of “classical Latin”. The proliferation of the accusative in prepositional phrases, for example, such as “una cum alias terras meas”, is already visible in Pompeii graffiti and Ravenna papyri, as are plurals such as “campora” (fields).

Rosanna went on to discuss various other syntactic forms visible in the charter corpus: I think many of the examples may have been more striking to those whose Latin is better than mine to start with. But there was one particular quotation in her handout I want to give. It’s from a charter from Gaeta in 918 (CodCajet 1, XXIV, 43), where someone states:
“mea boluntatem bendidisse et bendidit bobis” (By my own free uill I have zold and zell this to gou).

My translation isn’t accurate, of course, but that’s the whole point. How do you translate something that’s lurking uneasily between Latin and something else like that? And what on earth can you do with free text spelt like that? For Rosanna’s purposes it’s ideal. For anyone who’s trying to track down all documents about sales, it’s a massive problem.

Which is where we backtrack six months to Peter Stokes talking about Anglo-Saxon Cluster and the problem of integrating different ideas of what a charter is. There’s already been a slightly bad–tempered post about this paper from Jon Jarrett, who I think for once got distracted from the key point. Which is that a lot of the difficulty of integrating four projects all talking about the same documents is that the charters can be conceptualised in very different ways.

What is a charter in terms of these projects’ focus?

1) In ESawyer it’s a document, with the main point being creating an index to help locate it and discussions of it.

2) In ASChart it’s a text (a string of words) with a date. (It’s worth noting here that this is specifically said to be a pilot project and to focus on marking up texts with XML, so it was not intended to be a replacement/equivalent of Sean Miller’s useful database).

3) In PASE a charter is a source, a set of factoids (X did Y). In fact it’s the old game of gutting sources for snippets of information.

4) Finally in Langscape a charter is a unique document (every manuscript is a different version, there’s no critical edition).

All this is reflected in very different attitudes to what form any “full text” included in the project takes. ESawyer includes for many records (but not all) the text of charters, taken from a several different editions. ASChart, as already mentioned, includes (non-searchable) full text with certain sections (such as dispositive words) marked up. PASE doesn’t include the full text of charters, but does, in theory, include all the main data points from them. Finally, Langscape includes three different versions of each text: semi-diplomatic, edited (i.e. broken up into lexical units for analysis) and glossed (provided with a headword and translation).

So when we talk about a database including the full text of a charter, we’re potentially thinking about very different things, with varying amounts of editorial intervention. First of all, there’s the question of whether you’re editing the material from scratch (which is very time-consuming), or relying on existing editions, which may not be consistent (especially with large corpuses). Secondly there’s the possibility of using XML mark-up to highlight particular sections. Finally there’s the possibility of full-text search.

What Rosanna’s paper strongly suggested to me is that full-text search is something of a red herring in most cases. Short of the kind of extreme editing that Langscape includes, I can’t see how you can often find things reliably in texts where the spelling is so erratic. This is going way beyond problems of Latin stemming (which have been researched for at least 25 years). Full text search is only really likely to work effectively where you’ve got fairly standardised Latin AND consistent editorial practices. Or possibly for individual words/phrases which are sufficiently distinctive and not spelled in too many alternative ways: you might be able to find most examples of “friskingas” (suckling pigs) in a database of charters, for example, if you sit down and check half a dozen similar words. But I don’t see that you’re going to get very far trying to pick out sales, for example. And I was recently staring at a transcription of a St Gall charter for some while in bemusement before I worked out that “drado” meant someone was going to hand over (trado) some property.

Similarly, ASChart is, to my mind, an interesting exercise in showing that XML mark-up of a charter in terms of its diplomatic doesn’t really get you a whole heap further in its study (which may be the reason it didn’t get beyond the pilot project stage). It’s possible to use it to pull out a list of invocations, for example, but you get something that isn’t easily scalable to large collections, because so many invocations are marginally distinctive. There’s not a substantial difference, for example, between starting a charter “In nomine Domini nostri Iesu Christi mundi saluatoris” and “In nomine Domini nostri Iesu Christi saluatoris mundi”, but I can’t see how you can easily find an algorithm that would automatically conflate phrases that are “similar” in this way.

What, in theory, might be more helpful is using XML mark-up combined with full-text search, so that you search only in the dispositive words, say, for “vendo” or variants thereof. But I’m not yet convinced that with the kind of variability you have in early medieval charters, you would really end up saving enough of the users’ time to justify all the work of tagging this data in the first place. I’d be interested to hear from people who work more on diplomatic on this point – what do you think XML might do for you?

I said in discussing the Making of Charlemagne’s Europe project I’m now working on that we’re not providing the full text of the charters. It’s more accurate to say that we won’t be systematically providing the full text of them – we’ll link to the full text online, where it’s freely available, and provide references to printed sources otherwise (much as PASE does). The hope is that this gives users most of what they need, without the additional expense of either licensing full text from previous editions (it’s interesting to note that some publishers are now republishing nineteenth century cartularies) or having to spend large amounts of time scanning/OCRing material. But it’s fair to say that I’m starting to realise how much more there is to the “full text” of a charter than at first meets the eye.

Making charters useful

I finished at the Fitzwilliam Museum at the end of December and started a new job last week: as Postdoctoral Research Associate on the new King’s College London project The Making of Charlemagne’s Europe: 768-814. Officially the project is intended to create a database of the surviving documentary evidence from Charlemagne’s reign. Unofficially, I see it as a project to make charters useful.

There are a lot of people, of course, who already find early medieval charters very useful. If you’re doing regional studies (of e.g. Catalonia or Alsace or Brittany), charters are essential evidence. But if you’re doing a study that isn’t regionally focused in this way, then frankly charters are less ideal, because there are just too damn many of them. There are around 4,500 documents for Charlemagne’s reign alone. How do you find the ones that actually provide relevant information for your purposes?

This is why, potentially, our database will come in very handy, especially since it’s being designed by people who have considerable experience of previous similar database projects, such as Prospopography of Anglo-Saxon England (PASE) and Paradox of Medieval Scotland (POMS). The prosopographical side is thus very well-covered. However, the plan is to have more: both mapping facilities and statistical analysis. We’re not providing the full text of charters, but we will be providing structured data of various kinds. So one of the questions we need to ask right at the start is what information do researchers actually want to get out of the corpus of charters that they can’t get currently? Asking this question among the readers of this blog seems as good a place as any to start. I know you’re not all Carolingianists, but a lot of you will have worked with charters or bulk data of some kind. What research questions interest you for which such a database might be a help?

What follows is my first very rough list of possible research areas. All comments welcome; if you know of work that’s already been done, or if I’ve missed out something, please add it in. I’m still at the brainstorming stage at this point, and this post reflects this.

1) Studies on literacy
Graham Barrett is another researcher on the project, so this angle may be fairly well-covered anyhow. He’s already done studies with later Spanish charters, looking, for example, at affiliations of scribes and the number of documents that particular scribes wrote. This immediately ties into research questions about the professionalism of scribes, and the extent of lay literacy.

I also wonder whether we should make a special note of charters that include references to books as property, so we can get a picture of where they are mentioned.

2) Family/women
Most of the detailed studies of families will obviously be done on a regional basis. But the prosopographical side of the database will enable us to create biographies of individuals/families who have a transregional activity. What I’m not yet sure is what kind of data it would be useful to produce on such people. Given the strong spatial emphasis in the database, would it be useful to be able to map the activities of not just an individual, but a group of them?

One of the things we are definitely going to do is give the sex of every individual mentioned, which immediately makes possible a lot of the analysis about women’s land-holding etc. (It takes under a minute to dig out the 48 female witnesses from the POMS database, for example).

I think we need to have some kind of record of relatives being prayed for, though I’m not yet sure in how much detail. But this ties in usefully with discussions about which relatives “counted” in which situations.

It’d be nice to use charters for getting demographic data about families, as well, but that may be unrealistic. Has anyone seen this sort of thing done successfully?

3) Ethnicity
Despite all the problems with questions of ethnicity, it’s still interesting to see how the charters reflect this. We will probably be drawing on the work of Nomen et gens as far as ethnicity of personal names is concerned; what might also be useful to note is if specific ethnic terminology is used in charters to refer to people.

4) Legal practice
This is an area I know less about, so if anyone knows who’s doing interesting work on this, it’d be a help to know. My immediate thoughts for things it would be useful to record are number of witnesses to a document (so you could, say, pull out documents with less than the six witnesses Alemannic law said you were supposed to have) and references to law/laws within the charter (whether specific or general).

5) Monasticism
One useful piece of information would be to know how the collective membership of particular religious communities are described – are they ‘monachi’ or “deo sacrata” or what? It’d be particularly interesting to learn more about references to canons/canonesses.

It’ll be possible to break down charters by date and region, so we can potentially get comparative data on the well-known idea of “waves of pious giving” – how long do people keep on making large donations to churches/monasteries after they’ve been founded?

I don’t know if early Carolingian charters have enough boundary clauses to make this work, but Barbara Rosenwein’s classic study of Cluny was collecting data on the extent to which a donated piece of land was adjacent to Cluny’s property already, which allowed seeing monastic land-acquiring strategies and how literally “being the neighbour of St Peter” was meant.

Looking at statistics for proportions of donations versus precaria for different monasteries/regions also contributes to the whole debate about pragmatic versus spiritual rewards for donors (which I always associate with Rosenwein on Cluny versus John Nightingale on Gorze). I also wonder whether there is any way of flagging up people who make donations to more than one foundation, given these may form particularly interesting test cases for studying how patronage decisions were made.

6) Military history
One of the questions we’re trying to work out at the moment is how much detail we go into about renders. Possibly we will just have a general term for animal renders, given the trade-off between precision in recording and time taken. But I do wonder if we should treat references to renders in horses separately, given their military importance. Any thoughts?

7) Price information
This is again an issue of how much detail we can put in without the project over-running, but how useful would it be to note if there are references to values in coinage? Wendy Davies did some promising studies on this for Spain.

8) Political history
One of the most useful possibilities that the mapping side of the project potentially allows us to explore is the nature of the Carolingian county. The arguments about “flat counties” versus “scattered counties” have been going on for decades: if we input the data right, we can explore in detail the geographical relationships that the sources themselves choose to mention.

It will also be useful to be able to map and contrast royal interventions between regions; while the data from royal charters is probably limited enough that this could be done manually, this project will potentially also allow us a transregional view of royal missi and vassi.

9) Social structure
Chris Wickham, in particular, has used charters from a number of regions for the comparative study of social structures, but of necessity, such work has normally drawn on syntheses of studies of a few locations. Potentially, this database allows wider comparisons, though both potential approaches to categorising social levels have their problems. The first possibility is using explicit references to office and social status within the charters: although there are problems in comparing these across the regions, they are potentially soluble. Perhaps even more intriguing is whether a social classification could be developed based on activity-derived status. In other words, could we find a way to mark all those who made more than a dozen donations, or witnessed over a geographical range of more than 10 miles, etc? This might show to what extent influential people exist who don’t obviously hold office or get called “nobilis” etc.

10) Rural and landscape history
Again, this is an area where bulk comparative data is potentially useful, but we have to work out how much detail we can go into, especially for landscape features in charters. Should these be regarded as purely conventional and excluded or are some of them worth listing specifically? I’m inclined to think it’s worth mentioning mills, but not huts, for example.

Those, for now, are my ideas of what we might do with our data, given the limitation I’ve already mentioned, that we’re not going to have the full text of charters. Any obvious suggestions that I’ve overlooked will be gratefully received.

Things 14 and 15 of 23: LibraryThing or LibraryUserThing?

The most recent task for the Cambridge 23 Thingers has been to look at LibraryThing. I’ve been vaguely aware of this site for some time, and it does include one of the neatest tricks for data mining usage patterns I’ve seen, the UnSuggester. This works best if you enter fairly well-known novels, and goes beyond simple high-culture low-culture divides, to tell you that if, like me, you like Dorothy L Sayer’s ‘Gaudy Night’ you should not just skip the work of Chuck Palahniuk (which may be obvious), but also Paulo Coelho and Jodi Picoult (which is less obvious). But it was only 23 Things that has actually prompted me to get stuck in and try other aspects of LibraryThing.

So my personal library, or at least a very small part of it, is now up there. I decided to give LibraryThing a little test by including a range of my academic books, including some with no ISBNs and one foreign text (Regine Le Jan’s Famille et Pouvoir). Overall, it didn’t do badly. I found everything on there already (including Regine’s book), and though most of my books were fairly unusual, a surprising number of people have Kennedy’s Revised Latin Primer. The biggest problem is finding multiple variants when you haven’t got an ISBN, such as with my 1955 Penguin Classics version of Bede. The fact that it’s possible to edit details of any of the books you’ve found is handy, but also sends shivers up my cataloguer’s spine (because it affects records globally).

What happens when you do have your (personal) collection of books on LibraryThing? From a quick look, the site is a lot shorter on reviews than I have expected. For example, there are no user reviews for Rosamond McKitterick’s Charlemagne: the formation of a European identity, while there are (completely different) ones on both Amazon US and Amazon UK.

LibraryThing’s tags are not much use, but the recommendations aren’t at all bad. Take, for example J W C Wand, A History of the early church to AD 500. The tags are “a4, christianity, church history, churches, early church, early church history, ecclesiology, history, roman empire”, which tell you little that isn’t already in the title. But the recommendations include the Penguin translation of Eusebius and another volume of translated sources from the early church. (I’m not convinced, however, that the recommendations are better than Amazon’s ‘Customers who bought this item also bought’). It’s also noticeable that even with only 13 books added, the ‘members with your books’ feature has already highlighted as similar both Another Damned Medievalist and Curt Emanuel, both of whom I already knew from their
medievalist
blogs.

There are also some nifty book-centred tools on the site. The links from each title to Google Books is potentially very handy. The site is also strong on search links to both booksellers and libraries (slightly ironically, it may thus be most useful for books that aren’t actually yours).

So would I find it useful personally? Not for my own historical work. I need reference management software (I currently use EndNote) which can deal with journal articles as well as books, which is under my complete bibliographical control, and which allows me to format the output in multiple ways. Even the Google Books link feature (which does look useful) could potentially be replicated via the ability to set up personalised libraries on Google Books.

For about 90% of the time, meanwhile, I don’t need to have records of the other books I own – I can remember them or find them on the shelves. The one time it would come in useful would be in a bookshop. If I had a mobile phone I could surf the web on (I currently don’t), I could check exactly which works of Margary Allingham or Terry Pratchett I had before purchasing another (and so could any other relatives who want to give me books).

In contrast, LibraryThing does make a lot of sense for many amateurs (and I mean that term in a positive sense, for people who love books and reading). It’s far simpler to use than setting up your own database, and it also provides suggestions and the chance to connect with like-minded people. If you’re a public librarian working in reader development I can see it being an important tool. And it would also be handy for small libraries that can’t afford library software. (I once did a library catalogue for a small firm of solicitors in Excel, because that was what they had).

For larger libraries, I’m a lot less convinced that it’s useful. I had a look at the catalogue of the San Francisco State University Library which includes tag clouds generated from LibraryThing. Here are a couple of screen shots pulled from a search on Chris Wickham’s books:

Thing 14 no 1

The tag cloud for ‘Early medieval Italy’ had 13 terms, and the vast majority of them are either superfluous (‘history’, early medieval’, ‘Italy’) or too general to be useful (‘power’, ‘sociology’). Only two (‘barbarians’ and ‘late antiquity’) actually add useful information.

Thing 14 no 2

‘The mountains and the city: the Tuscan Apennines in the early Middle Ages’ has one tag: ‘early middle ages’.

If you’re reserving a place on the catalogue page for tags and thus adding to both processing time and onscreen ‘clutter’, you need to be getting something useful from this. I’m really not convinced that for academic non-fiction you do.

Some libraries have also put their new books up on LibraryThing, for example, Nuffield College, Oxford. My question is whether the extra work involved is worth it? Given that Nuffield already have a new books list on their own site, who is their target user on LibraryThing? Who wants to know about the new books they get, but is not sufficiently aware of them as a library already to visit their site? I presume that much of the process of adding new acquisitions to LibraryThing can be automated, with a string of ISBNs passed to the site, but I suspect not every new item in many academic libraries would have an ISBN, and it would still be quite a lot of work setting up the feed initially. For some kinds of library and library user, I can see the big attraction of LibraryThing. For academics and academic libraries, I’m much less convinced so far.

IMC 3: things to do with charters before you’re dead

Blogging the International Medieval Congress is itself increasingly historical in one sense: nowadays, you can get a range of reports on several of the key sessions, all written by historians with their own biases and agendas, and the attentive reader can try and reconstruct the event from multiple perspectives. In this spirit, I will rashly give you my thoughts on a couple of sessions that a fellow blogger organised on ‘Problems and possibilities of early medieval diplomatic’. Jon will doubtless give us a more informed take in time, but he is coming from the viewpoint that charters are intrinsically interesting, while I…am not.

I think I got turned off charters doing my MPhil at Cambridge, when I realised that there were volumes and volumes of Carolingian royal charters, none of which had been translated. Given that every project involving charters suggested to me seemed to involve reading dozens of them, I decided instead to focus my shaky translating ability on things that gave more immediate results. (OK, I realise now that charters can be read fairly quickly once you’ve got used to them, but I didn’t know that then).

Ever since, I have been gradually forced to admit that, actually, charters are very useful for studying all kinds of phenomena, and Jon’s IMC sessions this year gave a very good spread of both the kind of things you can study with charters, and even more interestingly, the scale you can work on.

At the most local scale, there was Jon’s own paper on St Pere de Casserres, a monastery in Catalonia. He was focusing on the oldest original charter, which was showing fictive sales to the monastery. How do we know they were fictive sales? Because some of the properties had already been transferred earlier to the founder of the monastery, Viscountess Ermentrude (Ermetruit) of Osona/Ausona. What case studies like this can give us is some feel for the texture of local power: for example, how new ‘histories’ are created (all the numerous people whose names appeared in this first charter were complicit in its fiction) and how power relationships worked (the early importance of viscountesses in Catalonia is very interesting).

Also on a local scale and focusing on Spain, but looking at a very different aspect of charters was Wendy Davies’ paper on ‘Local priests in Northern Spain in the tenth century’, which was using charters to look at priests’ education. In fact she was focusing on one formula within a charter (nullius cogentis imperio/nullius quoque gentis imperio) and its multiple variants. It says much about Wendy’s near hypnotic scholarly force that not only was twenty minutes on one formula fascinating, but her wish write a whole book on such language analysis seemed entirely reasonable (though I seem to remember she admitted it probably wouldn’t be publishable). Looking at these formula variations, Wendy saw different preferences between micro-regions, areas around particular cities that preferred one version of the formula, as well as individual preferences of some priests. Her analysis of charter-writing also showed different kinds of priest-scribes – some who were following aristocrats around, some who worked only in one location, writing for people who were probably peasants, because of the small-scale of these exchanges. From this incredibly detailed study of charters, she can thus build up a picture of the background of these members of a purely local elite, far below the social level that other early medieval sources normally deal with.

At a larger scale, Julie Hofmann from Shenandoah University was looking at women’s participation in patronage at Fulda (and hoping to expand this to other Carolingian monasteries east of the Rhine). A fair chunk of the paper was showing how hard it is to spot distinctive trends in women’s activities, when charters mentioning women are a relatively small proportion of a charter corpus that itself is changing over time. For example, how significant is a drop off in women’s charters, when there’s also a general decline in charters after the reign of Charlemagne? And could the overall figures be distorted by a few untypical families, such as one prominent Mainz magnate family which had no surviving sons?

One difference Julie thought she could detect was that women were less likely than men to witness their own donations. (I think I remember this correctly, but my notes are a bit sketchy at this point). The problem is determining the significance of this, which means trying to look at when men do or don’t witnesses their own donations, and there aren’t any clear answers yet. Work on women in early medieval charters has been very much neglected since the early attempts at statistical analysis by David Herlihy, Suzanne Wemple and the like, so this kind of charter analysis potentially offers an important new avenue to looking at Carolingian women’s history. Whether we are going to see consistent gendered patterns in the diplomatic, I’m still not sure, but after all, gender analysis is about similarities as well as differences.

On a national scale, we had Erik Niblaeus on how the Cistercians brought charters to Sweden, which had somehow managed to survive without them until the 1160s. Charters offer a useful approach to looking at the ‘Europeanization’ of northern and central Europe, and certainly provide evidence for it at a textual level. As Michael Clanchy commented, looking at one of Eric’s images, you wouldn’t be able to tell it in style from a charter from almost anywhere else in Western Europe. Despite Erik’s title referring to the ‘Import of a Political Culture’, however, he wasn’t sure that charters could be connected to political institutions, because there was so little other evidence for them from the period. Instead, charters add ‘reassuring mystery and complication’ to our knowledge. (Erik thus shows himself firmly in the John Gillingham tradition of applauding the increase of uncertainty in historical scholarship).

Lastly (though actually the first paper of all) we had the global vision of Georg Vogeler, one of the people working on the Monasterium project, talking about this and other projects to get charter corpora on the web. The possibilities are substantial. Rather than the handful of images that traditional charter editions have included, you can in theory have images of all the charters. You can have access from anywhere in the world and there are new possibilities for rapid textual analysis. Georg gave an example about looking at vernacular dating clauses in German charters and being able to explore regional differences over time. Diplomatic differences that previously might only be spotted by an expert after half a lifetime can be explored within a week or two.

Of course, the full effect is going to take a long time coming, and the other papers showed how different researchers want different things. Jon’s work, and to a certain extent Eric’s involved careful analysis of the specific physical form of charters, which needs high-resolution images. Wendy’s work requires full text (with non-normalized spellings). For the kind of larger-scale statistical analysis which Julie was interested in, in contrast, she didn’t really want the text of charters, so much as standardized data from them (she’d constructed her own database to store such data). In theory, people could code full-text to mark such key sections (as the Charters Encoding Initiative is thinking about), but it would still be an enormous amount of work. Georg said there are projects working on issues like automatic tagging of names, which might reduce some of these problems.

If we could get something like this working on a large scale, I think there are all kinds of new research areas that are opened up. For example, it strikes me as having great potential for socio-economic history. If you can relatively easily pull out charters referring to slaves or vineyards or mills etc, you can build up a selection of sources that you’d never have time to explore otherwise. Similarly, I once found a mention in a Freising charter about a woman serving at the royal court – it might be possible to find more evidence about that. All in all, after the two sessions, I’m starting to feel that I probably ought to be more enthusiastic about charters than I’ve previously been. Maybe, as a friend once commented, ‘charters are the new black’.