Fifty years of historical database angst

The Making of Charlemagne’s Europe project website has now gone live, and includes a post by me on interconnecting charter databases. I mention in that a recent argument when we were trying to decide which of several different categories of transaction a particular document fell into. Just to show that such problems of coding documents are not new, here are some quotes from a recent article on Charles Tilly, a historical sociologist and a pioneer of using databases for historical research.

The Codebook for Intensive Sample of Disturbances guides more than 60 researchers in the minutiae of a herculean coding project of violent civil conflicts in French historical documents and periodicals between 1830–1860 and 1930–1960…The Codebook contains information about violent civic conflict events and charts the action and interaction sequences of various actors (called there formations) over time….we find fine-grained detail and frequent provision made for textual commentary on the thousands of computer punch cards involved.

(John Krinsky and Ann Mische, “Formations and Formalisms: Charles Tilly and the Paradox of the Actor”, Annual Review of Sociology, 39 (2013), p. 3)

The article then goes on to quote the Codebook on the issue of subformations (when political groups split up):

In the FORMATION SEQUENCE codes,treat the subformation as a formation for the period of its collective activity—but place 01 (“formation does not exist as such at this time”) in the intervals before and after. If two or more subformations comprise the entire membership of the formation from which they emerge, place 01 in that formation’s code for the intervals during which they are acting. But if a small fragment breaks off from a larger formation,continue to record the activities of the main formation as well as the new subformation.

If a formation breaks up, reforms and then breaks up in a different way, assign new subformation numbers the second time.

If fragments of different formations merge into new formations, hop around the room on one foot, shouting ILLEGITIMIS NON CARBORUNDUM.

(Krinsky and Mische, p 4, citing Charles Tilly, Codebook for intensive sample of disturbances. Res.DataCollect. ICPSR 0051, Inter-Univ. Consort. Polit. Soc. Res., Ann Arbor, Mich. (1966), p. 95)

In nearly fifty years, we’ve gone from punch-cards to open source web application frameworks, but we still haven’t solved the problem of historical data (and the people behind it) not fitting neatly into the framework we create, however flexible we try and be.

ChartEx: data technologies and charters

York chartex imageYork, corner of Stonegate and Petergate – image taken from one of the ChartEx presentations

I will gradually be talking about the sessions I went to at this year’s International Medieval Congress, but I’ve had a special request to report on the session organised by the ChartEx project, because of its possible relevance to many of the other current charter database projects. Most of the presentations that the ChartEx team gave are now up on their project site, so that’s the first place to look. This post is more giving my personal views of what the wider significance of the project might be, judging on the basis of what were inevitably fairly brief presentations.

I’ll start by making three points that the team themselves made: this is a proof-of-concept project (i.e. the emphasis is on a relatively short intense project to see if the technology can work effectively), they’re working with existing digitised resources, and their aim is to provide tools for expert historians rather than end-results accessible to non-specialists. So any assessment of what they’ve achieved has to acknowledge the limits of what’s possible in the time, the sources they had to start from and who they’re designing things for.

There are three main areas on which they were focusing: Natural Language Processing (NLP), data mining and a virtual workbench. First of all, the NLP is attempting to create a system which will automatically mark-up charter texts or transcriptions, e.g. tagging people, places, occupations, relationships, charter features etc. So the obvious questions I was interested in were 1) can such automatic marking-up be done and 2) is it useful if you do succeed in doing it? To which the answers seemed to me to be 1) “yes, but” and 2) more useful when combined with data-mining than I’d previously appreciated.

From what we heard of the methods and successes of the NLP part of the project, there are certain limits on what it can effectively do:

a) You need a large training set to start with: they were talking about 200 charters that had to be marked-up by hand, which means it’s probably only a process worth doing if you have at least a thousand charters you want marked-up.

b) It works better on marking-up names (of people or places) than on relationships, beyond the most immediately adjacent in the text, e.g. it can cope with finding the father-son relationship in “Thomas son of Josce”, but not necessarily both of the relationships in “Thomas son of Josce goldsmith and citizen of York to his younger son Jeremy”.

c) One of the reasons it works more effectively on names is because it’s using existing editorial conventions, e.g. capitalisation of proper nouns. That means that if you get an editor who’s decided they’re not going to use this convention (e.g. as with the Farfa charters), you have problems.

d) It also sounded as if it would work reasonably well where you had a list of likely terms you could give it to look for, e.g. occupation names/titles.

e) Overall, it’s likely to work best on texts that are relatively standardised: the demonstrations we had were using modern English translations or summaries of charters from late medieval York. One of the team suggested that if you used the original Latin texts instead, you might get some extra relationships clearer because of grammatical case (e.g. you could distinguish the recipient from the donor in a sentence). However, that relies crucially on the writer of the Latin texts observing some consistent rules of grammar, which early medieval scribes frankly don’t.

f) There’s also what I now think of as the “Judas Iscariot problem”, after an example in my IMC 2013 paper. In other words, the names of people and places that you don’t want (e.g. Biblical figures in sanctions clauses, or those mentioned in pro anima clauses in this example), also get marked-up.

I think all these factors combined together means that NLP is only likely to be of substantial use where you’ve got big and fairly homogeneous corpora of charters: the only early medieval dataset ChartEx were considering working with was the online edition of the Cluny charters.

The part of the project that I found most interesting (and potentially more relevant to early medievalists) was the discussion of data mining. This was using statistical methods on marked-up text to suggest possible identifications, both person-to-person, and also (more complicated), site-to-site. More specifically the aim was to match people/families in charters from late medieval York with one another and combine this with boundary information to try and identify a series of charters all dealing with the same urban plot.

This is the kind of matching that Sarah Rees Jones and scholars like her have tried to do for urban landscapes by manual methods for many years. What is so useful about computer techniques is that they can combine multiple factors and compare different charters very rapidly. If you look at the demonstration of this (slides 10-12) , you can see how a phrase such as “Thomas, son of Josce, goldsmith”, can be broken down into a set of statements with probabilities and the likelihood of a match between two people with similar descriptors in two different charters can then be quantified. (For the mathematically inclined among us, the speaker admitted that the probabilities for names and profession weren’t necessarily entirely independent, but he didn’t think that distorted the results too much).

The speakers also demonstrated how it was possible to do transaction/transaction clustering, i.e. to spot the charters which were most like each other in terms of the boundaries of the property transferred and the people involved. That kind of large-scale matching (they were carrying out complete cross-matching of sets of 100 items or more) is extremely difficult for human brains, which find it hard to take multiple factors into account simultaneously.

It’s that combination of mark-up (automated or not) and data-mining that struck me as the most useful general application of the project. The mapping of plots is only likely to be relevant for collections where we have lots of data on the same small areas, which means urban areas with large amounts of charters. The person-to-person identification techniques work well if you’ve got people identified in some detail in relatively formalised ways. My immediate thought is that it would have been a useful tool to have had for the project team on Profile of a Doomed Elite. But the matching process can only be as effective as the quality of data you’ve got, and I don’t think most early medieval charter collections do give you enough identifying details. I’d be very interested to hear the team’s result on matching Cluny data, or what you’d get from e.g. twelfth century Scottish charters.

But in theory, you could apply the same matching techniques to any data in the charter that had been marked up, either by hand or via NLP. I’ve previously been sceptical about what you can do with a list of curses from Anglo-Saxon charters, but this kind of data mining probably could do some very interesting clustering of them, especially using some of the methods for matching texts that DEEDS has expertise in. And in particular, that means that it might be possible at last to do something systematic with early medieval formulae (for those of us who aren’t Wendy Davies).

Particular types of formulae, such as appurtenance clauses, are at once so standardised that they must be being derived from one another (or from shared earlier models) and at the same time so subtly different from one another that tracing their connections is extremely complicated. If you have the text of enough early medieval charters online it wouldn’t be that time-consuming to mark-up just the relevant few sections in each charter (either manually or possibly via NLP) and then turn such data-mining techniques on them. I suspect you would get some genuinely interesting suggested clusters as a result. And the whole point of this project is that it’s not intended to replace scholars, but to give them short-cuts to looking at data in a way that’s otherwise excessively time-consuming.

And it’s at this point that I want to go onto the final aim of the ChartEx project, which is to produce a virtual workbench for historians working with charters. The main novelty here seemed to be the involvement of specialists in human-computer interaction, but at this stage in the project we were told more about the methodology they were using for designing the interface than what was actually in it. So it’s a bit hard to know how different it will be from the kind of interface that KCL’s Department of Digital Humanities is now designing, e.g. the mapping and statistics possible with Domesday Book data. It’ll be interesting to see how this develops, but the project as a whole already seems to have some methods that those of us interested in charters from other periods might well find worth investigating and adapting.

What’s human about the digital humanities?

The first seminar I attended in the academic year 2012-2013 wasn’t a medieval one but one organised by KCL’s Department of Digital Humanities, which featured Alan Liu from UC Santa Barbara, a veteran of the digital humanities. (He started his Voice of the Shuttle catalogue of websites in 1994). Alan was talking on “The Meaning of Digital Humanities”, and arguing that issues about the meaning of the digital humanities are really about the wider question of the meaning of the humanities themselves, and about how you get from numbers to meaning.

The talk was part of his putting together of an introductory essay on the digital humanities for the PMLA journal. Alan’s background is in English literature, so one of the interesting things for me was hearing about literary attitudes to digital methods. He contrasted history, which is relatively used to working with big data, and literary studies which aren’t. For history he was mentioning GIS projects, such as the Stanford Spatial History Project, but also pointing out that there was a much longer cliometric tradition, especially in economic history. Historians don’t think that counting things is necessarily diminishing the humanities.

Alan also touched on the different techniques that digital projects could use: one aspect is quantification, with its inevitable problems of losing context. But he also pointed out the possible use of digital models and visualisation, as a way of reducing dimensions to see patterns (such as generating social network diagrams that aren’t incomprehensible blurs).

As an example of the use of digital methods in the study of literature, both in its methodology and its problems, he mentioned a project from the Stanford Literary Lab: Ryan Heuser and Long Le-Khac , A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method . This focuses on finding clusters of words with the same usage trends, and Alan was discussing the possibility of the hypothesis-free initiation of analysis. That is, it’s possible to use algorithms to “play games” and find patterns within your data, and only bring in the human interpreters and close reading at a later point, when you’ve got material that looks statistically significant.

The problem is that you may have to set up the parameters in a way that’s already potentially rigged the game. For example, what this project did was start from “seed words”, such as “land” or “country” and generate sets of terms that had similar usage trends to these over time. In practice, they had an oscillating dialogue between the empirical data of words that behaved similarly and words that humans thought were semantically linked. Alan suggested this hybridity of methods may be necessary, and quoted Stephen Ramsay: “the best digital humanities is the hermeneutics of screwing around”. He also pointed out that one of the big problem of such projects is that they’re often insufficiently documented: researchers need to provide more details of both how the data corpus is formed and how it is cleaned. (Cleaning up data at inputting stage is a major issue for most projects).

A project like that of Heuser and Le-Khac suggests the possibilities of digital humanities as one tool for scholars. But there’s still the question remaining when you’ve generated this data: what is the meaning of changes in word-use frequency? And this links into the problem of the meaningfulness of the humanities as a whole – where is the residual space for humans in a world of scientific golems? Do we need Raymond Williams on culture if we’ve got Google Ngram Viewer?

Alan concluded by saying that those working in humanities have to come up with wider answers about the significance of the humanities and also about new forms of digital pedagogy (especially with the rise of MOOCs). In the discussion afterwards, he referred to the possibility of humanists as being the repositories of the meaning that can’t be extracted from the texts. It’s a line that has interesting implications for the current Making of Charlemagne’s Europe project and the tension between charter as data-source and charter as one-off textual and material object. Digital humanities sometimes oversells itself as a new paradigm for all humanities research, but it is making us think about how and what we study in some very interesting ways.

Medieval social networks 2: charters and connections

As a follow-up to my first post on social network analysis, I’m now gradually reading some of the many books and articles on historians’ use of network analysis that readers of my blog suggested. And having read a couple of chapters of Giovanni Ruffini, Social Networks in Byzantine Egypt, I’m coming to realise that one of the most difficult issues for those of us working with documentary sources is deciding what counts as a connection between two people and what links should therefore be included in the network.

The majority of the late antique/medieval network analysis studies that I’ve looked at work by hand-crafting links. Someone sits down, works their way through their sources and picks out by eye every link between two people (or two places). Often, they also categorise the link. For example, Elizabeth Clark, when studying conflicts between Jerome and Rufinus, divided links into seven different types: “marriage/kinship; religious mentorship; hospitality; travelling companionship; financial patronage, money, and gifts; literature written to, for, or against members of the network; and carriers of literature and information correspondence.”

(Elizabeth A. Clark, “Elite networks and heresy accusations: towards a social description of the Origenist controversy”, Semeia (56) 1991, 79-117 at p. 95).

Similarly, Judith Bennett did the same thing when looking at connections of families recorded in the Brigstock manorial court records:

The content of these transactions has been divided into six qualitative categories that collectively encompass all possible transactions. These categories are based upon whether the network subject interacted with an-other person by whether the network subject interacted with an-other person by (i) receiving assistance, (2) giving assistance, (3) acting jointly, (4) receiving land, (5) giving land, or (6) engaging in a dispute.

(Judith M. Bennett, “The tie that binds: peasant marriages and families in late medieval England”, Journal of Interdisciplinary History 15 (1984), 111-129, at p. 115).

And for networks of places, Johannes Preiser-Kapeller, “Networks of border zones: multiplex relations of power, religion and economy in South-Eastern Europe, 1250-1453 AD”, in Revive the past: proceeding of the 39th conference on computer applications and quantitative methods in archaeology, Beijing, 12-16 April 2011 edited by Mingquan Zhou, Iza Romanowska, Zhongke Wu, Pengfei Xu and Philip Verhagen,. (Amsterdam, Pallas Publications, 2012), 381-393, combined existing geographical datasets on late antique land and sea routes with details of church and state administrative networks he’s compiled from documentary sources.

Such approaches create very reliable networks, but they’re hard to scale up. Clark looks at 26 people; Judith Bennett has 31 people and 1,965 appearances in extant records from 1287-1348. Preiser-Kapeller has around 270 nodes and 680 links in total. Rosé’s study of Odo of Cluny, which I discussed in the previous post, had 860 links. For charters, such hand-crafted networks would probably only allow the exploration of small archives or individual villages.

What is more, researchers often want to carry out social network analysis as an offshoot of more general prosopographical work, such as creating a charter database. But it’s hard to analyse links until you’ve first created a prosopography, because it’s only when you’ve been through all the charters that you have a decent idea of whether two people of the same name are actually the same person. (There’s a further issue here about whether you may end up with circular reasoning between prosopography and network analysis, but I’ll leave that for now). So in theory, you’d need to go through all the charters first to identify people and then have to go back to assess whether or not they are linked in a meaningful way, doubling your work.

As a result, some researchers have started trying to see if there are ways of automatically creating networks from existing databases or files, developing methods for analysing charters that (in theory) can be scaled up relatively easily. In the rest of the post I want to look at the relatively few projects I’m aware of attempting to do this and outline how we might approach the problem with the Making of Charlemagne’s Europe dataset.

The three projects I’m looking at are by Giovanni Ruffini, working on the village of Aphrodito in Egypt (see reference above), Joan Vilaseca, who’s been experimenting on creating graphs from the early medieval sources he’s collected at and a controversial article by Romain Boulet, Bertrand Jouve, Fabrice Rossi, and Nathalie Villa, “Batch kernel SOM and related Laplacian methods for social network analysis”, Neurocomputing 71 (2008), 1257-1273.

Ruffini is explicit about how he’s creating his networks and the problems that may result from this (pp. 29-31). He’s taking documents and creating “affiliation networks”: all those who appear in the same document are regarded as connected to one another. As he points out, the immediate problem is that this method can introduce distortions if you have one or two documents with very large numbers of names. For example, one of the texts in his corpus is part of the Aphrodito fiscal register and has 455 names in it, while the average text names only eleven (p. 203). If such a disproportionately large text is included, analysis of connectivity is badly distorted, with all the people appearing in the fiscal register appearing at the top of connectivity lists.

The same effect can be seen in Joan Vilaseca’s graphs. If you look at his first attempts at graphing documents from Catalonia between 898-914, they’re dominated by the famous judgement of Valfogona in 913.

But Joan’s graphs also show an additional problem. His first graphs also give great prominence to Charles the Simple and Louis the Stammerer, because they appear so often in dating clauses. When he starts looking for measures of centrality in his next post he initially finds the most connected people to be St Peter, the Virgin Mary and Judas Iscariot (who appear frequently in sanction clauses).

This brings us to the key question: what does it mean to be in the same charter as another person? The problem is that people are named in charters for so many different reasons: they may be saints, donors, witnesses, relatives to be commemorated, scribes or even the count whose pagus you are in. People may also appear as the objects of transactions: some of our early decisions on the Charlemagne project were deciding how we would treat the unfree (and possibly the free) who were being transferred between one party and another. Such unfree have an obvious connection to the donor and the recipient. But do they have any meaningful relationship to the witnesses or the scribe? At least with witnesses, there’s a reasonable chance in most cases that they all physically met at some point, but I don’t know of any evidence that the unfree would necessarily have been present when their ownership was transferred by a charter.

So simple affiliation networks, even when you eliminate disproportionately large documents and people mentioned only in dating or sanction clauses, can still be inaccurate representations of actual relationships. One possible response to this problem is to include as links only types of relationships that are themselves spelled out in the charters. Joan has some graphs showing only family and neighbourhood relationships, for example. Ruffini (p. 21) suggests the possibility of using data-sets where a link is defined as existing only when there is a clear connection between two parties in a document e.g. between a lessor and a lessee. But as he points out, we would then have much smaller data-sets. And for early medieval charters, in particular, focusing on the main parties to a transaction only would simply demonstrate that most transaction were about people donating or selling land to churches and monasteries, which is not exactly new information.

Are there any other ways to cut out “irrelevant” connections while keeping those we think are likely to show meaning? Another approach that Joan tries uses affiliation networks, but then removes links where two people occur together in only one document. For his interest in identifying key members of Catalan society, focusing on the most important links may well make sense. But they potentially distort the evidence on one question of wider interest: how significant are weak ties in charter-derived networks? Weak ties, where two people interact only occasionally, may paradoxically be more important for spreading information or practices. Given we have only a small subset of interactions preserved via charter data, significant weak ties may be lost if we start removing data from affiliation networks in this way.

Implicitly, at least, an alternative method for selecting links within what’s broadly an affiliation network is given by Boulet, Jouvet, Rossi and Villa. As they explain in their study of thirteenth and fourteenth century notarial acts, they constructed a graph in the following manner (pp. 1264-1265):

First, nobles and notaries are removed from the analyzed graph because they are named in almost every contracts: they are obvious central individuals in the social relationships and could mask other important tendencies in the organization of the peasant society. Then, two persons are linked together if:

_ they appear in a same contract,
_ they appear in two different contracts which differ from less than 15 years and on which they are related to the same lord or to the same notary.

The three main lords of the area (Calstelnau Ratier II, III and Aymeric de Gourdon) are not taken into account for this last rule because almost all the peasants are related to one of these lords. The links are weighted by the number of contracts satisfying one of the specified conditions.

Though it’s not clear why people are regarded as linked if they use the same notary, the other criteria seem to be ways of trying to filter out distortions that potentially arise from notorial practices. If men are routinely described in terms of their affiliation to a lord e.g. “A the man of B”, then an affiliation network will derive from a sale between “A the man of B” and “C the man of D” not only the justified links A to B, C to D and A to C, but also links that in practice are unlikely to exist or at least are not proven to do so, i.e. A to D, C to B and B to D.

So how might we balance distortions from applying the affiliation network model to charter data against loss of data or an unfeasibly high workload if we don’t use this method? The model for the Making of Charlemagne’s Europe database allows inputting of relationship factoids, which will catch explicit references to people as the relatives or neighbours of others. Graphs using such data will be relatively easy to construct.

We are also, however, recording “agent roles”, used to identify what role a person or an institution plays within an individual charter or transaction (e.g. witness, scribe, object of transaction, granter). At the minimum, any social network analysis application added to the system should probably allow a user to choose which of these roles they want included within the graphs to be created. There should also be some threshold (either chosen by us or user-defined) for excluding documents that contain “too many” different agents. We’re still not going to get the precision graphs that hand-crafting links will give, but we can hopefully still get something that will tell us something useful about how people interact.

Medieval social networks 1: concepts, intellectual networks and tools


Data visualization of Facebook relationships by Kencf0618

Network analysis is one of those areas which keeps on cropping up as a possibility for medieval researchers. (There have been some interesting discussions and examples previously at A Corner of Tenth Century Europe and Cathalaunia, which I’ll discuss more in a later post).

Since one of the hopes of the Making of Charlemagne’s Europe project I’m working for is that the data collected can be used for exploring social networks, I thought it would be useful to find out a bit more about what has been done already. So is this my first attempt to get a feeling for what’s been done with medieval data and what it might be possible to do.

I should note at this point that I’m drawing very heavily on the work of Johannes Preiser-Kapeller, especially his paper: “Visualising Communities: Möglichkeiten der Netzwerkanalyse und der relationalen Soziologie für die Erfassung und Analyse mittelalterlicher Gemeinschaften”. I found out about many of the projects I discuss from this paper, so I am grateful for to him for providing such a primer. My focus is slightly different to his, however, as what I’m particularly interested is the type of research questions that social network analysis might be used to answer, more than the details of particular projects.

Defining networks
One immediate problem in knowing where to look comes because the key mathematical tools and visualization techniques can be applied to very different kinds of data. The underlying concepts come mainly from graph theory. Wikipedia defines that as: “the study of graphs, which are mathematical structures used to model pairwise relations between objects from a certain collection. A “graph” in this context is a collection of “vertices” or “nodes” and a collection of edges that connect pairs of vertices. A graph may be undirected, meaning that there is no distinction between the two vertices associated with each edge, or its edges may be directed from one vertex to another.”

What that means is that you can use the same basic techniques to study anything from a road network via the structure of novels, to how infections spread through a population. But it also means that the type of network and how you can analyse it depends crucially on several factors. These include how you define a node and edge, whether all edges are the same (or whether you’re counting the connections between some pairs as somehow different/more important than others) and whether it’s a directed or undirected graph.
The size of the network is also crucial, and that differs vastly between disciplines: it’s when you see a physicist commenting that “At best power law forms for small networks (and small to me means under a million nodes in this context) give a reasonable description or summary of fat tailed distributions” that you know that not all networks are the same kind of thing. One of the things that interests me when looking at projects is the extent to which data visualization is important in itself or whether the emphasis is on mathematical analysis of the underlying data.

Data quality

There are, inevitably, particular issues with data quality for medieval networks. The obvious one is whether the information you have is typical or whether the reasons for its survival bias our evidence excessively from the start. (The answer is almost certainly yes, but medievalists wouldn’t know how to cope if they had properly representative sources, so let’s move on rapidly).

Another big issue is identifying individual nodes. You can in theory have anything as nodes: an individual, a “family”, a manuscript, a place, a type of archaeological artefact, a gene, a unit of language. (I’m not going to look at either linguistic or genetic network analysis in what follows, but there are projects doing both of those). The problem with medieval data is that there’s almost always some uncertainty about identification: are two people the same or not? What do you do about unidentifiable places? How do you decide whether two people belong to the same family?

Then there’s question of how you define a connection between two nodes. What makes two people connected to one another? The data you extract from the sources obviously depends on decisions made about this, but for a lot of medieval networks there’s the added complication that not all connections are made at the same time. If you have a modern social network where A connects with B and (simultaneously) B connects with C you can make certain deductions about the network from data about whether or not A and C are connected. If you have limited medieval data where A connects with B and 20 years later B connects with C, can you model that as one network, or do you have to take time-slices across the network (which may often reduce your available data set from small to pathetic)?

Varieties of projects
Of the medieval history projects I’ve come across so far (I suspect there’s a whole slew of others in fields such as archaeology), most seem to fall into three categories. There are studies on networks of traders, such as by Mike Burkhardt on the Hanse. There are probably other similar examples: I’ve not yet had a chance to investigate whether the important work by Avner Greif on traders in the Maghreb also uses network analysis or not. But these kinds of studies are unlikely to be relevant to any early medieval project, because they will almost certainly rely on relatively large-scale sets of data from a short chronological range (account-books, registers of traders etc). Such data sets simply don’t exist for the periods I’m interested in.

The other two types of medieval network studies I’ve noticed are ones which are looking at intellectual networks or the spread of ideas (with some possible overlap with spread of objects more generally) and ones using network analysis to study how a society operates (social network analysis in its most specific sense). For both of these, I’m aware of some early medieval studies and others that are potentially applicable to early medieval style-data. I’ll cover intellectual networks in this post (including a discussion of a recent IHR seminar) and then move onto social history uses of network analysis in the next post.

Intellectual networks/spread of ideas: example projects

1) Ego-networks
There are several forms that network analysis of intellectual networks can take. One obvious one is as a more quantitative version of what’s been done for many years (if not centuries): the study of “ego-networks”, the intellectual contacts that a particular individual has.

This is the basis for the study by Isabelle Rosé of Odo of Cluny (Rosé, Isabelle. “Reconstitution, représentation graphique et analyse des réseaux de pouvoir au haut Moyen Âge: Approche des pratiques sociales de l’aristocratie à partir de l’exemple d’Odon de Cluny († 942)”, Redes. Revista hispana para el análisis de redes sociales 21, no. 1 (2011)

Rosé’s study isn’t strictly of just an ego-network, since she also tries to analyse the connections that Odo’s contacts had with each other in which Odo wasn’t involved, but the centre is clearly Odo. Rosé uses a mix of different sources (narrative and charters) to construct snapshots of Odo’s connections over time: she ends up with a PowerPoint slideshow showing the network for every year (available from here). She wanted to include a spatial dimension to the networks (showing where connections were formed), but couldn’t find a way of doing that.

Rosé’s account includes some useful detail about her methodology. The data she collected in Excel consisted of 2 people’s names, a type of connection and a direction for it, a source and start dates and end dates for the connection. She also codes individual nodes based on the person’s social function (monk, layman, king etc) and the aristocratic group they belong to (Bosonids etc); this is reflected in their colour and shape on her network diagrams.

There are a lot of questions raised immediately about how such decisions are made (period of time allocated to a particular connection, how she decides on who counts as on one of the groups); all the kind of nitty-gritty that has to be sorted out for any particular project.

What does Rosé’s use of network analysis allow that a conventional analysis of how Odo’s social networks helped him couldn’t do? One is that the data collection method encourages a systematic searching for all connections that an unstructured reading of the sources might miss. Secondly, the visualization of networks (especially as they change over time) gives an easy way of spotting patterns, allowing periodization of Odo’s career, for example. Thirdly, it’s possible to compare different sorts of tie, e.g. she shows that the kinship networks (whether actual or the fictive kinship of godparenthood) consists of a number of unconnected segments. But when you include ties of kinship and ties of fidelity, you do get a single network. Finally, Rosé uses a few formal network metrics to rank people by their centrality to the network (their importance to it) and their role as cut-points (people whose removal from the network would mean that there were disconnected segments of it).

Apart from this restricted use of metrics, Rosé is mostly doing visualization and I suspect that many of her conclusions are confirmations of things that a conventional analysis of Odo’s social network without such complex data collection would have come up with anyhow: who Odo’s key connections were, the importance of the fact that right from the start Odo had connections to the Robertines and also the Guilhemides. But one of her most interesting comments was that analysis showed a move away from kings as central to social networks, which she connected to a move to “feudalism”. If we could find comparable data sets (and there are obvious problems in doing so), it’d be interesting to see whether kings outside France become non-central to reforming abbots in the same way.

2) Scale-free networks
There are a couple of articles I want to highlight which talk about scale-free medieval networks and which I want to discuss more for some of the difficulties they raise than the answers they’re coming up with. One is work that hasn’t yet been published, but has been publicised: analysis of the spread of heresy by Andrew Roach of Glasgow and Paul Ormerod. The other is Sindbæk, S.M. 2007. ‘The Small World of the Vikings. Networks in Early Medieval Communication and Exchange’, Norwegian Archaeological Review 40, 59-74, online.

But first, a very rough explanation of scale-free networks, which means introducing one or two basic mathematical/statistical ideas. The first is the degree of a node, the number of connections it has. The second is the distribution of these degrees, i.e. what percentage of nodes have 1 degree, 2 degrees, etc. Scale-free networks are ones where the degree distribution follows a power law: roughly speaking, you have a few very well-connected nodes and then a long tail of a lot of poorly-connected nodes.

The crunch here is “roughly-speaking”: there are all kinds of issues about whether any particular example really does represent the power law distributions that supposedly lie behind it. It’s a reminder that if we as historians we do start doing more of this kind of work, we’re probably going to need some good mathematicians/statisticians behind us pointing out possible issues.

Without seeing the data, it’s impossible to tell whether Roach and Ormerod are accurate about medieval heresy spreading through such types of networks. But Søren Sindbæk’s paper on Viking trade suggests that the interest here isn’t strictly whether we’re talking about scale-free distributions or not. It’s a more general question about how the very localized societies within which the vast majority of medieval people lived could nevertheless allow the relatively rapid long-range spread of everything from unusual theological ideas to silver dirhams.

Søren’s main point is that there are two possible ways that such small-world networks can evolve: either you can have a few random links between two otherwise largely separate networks (weak-ties model) or you can have a few very well-connected nodes amid the otherwise very localised societies (“scale-free”). Which of these two ideal type of networks you have affects considerably the robustness of the network: i.e. if you have one or two crucial hubs that get destroyed by attackers, the whole network falls apart, but random attacks aren’t likely to have much effect, while the weak-ties model is more vulnerable to a random attack (if a random link that ties two networks together happens to get severed). Søren tries to see which type of network best fits two very limited sets of data (one based on the Vita Anskari) and one on archaeological data. The answer, not surprisingly, is “scale-free” networks.

I say the answer isn’t surprising because the medieval world is full of hierarchies of people and places, and some of the defining characteristics of those at the top of such hierarchies are that they move around more or they have connections to a lot more places. I found Søren’s paper mainly revealing in giving a feel of the numerical bounds for where simple visualization is a useful tool: a plot of 116 edges (see Fig 3) is already getting complex to visualise; one with 491 edges (see fig 4) almost impossible to take in by eye.

As for Roach and Ormerod, the fact that heresy was mainly spread through a small number of widespread travellers isn’t exactly news. We’ll have to wait and see whether they can provide something that gives a new dimension of analysis.

3) Six degrees of not-Alcuin
Finally for this post, I want to discuss an IHR seminar I heard back in May: Clare Woods from Duke University talking about “Ninth century networks: books, gifts, scholarly exchange”. Clare’s coming to intellectual history from a slightly different angle from Isabelle Rosé: she has been editing a collection of sermons by Hrabanus Maurus for Archbishop Haistulf of Mainz, and thinking about how to represent the relationship between manuscript witnesses visually (rather than just rely on verbal descriptions or stemma diagrams.

The point here is that manuscript stemma can be thought of as directional networks between manuscripts, whose place of production can be located (more or less accurately). (There are also projects endeavouring to generate manuscript stemma automatically, but I’m not discussing those at the moment). Clare is also using data from book dedications, known manuscript movements, and the evidence of medieval library catalogues.

Also in contrast to Rosé, Clare was interested in the possibility of getting beyond the spider’s web idea of intellectual history. i.e. that Hrabanus (or Odo) sits at the centre and everyone else revolves around him. This is a particular issue for Carolingian intellectual history because of Alcuin. We have by far more letters of Alcuin preserved than of any other Carolingian author (Hincmar probably comes second, but his letters still haven’t been edited properly), so if you use Rosé’s techniques you’re liable to end up overrating Alcuin’s significance vastly.

Clare’s main focus was on simple tools for visualizing this information, ideally in both its spatial and temporal dimensions. As I said above, Rosé was using Excel, Powerpoint and NetDrawand was finding problems in showing locations. Clare was using Google Maps for the spatial element, but thought she’d need Javascript (which she doesn’t know) to show changes over time. I have seen projects which use GoogleMaps and a timeline, such as the MGH Constitutiones timemap (click on Karte to follow how Charles IV, the fourteenth century Holy Roman Emperor moved around his kingdom). I don’t know how that is made to work.

I’d be interested to know from more informed readers of the blog if there are such tools available that non-experts can use to produce geo-coded networks of this kind. Gephi seems to be popular free software for network analysis, and I’ve seen a reference to a plug-in for this which allows entering geo-coded data. The Guardian datablog recommends Google Fusion Tables.

But whatever software you have, there are the normal issues of data quality. There’s a particular problem with data coming from a very long timescale: in questions David Ganz wondered whether the evidence was getting contaminated by C12 copies (I wasn’t quite sure whether that’s just because there are so many manuscripts of all sorts from later). How do we know whether manuscript movements do reflect actual intellectual contacts, rather than just random accidents of them getting moved/displaced etc? Clare also discussed the problems of how you mapped a manuscript which came from “northern Italy”. Her response was to choose an arbitrary point in the region and use that – at the level of approximation and small number of data points she’s using, it’s not a major distortion.

The data sets for early medieval texts are always going to be tiny: having more than 100 manuscripts of one text from the whole of the Middle Ages is exceptional. (The largest transmission I know of is for Alcuin’s De virtutibus et vitiis of which we have around 140 copies). But Clare’s project does potentially offer the possibility of combining her data with other geo-referenced social networks to get an alternative and wider picture of intellectual connections in the Carolingian world. Combining data-sets is likely to lead to even more quality issues, but it does offer the possibility of building up new concepts of the Carolingian world module by module.

Who cares about history (unique identifiers edition)?

Over at A Corner of Tenth-Century Europe, a discussion that started off being about the trajectory of women’s history has mutated into one about why someone isn’t creating a system of unique identifiers for medieval texts. And while I’ve spent the last decade or so thinking about gender history, I’ve spent half my life thinking about databases and identifying references uniquely, because that is one of the things librarians do all day. So I wanted to start from Joan Vilaseca’s plea for “A public and standarized corpus of classical/ancient texts with external references to editions, versions, comments, articles, etc,etc.etc”, sketch out what I’m aware of as existing and explore why history seemingly can’t get its act together in the way that chemistry or taxonomy has.

There are actually some databases that do a fair amount of what Joan would want. As examples, there are:

1) Perseus Digital Library. This is a big and sophisticated free collection of classical texts, including some very neat tools, such as Greek and Latin word study tools (which I freely admit to using when I’m stumped on working out the root verb from a conjugated form). This doesn’t have identifying references, however.

2) Library of Latin texts. This commercial database includes the full text of the whole corpus of Latin literature up to the second century AD (essentially taken from the Teubner editions), plus a lot of patristic and medieval Latin (largely, but not entirely taken from the Corpus Christianorum series). Associated with this is the Clavis Patrum Latinorum which provides a numbered list of all Christian Latin texts from Tertullian to Bede. (There are similar indexes which cover Greek patristic texts, apocrypha, and early medieval French authors.

3) Thesaurus Linguae Graecae. This database includes most literary texts in Greek from Homer to the fall of the Byzantine Empire. It’s a subscription service, but it includes a free online canon database that provides unique identifying numbers for works and parts of works.

4) Bibliotheca Hagiographica Latina (BHL). This is a catalogue of ancient and medieval Latin hagiographical materials, produced by the Bollandists, which provides unique identifying numbers for different texts. There’s also a free online version.

5) Leuven Database of Ancient Books. This free database includes basic information on all literary texts preserved in manuscript from the fourth century BC to AD 800; the texts are assigned a unique number. (It’s a subset of the Trimegistos project which focus on documents from Graeco-Roman Egypt, both literary and non-literary and also provides identifying numbers).

What this very brief overview reflects is one basic fact: to produce a database and/or identifying ring systems of any size takes time and money. As a result, there have to be enough people wanting the result to make it worthwhile making that investment. There are several different models for financing such projects: you can sell the resultant database (either for profit or at a break-even price), or you can persuade funding bodies to support you, or rely on charitable donations but you need someone willing to pay.

It’s worth looking here to see why identifier projects in other fields have succeeded. A lot of large-scale identifier projects, for example, have come out of library science and publishing, both because these are huge and connected networks and because there’s the commercial driver of being able to identify something in your inventory quickly and accurately). So the Standard Book Number, developed for WH Smith in the 1960s became the ISBN of today, followed in the 1970s by the ISSN for serials, etc. It’s noticeable that it took more than twenty years after unique identifiers for serials to develop for unique identifiers for individual articles within those serials to develop (the CrossRef project using DOIs). This wasn’t because no user ever wanted an individual article to read before then; it was because it was only with electronic journals that it became feasible to try and sell individual articles to people.

Most of the other really large-scale nomenclature/identifier projects have been in the sciences, for the simple reason that the same phenomena are being studied all over the world. We’re (mostly) looking at the same sky, hence the International Astronomical Union was formed in 1919. The International Union of Pure and Applied Chemistry, responsible for chemical nomenclature also dates from a similar period. (One of the other main systems of chemical nomenclature, the CAS Registry number is an offshoot of the subscription index/database Chemical Abstracts). Again, people are trying to do the same chemical reactions from Bombay to Los Angeles, so there’s a big demand for such systems. Biological classification has a very long history, dating back to Linnaeus (although unique identifiers are only just being developed), reacting to thousands of years of attempts to show how all species are related.

The classical/medieval database projects that I’ve mentioned above have essentially been possible because they have a sufficiently tightly-defined group of potential users who are all interested in the same sort of thing: classical literature or papyrology or hagiography. It’s therefore worth creating something for them to use. The problem with extending such a system to broader historical areas is that no-one cares about history.

That sounds ridiculous, but it’s a problem I’ve mentioned before: it’s not really clear that we’re doing the same thing as historians when we study vastly different periods and use completely different sorts of sources. Or to put it a different way, the Old Bailey database is a remarkable resource, but not of any professional use to me. I don’t care about all history, everywhere; I care specifically about early medieval European history. Historical sources, even just medieval sources, aren’t one thing, but a patchwork of different islands and most researchers spend most of their time perched securely on a few of these, rarely venturing off them. I’ve had years of being an early medievalist and never needed to cite Sawyer numbers, for example, because I don’t research or teach Anglo-Saxon history; I’d be almost equally baffled if I came across Corpus Iuris Canonici footnotes without the help of Edward Peters. The patchwork systems of identifying medieval documents remain because of the lack of overlap between the groups of researchers using them, and I can’t see any driving force that is going to change that. Crowd-sourcing has produced some remarkable things, but creating unique identifiers is a peculiarly ill-suited task for crowd-sourcing. Unless more people start caring about the history of everywhere at all times, Joan isn’t going to get the wide-ranging system he’d like.

Digital diplomatics 1: projects and possibilities

I am currently trying to get up to speed on some of the many projects involving charters online, drawing heavily on accounts from the Digital Diplomatics conferences (and also Jon Jarrett’s useful reports on the 2011 conference). I don’t claim to be an expert on charters, but I have been using (and sometimes developing) databases for 25 years, so some of the issues seem quite familiar from my experience as a librarian. What I want to do in this first post is give a sample of the types of project out there and also note what I consider to be some particularly interesting features.

It’s useful to start with a sketch of the origins of diplomatics (the study of charters) because that explains a lot about how digital developments have been shaped. The starting point was the attempts by early modernists to work out which charters of a particular religious institution were false and which were genuine. For this, the key ability was being able to compare charters with good evidence for being authentic (e.g. held as originals) to other more dubious versions. As a result, charter studies have often been organised either around particular collections/archives (e.g. editions of cartularies, charters of St Gall) or around rulers (e.g. the diplomas of Charles the Bald), because it’s easier to spot the dodgy stuff in a reasonably homogenous corpus.

Charters have also long been a key source for regional history, so eighteenth and nineteenth century scholars produced a lot of editions of regional collections of documents including charters, such as the Histoire générale de Languedoc. Where the corpus is small enough, these have then been extended to national collections or overviews, some of which I mention below.

From the purely print age, we have now, however, begun moving into digital diplomatics and there have been a variety of approaches.

1) Simple retro-digitisation
Because there’s been scholarly interest in diplomatics for several centuries, a lot of early editions are now out of copyright. Simple retro-digitisation of old editions doesn’t often get mentioned in discussions of digital diplomatics (though Georg Vogeler, “Digitale Urkundenbücher. Eine Bestandsaufnahme”, Archiv für Diplomatik, 56 (2010), p. 363?392 has a useful discussion of them), but there are a lot of old charter editions being put online by projects such as Google, Internet Archive, Gallica etc. This data, however, is pretty hard for charter scholars to make use of unless they’re looking for a specific charter (or at most a specific edition). Is there any way in which this material could be deal with more effectively?

Doing something with such data doesn’t strike me as a project that’s likely to possible to fund (it’s not new and exciting enough). The most plausible way of organising it seems to me to be crowd-sourcing of OCR work on charter scans (or checking already OCR’d documents) along with adding some basic XML markup and then sticking them in a repository. Monasterium seems the obvious one to use. Whether there would be enough researchers interested in charters from more than one foundation to make the effort of doing this worthwhile, however, I’m not sure.

2) Databases based on the printed edition model
Printed editions of charters are normally either arranged chronologically or include a chronological index. (There are a few cartulary editions which don’t have this, and I have winced at having to look through hundreds of pages to spot if there are any Carolingian charters). The vast majority of printed editions also have indexes to personal names and place names. In contrast, content analysis of the charter is often fairly limited, in the form of headnotes plus a narrative introduction.

The indexes to printed charters, if they’re done properly, work pretty well for the needs of many people working with these sources. Or, to see it from a different angle, historians studying charters arrange their research into these kind of categories. As a result, where such indexes don’t exist in the original edition, you’ll often find that someone creates them later (like Julius Schmincke doing an index to Dronke’s edition of the charters from Fulda).

A lot of charter databases are still essentially arranged around these traditional print access methods, with digitisation essentially adding (often fairly basic) full text search and remote access. Many of the online charter projects that have got furthest have been digitisations of relatively small and coherent existing charter collections, which have already been published in a single print series. There are several based on national collections, such as Sean Miller’s database of Anglo-Saxon charters, Diplomaticum Norvegicum and Diplomatarium Fennicum. There are also some regional charter databases of the same type (such as the Württembergische Urkundenbuch, and the early twentieth-century edition of the Cluny charters have also been put in a database. And then, of course, there’s the charters section of the digital Monumenta Germaniae Historica.

3) Aggregator databases
There are also a few charter database projects which are based on aggregating multiple printed editions: the two most important are Monasterium and Chartae Burgundiae Medii Aevi.

4) Born digital/hybrid editions
In contrast to the substantial projects of digitising existing editions, most of the born digital (or moved to digital) charter databases seem to be fairly small scale. The one exception I’ve found so far is Codice diplomatico della Lombardia Medievale which has now put over 5,000 Lombard charters from the eighth to twelfth century online.

5) Databases of originals
There is also a slightly separate strand of digital diplomatics research, which has focused on charters which are preserved in the originals (rather than as cartulary copies, etc). Some of these databases just include the text, others focus on images of charters. Projects include ARTEM and the (basic) database now attached to the Chartae Latinae Antiquiores publishing project. I’m also aware of several more image-focused projects, such as the Marburg Lichtbildarchiv, and Pergamo Online, which contains images of parchments preserved in Pergamo.

I’m not going to discuss the image databases in any detail, because they’re a very different kettle of fish to the textual databases I’m used to working with, but it is worth noting how decisions made on how much detail is recorded for original documents can be fairly arbitrary. As George Vogeler points out, there’s an odd division for the St Gall charters between the early stuff that gets put in horrendously expensive printed ChLA editions and the material from the eleventh century onwards that is available free via Monasterium.

6) Linguistic projects
I also won’t say much about charter database projects that focus on linguistic analysis of texts, such as Corpus der altdeutschen Originalurkunden bis zum Jahr 1300, Langscape and the work being done by people like Rosanna Sornicola and Timo Korkiangas. While this is interesting work, it seems to me of less immediate relevance to most historians.

7) Factoid model
As Patrick Sahle put it in a recent paper (“Vorüberlegungen zur Portalbildung in der Urkundenforschung”, Digitale Diplomatik: Neue Technologien in der historischen Arbeit mit Urkunden. Archiv fur Diplomatik Schriftgeschichte, Siegel-und Wappenkunde, Beiheft 12, edited by Georg Vogeler (Cologne, Böhlau Verlag, 2009), 325-341 at p. 338), the object of diplomatic research is the individual charter. Most database projects are structured in a way that reflects this focus on the charter as a unit.

A contrast is given by the factoid model adopted by a number of KCL projects, such as the Prosopography of Anglo-Saxon England and what will shortly become the People of Medieval Scotland project. Here, the key unit is the factoid, a statement of the form: “Source S claims Agents X1, X2, X3 etc carried out Action A1 connected with Possessions/Places P1, P2 at date D1.” A charter (or another source) can thus be broken down into a number of factoids, allowing finer grained-access to the content of charters. Although this may not seem an obvious approach to considering charters (and there are a number of practical problems), it does match surprisingly well to the “Who, What, Where, When, How do we know” model that I’ve mentioned before as one approach to working with charters.

What works
As my overview suggests, there are already too many charter databases out there to make it easy to discuss them all in any more depth than “here’s another one that does X, Y and Z”. But there are some projects that seem to me to illuminate particularly important aspects of digital diplomatics:

1) DEEDS: full text done right
I’ve discussed before the problems of searching full-text databases of charters, but most projects don’t seem to respond to such problems. Instead they have very basic full-text facilities, and certainly nothing like the ability to use regular expressions that Jon Jarrett longs for.

The problem with regular expressions, of course, is that they still require an expert user. And as several generations of designers of library catalogues and other kinds of databases know, most users aren’t experts, and they don’t want to have to become so to be able to use your database. Even if you learn the right syntax, how do you know what spelling variations to try searching for before you’ve seen what might be lurking in the database? For example, if know that the MGH edition of one of Charlemagne’s charters (DK 169) refers to a particular county as Drungaoe or Trungaoe, how on earth would it occur to you that the same charter in Monasterium would name the place as “Traungaev”?

DEEDS is the only project I’ve seen so far that has really sophisticated analytical tools for full-text. Its methods of shingles for example, is currently being applied to dating documents, but it strikes me as something that might also very usefully be applied to identifying particular formularies used by someone drawing up a charter. By breaking a document down in this way, you can analyse multiple factors suggesting that a document is “nearer” to one model than another in a way that’s simply not practical with manual methods.

Even more useful, potentially is DEEDS’ use of normalisation. Their alternative spelling option makes their search engine cope with a lot of the more common issues in searching Latin. But the really interesting part to me was their discussion of using normalisation to produce phonetic proxies. This takes a phrase such as “Sciant presentes et futuri quod ego Iohannes de Halliwelle” and reduces it to “scnt prsnt cj futr cj eg iohns pr hall”, the bare sounds of the key terms. A full-text search facility with phonetic proxy as option strikes me as one of the few ways that you might be able to produce something that could find you the multiple possible Latin spellings of the Traungau, without you needing to sit down for a week to work them out…

2) ARTEM: bringing in the users
ARTEM, the database of French original charters before 1121 is far from being the biggest or the more sophisticated charter database around. Where the project has succeeded, however, is in getting researchers actually to use the database. There have been several conference publications based on its work, e.g. Marie-José Gasse-Grandjean and Benoît-Michel Tock, eds. Les actes comme expression du pouvoir au haut Moyen âge: actes de la table ronde de Nancy, 26-27 novembre 1999. Atelier de recherches sur les textes médiévaux, 5. (Turnhout, Brepols, 2003).

What I’m not yet sure of is why ARTEM have been more successful than comparable projects in getting other scholars involved. Is it because they’ve been going longer, that they’re more pro-active in arranging roundtables, or is it because France has a weird early medieval charter distribution, with a large number of relatively small collections of charters, and thus researchers desperately need a multi-archive database?

3) Monasterium: Charters 2.0 describes itself as a “collaborative archive” and it’s the only project I’m so far aware of that takes the idea of user participation seriously. As well as providing tools for working with and annotating individual charters (which I haven’t yet had the chance to try out), it’s also intended to provide a distributed infrastructure into which individual archives from across Europe can add their material. As a means for getting later medieval charters available online, especially for smaller archives, it looks ideal. In terms of data quantity and quality, however, it’s liable to the patchiness inherent to large-scale collaborative projects: some areas get very well-covered, some don’t get referred to at all.

4) CBMA: blending old and new
Chartae Burgundiae Medii Aevi isn’t unusual in its scope ? it’s aiming to put online the 15,000 charters from the region of Burgundy. What’s more unusual is its methods ? it’s putting online both old editions and previously unedited cartularies. There are obvious issues here about whether they can get data consistency, but potentially it seems more practical to start with existing editions (however imperfect) and “grow” a database using them, than to wait for funding to re-edit everything from scratch.

5) DIY databases
All the databases I’ve discussed so far have been major research projects. However, created by Joan Vilaseca shows the possibility for a dedicated individual to produce their own web-based charter database, using easily available tools.
Joan uses a wiki format, which for the relatively small number of documents he has provides a neat way of showing links between people and places. The unstructured nature of the data may make it harder to search, but it also means that different genres of documents (not just charters, but hagiography etc) can be incorporated easily. It’s a useful reminder that charter information doesn’t have to be stored in relational databases. (For another example of this minimalist approach, see Project FAST, which is putting a Florentine archive online). also raises an interesting point about audiences and the accessibility of charter databases. The site is in Catalan, which makes it far more suitable for what I presume is Joan’s main audience, people interested in the history of their own region. But for those of us who aren’t Catalans (and don’t specialise in its history) the use of a relatively uncommon language is a disadvantage.

Preliminary conclusions
The databases I’ve so far read about or seen prove that there are lots of interesting projects going on, but I do slightly wonder if there’s too much variety. Different audiences and different aims can explain some of the variants, but I think maybe we start needing to adapt more systematically from previous projects. I can see the components of really effective databases in some projects, but so far they’re not being pulled together into something that properly builds on the pioneering work. So, I finish with a question for the more experienced users: what do you like from particular charter database sites? What should the Charlemagne project be stealing from other projects?

Making charters useful

I finished at the Fitzwilliam Museum at the end of December and started a new job last week: as Postdoctoral Research Associate on the new King’s College London project The Making of Charlemagne’s Europe: 768-814. Officially the project is intended to create a database of the surviving documentary evidence from Charlemagne’s reign. Unofficially, I see it as a project to make charters useful.

There are a lot of people, of course, who already find early medieval charters very useful. If you’re doing regional studies (of e.g. Catalonia or Alsace or Brittany), charters are essential evidence. But if you’re doing a study that isn’t regionally focused in this way, then frankly charters are less ideal, because there are just too damn many of them. There are around 4,500 documents for Charlemagne’s reign alone. How do you find the ones that actually provide relevant information for your purposes?

This is why, potentially, our database will come in very handy, especially since it’s being designed by people who have considerable experience of previous similar database projects, such as Prospopography of Anglo-Saxon England (PASE) and Paradox of Medieval Scotland (POMS). The prosopographical side is thus very well-covered. However, the plan is to have more: both mapping facilities and statistical analysis. We’re not providing the full text of charters, but we will be providing structured data of various kinds. So one of the questions we need to ask right at the start is what information do researchers actually want to get out of the corpus of charters that they can’t get currently? Asking this question among the readers of this blog seems as good a place as any to start. I know you’re not all Carolingianists, but a lot of you will have worked with charters or bulk data of some kind. What research questions interest you for which such a database might be a help?

What follows is my first very rough list of possible research areas. All comments welcome; if you know of work that’s already been done, or if I’ve missed out something, please add it in. I’m still at the brainstorming stage at this point, and this post reflects this.

1) Studies on literacy
Graham Barrett is another researcher on the project, so this angle may be fairly well-covered anyhow. He’s already done studies with later Spanish charters, looking, for example, at affiliations of scribes and the number of documents that particular scribes wrote. This immediately ties into research questions about the professionalism of scribes, and the extent of lay literacy.

I also wonder whether we should make a special note of charters that include references to books as property, so we can get a picture of where they are mentioned.

2) Family/women
Most of the detailed studies of families will obviously be done on a regional basis. But the prosopographical side of the database will enable us to create biographies of individuals/families who have a transregional activity. What I’m not yet sure is what kind of data it would be useful to produce on such people. Given the strong spatial emphasis in the database, would it be useful to be able to map the activities of not just an individual, but a group of them?

One of the things we are definitely going to do is give the sex of every individual mentioned, which immediately makes possible a lot of the analysis about women’s land-holding etc. (It takes under a minute to dig out the 48 female witnesses from the POMS database, for example).

I think we need to have some kind of record of relatives being prayed for, though I’m not yet sure in how much detail. But this ties in usefully with discussions about which relatives “counted” in which situations.

It’d be nice to use charters for getting demographic data about families, as well, but that may be unrealistic. Has anyone seen this sort of thing done successfully?

3) Ethnicity
Despite all the problems with questions of ethnicity, it’s still interesting to see how the charters reflect this. We will probably be drawing on the work of Nomen et gens as far as ethnicity of personal names is concerned; what might also be useful to note is if specific ethnic terminology is used in charters to refer to people.

4) Legal practice
This is an area I know less about, so if anyone knows who’s doing interesting work on this, it’d be a help to know. My immediate thoughts for things it would be useful to record are number of witnesses to a document (so you could, say, pull out documents with less than the six witnesses Alemannic law said you were supposed to have) and references to law/laws within the charter (whether specific or general).

5) Monasticism
One useful piece of information would be to know how the collective membership of particular religious communities are described – are they ‘monachi’ or “deo sacrata” or what? It’d be particularly interesting to learn more about references to canons/canonesses.

It’ll be possible to break down charters by date and region, so we can potentially get comparative data on the well-known idea of “waves of pious giving” – how long do people keep on making large donations to churches/monasteries after they’ve been founded?

I don’t know if early Carolingian charters have enough boundary clauses to make this work, but Barbara Rosenwein’s classic study of Cluny was collecting data on the extent to which a donated piece of land was adjacent to Cluny’s property already, which allowed seeing monastic land-acquiring strategies and how literally “being the neighbour of St Peter” was meant.

Looking at statistics for proportions of donations versus precaria for different monasteries/regions also contributes to the whole debate about pragmatic versus spiritual rewards for donors (which I always associate with Rosenwein on Cluny versus John Nightingale on Gorze). I also wonder whether there is any way of flagging up people who make donations to more than one foundation, given these may form particularly interesting test cases for studying how patronage decisions were made.

6) Military history
One of the questions we’re trying to work out at the moment is how much detail we go into about renders. Possibly we will just have a general term for animal renders, given the trade-off between precision in recording and time taken. But I do wonder if we should treat references to renders in horses separately, given their military importance. Any thoughts?

7) Price information
This is again an issue of how much detail we can put in without the project over-running, but how useful would it be to note if there are references to values in coinage? Wendy Davies did some promising studies on this for Spain.

8) Political history
One of the most useful possibilities that the mapping side of the project potentially allows us to explore is the nature of the Carolingian county. The arguments about “flat counties” versus “scattered counties” have been going on for decades: if we input the data right, we can explore in detail the geographical relationships that the sources themselves choose to mention.

It will also be useful to be able to map and contrast royal interventions between regions; while the data from royal charters is probably limited enough that this could be done manually, this project will potentially also allow us a transregional view of royal missi and vassi.

9) Social structure
Chris Wickham, in particular, has used charters from a number of regions for the comparative study of social structures, but of necessity, such work has normally drawn on syntheses of studies of a few locations. Potentially, this database allows wider comparisons, though both potential approaches to categorising social levels have their problems. The first possibility is using explicit references to office and social status within the charters: although there are problems in comparing these across the regions, they are potentially soluble. Perhaps even more intriguing is whether a social classification could be developed based on activity-derived status. In other words, could we find a way to mark all those who made more than a dozen donations, or witnessed over a geographical range of more than 10 miles, etc? This might show to what extent influential people exist who don’t obviously hold office or get called “nobilis” etc.

10) Rural and landscape history
Again, this is an area where bulk comparative data is potentially useful, but we have to work out how much detail we can go into, especially for landscape features in charters. Should these be regarded as purely conventional and excluded or are some of them worth listing specifically? I’m inclined to think it’s worth mentioning mills, but not huts, for example.

Those, for now, are my ideas of what we might do with our data, given the limitation I’ve already mentioned, that we’re not going to have the full text of charters. Any obvious suggestions that I’ve overlooked will be gratefully received.

What can the vulgus do? Crowd-sourcing for medievalists

In 802, Alcuin blamed a riot at Tours on the ‘untaught crowd [vulgus indoctum], who are always accustomed to do unsuitable things without counsel’. Recently, however, there’s been an increasing interest in the ‘wisdom of crowds’, and this year I’ve kept on coming across mentions of crowd-sourcing projects for historical purposes. Dan Cohen, in his Arcadia Lecture at Cambridge in April, mentioned several such projects. The closing plenary session at a recent conference on digital humanities brought another (Digital Bentham). This summer, Oxford University launched Project Woruldhoard, a follow-up to a project on making a community collection of material on World War I. Oxford presumably think that crowd-sourcing has potential for work on the Middle Ages, but how else might medievalists be able to use such techniques?

One problem with any discussion of crowd-sourcing is that it covers so many different things. So, rather than getting into the details of projects, which I don’t really know enough about to do, I want to try and identify some broad themes. At the small-scale level, there’s getting quick answers to your problems via your Twitter followers: what might be called comitatus-sourcing. (As an aside for medievalists, studies of Twitter have suggested that it’s more hierarchical and less reciprocal than other social media, which has interesting implications if we’re using it as a model for opinion forming).

A lot of historical crowd-sourcing projects are predominantly concerned with creating mass archives. Such projects existed even before the development of the internet, as seen in the Mass Observation project and BBC Domesday Project. But new social media technologies have made the process of creating and maintaining such archives much easier, with less chance of digital obsolescence. Dan Cohen’s Arcadia lecture talked about how rapidly he’d been able to set up a digital archive about the 9/11 attacks, and also about how the archive was now being used for purposes he’d never imagined at the time, such as for linguistic studies on teenspeak at the start of the twenty-first century.

It’s noticeable, however, that almost all these attempts at mass archiving have dealt with topics that can be seen as ‘people’s history’: oral history, local history or family history. That doesn’t mean to say that they’re only of interest to historians in these fields: an exercise such as the crowd-sourcing element of the BBC History of the World in 100 objects project can produce some material culture of interest to medievalists. But I’m still not clear who will respond to Project Woruldhoard. It seems to me to be pitched rather uneasily between academics (send us your lecture-notes) and the public (send us your living history/images).

A different approach to crowd-sourcing comes from attempts in various academic fields to make use of mass volunteers. Some of these projects have been very successful: Dan Cohen referred to Galaxy Zoo a project for classifying images of galaxies. A number of projects are trying such crowd-sourcing techniques within museums and historical projects. To name just a few, there’s the Victoria and Albert Museum asking visitors to choose the best images of objects, the Digital Bentham project for transcriptions of Jeremy Bentham’s writing, and lots of museum tagging projects.

As I’ve discussed before I’m unconvinced about the ability of tagging to produce good results. But my mind was changed somewhat by discovering Freebase, a project that aims at crowd-sourcing what are essentially authority files. That’s successful enough that Google has bought it.

That’s when it dawned on me: what you need for mass volunteer projects isn’t actually crowd-sourcing, but nerd-sourcing. You need to find, among the vast number of vaguely interested, not very analytical people who look at web sites, the small number of tidy-minded obsessives who care deeply about the ethnic origins of Freddie Mercury or want to analyse statistical data for fun and no profit. And then you need to persuade these people to do as much work for you as you can.

The success of mass volunteering, therefore is going to depend heavily on the number of well-informed enthusiasts ‘out there’. Dan Cohen mentioned a crowd-sourcing transcription project he was involved in: the papers of the early US War Department. He thought that the number of amateur historians interested in early US history meant that they would be able to get enough volunteers to do this effectively. The Library of Congress has also had a lot of success with its picture identification requests on Flickr.

In contrast, whether there are really enough Bentham enthusiasts to do transcriptions for the Digital Bentham project seems to me far more dubious. And transcribing medieval texts or identifying medieval images, is something that only the most hardcore amateurs are going to be able to help with, though the Your Archives project by the UK National Archive has one example.

Where does this leave crowd-sourcing for medievalists? There are a few possibilities I can see. One would be crowd-sourcing images of medieval buildings: there are already images on Flickr of extremely obscure medieval churches. Roger Pearse has also made the controversial suggestion that manuscript digitisation should be crowd-sourced. But all crowd-sourcing projects have costs, in terms of the time and money required to set up the project infrastructure, to monitor the input. motivate the crowd, and archive the results. I suspect that for most medieval topics, the vulgus is just too indoctum to make the effort worthwhile.

IMC 3: things to do with charters before you’re dead

Blogging the International Medieval Congress is itself increasingly historical in one sense: nowadays, you can get a range of reports on several of the key sessions, all written by historians with their own biases and agendas, and the attentive reader can try and reconstruct the event from multiple perspectives. In this spirit, I will rashly give you my thoughts on a couple of sessions that a fellow blogger organised on ‘Problems and possibilities of early medieval diplomatic’. Jon will doubtless give us a more informed take in time, but he is coming from the viewpoint that charters are intrinsically interesting, while I…am not.

I think I got turned off charters doing my MPhil at Cambridge, when I realised that there were volumes and volumes of Carolingian royal charters, none of which had been translated. Given that every project involving charters suggested to me seemed to involve reading dozens of them, I decided instead to focus my shaky translating ability on things that gave more immediate results. (OK, I realise now that charters can be read fairly quickly once you’ve got used to them, but I didn’t know that then).

Ever since, I have been gradually forced to admit that, actually, charters are very useful for studying all kinds of phenomena, and Jon’s IMC sessions this year gave a very good spread of both the kind of things you can study with charters, and even more interestingly, the scale you can work on.

At the most local scale, there was Jon’s own paper on St Pere de Casserres, a monastery in Catalonia. He was focusing on the oldest original charter, which was showing fictive sales to the monastery. How do we know they were fictive sales? Because some of the properties had already been transferred earlier to the founder of the monastery, Viscountess Ermentrude (Ermetruit) of Osona/Ausona. What case studies like this can give us is some feel for the texture of local power: for example, how new ‘histories’ are created (all the numerous people whose names appeared in this first charter were complicit in its fiction) and how power relationships worked (the early importance of viscountesses in Catalonia is very interesting).

Also on a local scale and focusing on Spain, but looking at a very different aspect of charters was Wendy Davies’ paper on ‘Local priests in Northern Spain in the tenth century’, which was using charters to look at priests’ education. In fact she was focusing on one formula within a charter (nullius cogentis imperio/nullius quoque gentis imperio) and its multiple variants. It says much about Wendy’s near hypnotic scholarly force that not only was twenty minutes on one formula fascinating, but her wish write a whole book on such language analysis seemed entirely reasonable (though I seem to remember she admitted it probably wouldn’t be publishable). Looking at these formula variations, Wendy saw different preferences between micro-regions, areas around particular cities that preferred one version of the formula, as well as individual preferences of some priests. Her analysis of charter-writing also showed different kinds of priest-scribes – some who were following aristocrats around, some who worked only in one location, writing for people who were probably peasants, because of the small-scale of these exchanges. From this incredibly detailed study of charters, she can thus build up a picture of the background of these members of a purely local elite, far below the social level that other early medieval sources normally deal with.

At a larger scale, Julie Hofmann from Shenandoah University was looking at women’s participation in patronage at Fulda (and hoping to expand this to other Carolingian monasteries east of the Rhine). A fair chunk of the paper was showing how hard it is to spot distinctive trends in women’s activities, when charters mentioning women are a relatively small proportion of a charter corpus that itself is changing over time. For example, how significant is a drop off in women’s charters, when there’s also a general decline in charters after the reign of Charlemagne? And could the overall figures be distorted by a few untypical families, such as one prominent Mainz magnate family which had no surviving sons?

One difference Julie thought she could detect was that women were less likely than men to witness their own donations. (I think I remember this correctly, but my notes are a bit sketchy at this point). The problem is determining the significance of this, which means trying to look at when men do or don’t witnesses their own donations, and there aren’t any clear answers yet. Work on women in early medieval charters has been very much neglected since the early attempts at statistical analysis by David Herlihy, Suzanne Wemple and the like, so this kind of charter analysis potentially offers an important new avenue to looking at Carolingian women’s history. Whether we are going to see consistent gendered patterns in the diplomatic, I’m still not sure, but after all, gender analysis is about similarities as well as differences.

On a national scale, we had Erik Niblaeus on how the Cistercians brought charters to Sweden, which had somehow managed to survive without them until the 1160s. Charters offer a useful approach to looking at the ‘Europeanization’ of northern and central Europe, and certainly provide evidence for it at a textual level. As Michael Clanchy commented, looking at one of Eric’s images, you wouldn’t be able to tell it in style from a charter from almost anywhere else in Western Europe. Despite Erik’s title referring to the ‘Import of a Political Culture’, however, he wasn’t sure that charters could be connected to political institutions, because there was so little other evidence for them from the period. Instead, charters add ‘reassuring mystery and complication’ to our knowledge. (Erik thus shows himself firmly in the John Gillingham tradition of applauding the increase of uncertainty in historical scholarship).

Lastly (though actually the first paper of all) we had the global vision of Georg Vogeler, one of the people working on the Monasterium project, talking about this and other projects to get charter corpora on the web. The possibilities are substantial. Rather than the handful of images that traditional charter editions have included, you can in theory have images of all the charters. You can have access from anywhere in the world and there are new possibilities for rapid textual analysis. Georg gave an example about looking at vernacular dating clauses in German charters and being able to explore regional differences over time. Diplomatic differences that previously might only be spotted by an expert after half a lifetime can be explored within a week or two.

Of course, the full effect is going to take a long time coming, and the other papers showed how different researchers want different things. Jon’s work, and to a certain extent Eric’s involved careful analysis of the specific physical form of charters, which needs high-resolution images. Wendy’s work requires full text (with non-normalized spellings). For the kind of larger-scale statistical analysis which Julie was interested in, in contrast, she didn’t really want the text of charters, so much as standardized data from them (she’d constructed her own database to store such data). In theory, people could code full-text to mark such key sections (as the Charters Encoding Initiative is thinking about), but it would still be an enormous amount of work. Georg said there are projects working on issues like automatic tagging of names, which might reduce some of these problems.

If we could get something like this working on a large scale, I think there are all kinds of new research areas that are opened up. For example, it strikes me as having great potential for socio-economic history. If you can relatively easily pull out charters referring to slaves or vineyards or mills etc, you can build up a selection of sources that you’d never have time to explore otherwise. Similarly, I once found a mention in a Freising charter about a woman serving at the royal court – it might be possible to find more evidence about that. All in all, after the two sessions, I’m starting to feel that I probably ought to be more enthusiastic about charters than I’ve previously been. Maybe, as a friend once commented, ‘charters are the new black’.