Fifty years of historical database angst

The Making of Charlemagne’s Europe project website has now gone live, and includes a post by me on interconnecting charter databases. I mention in that a recent argument when we were trying to decide which of several different categories of transaction a particular document fell into. Just to show that such problems of coding documents are not new, here are some quotes from a recent article on Charles Tilly, a historical sociologist and a pioneer of using databases for historical research.

The Codebook for Intensive Sample of Disturbances guides more than 60 researchers in the minutiae of a herculean coding project of violent civil conflicts in French historical documents and periodicals between 1830–1860 and 1930–1960…The Codebook contains information about violent civic conflict events and charts the action and interaction sequences of various actors (called there formations) over time….we find fine-grained detail and frequent provision made for textual commentary on the thousands of computer punch cards involved.

(John Krinsky and Ann Mische, “Formations and Formalisms: Charles Tilly and the Paradox of the Actor”, Annual Review of Sociology, 39 (2013), p. 3)

The article then goes on to quote the Codebook on the issue of subformations (when political groups split up):

In the FORMATION SEQUENCE codes,treat the subformation as a formation for the period of its collective activity—but place 01 (“formation does not exist as such at this time”) in the intervals before and after. If two or more subformations comprise the entire membership of the formation from which they emerge, place 01 in that formation’s code for the intervals during which they are acting. But if a small fragment breaks off from a larger formation,continue to record the activities of the main formation as well as the new subformation.

If a formation breaks up, reforms and then breaks up in a different way, assign new subformation numbers the second time.

If fragments of different formations merge into new formations, hop around the room on one foot, shouting ILLEGITIMIS NON CARBORUNDUM.

(Krinsky and Mische, p 4, citing Charles Tilly, Codebook for intensive sample of disturbances. Res.DataCollect. ICPSR 0051, Inter-Univ. Consort. Polit. Soc. Res., Ann Arbor, Mich. (1966), p. 95)

In nearly fifty years, we’ve gone from punch-cards to open source web application frameworks, but we still haven’t solved the problem of historical data (and the people behind it) not fitting neatly into the framework we create, however flexible we try and be.


10 thoughts on “Fifty years of historical database angst

  1. Great to see the The Making of Charlemagne’s Europe project going online! Just some annecdotal comments: the blog section does not have date information, nor comments enabled (yet?).

    On interoperability of electronic historical data: I think you are right, the dificult part is to harmonize people, not tools. In my experience, the most challenging part is to get the parties involved in a KISS mental attitude.


    • Joan – we’re trying to get the comments enabled: I’ve raised this with the Department of Digital Humanities who are developing the database and website for us. I’ll also add a request about dating of blog entries. But given the amount the developer has to do on the user interface, these may not be priorities. You’re always welcome to comment here in the meanwhile.

      The problem with interoperability is how you get simple systems of historical data that allow people to do the things they actually want to. Interoperability isn’t a goal in itself, it’s a means to an end. I’d therefore recommend looking at SNAP:DRGN, because it’s looking at how you can add a layer of interoperability over existing systems.


      • Thanks for the link to SNAP:DRGN link, looks quite interesting as they are making public his specifications, that’s surely the way to go.
        On interoperability. Well, somehow, that’s usually the problem. People has especific projects to do, and targets to meet, and that’s fine, as long as getting this inmediate projects done does not forbids them the planning and making of more esential/basic ones. For example, it hurts me badly how can history be done without standarizing/universalize basic data as documents or sources. When I am thinking of interoperability, I am thinking about those basics build blocks, not just how to get project X interchange data with project Y – the lack of basic interoperability standards will force them to search for especific methods to reconcile his own idiosincratic datasets, nothing new -.
        For example, on early mediaeval prosopography: The Making of Charlemagne’s Europe project will generate a dataset – if my memory serves me well – based on ‘factoids’ as ‘interacctions’ between ‘actors’. It could be interesting to be able to express the structure of the information on this dataset in an XML schema, to allow import/export of data. The KISS part in this would be to try to keep the essential parts as simple as possible, to avoid to force the potential users to deal with unneeded complexity. A further step could be to try to reach a consensus on this data specification between other major projects and once the method is know to be useful it could be presented as an XML-TEI standard proposal, for a wider audience.


  2. How to protect yourself from feudal violence, and other linksToday there is only time for a links post, I’m sorry about that. But happily I had most of one ready in the backlog drawer, and they’re all of reasonable moment. The refuge site at Bléré Val-de-Cher, seen from above during exc…


  3. How to protect yourself from feudal violence, and other linksToday there is only time for a links post, I’m sorry about that. But happily I had most of one ready in the backlog drawer, and they’re all of reasonable moment. The refuge site at Bléré Val-de-Cher, seen from above during exc…


  4. A short comment on “Building a charter database 1: the factoid model and its discontents”

    Some basic questions:
    1) How do you manage people identifications and how it does relate with authority lists?
    2) Witness subcribing charters are being also recorded ?
    3) Did you have some kind of ‘generic’ factoid type or factoids can only exists in a predefined set of types/semantics?

    A plea:

    Without access to charter texts (stated in a previous post of Edward Roberts), usability will be severtely impaired, imo.


    • Dear Joan,

      1) I’ll be talking about people in the next post – which I need to set down and write now!

      2) Yes, the witnesses to charters are being included in the charter main model. What I was talking about in the post was that we haven’t split off information on the charter as a source from the charter writing/signing as an event.

      3) The miscellaneous factoid type is effectively a generic type – you can put varied combinations of agents, objects and places in these and then create new descriptors to explain how they connect together. The transaction factoid is one particular specialised way of connecting together agents, places and objects which allows you to specify (and search for) the way in which possessions are flowing between the parties involved.

      As for the lack of full text, we’re providing links to the full text online where it’s available. But in a fair number of cases the text of the most recent edition is under copyright, so we’re limited as to what we can provide and including the full text of very old editions (especially if taken from OCRd texts) would probably be actively misleading).


  5. A short comment on your ‘Building a charter database 2: agents and their characteristics’ post on ‘The Making of Charlemagne’s Europe project’ (the comment system seems to be still under construction).

    Is nice to realize that the project decided to use the same convention used in the Fons Cathalaunia project when dealing whith person identification, namely : 1) when in doubt , generate a new identity/agent record , and 2) a notation to describe probably compatible/amalgamable identities. Did you read my : ‘Detecció de grups d’homònims en documents de l’Alta Edat Mitjana’ ?

    One basic doubt about internal data representation and a word of caution. I see that every agent has a numeric agentID ; if the project has to express, ie, that: in document D, X is ‘father of’ Y, how is this relation being expressed in the database ? as link between agentID=X and agentID=Y (using numeric agentIDs) ?

    In the Cathalaunia project, I first used this kind of ‘direct’ inter-agent addressing method, until I realized that manual changes in identification forced also to manually transfer old relations…

    It’s not specified, but I suppouse that data in the project will be addressable with stable URLs (something like: fixed_prefix/agentID/X).

    I can subcribe and sympathize with most of your comments about the difficulty on building such a database: errors, contradictory statements, local variations, etc, etc, etc. Lots of decissons to take, lots of work to try to make-it right!

    A final note. I have to insist on the convenience of online access to the sources, at leats for two motives. 1) I can imagine lots of research topics where having the basic data from the database (ie: documents,agentIDs,whatever) is just a first step, the ‘real’ research starts examining the obtained data, and that can probably imply reading the sources, so, without online access, usability will be severely limited. 2) As you say in your post, agent identification is not an easy or evident matter, to be able to suggest the reunification of some agents you need also to check-it against the sources. The same can be said to be able spot human errors (any massive manual data entry task has a non 0 error rate). To produce a ‘local’ text when no free online version is known, is just a fraction of the work done to build up the database…


    • Dear Joan,

      A few belated responses:

      1) I haven’t yet read your paper on detection of homonyms, but I probably ought to do so, so I’ll make a start on it. But I’m afraid my Catalan is non-existent, so it takes me a long time to work through anything in that language.

      2) I’ll talk a bit more in a future post about how we record statements like “in charter C X is father of Y” (in so-called agent/relationship factoids). But yes, we would have to make manual changes if we subsequently decided that the agent X1 in charter C was in fact X2. On average though, we’ve got less than 1 attribute for every agent recorded, and you can change the details within an attribute factoid in a couple of minutes, so such changes are feasible unless we’ve got wholesale reassignment.

      3) The current test version has fixed URLs for individual charters, agents, places and factoids. I presume the public version will have those as well.

      4) For the charters you’re currently putting online, what are you doing about copyright clearance? One of the issues for us is that a number of the editions we’re using are still in copyright: although the original document may have been produced hundreds of years ago, the text of the edited version is copyrighted when it’s produced (because editors have made an intellectual contribution to the work). So, for example, Karl Glöckner’s edition of the Wissembourg charters is in copyright until at least 70 years after his death in 1962. (I think it’s actually longer than that, because it was posthumously published). That’s 130 charters for our time period immediately where we’d either have to find and OCR an old edition or have long negotiations, which is why we didn’t think it was feasible.


      • 1: I really wish my english was good enough to make a translation… (maybe catalan/english google translation can help?) 😦

        4: It’s explained in the website. What I do, is :
        1) rewrite the texts (use free versions if they exist, whenever more than one version is available, mix versions).
        2) change / normalize punctuation, conjunctives, numbers, case of names and locations, text layout, etc.
        3) does not incorporate any ‘modern’ textual integrations.
        4) On a non visible level, include markers to segment the text.

        On a global level:
        1) Make explicit that the version presented is not suppoused to be any accurate rendition of the actual texts, just a version to help to contextualize presented informations.
        2) Always incorporate external references to the sources used.

        It’s not a panacea, as you have to rewrite the texts, but in my experience, that’s not more than 5-10% of the total work when processing a document. 5-10% texting, 10-15% extracting data, and the rest, 75-85%, identify.

        Basically, what I am producing is a low quality new version with links and references to the ‘good’ ones. Witch is in line with the prosopographical philosophy of to delegate authority to external sources (encyclopedies, bibliographic references, etc) wherever possible.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s