Over at A Corner of Tenth-Century Europe, a discussion that started off being about the trajectory of women’s history has mutated into one about why someone isn’t creating a system of unique identifiers for medieval texts. And while I’ve spent the last decade or so thinking about gender history, I’ve spent half my life thinking about databases and identifying references uniquely, because that is one of the things librarians do all day. So I wanted to start from Joan Vilaseca’s plea for “A public and standarized corpus of classical/ancient texts with external references to editions, versions, comments, articles, etc,etc.etc”, sketch out what I’m aware of as existing and explore why history seemingly can’t get its act together in the way that chemistry or taxonomy has.

There are actually some databases that do a fair amount of what Joan would want. As examples, there are:

1) Perseus Digital Library. This is a big and sophisticated free collection of classical texts, including some very neat tools, such as Greek and Latin word study tools (which I freely admit to using when I’m stumped on working out the root verb from a conjugated form). This doesn’t have identifying references, however.

2) Library of Latin texts. This commercial database includes the full text of the whole corpus of Latin literature up to the second century AD (essentially taken from the Teubner editions), plus a lot of patristic and medieval Latin (largely, but not entirely taken from the Corpus Christianorum series). Associated with this is the Clavis Patrum Latinorum which provides a numbered list of all Christian Latin texts from Tertullian to Bede. (There are similar indexes which cover Greek patristic texts, apocrypha, and early medieval French authors.

3) Thesaurus Linguae Graecae. This database includes most literary texts in Greek from Homer to the fall of the Byzantine Empire. It’s a subscription service, but it includes a free online canon database that provides unique identifying numbers for works and parts of works.

4) Bibliotheca Hagiographica Latina (BHL). This is a catalogue of ancient and medieval Latin hagiographical materials, produced by the Bollandists, which provides unique identifying numbers for different texts. There’s also a free online version.

5) Leuven Database of Ancient Books. This free database includes basic information on all literary texts preserved in manuscript from the fourth century BC to AD 800; the texts are assigned a unique number. (It’s a subset of the Trimegistos project which focus on documents from Graeco-Roman Egypt, both literary and non-literary and also provides identifying numbers).

What this very brief overview reflects is one basic fact: to produce a database and/or identifying ring systems of any size takes time and money. As a result, there have to be enough people wanting the result to make it worthwhile making that investment. There are several different models for financing such projects: you can sell the resultant database (either for profit or at a break-even price), or you can persuade funding bodies to support you, or rely on charitable donations but you need someone willing to pay.

It’s worth looking here to see why identifier projects in other fields have succeeded. A lot of large-scale identifier projects, for example, have come out of library science and publishing, both because these are huge and connected networks and because there’s the commercial driver of being able to identify something in your inventory quickly and accurately). So the Standard Book Number, developed for WH Smith in the 1960s became the ISBN of today, followed in the 1970s by the ISSN for serials, etc. It’s noticeable that it took more than twenty years after unique identifiers for serials to develop for unique identifiers for individual articles within those serials to develop (the CrossRef project using DOIs). This wasn’t because no user ever wanted an individual article to read before then; it was because it was only with electronic journals that it became feasible to try and sell individual articles to people.

Most of the other really large-scale nomenclature/identifier projects have been in the sciences, for the simple reason that the same phenomena are being studied all over the world. We’re (mostly) looking at the same sky, hence the International Astronomical Union was formed in 1919. The International Union of Pure and Applied Chemistry, responsible for chemical nomenclature also dates from a similar period. (One of the other main systems of chemical nomenclature, the CAS Registry number is an offshoot of the subscription index/database Chemical Abstracts). Again, people are trying to do the same chemical reactions from Bombay to Los Angeles, so there’s a big demand for such systems. Biological classification has a very long history, dating back to Linnaeus (although unique identifiers are only just being developed), reacting to thousands of years of attempts to show how all species are related.

The classical/medieval database projects that I’ve mentioned above have essentially been possible because they have a sufficiently tightly-defined group of potential users who are all interested in the same sort of thing: classical literature or papyrology or hagiography. It’s therefore worth creating something for them to use. The problem with extending such a system to broader historical areas is that no-one cares about history.

That sounds ridiculous, but it’s a problem I’ve mentioned before: it’s not really clear that we’re doing the same thing as historians when we study vastly different periods and use completely different sorts of sources. Or to put it a different way, the Old Bailey database is a remarkable resource, but not of any professional use to me. I don’t care about all history, everywhere; I care specifically about early medieval European history. Historical sources, even just medieval sources, aren’t one thing, but a patchwork of different islands and most researchers spend most of their time perched securely on a few of these, rarely venturing off them. I’ve had years of being an early medievalist and never needed to cite Sawyer numbers, for example, because I don’t research or teach Anglo-Saxon history; I’d be almost equally baffled if I came across Corpus Iuris Canonici footnotes without the help of Edward Peters. The patchwork systems of identifying medieval documents remain because of the lack of overlap between the groups of researchers using them, and I can’t see any driving force that is going to change that. Crowd-sourcing has produced some remarkable things, but creating unique identifiers is a peculiarly ill-suited task for crowd-sourcing. Unless more people start caring about the history of everywhere at all times, Joan isn’t going to get the wide-ranging system he’d like.


  1. Thanks for such a well documented post (I do learn a lot reading you)!
    I think you rightly identified the main problems for such global project as an unified corpus of ancient texts.
    1) Historian’s work is quite polyhedral.
    2) ‘No-one cares about history’. That’s something I can witness by miself talking to professional historians. In fact, it’s so common that is not unusual here to gift or exchange books/articles between the very few interested ones. Sadly, it seems the same disinterest happens at a higher levels…
    Jonathan Jarrett also raised a third and very realistic dificulty, stating there’s already a stablished common practice for source citation, so there’s no peremptorius need of a such a global system.
    I can understand that’s how the things are now, as heirs of a solid book based tradition, but as you already said, computers are changing our possibilities, and I am quite confident that new developments are and will enrich historiography methodologies; although making a global corpus of ancient texts with unique id’s will not probably be one of those changes. 🙂


  2. I think there is also the issue that it’s not necessarily clear what the base unit of any such classification is. With a library catalogue (and with citation of print works), there’s a basic distinction that can be made between something that exists as a stand-alone and something that is part of a larger work–book versus article, essentially. Many medieval texts complicate this, either by being preserved in manuscripts substantially given over to other things, or by being shared between several different works (I used the example of the Spanish ‘Prophetic Chronicle’ in the discussion at mine) and even if one takes a `text’ as a basic unit, getting round whether it needs to be a whole work or not, things like glossed manuscripts mess that up too. We consider, and near-ccontemporaries considered, for example, Peter Lombard’s Sentences a work in its own right, but he thought of it as a commentary on the Bible and it makes little sense without the Bible to refer to. Many lesser glossators are more inarguably part of a version of the Bible, not works in their own right. But whatever system you’re using, the Vulgate Bible or its books are surely at the base unit level. What then is the commentary that differentiates the manuscripts? And that opens up a bigger problem, that almost no two manuscripts of any text are completely the same, so the threshold of difference at which a recension becomes a separate work (the various books of the Mabinogion might be an example here) is going to be important, and endlessly disputable. It’s slightly amazing to me that we can achieve rough consensus as to what sources we mean at all, but it doesn’t take long to mess it up once you get into the details.


  3. One of the little know by-products (or, to put it better, cornerstones) of the Dictionary of Old English project at the University of Toronto (on which I was a research assistant many moons ago, working on F- and G-) is the comprehensive List of Texts cited, available online at http://www.doe.utoronto.ca/st/index.html.

    It’s quite a good case in point for what you’re talking about.

    It amounts to a complete catalogue of the Old English corpus, with essential bibliographic information. Each text is assigned a title (and standard abbreviation), and significant manuscript variants of each text are distinguished. An earlier catalogue by Angus Cameron is fully absorbed.

    It’s an impressive and under-recognised resource: despite the fact that it serves the interests of one of these ‘tightly-defined group[s] of potential users who are all interested in the same sort of thing’. It’s worth pointing out that this is a comparatively small corpus of texts, and the work that has gone into its production is staggering. The DOE had the singular advantage that they have images of the entire corpus of Old English texts available on site. It would no doubt take two minutes to find all kinds of flaws in the system, but if it’s robust enough for the DOE, it’s robust enough for me. It might provide a model for similar projects: but I simply can’t imagine anything quite so ambitious being produced for many other corpora, even for dictionary projects (which tend to be more selective).


  4. Sorry for delay in replying – as usual I’m trying to juggle numerous different projects and time slips by.

    Joan and jpg’s comments tie neatly together in one way: it may be becoming technically easier to ensure the supply of a database/list of unique identifiers, but that doesn’t mean much unless there’s also a demand-side interest. Things that may be technically wonderful aren’t necessarily going to get used by scholars. It was particularly interesting to hear about the Dictionary of Old English list of sources, because this sounds to have lots of overlap with source lists produced by several other projects, such as the Fontes Anglo-Saxonici and Prosopography of Anglo-Saxon England. But the problem is that no source list matches exactly what everyone wants, so new versions keep coming along.

    On Jon’s comments on basic units, this is a problem that librarians (or rather cataloguers, the true obsessives in the information world), have spent a lot of time thinking about. The best starting point for anyone designing a system now would probably be the FRBR (Functional Requirements for Bibliographic Records) model, which distinguishes between a work (the Platonic ideal of a creation), an expression (the specific form a work takes), a manifestation (the physical embodiment of that expression) and an item (the single thing you hold in your hand). For a discussion of this with examples and spooky Dracula music, see this video by librarygeek.

    There’s also a version of the model called FRBRoo which is intended to be used by libraries and museums and so might be more attuned to unique rather than mass-produced objects. This undoubtedly wouldn’t solve all the arguments, but again, we don’t need to reinvent the wheel completely on this.


  5. I’ve posted a short draft of what I had in mind about how a uniifed corpus of ancient texts could work. It does not change substantially what has already been said, just to put-it black on white – I also have to ask in advance for some indulgence on the quality of the included english traslation -.


