Information technology has transformed our lives in many ways: DNA sequencing, Google searches, MRI scans and cash machines all rely on the ability of computers to perform rapid and accurate calculations on large volumes of data. However, pattern recognition is one task in which the capacity of the human brain easily excels. Developing an artificial intelligence which can perform anywhere near as well, even in a limited context, poses a significant challenge.
Nevertheless, machine reading has developed rapidly over the last few years. Whenever you use a free text search, on a website, electronic document or digital image of a typed or printed text, that search is carried out by software using Optical Character Recognition (OCR). OCR algorithms parse structured texts and match the shapes of the letters, with varying degrees of sophistication. OCR has revolutionised our ability to search documents, but because it relies so much on structured and standardised text, its use has largely been limited to printed material. Hand-written documents present a whole new set of problems: even a trained scribe will not reproduce all of their letters identically each time they write. Because the human brain is very good at second-guessing and ‘filling in the gaps’ to make sense of what is around us, these are problems which, with a little training, we can surmount. Getting an artificial intelligence to do this is not so easy.
When it comes to reading written texts, humans with a competent level of literacy do not process text letter by letter: we generally scan for recognised combinations which we expect to occur in a given context. So it’s not surprising that in the last few years, IT developers have been trying to emulate this approach in developing software which will read handwriting, or Handwritten Text Recognition (HTR) software. This is done by using ‘training data’. Handwritten texts are fed into the software, along with their transcriptions, allowing the programme to correlate between the handwritten and printed texts. The more training data is available, the more accurate the results will be, as false correlations will be eliminated. It’s a sophisticated example of machine learning which is being improved all the time.
We’re delighted to inform you that Adam Matthew Digital are now implementing HTR on some of the material in ‘Literary Print Culture’, their resource presenting the digitised documents from our archive. HTR enhanced texts have a pencil symbol next to their titles. AMD have posted a news story on their website which you are encouraged to share: https://www.amdigital.co.uk/about/news/item/htr-technology-added-to-literary-print-culture
We hope that the new HTR facility will help researchers to make the most of our digital resources. Please be aware, though, that no electronic search system is flawless; so if you don’t find what you’re looking for, please feel free to contact the Archivist for further information. In particular, earlier hand-writing is not always easy to parse. For instance, a search on Liber C for the word ‘goose’ produced some odd results, including the following:
The highlighted and misidentified texts read ‘[preac]her of’ and ‘peace’ respectively – clearly the software has been looking for the down stroke of the ‘g’ followed by a sequence of curved letters without up or down strokes, and I think most of us would agree that the actual writing in the texts is far from obvious! So do use a combination of search techniques (in the case of the early entry books of copy, the transcriptions of Arber and Eyre are also available to search) – including emailing the Archive.
If you’re interested in learning more about HTR technology, there is an excellent blog post by Richard Dunley on HTR at work in the National Archives at https://blog.nationalarchives.gov.uk/blog/machines-reading-the-archive-handwritten-text-recognition-software/, and regular, highly informative blogposts on UCL’s Bentham Project at http://blogs.ucl.ac.uk/transcribe-bentham/.