Punched Cards and a Concordance program
The 56 volume Index Thomisticus, a complete lemmatization of the works of Saint Thomas Aquinas and of a few related authors
The aim of this project was to catalogue all the words appearing in Aquinas’ works, with cards mentioning the location of the word in the text, along with a quotation of the sentence containing the word.
The inferences made were a revelation in the context of the word making a difference in the its cataloguing. For example, both praesen and praesenti mean presence, but the significance varied. In Latin also, one word could have many different meanings.
The cultural objective was to understand the author’s mind. Why would Shakespeare use, or make up the words he did is a hotly debated discussion. Naturally, understanding a scholar’s mind is the a fundamental tenet in the humanities. Robert Busa was now looking for mechanical aid, as with everything tech, to speed up the process. He had completed only 10,000 words, and there was a lot more to be done.
In an interesting mythical anecdote, Father Busa recollects that Thomas J. Watson (The CEO of IBM and yes, the Watson, IBM’s Watson is named after) had a report deeming the concordance program impossible to bring to fruition. Then Father Busa pointed towards an old IBM poster with the slogan
“The difficult we do right away; the impossible takes a little longer”.
Watson stood by this, on the condition that IBM would remain International Business Machines and not “International Busa Machines”. I love this kind of stories simply because of their good marketing value. Instant human touch, personal experience in a pretty momentous conversation in the history of Humanities Computing.
The first impediment was IBM machinery. Back then, in the burgeoning days of eighty characters recorded on a card could fit only one line of Aquinas’ hendecasyllabic poetry. That’s a line of eleven syllables. Eleven syllables per card was not nearly enough. Couple that with the processing time and the kind of quality Busa was expecting, a side project was born – The Dead Sea Scrolls project. Now, instead of punch cards, the progression to magnetic tapes was made.
Now the idea in the old system of IBM705? was to first sync up the text – phrase by phrase, locate the phrase, break down the sentence into words, locate the word, note the last letter of the preceding word and first word of the subsequent word, the number denoting the location, followed by a special character. Now the duplicates had to eliminated. Note that the duplicates could be same words with different meanings. Father Busa elaborates in Inquisitiones Lexicologicae, that each card had to be understood or interpreted by the machine. In the case of the Dead Sea Scrolls project, there were many re-writing attempts by the data-processing machine, especially when white space or missing words were encountered.
Paul Tasman from IBM helped father Robert Busa in linguistic automation. Together they formed the the “Centro Automazione Analisi Linguistica” (CAAL), the “Comitato Promotore” and the “Collegio d’Iniziativa” to monitor the outcome of lemmatization as they put together the works of Thomas Aquinas.
If you are wondering what lemmatization exactly entailed in the branch of linguistics, it is the root of a word – identified by the word’s lemma, or dictionary form. It is not to be confused with stemming, which does not care for the context. For example, if you enter the word “art” in the Google search bar, the prefix or suffix to the word “art-ist”, this can be considered Lemmatization, while “articulate” is not, even though it contains “art”. It is based on this complex NLP that Busa and Tasman’s team tried to come up with at least a semi-automatic way to categorize words by their dictionary heading. No wonder the Index Thomisticus took the better part of 4 decades to publish.
Slowly, but surely, this indexing and coding technique from a literature searching engine was comparatively faster than hand-written cards. The new era od language engineering was here. Father Busa expected these to be improved for sophisticated use in libraries and analysis. He is called IBM’s pivot point for providing the trigger to make the impossible happen.
Essentially, throughout the history of Humanities Computing, scholars took a look at what technology was affecting the industrial fields, and applied the same in humanities context.
Hewlett-packard’s Packard Institute of Humanities also concentrated its efforts similarly.
In fact if one were to point out major breakthroughs in Computing, they will eerily coincide with Humanities Computing. The mini timeline within this particular project can also be traced back from the World Wide Web.
|1989||World Wide Web||Tim Berners-Lee/ CERN|
|1980||Ibycus mini mainframe||David Packard|
|1974||Index Thomisticus||R. Busa, CAEL|
|1972||TLG Planning Committee|
|1968||David Packard’s program-produced Concordance to Livy|
|1967||Thomas Aquinas text card-punching completed.||Robert Busa|
|1957||(Published) Dead Sea Scrolls machine-readable.||Robert Busa|
|1957||Magnetic-tape assisted Bible Concordance.||J. Ellison/ Remington Rand|
|1957||FORTRAN made public|
|1953||Computers made public (IBM Ships IBM 70 1)|
|1951||Machine-generated concordance||R. Busa/ IBM|
|1946||ENIAC (electronic tube computer)|
|1943||MARK I,Electronic relay computer.||IBM/ Harvard|
|1890||U.S. Census recorded on punch-cards|
Hollerith founds IBM predecessor
|1884||Punch-cards||Herman Hollerith patent|
By the 1960s, other researchers began to index their texts of interest.
|Early Middle High German texts||Roy Wisbey|
|Matthew Arnold and W B. Yeats poems||Stephen Parrish|
At around the same time, computing facilities became mainstream, not only in educational institutions, but also in research centers around the world, but principally around Europe.
Examples include the Trésor de la Langue Française (Gorcy 1983), which was established in Nancy to build up an archive of French literary material, and the Institute of Dutch Lexicology in Leiden (De Tollenaere 1973). Consolidation was the main buzzword for all Humanities Computing works well into the mid 1980s.
The next step came with knowledge representation and semantic evaluation in visual form.
Benefits of Apple’s Mac in the 1990s for Humanities Computing:
- Simple programming tool
With Apple’s Macintosh systems, the attraction to the graphical user interface proved to be of paramount importance for displaying special characters from different languages. The HyperCard enabled linking between the cards. Soon, the collection of these consolidated documents came to be called ‘archives’.
Winter, Thomas Nelson, “Roberto Busa, S.J., and the Invention of the Machine-Generated Concordance” (1999). Faculty Publications, Classics and Religious Studies Department. 70.