In this series of articles Kevin D. Franklin and Karen Rodriguez’G examine computational tools and approaches at the interface of humanities, arts and social science.
18thConnect: Digitizing the Canon
For the humanities scholar who may have only recently mastered library and archival finding aids beyond the archaic card catalog, the possibility of retrieving source materials at the flash of a keystroke (well maybe a few…) is very heady stuff. Very. But even as scholars rub their hands together and salivate at the possibilities that advanced computer technologies bring to the archival table, questions of open access and issues of intellectual ownership and copyright infringement have emerged as fast as the world’s knowlege repositories (and Google) are digitizing texts. Accessibility is particularly important to historians, for example, where research in primary sources can often only be accomplished with an expensive plane ticket, extended sabbatical leave, and a pocketful of increasingly dwindling research monies. University humanities, arts and social science departments often suffer from second-string status when it comes to federal funding, alumni gifting and corporate grants, compared to those received by the “hard” science community. The global financial crisis will of course only make matters worse. The ability, then, to tap into the world’s archives from your desktop becomes not only very appealing but even — dare we say it — necessary.
For Laura Mandell and Robert Markley, professors of English at Miami University-Ohio (MU) and University of Illinois at Urbana-Champaign (UIUC), respectively, the possibilities of internet-enabled research are tremendous. Mandell and Markley are the lead organizers of 18thConnect, a collaboration between MU, UIUC, and the National Center for Supercomputing Applications (NCSA), which will provide the first comprehensive means of digitally organizing materials produced before 1800. Like its sister site, Networked Infrastructure for Nineteenth-Century Electronic Scholarship (NINES), for which Mandell is also Associate Director, 18thConnect will bring together in one forum separate digital collections and texts as well as allow interdisciplinary collaborations by publishers, libraries and scholars. Markley and Mandell both bring considerable scholarly interest in and experience with the digital humanities to this visionary project, and here tell us about 18thConnect’s inception and future prospects:
How did you get involved in this area or research?
Mandell: I first began thinking about how literary and cultural texts can be transformed into digital data when Jerry McGann, professor of English at the University of Virginia and founder and director of its Applied Research in ‘Patacriticism, gathered together a group of people in order to create NINES. McGann is, first of all, one of the most generous scholars ever, since he used the $1.5 million that he received from the Mellon Foundation as a “lifetime achievement award” to start NINES. None of us knew at the outset what NINES would be. Jerry just kept saying that the archive is going digital, and scholars have to be at the table, helping to shape it. All we knew is that NINES had to be a scholarly body that would make digitizing worth it professionally for younger scholars so that they would be able to participate in this important work. NINES would peer-review digital scholarship according to the highest standards to which printed texts are held. But then, what else would it be or do? We thought that it could be a kind of digital anthology published by a university press, but copyright issues proved insurmountable. We finally realized that NINES would aggregate rather than publish data: it would be an online finding aid, leaving all the participating digital archives exactly where they live. NINES would be a place for scholars to come to do research into digital archives, and to interact with each other in the process. To work, NINES had to be a comprehensive research environment, the first place you would come. To make that possible, we decided to bring together commercial, library, and open access digital records, texts, and images. All NINES records are freely available, the free-culture items immediately accessible, the others only if the scholar’s library subscribes. We built a tag cloud for note taking, and an exhibit builder is on the way, coming in December. As a finding aid and a venue for social scholarly interaction, NINES is the place to be: it means something to have a digital archive peer-reviewed and accepted by NINES.
Markley: In 1997, I started working on a large-scale digital project with colleagues at Washington State University-Vancouver. This was the first of a series of scholarly DVD-ROMs, published by the University of Pennsylvania Press, Red Planet: Scientific and Cultural Encounters with Mars, which appeared in the digital dark ages of 2001. Over the course of this project, we quite literally had to adapt to changing hardware and authoring software, doing in incredibly laborious fashion the sorts of video capture and editing that is now routine. My colleagues and I ended up co-authoring an article on this process. While Red Planet and the other titles in the series were all content driven, we had to grapple constantly with a host of deeply embedded disciplinary assumptions: humanists write the content, software designers provide authoring tools, and retrained MFAs design web sites. Much of my time in authoring and serving as a series editors for these DVD titles has been spent in exploring the mutually constitutive relations among content, digital form, and evolving technologies.
In an important sense, 18thConnect represents, for me, a continuation of an incomplete revolution within the digital humanities that must deal, in a variety of ways, with entrenched beliefs among my colleagues in the humanities. There’s a fundamental assumption that the content of the humanities, the canonical texts we have always taught, stay “the same” but now can be delivered through different media. For some scholars, digital media means that downloading pdfs is simply an alternative to xeroxing articles from journals. One of the ironies in the digital humanities is that these kinds of assumptions allow many scholars to persist in the belief that digital technologies reinforce the boundaries between disciplines rather than causing us to rethink them. In this respect, it sometimes seems that literary studies is more conservative, more wedded to simplistic understandings of technology, than it was a decade ago. As the late historian and anthropologist Greg Dening put it: “Surrenders to conventionality are what disciplines are.” So, in my mind, the ongoing challenge for digital humanities remains to fashion dialogic means of cross-disciplinary collaboration.
What is the origin of this project?
Mandell: The whole time that I was participating in NINES, I deeply regretted that it did not include 18th-century materials. Romanticists and nineteenth-century scholars are incredibly active in producing digital materials. There are of course 18th-century projects, and Jack Lynch of Rutgers University is a tireless tracker of those resources. But there aren’t as many, and it is for one simple reason: people believe that “Gale Group” has taken care of it. Gale produced ECCO, Eighteenth-Century Collections Online, a dataset that seemingly single-handedly solves the problem of transferring 18th-century texts to digital media. There had been a microfilm project for capturing all the eighteenth-century texts listed in a renowned bibliography in the field — all 400,000 of them. It had begun in the late 1970s, and libraries all over the Anglo-American world had participated, beginning with the British Library. Gale had taken over that microfilm collection and created digital image files out of 138,000 of the 200,000 that had been filmed. But image files aren’t data. Having images online makes them easier to look at but not fundamentally different from microfilm. Leaving those texts as image files is almost as good as burying them in the backyard.
Well, I had started a Digital Humanities Program at Miami and had brought in a series of the best speakers in the field: Rita Raley, Matt Kirschenbaum, Julia Flanders. I brought in Robert Markley to talk about his Red Planet project, and then we kept in touch. We both realized that letting Gale “own” the eighteenth century was a bad idea. What we didn’t know at the time was that we would be able to collaborate with Gale, that they want scholarly involvement in directing how this archive is structured as much as we want to be involved. At the end of August, we had a landmark meeting with Gale. Gale has gone far to create an Optical Character Recognition system for transforming the text images into type, but the process is still faulty enough for the University of Michigan Libraries to have undertaken the process of keying the texts — basically, typing them from scratch. Gale would like to increase the accuracy of its OCR, and the technologists at Gale are generously sharing with us all the information they have about their OCR, which we can then improve. Like NINES with which it is directly connected, 18thConnect brings together scholars, libraries, and publishers, as well as independent scholars producing archives, aggregating these separate digital records and collections, as well as trying to solve some of the major problems confronting that archive and threatening us with its illegibility in the future. The biggest problem is that we cannot type it. We need machines to read it and transform images into text. Printing wasn’t modernized until about 1820, so creating software smart enough to read well is a challenge, and we’ll be working with Peter Bajcsy of the NCSA to develop the algorithms adequate to the task.
Markley: In adding to what Laura has said, I’d emphasize that 18thConnect is intended to make available eighteenth-century texts so that they can be aggregated, searched, tagged with metadata, and manipulated in ways that simply don’t exist now. Texts printed before 1820 or so threaten to be shunted aside because they cannot be read with any degree of accuracy by off the shelf OCR software. In this sense, students and scholar are faced with a new form of the digital divide that threatens to render texts published before the mid nineteenth-century modern opaque to cutting edge stools in the digital humanities. Unless collaborative ventures such as 18thConnect and NINES deal with the issues that Laura has identified, the humanities threatens to become skewed toward a kind of modernist myopia that, over time, will marginalize foundational, pre-1820 texts in a variety of disciplines and produce a distorted view of our cultural heritage.
How is this innovative in computing?
Mandell: You know, it’s almost innovative by backing up. There are all these great computer scientists out there creating all kinds of incredible tools for analyzing data — I just got back from a workshop held by IBM about their World Community High Performance Computing Grid, and they were telling us how much they wanted to work with humanities data. Everyone is excited by the prospect — NEH offers grants in Humanities Supercomputing; it’s all the rage.
But at this workshop, I sort of stood up and said, “Data? Did someone ask for data?” Because we don’t HAVE data yet. We cannot simply wait to use high performance computing in the Humanities AFTER we’ve got the data; we have to use it to help us transform the print archive into machine-readable information.
When you say to Humanists that you want to create smart data, to key and code texts so that software, data-mining software for example, can be used to manipulate them, well, from the horrified response you get, you might as well have said that you were inventing Artificial Intelligence capable of creating and staffing distance learning courses.
But machines will beautifully structure and create textual data out of images, and then read it in numerous kinds of ways, not just by keyword. These software applications and visualization tools will help us figure out where to pay close attention to the archive. Frank Kermode has written book of literary criticism called Forms of Attention. Well, you can only pay deep attention to a limited data set, and the canon at which Kermode gazes was formed for at least one practical reason: limiting an unwieldy archive so that it could be approached by close readers. Ten or fifteen years ago, when I first worked on the problem of how anthologies shaped the discipline of English, and of course in particular what they cut out of view, I contacted anthology-editor David Perkins of Harvard to ask him what he cut and why. He wrote to me, “I wanted to save you all from having to read every worm that ever wrote.” As an information filter, the canon has its biases; your worms are not my worms. But still there MUST be filters — no one can read through 400,000 texts and still have time to take a shower or eat a meal. Machine reading will provide us with alternate and infinitely malleable filters. Your software or visualization tool can be a feminist one day and a legal theorist the next. Machine filters do not have to be pernicious. For one thing, human close-reading follows and is beautifully supported by mechanized “distant reading,” Franco Moretti’s term for data crunching.
But machine reading does more than simply allow scholars to focus attention on particular texts. It can help us figure out what’s human in the cultural record. Where the machine breaks down is incredibly interesting. For instance, visualization theorists always say that the most salient anomaly in visualized data reflects errors in the way that data was conceived or ordered. “Error”: that’s one name for rhetoric, for fiction, for the kind of wandering amongst, and deviations from, convention constituting human creativity. We want to know where a highly competent machine-reading program breaks, where it doesn’t work anymore. If something cannot be automated, it’s incredibly interesting.
Markley: Another way of looking at the problem of innovation is to recognize that most scholars in the humanities lack the means to collaborate effectively in database design, in creating sophisticated filters, in communicating effectively to people working in computer science and digital media what it is that they want or need. As Laura suggests, the concept of what counts as data in the humanities is more or less up for grabs, and 18thConnect is designed to foster multiples modes of collaboration and interdisciplinary scholarship. The humanities and digital media already are evolving in complex feedback loops, and in some very real ways the modes of analysis that developed in the twentieth century are now going being changed in open-ended and more or less a-predictable ways.
How does this project broaden/challenge/alter our understandings of Humanities, Arts, and Social Science Research or Education?
Mandell: As Johanna Drucker has said in a recent book review, the holy grail of digital textual studies is uncovering the process of human thought, and there are at least two ways that digitizing the archive will do that. First, it will help us understand how brains move from seeing analog images to crunching them into thought-morsels of binary data, how we get from pictures of black marks on a page, graphemes, to the signifier and signified of thinking. What a mysterious process: an Optical Character Recognition process smacks of Descartes’s pineal gland, taking us from body to soul. Second, digitizing the archive and then analyzing it with all the tools that this media ecology has to offer will help us re-envision the printed book not just as material but as thinking matter. Nancy Armstrong has recently published a book with the most wonderful title: How Novels Think. Books are smart data, too, but smart in their own particular way. It seems likely to me that the terminology used in the field of literary studies hangs on the printed book. Just as Moretti did away with the term “period,” replacing it with a notion of generic cycles, digitizing the archive will retire specific disciplinary structures, giving us new ones. Nothing could be more exciting than watching the transfer from dominant to residual, and emergent to dominant: the lucky people who get to live at those historical moments — this is McLuhan’s idea; he says that it’s true of Tocqueville — those lucky people get a glimpse at the inner workings of ideology.
Markley: At a fundamental level, 18thConnect will move beyond what I think of as a tepid interdisciplinarity: a historian sits down with a copy of Alexander Pope’s Windsor Forest and pokes through it looking for Tory responses to the ongoing war with France; in another building, a literary critic reads passages from an eighteenth-century history of landscape architecture in order to make general claims about Pope’s view of the ideal country estate. Neither specialist ever reads the other’s article. ECCO, the Eighteenth-Century Collections Online, is a step in the right direction of breaking down disciplinary boundaries because its 400,000 texts offer relatively easy ways for scholars to browse through resources, to do down-and-dirty keyword searches within texts, and to get some sense of the publishing history of individual titles. But we need tools that will improve OCR to the point where the data from sermons, mathematical treatises, histories, travel narratives, and anonymous poems that have never been reprinted can be tagged and manipulated using software like Collex so that we finally can do what Jerry McGann accurately calls “real scholarship” — taking advantage of and helping to design digital media that offers opportunities for interdisciplinary research and collaboration. Then and only then will scholars get to the point where they can use digital tools to read critically texts from multiple disciplines and look closely at visual media from the period (paintings, maps, title pages, tables, and so on). Above all else, the eighteenth century was a period in which the disciplinary divisions that we’re now familiar with — separate departments or colleges of medicine, literature, astronomy, chemistry, agriculture, and religion–did not yet exist. So a robust means to be able to tag, filter, display, and discuss data from different fields is pretty much essential to a new, robust interdisciplinary scholarship.
What does this project offer the humanistic/scientific/technological/corporate world?
Mandell: One thing that NINES is in the process of doing and that 18thConnect must do at the very start is to bring together humanists and businesses. Ever since creating his Palinurus web site, Alan Liu has been arguing against the humanist’s tendency to demonize business. Liu contests Bill Reading’s The University in Ruins, which blames the profit motive for having transformed universities from learning institutions into certifying machines. I myself have been perfectly willing to demonize the role of business in preserving the eighteenth century cultural record, as you can see in my videos describing the endeavor. But more recently, as I have been encoding books — which is to say simulating their powers of representation on computers — I have realized how powerful the book is as an information architecture. It has that power not despite the fact that it is a commodity but because it is one. As the teams of scholars constituting NINES and 18thConnect work together with business, we have to map intellectual onto profit motives. Seeing that confluence is startling and revealing; our interests are more business-like than we like to think.
The real payoff for humanists, though, will come from taking a hand in producing texts rather than imagining such an enterprise as a purely mechanical pre-requisite to the scholar’s work. Manipulating form and content, and the parts of both that are inextricable from each other, will show us how much thought is conveyed in what Garret Stewart calls “The Look of Reading.” A new analysis of the meaning-shaped materiality of cultural artifacts is made possible by digitizing the archive. Digitizing is transforming our notion of materiality as is beautifully expressed in Kate Hayles’s Writing Machines. As an example, let me translate one of Paul de Man’s ideas according to notions made possible by digital textual studies. Each page of a book resembles a human face, the lines making up alphabetical characters like the lines on a face, each one specifying a thought, a feeling. Our new Optical Character Recognition program running on the NCSA Supercomputer will read faces, ideally as well as children do. Like children, the machines will make surprising mistakes. Their solecisms, I hope, will startle us into thought.
Markley: There’s another way of looking at collaboration among different partners in education, business, and non-profits. Digital projects are extremely expensive in terms of money, time, and labor by any standards–and prohibitively expensive by traditional means of allocating grant money in the humanities. The titles in the Mariner10 DVD series all received internal or external grant funding, but these grants, although generous in the humanities (mid five figures) covered only a fraction of the costs; much of the slack was taken up by what I’d call scholarly sweat equity. If the scholars in literature, history, art history, music, theater, and so on don’t learn to take the initiative in collaborating with a variety of potential partners, they will have the digital archive constructed for them, in ways that they may not like or may not use. There are some terrific projects out there that have been the result of such collaborations: one of my favorites that we haven’t mentioned yet is the South Seas project, authored by Paul Turnbull and Chris Blackall, that is hosted by the National Library of Australia.
From a corporate point of view, investing in digital projects in the humanities is analogous to investing in alternative energy generation. The point is not short term profits but producing a culture of collaboration that is in itself productive of innovative ideas and tools. We envision large-scale undertakings like 18thConnect as a means to generate collaborative ways of designing and disseminating of a suite of tools (as NINES did with Juxta and Collex) for scholars in the humanities. Developing a robust, open-source OCR for texts printed before 1820 opens up a host of possibilities. Imagine, for example, popular texts such as Milton’s Paradise Lost or Shakespeare’s Macbeth available in different printed versions that can be customized from a vast database of copytexts and editions. For example, an annotated, modernized spelling paperback for secondary school students, with explanations of unfamiliar words or phrases; a university text that includes variants but retains the modernized spelling; and more specialized scholarly texts that include textual variants from various editions and retain the original spelling and punctuation.
About the authors
Kevin D Franklin is the Executive Director of the Institute for Computing in Humanities, Arts and Social Science (ICHASS), Senior Research Scientist at the National Center for Supercomputing Applications (NCSA), Research Professor – Educational Policy Studies at the University of Illinois at Urbana-Champaign (UIUC), and Adjunct Associate Professor – African American Studies (UIUC). Karen Rodriguez’G is Public Relations Liaison for ICHASS and a doctoral candidate in the Department of History at UIUC. Founded in 2004 at UIUC, ICHASS charts new ground in high performance computing and the humanities, arts, and social sciences by creating both learning environments and spaces for digital discovery. ICHASS presents path-breaking research, computational resources, collaborative tools, and educational programming to showcase the future of the humanities, arts, and social sciences by engaging visionary scholars from across the globe to demonstrate approaches that interface advanced interdisciplinary research with high-performance computing.