Mellon grant to fund project to develop data-mining software for libraries


CHAMPAIGN, Ill. -- Using cutting edge "tools of discovery" and a diamond-sharp new process called data-mining, information scientists at the University of Illinois at Urbana-Champaign are beginning work that eventually will help scholars carve out new literary knowledge in the works of writers across languages, cultures and time.

The Andrew W. Mellon Foundation is funding the two-year, nearly $600,000 multi-institutional project, which John Unsworth, dean of Illinois' Graduate School of Library and Information Science (GSLIS), will lead. In his winning project, titled "Web-based Text-Mining and Visualization for Humanities Digital Libraries," Unsworth expects to produce software "for discovering, visualizing and exploring significant patterns across large collections of full-text humanities resources in digital libraries and collections." The collections he's focusing on are at Illinois, Indiana University, the University of Michigan, the University of North Carolina, Tufts University, the University of Virginia and other universities.

In traditional "search-and-retrieval" projects, scholars bring specific queries to collections of text and get back more or less useful answers to those queries, Unsworth said.

"By contrast, the goal of data-mining, including text-mining, is to produce new knowledge by exposing unanticipated similarities or differences, clustering or dispersal, co-occurrence and trends."

During the last decade, he said, many millions of dollars have been invested in creating digital library collections. Thus, today, terabytes of full-text humanities resources are publicly available on the Web. One terabyte, Unsworth said, equals 1,000 gigabytes, or enough storage for 300 feature-length films in digital form.

Those collections, dispersed across many institutions, "are large enough and rich enough to provide an excellent opportunity for text-mining. By creating the Web-based software tools, we aim to make those collections significantly more useful, more informative and more rewarding for research and teaching."

With its roots in statistics, artificial intelligence and machine learning, data-mining has been around since the 1990s. And statistical analysis of humanities texts is "one of the older activities in humanities computing," Unsworth said. "People have been doing it in authorship-attribution studies, for example, for most of the last half of the 20th century.

"But data mining per se discovering patterns in large textual data sets is not something that's been done much in the humanities. Our project may not be a 'first,' but it is an early entry into the field, certainly." Unsworth said he intends to build on data-mining expertise at GSLIS and on "several years of software development work" that has been done at the U. of I.'s National Center for Supercomputing Applications, in particular, work developing the D2K (Data 2 Knowledge) software in Michael Welge's Automated Learning Group.

"This project relies on Michael's D2K, and could not happen without it," Unsworth said. "We're grateful for his participation." Nor is this the first GSLIS project to build on D2K, Unsworth said. The National Science Foundation and the Mellon Foundation has funded Stephen Downie, a young scholar in GSLIS, to use D2K in a music information retrieval project.

With data-mining tools, Unsworth said, you first select a body of material that you think is important in some way, next select features of those materials that you similarly think are important, and then "map the occurrence of those features in the selected materials to see whether patterns emerge. If patterns do emerge, you analyze them and from that analysis emerges if you are lucky new insights into the materials."

For example, in the planning grant for this project, members of his research team, using the full set of Shakespeare's plays, selected five "circulation-of characters" features scenes, nodes, singles, loops, switches as independent variables, and "genre" as the dependent variable; they then "attempted to order the plays by feature similarities and see how that corresponded or didn't to genre," he said.

"There was one very interesting result, which was that Othello fell squarely in with the comedies. If I were to analyze this result, I'd ask a number of questions about the methods used to produce the results, but once satisfied that I was not looking at an artifact of the procedure itself, I would ask what it means that Othello has the structural features of comedy, and from there, an interesting journal article might emerge."

Unsworth said that this example isn't strictly representative of what he's proposing to do in his project in terms of the scope of the data set.

"In the project we plan to explore thousands of works by hundreds of authors," he said. "Part of the experimentation will be to determine what features are meaningful at what level of generality, what subsets present the richest veins for data-mining and what methods expose the most interesting patterns at what scope."

Unsworth said that he and his team of researchers know literary scholars are interested in the works that make up the data set he proposes to use British and American literary texts of many types, mostly from the 19th century and he knows that the features they'll be identifying are "features of interest," especially structurally.

"What we don't know, because this is an experiment with a tool of discovery, is what interesting patterns we will find as we map these features across this body of works. It is, therefore, a bit of a leap of faith to accept the assertion that interesting patterns will emerge, but I do make that assertion and I am comfortable doing so.

"To date, we haven't had a tool that exposes patterns in literary texts at the level of granularity and the scope that we propose in this project, but we know that the D2K tools work at that scope and granularity with other kinds of data, and we know that literature and language itself exhibits some meaningful patterns at every level we can observe, so it seems reasonable to hypothesize that new levels of observation across larger scopes of literary text at higher resolution, with respect to textual features, will expose meaningful patterns that haven't been visible before.

"From there, it will be up to literary scholars to analyze, interpret and explain those patterns, and in a very general way, that activity is the advance in literary scholarship that we assume will emerge from this project."

Additional project partners in humanities research computing are Stephen Ramsay, English department at the University of Georgia; Matthew Kirschenbaum, English department at the University of Maryland, and fellow at the Maryland Institute for Technology in the Humanities; and Tom Horton, computer science department at the University of Virginia.

The Mellon Foundation provided an earlier $56,000 planning grant for this project in 2003.

The new Mellon grant is the second major grant Unsworth has won this fall. With co-project investigator Beth Sandore, associate university librarian for information technology planning and policy at Illinois, he won nearly $3 million over three years from the Library of Congress to take part in a massive project to save at-risk digital materials nationwide. Through the grant, the U. of I. Library and the U. of I. Graduate School of Library and Information Science will take a leadership role in the National Digital Information Infrastructure and Preservation Project.

Source: Eurekalert & others

Last reviewed: By John M. Grohol, Psy.D. on 21 Feb 2009
    Published on All rights reserved.