Tufts University's Perseus Digital Library to offer new online search tool

Resource allows users to find specific people, places and dates

Tufts University's Perseus Digital Library has created a resource that will allow users to find specific people, places and dates in its updated 19th Century American collection, which now contains more than 55 million words.

Amateurs and academics alike will find this resource beneficial. Historians, genealogists, Civil War buffs and users of digital libraries frequently need to search for a specific individual or an exact place, but the ability to quickly locate such entities in massive digital collections can be a challenging task.

"Free text searching for proper names can be quite frustrating since many individuals share the same name such as John Smith and there are many different Springfields across the United States," explains Gregory Crane, editor-in-chief of the Perseus website and classics professor at Tufts University. "The problem of ambiguity further complicates this problem, for Washington can be either a person or a place depending on the context."

To solve this problem, the team at Tufts has implemented an automated system that makes use of natural language processing and historical knowledge sources, such as 19th Century gazetteers and encyclopedias, to extract different types of entities.

"With this project, Perseus seeks to demonstrate the application of text mining and information extraction to historical documents," says Crane. "In this collection of Civil War and local history documents, all place names, personal names and dates have been automatically extracted."

In text mining, a computer rapidly scans documents to determine patterns and other connections in the text. Researchers can then retrieve the information through information extraction, which automatically recognizes specific vocabulary in the test documents.

This process is different from other Web search engines because rather than presenting information in an unorganized list, text mining and information extraction categorizes the information, providing more relevant results. Researchers have applied named entity analysis to collections for years, but this is the first time that such analysis has been applied to a production digital library of historical documents, says Crane.

Support from the National Endowment for the Humanities and the National Science Foundation laid the foundations for this work and support from the Institute for Museum and Library Services made the current named entity analysis work possible.

"This tool allows users to search for specific individuals, places or dates within the collection, and to browse lists of entities and their frequencies both within individual documents and the collection as a whole," explains Alison Jones, a research coordinator at Perseus. "This allows a user to quickly locate all references to Robert Lee within a specific biography or across the entire collection."

The tool will also enable users to search for a full name, a forename or a surname to support the most robust searching possible. The ability to browse by place name allows users to search specifically for Cambridge, Mass., rather than Cambridge, England. To view the website or for a more complete description of this system, visit http://www.perseus.tufts.edu/hopper/nebrowser.jsp.

"This is a big step forward, especially considering the big Internet search engines do not search in this way," says Crane. "We have a test collection of about 55 million words to demonstrate the concept and will extend that service to a much larger corpus over the next six months."

With more and more documents added on a daily basis to digital libraries, what is needed now is a method to search the materials. Eventually individuals will be able to add their own documents to search, as part of an open community like that used by Wikipedia, a free, open-content, online encyclopedia.

The system will provide users with better ways of accessing the information available and Crane hopes it will encourage larger search engines such as Google and Yahoo to offer a similar type of tool.

"The services that we are providing are the things we hope to see the big search engines pick up and provide eventually," says Crane. "And we think they will."


Tufts University, located on three Massachusetts campuses in Boston, Medford/Somerville, and Grafton, and in Talloires, France, is recognized among the premier research universities in the United States. Tufts enjoys a global reputation for academic excellence and for the preparation of students as leaders in a wide range of professions. A growing number of innovative teaching and research initiatives span all Tufts campuses, and collaboration among the faculty and students in the undergraduate, graduate and professional programs across the university's eight schools is widely encouraged.

Last reviewed: By John M. Grohol, Psy.D. on 30 Apr 2016
    Published on PsychCentral.com. All rights reserved.