About

Do you wish you could do large-scale text analysis on the languages you study? Is the lack of good linguistic data and tools a barrier to your research?

Learn how to create the data and language models you need for digital humanities analysis at “New Languages for NLP: Building Linguistic Diversity in the Digital Humanities,” a National Endowment for Humanities Institute for Advanced Topics in the Digital Humanities.

Held at the Center for Digital Humanities at Princeton, this Institute is a collaboration with Haverford College, the Library of Congress Labs, and DARIAH, the European Digital Research Infrastructure for the Arts and Humanities.

Participants will work over the course of a year—between June 2021 and May 2022— and will meet for three intensive workshops where they will learn how to annotate linguistic data and train statistical language models using cutting-edge natural language processing (NLP) tools. They will learn best practices in project and research data management. They will join discussions with leaders in the fields of multilingual NLP and DH. They will advance their own research projects by creating, employing and interrogating text-analysis tools and methods, while increasing much-needed linguistic diversity in the field of NLP.

More details are provided in our CFP.

Please feel free to contact the project directors with questions:
Natalia Ermolaev (nataliae@princeton.edu)
Andrew Janco (ajanco@haverford.edu)

Here are some frequently asked questions.

NLP has revolutionized our ability to analyze texts at scale. However, of the world's more than 7,500 languages, the major NLP resources only support eighty-five. While large linguistic datasets exist for high-resource languages such as English or German, text mining, topic modeling and other methods of computational text analysis are unavailable for the vast majority of languages — especially those that are minority, regional or endangered.

We are acutely aware of the risks to research—and to culture more broadly—if language technologies continue to lack diversity. The proliferation of data and tools in several dominant languages will perpetuate and deepen the existing structural inequalities on both local and global scales.

For the purposes of this workshop, a “new language” can include one with few existing resources, such as Mauritian Creole, Plains Cree, Gaelic, and Guadeloupean Creole, or a domain-specific language that currently lacks models, such as early modern Portuguese or the literary Russian diction used by Leo Tolstoy.

We especially welcome applications that use materials from the Library of Congress Digital Collections.

Check out our Language page for help exploring the Library of Congress collection, and browsing and searching other important online resources that collect various texts in various world languages.

Workshop participants should bring a corpus of machine-readable text (~20,000 or more words/tokens) in their language. More detail can be found in the CFP.

Scholars from any field of the humanities and humanistic social sciences with language or domain expertise are eligible to apply. Applicants may be researchers in any professional role (e.g. faculty, graduate student, independent research scholar, librarian, curator, information professional). We especially welcome proposals from researchers from less-resourced institutions or from those in contingent or non-affiliated roles.

Applicants may apply as individuals or in pairs. While two language or domain experts may apply together, we especially welcome teams of scholars from different disciplines or realms of expertise (e.g. data science, computational linguistics, machine learning, digital humanities). Team members may be from different institutions.

Non-US individuals and those based at non-US institutions are eligible to apply.

The workshop curriculum is designed to support participants with a range of technical expertise. Scholars with little or no experience with programming and natural language processing will be taught the required skills. Scholars with DH or NLP experience, and/or with more technical proficiency, will have an opportunity to advance their research.

The Institute’s technical team will provide training and support in linguistic data standards and annotation techniques, which includes basic programming with Python, NLP tools, and workflows.

Participants will be introduced to the spaCy NLP library and trained in several key tools and programs, including GitHub, Jupyter Notebook, Prodigy, and INCEpTION.

SpaCy is an increasingly popular NLP tool designed for quick experimentation and applied NLP tasks that can be performed on a standard laptop.

The NLP tools most familiar to humanists — Natural Language Toolkit (NLTK) and StanfordNLP — are highly regarded and useful libraries. The linguistic data created during the Institute can be used to train models with these and other libraries. However, we have chosen to work with spaCy during this workshop given its focus on practical tasks and simplicity.

Workshop 1 (scheduled for June 2021) will be remote. We hope that the other two workshops (scheduled for January 2022 and May 2022) will take place in Princeton, but they may be remote depending on pandemic-related circumstances.

Participants will create an annotated linguistic dataset and publish it in an open repository and will contribute the statistical model for their language in an open-source community. All participants will present their research outcomes at our public conference in Workshop 3. They will publish their data and annotations in an open repository and contribute their models to an open-source community.

The Institute instructional team will revise the training materials and publish them for a general audience online on DARIAH-CAMPUS.

The Institute instructors are leading experts in multilingual DH, and have substantial experience teaching NLP to digital humanists, using existing training materials and tools for corpus annotation, and working with TEI-encoded texts.

Our Institute is funded by the NEH Office of Digital Humanities and affiliated with DARIAH-EU, the pan-European infrastructure for arts and humanities scholars.

The first workshop will feature a keynote lecture by David Bamman, developer of the BookNLP project. Ted Underwood, author of Distant Horizons: Digital Evidence and Literary Change (Chicago, 2019) will be the keynote speaker for Workshop 2. We have invited Matthew Honnibal and Ines Montani, one of the lead developers of spaCy, to be the featured keynote speaker at Workshop 3.