Skip to Main Content

Linguistics: Corpora

Linguistics Corpora

What is a corpus?

The Oxford English Dictionary  defines a corpus as, "The body of written or spoken material upon which a linguistic analysis is based." For linguistics purposes, a corpus may be made up of text, audio, video, or other types of files and can be analyzed with various software tools.

How do I work with a corpus?

Students work with corpora in Linguistics departments courses like LING 4100/5800: Statistical Analysis and LING 5200: Introduction to Computational Corpus Linguistics. The Libraries' Center for Research Data and Digital Scholarship (CRDDS) offers workshops that support many of the skills needed to work with corpora, such as how to find them, cleaning and processing data, adding metadata where needed, and using programming skills and tools such as Python, UNIX, or R to analyze the data. Some "out of the box" software are available to analyze corpora without needing much technical knowledge. CRDDS offers consultations and Interdisciplinary Consult Hours on Tuesdays, 12-1pm in Norlin E206 for drop-in consultations and communal working space for digital scholarship research, from coding to writing to exchanging ideas.