Skip to Main Content

Linguistics

This section of the guide provides digital resources for computational linguistics and successfully navigating corpus data.

English-Corpora.org & Computational Linguistics

What is a corpus?

The Oxford English Dictionary  defines a corpus as, "The body of written or spoken material upon which a linguistic analysis is based." For linguistics purposes, a corpus may be made up of text, audio, video, or other types of files and can be analyzed with various software tools.

How do I work with a corpus?

Students work with corpora in Linguistics departments courses like LING 4100/5800: Statistical Analysis and LING 5200: Introduction to Computational Corpus Linguistics. The Libraries' Center for Research Data and Digital Scholarship (CRDDS) offers workshops that support many of the skills needed to work with corpora, such as how to find them, cleaning and processing data, adding metadata where needed, and using programming skills and tools such as Python, UNIX, or R to analyze the data. Some "out of the box" software are available to analyze corpora without needing much technical knowledge. CRDDS offers consultations and Interdisciplinary Consult Hours on Tuesdays, 12-1pm in Norlin E206 for drop-in consultations and communal working space for digital scholarship research, from coding to writing to exchanging ideas. 

What does English-Corpora.org do?

English-Corpora.org, a clearinghouse of monitor corpora from BYU, provides a platform to run a variety of queries on their established corpora. Any logged in user can search for specific words or phrases in each corpus, observe frequency over regular periods of time (months/years/decades), measure collocation, and perform other quantitative tasks for tracing language in use over time. It is also possible to download the corpus for more local use, for more advanced users.

Understanding the corpus landing page

Upon choosing a corpus, you'll arrive on a landing page that looks like this. There are two main boxes in the body of the site; on the left hand side there will be a search apparatus, and some display options; on the right hand site there will be a description of the corpus. On the top bar, there are some additional search options.

 

Each of these features are labelled with a corresponding number and described below.

1. Search box - Type in the character string you want to search for in the corpus. This search apparatus supports alphanumeric characters, special characters (such as * or ?), and spaces count as characters. Since we are looking at linguistic data, we can also add part of speech markers to identify specific grammatical features (nouns, verbs, prepositions, etc).

The default is to list outcomes for the strings, You can also clear the search by using the 'reset' button. 

2. Other search options - The search capacity allows users to perform other kinds of searches; these are listed as links above the search box. Selecting one of these options will change the search apparatus' output to reflect the kind of search you are performing. (These operate somewhat like 'advanced' search features.

  • CHART will produce a heatmap of your search term divided up by period of time.
  • COLLOCATES lets you identify terms that are more likely to appear around the initial search term than by simple chance.
  • COMPARE lets you compare two words in the corpus.
  • KWIC (keyword in context) searching shows you how your search term is used in context in aggregate with short snippets of text.

3. Description of corpus / Help box - This section provides information on the corpus selected, what it can be used for, and other information that might be relevant for the user (including the option to create a virtual sub-corpus from material within the corpus). The links provided here give examples of the kinds of searches you might want to perform and how to conduct them. It is possible to download the whole corpus through this informational box, but the web interface is more than adequate for running most searches. Clicking on different search features (e.g. CHART) provides a description of what this approach does.

4. Header of different activities - Once you perform a search, your results will appear under these tabs using the web interface. This allows users to move across different features of the corpus interface without having to run new searches each time.

Understanding the various search options with the sample search of "Wuhan":

With the List option, the corpus will give you all the results for your search (helpful if you are looking for multiple words or forms with a * in the search query) and let you click through to the Keyword in Context View (KWIC).

With Keyword in Context View (KWIC), your search term will be highlighted in a light green and provide some larger context. Click on an example to get even more context. This is helpful for observing patterns of use in aggregate.

With the Chart option, it will produce a bar graph of how often "covid" is mentioned in news articles per year, both by raw and normalized (per million words) frequency. "Wuhan" is not used very often in most news discourse until 2020. (You can also opt to see frequency by country and get to a KWIC search for all articles from Hong Kong, for example. Hong Kong is more likely to report on news from Wuhan prior to the Covid-19 pandemic than other countries).

What's going on with the search query language?

Building out different kinds of searches (including thinking about prefixes, suffixes, and plurals) allows the user to search increasingly complex phenomena. Considering grammatical parts of speech to differentiate words that look the same and have different meanings is helpful; the search apparatus has a drop down menu next to the general box if you click on [POS] (part of speech). As an example, this POS feature allows users to specify "will" as the noun and not the verb with the search (will [POS > noun.all), removing all irrelevant verbs. Being intentional about developing searches means excluding irrelevant examples, saving the user from unnecessary cleanup in the analysis stage. High-frequency words will require narrowing your search somewhat (e.g. by year) to conduct analyses.

Downloading Corpora from English-Corpora.org

English-Corpora.org provides free, complete access to their data from a robust web-based platform. However, this doesn't work for everyone's needs and they know that you might want to have more localized access to a their corpora for more advanced text and data mining tasks. English-Corpora.org provides access to their data with two available licenses, Academic and non-Academic; the below chart displays the cost to get access to one or more corpus. You cannot buy all their corpora but you can buy eleven of their biggest ones (see https://www.corpusdata.org/corpora.asp for details).

Penn State researchers seeking to download any corpus would fall under the license category "ACAD" (academic). 

Cost to get access to one or more corpus
License Explanation One corpus Two corpora 3+ corpora
(see example)
ACAD For use by university or college personnel (professors, teachers, students). $375 $595 $200 each additional corpus
NON-ACAD     Any other use*, including commercial. $795 $1,395 $400 each additional corpus

 

You can purchase their data to be used any way you like from English Corpora.org in three different formats: in relational databases, word/lemma/PoS, and words (paragraph format). Purchasing the data means you are also purchase the rights to any and all of these formats. Read more about these formats and their affordances at https://www.corpusdata.org/formats.asp

For legal and copyright reasons, they cannot distribute 100% of the corpus to users, even paid users.. When you purchase the data, you are purchasing 95% of the full text data. The remaining 5% is removed for reasons of copyright. Read more about these limitations: https://www.corpusdata.org/limitations.asp. This is a common practice to conform with US copyright law and will not affect the validity of any results or output.

There are a couple restrictions you must abide by before purchasing the data.

(These are all copied from their Restrictions page, which you must complete to initiate purchase: https://www.corpusdata.org/restrictions.asp)

1. In no case can substantial amounts of the full-text data (typically, a total of 50,000 words or more) be distributed outside the organization listed on the license agreement. For example, you cannot create a large word list or set of n-grams, and then distribute this to others, and you could not copy 70,000 words from different texts and then place this on a website where users from outside your organization would have access to the data.

2. The data cannot be placed on a network (including the Internet), unless access to the data is limited (via restricted login, password, etc) just to those from the organization listed on the license agreement.

3. In addition to the full-text data itself, #2 also applies to derived frequency, collocates, n-grams, concordance and similar data that is based on the corpus.

4. If portions of the derived data is made available to others, it cannot include substantial portions of the the raw frequency of words (e.g. the word occurs 3,403 times in the corpus) or the rank order (e.g. it is the 304th most common words). (Note: it is acceptable to use the frequency data to place words and phrases in "frequency bands", e.g. words 1-1000, 1001-3000, 3001-10,000, etc. However, there should not be more than about 20 frequency bands in your application.)

5. Academic licenses: are only valid for one campus. So if you are part of a research group, for example, with members at universities X, Y, and Z, they all need to purchase the data separately.

6. Academic licenses: you can not use the data to create software or products that will be sold to others.

7. Academic licenses: students in your undergraduate classes cannot have access to substantial portions of the data (e.g. 50,000 words or more). Graduate students can have access to the data for work on theses and dissertations. The data is primarily intended for use in research, not teaching. If you need corpus data for undergraduate classes, please use the standard web interface for the corpora.

8. Academic and Commercial licenses: supervisors will make best efforts to ensure that other employees or students who have access to the data are aware of these restrictions.

9. Commercial license: large companies with employees at several different sites (especially different countries) may need to contact us for a special license.

10. Any publications or products that are based on the data should contain a reference to the source of the data: https://www.corpusdata.org.

There are a lot of search options available.

How do I know which one is right for me?

The overall goal of English-Corpora.org's search interface is to help researchers observe language in use, and presents several different ways to get at that sense of language in use. The List outcomes to Keyword In Context pipeline is the most intuitive, but it might not be the most helpful way to approach the corpus. This page will discuss how to best harness the different search options with some examples.

List

If your search is designed to produce a range of outputs and you want to search a variety of them, you might want to start with the LIST feature. This is helpful if you want to find:

  • Singular and plural examples of the same word (woman, women; football, footballs)
  • Words with the same prefix or suffix (all words ending in ING, all words beginning with intra-)
  • All forms of a specific verb (have and had)
  • Alternative spellings for the same word (color vs colour) - this one is especially useful if you are looking at multiple English dialects!) 
  • a specific string match - is it even in the corpus? 

All these outcomes list frequencies for each hit - you might be surprised by some words that are used more frequently than others! Clicking on overall frequency will take you straight to the Keyword in Context view. Importantly, all of these search queries will work for the additional searches described below -- the presentation of the outcome is what will change.

Chart 

Using the data from the LIST feature to create a heatmap shows when in a specific period of time (year, usually, but could be by month depending on your corpus) your word or words were especially well used. The darker the box containing the number is, the more it is used. This visualization is helpful for observing periods of time to focus your keyword-in-context searching.

Collocates

What words are more likely than by chance to appear within a specific window of your target search term? This feature identifies those for you and leads you to a keyword in context view.

Compare

If you want to look at two wordforms simultaneously (compare their use against each other), you'd want to use this feature; it provides the option to explore collocates for each search term too.

Keyword in Context (KWIC)

This option is best if you know exactly what you want to look for and want to go straight to a set of results. Otherwise, you could use the LIST function to confirm the word form you are searching for is available in the corpus and click once more to go straight to KWIC results.  

Search Syntax

I know which search outcome I want to use but I want to make my search query really good!

Because this is a grammatically-driven interface, the programmers want to help you surface particular parts of speech (verbs, adverbs, etc) and other linguistic features. Here is a short list of kinds of searches English-Corpora can support. 

Single word:   mysterious, skew
Phrase:    make up, on the other hand
Any word:   more * than, * bit
Wildcard:    *icity, *break*, b?t?er
Lemma (forms):   DECIDE, CURVE_n
Part of speech:    rough NOUN, VERB money
Alternants:   fast|slow, fast|slow rate
NOT:    pretty -NOUN (compare pretty NOUN)
Synonyms:   =beautiful, =strong ARGUMENT

English-Corpora.org is a great resource for teaching language, variation and change in action. This tab offers some suggestions about how you might use it in a pedagogical setting.

What can we learn from news discourse?

Choose a major topic or issue that was popular in the past five years and develop some search terms - such as questions of climate change or migration - and observe how these topics are discussed in very close detail using one aggregated search of many of different news sources. What rhetorical strategies do different sources employ? What does that look like? How do different countries report on the same issue?

How do we use words and do they change their usage or meaning over time?

Using the Corpus of Historical American English or the Corpus of Contemporary American English, choose some keywords to start with that are interesting to you. Do they rise and fall over time, or do they stay pretty consistent? The Corpus of Contemporary American English is especially fun for tracking slang usage since 1990. Watch the rise and fall of "on fleek" or "woke". The Corpus of Historical American English is especially useful for thinking about how a particular word has been used over the past century. And, you can always check to see if something has entered the cultural zeitgeist enough to make it to TV or movie scripts (or soap operas).

I want to check something weird I saw somewhere online. I think it's wrong!

Resources like the Global Web-Based English corpus are really helpful for observing newer forms of words you haven't seen before -- where else is this construction used? Or is this really just a one-off weird thing? (hint: it's probably not.)

Corpus Data

Data Analysis & Digital Tools