The Oxford English Dictionary defines a corpus as, "The body of written or spoken material upon which a linguistic analysis is based." For linguistics purposes, a corpus may be made up of text, audio, video, or other types of files and can be analyzed with various software tools.
Each of these features are labelled with a corresponding number and described below.
1. Search box - Type in the character string you want to search for in the corpus. This search apparatus supports alphanumeric characters, special characters (such as * or ?), and spaces count as characters. Since we are looking at linguistic data, we can also add part of speech markers to identify specific grammatical features (nouns, verbs, prepositions, etc).
The default is to list outcomes for the strings, You can also clear the search by using the 'reset' button.
2. Other search options - The search capacity allows users to perform other kinds of searches; these are listed as links above the search box. Selecting one of these options will change the search apparatus' output to reflect the kind of search you are performing. (These operate somewhat like 'advanced' search features.
3. Description of corpus / Help box - This section provides information on the corpus selected, what it can be used for, and other information that might be relevant for the user (including the option to create a virtual sub-corpus from material within the corpus). The links provided here give examples of the kinds of searches you might want to perform and how to conduct them. It is possible to download the whole corpus through this informational box, but the web interface is more than adequate for running most searches. Clicking on different search features (e.g. CHART) provides a description of what this approach does.
4. Header of different activities - Once you perform a search, your results will appear under these tabs using the web interface. This allows users to move across different features of the corpus interface without having to run new searches each time.
License | Explanation | One corpus | Two corpora | 3+ corpora (see example) |
---|---|---|---|---|
ACAD | For use by university or college personnel (professors, teachers, students). | $375 | $595 | $200 each additional corpus |
NON-ACAD | Any other use*, including commercial. | $795 | $1,395 | $400 each additional corpus |
You can purchase their data to be used any way you like from English Corpora.org in three different formats: in relational databases, word/lemma/PoS, and words (paragraph format). Purchasing the data means you are also purchase the rights to any and all of these formats. Read more about these formats and their affordances at https://www.corpusdata.org/formats.asp.
For legal and copyright reasons, they cannot distribute 100% of the corpus to users, even paid users.. When you purchase the data, you are purchasing 95% of the full text data. The remaining 5% is removed for reasons of copyright. Read more about these limitations: https://www.corpusdata.org/limitations.asp. This is a common practice to conform with US copyright law and will not affect the validity of any results or output.
(These are all copied from their Restrictions page, which you must complete to initiate purchase: https://www.corpusdata.org/restrictions.asp)
1. In no case can substantial amounts of the full-text data (typically, a total of 50,000 words or more) be distributed outside the organization listed on the license agreement. For example, you cannot create a large word list or set of n-grams, and then distribute this to others, and you could not copy 70,000 words from different texts and then place this on a website where users from outside your organization would have access to the data.
2. The data cannot be placed on a network (including the Internet), unless access to the data is limited (via restricted login, password, etc) just to those from the organization listed on the license agreement.
3. In addition to the full-text data itself, #2 also applies to derived frequency, collocates, n-grams, concordance and similar data that is based on the corpus.
4. If portions of the derived data is made available to others, it cannot include substantial portions of the the raw frequency of words (e.g. the word occurs 3,403 times in the corpus) or the rank order (e.g. it is the 304th most common words). (Note: it is acceptable to use the frequency data to place words and phrases in "frequency bands", e.g. words 1-1000, 1001-3000, 3001-10,000, etc. However, there should not be more than about 20 frequency bands in your application.)
5. Academic licenses: are only valid for one campus. So if you are part of a research group, for example, with members at universities X, Y, and Z, they all need to purchase the data separately.
6. Academic licenses: you can not use the data to create software or products that will be sold to others.
7. Academic licenses: students in your undergraduate classes cannot have access to substantial portions of the data (e.g. 50,000 words or more). Graduate students can have access to the data for work on theses and dissertations. The data is primarily intended for use in research, not teaching. If you need corpus data for undergraduate classes, please use the standard web interface for the corpora.
8. Academic and Commercial licenses: supervisors will make best efforts to ensure that other employees or students who have access to the data are aware of these restrictions.
9. Commercial license: large companies with employees at several different sites (especially different countries) may need to contact us for a special license.
10. Any publications or products that are based on the data should contain a reference to the source of the data: https://www.corpusdata.org.
Single word: mysterious, skew
Phrase: make up, on the other hand
Any word: more * than, * bit
Wildcard: *icity, *break*, b?t?er
Lemma (forms): DECIDE, CURVE_n
Part of speech: rough NOUN, VERB money
Alternants: fast|slow, fast|slow rate
NOT: pretty -NOUN (compare pretty NOUN)
Synonyms: =beautiful, =strong ARGUMENT
English-Corpora.org is a great resource for teaching language, variation and change in action. This tab offers some suggestions about how you might use it in a pedagogical setting.