Research Guides: Linguistics: Corpora & Computational Linguistics

Ask A Librarian

Email: libraries@colorado.edu

English-Corpora.org & Computational Linguistics

What is a corpus?

The Oxford English Dictionary defines a corpus as, "The body of written or spoken material upon which a linguistic analysis is based." For linguistics purposes, a corpus may be made up of text, audio, video, or other types of files and can be analyzed with various software tools.

How do I work with a corpus?

Students work with corpora in Linguistics departments courses like LING 4100/5800: Statistical Analysis and LING 5200: Introduction to Computational Corpus Linguistics.
The Libraries' Center for Research Data and Digital Scholarship (CRDDS) offers workshops that support many of the skills needed to work with corpora, such as how to find them, cleaning and processing data, adding metadata where needed, and using programming skills and tools such as Python, UNIX, or R to analyze the data. Some "out of the box" software are available to analyze corpora without needing much technical knowledge. CRDDS offers consultations and Interdisciplinary Consult Hours on Tuesdays, 12-1pm in Norlin E206 for drop-in consultations and communal working space for digital scholarship research, from coding to writing to exchanging ideas.

What does English-Corpora.org do?

English-Corpora.org, a clearinghouse of monitor corpora from BYU, provides a platform to run a variety of queries on their established corpora. Any logged in user can search for specific words or phrases in each corpus, observe frequency over regular periods of time (months/years/decades), measure collocation, and perform other quantitative tasks for tracing language in use over time. It is also possible to download the corpus for more local use, for more advanced users.

Understanding the corpus landing page

Upon choosing a corpus, you'll arrive on a landing page that looks like this. There are two main boxes in the body of the site; on the left hand side there will be a search apparatus, and some display options; on the right hand site there will be a description of the corpus. On the top bar, there are some additional search options.

Each of these features are labelled with a corresponding number and described below.

1. Search box - Type in the character string you want to search for in the corpus. This search apparatus supports alphanumeric characters, special characters (such as * or ?), and spaces count as characters. Since we are looking at linguistic data, we can also add part of speech markers to identify specific grammatical features (nouns, verbs, prepositions, etc).

The default is to list outcomes for the strings, You can also clear the search by using the 'reset' button.

2. Other search options - The search capacity allows users to perform other kinds of searches; these are listed as links above the search box. Selecting one of these options will change the search apparatus' output to reflect the kind of search you are performing. (These operate somewhat like 'advanced' search features.

CHART will produce a heatmap of your search term divided up by period of time.
COLLOCATES lets you identify terms that are more likely to appear around the initial search term than by simple chance.
COMPARE lets you compare two words in the corpus.
KWIC (keyword in context) searching shows you how your search term is used in context in aggregate with short snippets of text.

3. Description of corpus / Help box - This section provides information on the corpus selected, what it can be used for, and other information that might be relevant for the user (including the option to create a virtual sub-corpus from material within the corpus). The links provided here give examples of the kinds of searches you might want to perform and how to conduct them. It is possible to download the whole corpus through this informational box, but the web interface is more than adequate for running most searches. Clicking on different search features (e.g. CHART) provides a description of what this approach does.

4. Header of different activities - Once you perform a search, your results will appear under these tabs using the web interface. This allows users to move across different features of the corpus interface without having to run new searches each time.

Understanding the various search options with the sample search of "Wuhan":

With the List option, the corpus will give you all the results for your search (helpful if you are looking for multiple words or forms with a * in the search query) and let you click through to the Keyword in Context View (KWIC).
With Keyword in Context View (KWIC), your search term will be highlighted in a light green and provide some larger context. Click on an example to get even more context. This is helpful for observing patterns of use in aggregate.
With the Chart option, it will produce a bar graph of how often "covid" is mentioned in news articles per year, both by raw and normalized (per million words) frequency. "Wuhan" is not used very often in most news discourse until 2020. (You can also opt to see frequency by country and get to a KWIC search for all articles from Hong Kong, for example. Hong Kong is more likely to report on news from Wuhan prior to the Covid-19 pandemic than other countries).

What's going on with the search query language?

Building out different kinds of searches (including thinking about prefixes, suffixes, and plurals) allows the user to search increasingly complex phenomena. Considering grammatical parts of speech to differentiate words that look the same and have different meanings is helpful; the search apparatus has a drop down menu next to the general box if you click on [POS] (part of speech).
As an example, this POS feature allows users to specify "will" as the noun and not the verb with the search (will [POS > noun.all), removing all irrelevant verbs. Being intentional about developing searches means excluding irrelevant examples, saving the user from unnecessary cleanup in the analysis stage. High-frequency words will require narrowing your search somewhat (e.g. by year) to conduct analyses.

Downloading Corpora from English-Corpora.org

English-Corpora.org provides free, complete access to their data from a robust web-based platform. However, this doesn't work for everyone's needs and they know that you might want to have more localized access to a their corpora for more advanced text and data mining tasks. English-Corpora.org provides access to their data with two available licenses, Academic and non-Academic; the below chart displays the cost to get access to one or more corpus. You cannot buy all their corpora but you can buy eleven of their biggest ones (see https://www.corpusdata.org/corpora.asp for details).
CU Boulder researchers seeking to download any corpus would fall under the license category "ACAD" (academic).

Cost to get access to one or more corpus
License	Explanation	One corpus	Two corpora	3+ corpora (see example)
ACAD	For use by university or college personnel (professors, teachers, students).	$375	$595	$200 each additional corpus
NON-ACAD	Any other use*, including commercial.	$795	$1,395	$400 each additional corpus

You can purchase their data to be used any way you like from English Corpora.org in three different formats: in relational databases, word/lemma/PoS, and words (paragraph format). Purchasing the data means you are also purchase the rights to any and all of these formats. Read more about these formats and their affordances at https://www.corpusdata.org/formats.asp.

For legal and copyright reasons, they cannot distribute 100% of the corpus to users, even paid users.. When you purchase the data, you are purchasing 95% of the full text data. The remaining 5% is removed for reasons of copyright. Read more about these limitations: https://www.corpusdata.org/limitations.asp. This is a common practice to conform with US copyright law and will not affect the validity of any results or output.

There are a couple restrictions you must abide by before purchasing the data.

(These are all copied from their Restrictions page, which you must complete to initiate purchase: https://www.corpusdata.org/restrictions.asp)

1. In no case can substantial amounts of the full-text data (typically, a total of 50,000 words or more) be distributed outside the organization listed on the license agreement. For example, you cannot create a large word list or set of n-grams, and then distribute this to others, and you could not copy 70,000 words from different texts and then place this on a website where users from outside your organization would have access to the data.

2. The data cannot be placed on a network (including the Internet), unless access to the data is limited (via restricted login, password, etc) just to those from the organization listed on the license agreement.

3. In addition to the full-text data itself, #2 also applies to derived frequency, collocates, n-grams, concordance and similar data that is based on the corpus.

4. If portions of the derived data is made available to others, it cannot include substantial portions of the the raw frequency of words (e.g. the word occurs 3,403 times in the corpus) or the rank order (e.g. it is the 304th most common words). (Note: it is acceptable to use the frequency data to place words and phrases in "frequency bands", e.g. words 1-1000, 1001-3000, 3001-10,000, etc. However, there should not be more than about 20 frequency bands in your application.)

5. Academic licenses: are only valid for one campus. So if you are part of a research group, for example, with members at universities X, Y, and Z, they all need to purchase the data separately.

6. Academic licenses: you can not use the data to create software or products that will be sold to others.

7. Academic licenses: students in your undergraduate classes cannot have access to substantial portions of the data (e.g. 50,000 words or more). Graduate students can have access to the data for work on theses and dissertations. The data is primarily intended for use in research, not teaching. If you need corpus data for undergraduate classes, please use the standard web interface for the corpora.

8. Academic and Commercial licenses: supervisors will make best efforts to ensure that other employees or students who have access to the data are aware of these restrictions.

9. Commercial license: large companies with employees at several different sites (especially different countries) may need to contact us for a special license.

10. Any publications or products that are based on the data should contain a reference to the source of the data: https://www.corpusdata.org.

There are a lot of search options available. How do I know which one is right for me?

The overall goal of English-Corpora.org's search interface is to help researchers observe language in use, and presents several different ways to get at that sense of language in use. The List outcomes to Keyword In Context pipeline is the most intuitive, but it might not be the most helpful way to approach the corpus. This page will discuss how to best harness the different search options with some examples.

List

If your search is designed to produce a range of outputs and you want to search a variety of them, you might want to start with the LIST feature. This is helpful if you want to find:
- Singular and plural examples of the same word (woman, women; football, footballs)
- Words with the same prefix or suffix (all words ending in ING, all words beginning with intra-)
- All forms of a specific verb (have and had)
- Alternative spellings for the same word (color vs colour) - this one is especially useful if you are looking at multiple English dialects!)
- A specific string match - is it even in the corpus?
All these outcomes list frequencies for each hit - you might be surprised by some words that are used more frequently than others! Clicking on overall frequency will take you straight to the Keyword in Context view. Importantly, all of these search queries will work for the additional searches described below -- the presentation of the outcome is what will change.

Chart

Using the data from the LIST feature to create a heatmap shows when in a specific period of time (year, usually, but could be by month depending on your corpus) your word or words were especially well used. The darker the box containing the number is, the more it is used. This visualization is helpful for observing periods of time to focus your keyword-in-context searching.

Collocates

What words are more likely than by chance to appear within a specific window of your target search term? This feature identifies those for you and leads you to a keyword in context view.

Compare

If you want to look at two wordforms simultaneously (compare their use against each other), you'd want to use this feature; it provides the option to explore collocates for each search term too.

Keyword in Context (KWIC)

This option is best if you know exactly what you want to look for and want to go straight to a set of results. Otherwise, you could use the LIST function to confirm the word form you are searching for is available in the corpus and click once more to go straight to KWIC results.

Search Syntax

I know which search outcome I want to use but I want to make my search query really good!

Because this is a grammatically-driven interface, the programmers want to help you surface particular parts of speech (verbs, adverbs, etc) and other linguistic features. Here is a short list of kinds of searches English-Corpora can support.

Single word: mysterious, skew
Phrase:    make up, on the other hand
Any word:   more * than, * bit
Wildcard: *icity, *break*, b?t?er
Lemma (forms): DECIDE, CURVE_n
Part of speech:    rough NOUN, VERB money
Alternants: fast|slow, fast|slow rate
NOT:    pretty -NOUN (compare pretty NOUN)
Synonyms: =beautiful, =strong ARGUMENT

English-Corpora.org is a great resource for teaching language, variation and change in action. This tab offers some suggestions about how you might use it in a pedagogical setting.

What can we learn from news discourse?

Choose a major topic or issue that was popular in the past five years and develop some search terms - such as questions of climate change or migration - and observe how these topics are discussed in very close detail using one aggregated search of many of different news sources. What rhetorical strategies do different sources employ? What does that look like? How do different countries report on the same issue?

How do we use words and do they change their usage or meaning over time?

Using the Corpus of Historical American English or the Corpus of Contemporary American English, choose some keywords to start with that are interesting to you. Do they rise and fall over time, or do they stay pretty consistent? The Corpus of Contemporary American English is especially fun for tracking slang usage since 1990. Watch the rise and fall of "on fleek" or "woke".
The Corpus of Historical American English is especially useful for thinking about how a particular word has been used over the past century. And, you can always check to see if something has entered the cultural zeitgeist enough to make it to TV or movie scripts (or soap operas).

I want to check something weird I saw somewhere online. I think it's wrong!

Resources like the Global Web-Based English corpus are really helpful for observing newer forms of words you haven't seen before -- where else is this construction used? Or is this really just a one-off weird thing? (hint: it's probably not.)

Corpus Data

ELRA Catalogue of Language Resources
The ELRA Catalogue provides a language resource repository to help identify and access existing language corpus resources. Some of these corpera require additional subscriptions.
Text Creation Partnership
This institutional collaboration among several libraries and vendors has produced approximately 73,000 accurate, searchable, full-text transcriptions of early print books, which were previously only available as static page images, now contained inside Early English Books Online–TCP ,
Eighteenth-Century Collections Online–TCP, and
Evans Early American Imprints–TCP. Some of these resources require subscriptions.
World Atlas of Language Structures (WALS)
This freely available database covers phonological, grammatical, and lexical aspects of languages around the world.
Glottolog
Comprehensive reference information for the world's languages, especially the lesser known languages. Glottolog provides a comprehensive catalogue of the world's languages, language families and dialects. It assigns a unique and stable identifier (the Glottocode) to (in principle) all languoids, i.e. all families, languages, and dialects. Any variety that a linguist works on should eventually get its own entry. The languoids are organized via a genealogical classification (the Glottolog tree) that is based on available historical-comparative research (see also the Languoids information section).

British National Corpus
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century.
Centre for English Corpus Linguistics
The Centre for English Corpus Linguistics (CECL) specializes in the collection and use of corpora for linguistic and pedagogical purposes., including language learner corpora, multilingual corpora, and corpora of native novice writers.
COCA - Corpus of Contemporary American English
With more than 520 million words collected between 1990-2015, the corpera draw English usages from spoken word, fiction, news sources, and scholarly sources.
COHA - Corpus of Historical American English
With more than 400 million words from 1810-2009, COHA allows for searching of historical English usages and grammatical constructions.
International Corpus of English (ICE)
Each ICE corpus consists of 1 million words of written and spoken English collected after 1989, with corpora from Canada, East Africa, Great Britain, Hong Kong, India, Ireland & SPICE, Jamaica, New Zealand, Nigeria, the Philippines, Singapore, Sri Lanka, and the USA.
Survey of English Usage
The Survey of English Usage carries out research in English language Corpus Linguistics, and was the first centre in Europe to undertake this type of research. From its inception in 1959, the Survey collected samples of naturally-occurring language for the purposes of description and analysis.
Penn Parsed Corpora of Historical English
The Penn Parsed Corpora of Historical English, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English, second edition (PPCMBE2), are running texts and text samples of British English prose across its history - from the earliest Middle English documents up to the First World War. The texts come in three forms: simple text, part-of-speech tagged text and syntactically annotated text.

AIATSIS Thesauri (fka: Aboriginal Studies Electronic Data Archive (ASEDA))
The Aboriginal Studies Electronic Data Archive (ASEDA) holds computer-based (digital) materials about Australian Indigenous studies collected from the late 80’s to early 2009 by the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS). The majority of items in ASEDA are about Australian Indigenous languages (dictionaries, grammars, teaching materials, etc.) AIATSIS distributes these materials on behalf of the depositor/provider of these materials. Availability of materials is subject to depositors/providers' access conditions and often requires permission from the depositor/provider. This catalogue describes items held by ASEDA.
AnCora Catalan and Spanish Corpus
AnCora consists of a Catalan corpus (AnCora-CA) and a Spanish corpus (AnCora-ES), each of them of 500,000 words. The corpora are annotated at different levels.
Chinese-English Parallel Corpora
The following Chinese-English parallel corpora downloads were developed by TranslateFX researchers and linguists for public use. The corpora is made of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others. All the texts are from the Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong, and Hong Kong government websites.The collection is organized by source type and follows the .tsv file format.
Corpus del Español
Search Spanish texts from the 1200's to the 1900's. The corpera will soon expand to include texts from the 1900's and 2000's.
CREA - Corpus de Referencia del Espanol Actual
From the Royal Spanish Academy, this text-based corpus contains books and articles from across the Spanish-speaking world.
Discours sur la ville Corpus de Français Parlé Parisien des années 2000 (Corpus of Parisian French from the 2000's)
Provides recent (last two decades) interviews of Parisians. Audio files and transcripts are available for download.
Endangered Languages Project (ELP)
Contains a catalog and detailed map of endangered languages around the world, with sample videos of language speakers and scholarship about these languages.
French Language Corpus Collection
French language corpora, both text and audio files, with special focus on interactions between adults and children as well as interactions between adults.

Brown University Corpus of American English
The first digitized corpus for use with computers, the Brown corpus dates to 1964.
CU Linguistics' Verb Corpora Server
CU's Linguistics department has an online search tool for their own corpora: by language, by name. Note: some data are on disks, or other formats
Linguistic Data Consortium (LDC)
The Linguistic Data Consortium (LDC) is an open consortium of universities, libraries, corporations and government research laboratories, with a large repository of corpera. Some queries and downloads require a subscription, which CU has sometimes had in the past. Contact your librarian to learn more!
MICASE - Michigan Corpus of Academic Spoken English
Online collection of transcripts of academic speeches presented at the University of Michigan (may also include some sound recordings of speeches). Intended for use in a research project examining the characteristics of academic speech.
UCLA Phonetics Lab Archive
"The UCLA Phonetics Laboratory has collected recordings of hundreds of languages from around the world, providing source materials for phonetic and phonological research, of value to scholars, speakers of the languages, and language learners alike. The materials on this site comprise audio recordings illustrating phonetic structures from over 200 languages."

Data Analysis & Digital Tools

CMDI Maker
CMDI Maker is an easy-to-use HTML5 web app to quickly generate scientific metadata in IMDI, CMDI, and other formats for your research project.
Dedoose
Cloud-based traditional qualitative data management, excerpting/coding, and analysis. Easy integration of the qualitative data including demographics, survey, test score, and other quantitative data. Introduce new dimensions to your project data using code weighting for mixed methods analysis with data visualizations. Import and export directly from your local machine.
Google Books Ngram Viewer
Search the Google Books corpora by language or regional variety of English. The search features trace usage over time, part-of-speech tags, inflection, and much more.
Mallet: Machine Learning for LanguagE Toolkit
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. A collection of tools to document classification, sequence tagging, and topic modeling.
MAXQDA
Streamline your data analysis with automatic transcription, powerful analysis tools, ease of use, and smart AI integration. Virtual research assistant can streamline aspects of your work. Transcribe, code, analyze, and visualize interviews.
Seeing Speech
Provides ultrasound tongue imaging (UTI) video of speech, magnetic resonance imaging (MRI) video of speech and 2D midsagittal head animations based on MRI and UTI data.
TAPoR
TAPoR was originally created to be a directory only of tools used for text analysis. TAPoR has evolved to its current iteration, TAPoR 3.0 to include the DiRT project (Digital Research Tools) and now includes tools that work on non-textual data and tools that provide services used by digital humanists. This evolution largely reflects that of the field of digital humanities.

ANNIS (ANNotation of Information Structure)
ANNIS (ANNotation of Information Structure), now in version 3.4.3, is an open source browser-based visualization architecture designed to annotate complex multilevel linguistic corpora. Now in its second version, ANNIS enables users to concurrently annotate, query and visualize data for syntax, semantics, morphology, prosody, referntiality, lexis and so forth. It also offers support for spoken language and audio and visual annotation.
Hexatomic
Hexatomic is an extensible, OS-independent platform for deep multi-layer linguistic annotation of corpora.
INPhO-TE
The InPhO Topic Explorer features a powerful interactive coding environment that enables direct manipulation of the corpus and models, in contrast to the web visualization.
Intelligent Archive
Intelligent Archive is a Java application for stylometry, or computational and statistical analysis of style in texts, produced by the University of Newcastle's Centre for Literary and Linguistic Computing (CLLC). It can handle corpora of plain text, HTML, XML and TEI texts. IA enables you to easily organise texts into sets, manage metadata, generate word frequencies, handle XML tags, and split texts in various ways to generate results that can be exported for further analysis in statistical software.
Pepper
A highly extensible plattform for conversion and manipulation of linguistic data between an unbound set of formats. Pepper can be used stand-alone as a command line interface, or be integrated as an API into other software products. Pepper can convert documents in a variety of linguistic formats, such as: EXMARaLDA, Tiger XML, MMAX2, RST, TCF, TreeTagger format, TEI (subset), ANNIS format, PAULA and many many more.
Sketch Engine
Sketch Engine is a tool to explore how language works. Its algorithms analyze authentic texts of billions of words (text corpora) to identify instantly what is typical in language and what is rare, unusual or emerging usage. Sketch Engine contains 500 ready-to-use corpora in 90+ languages, each having a size of up to 30 billion words to provide a truly representative sample of language. Some of its functions are: word sketch, sketch difference, thesaurus, concordance, parallel concordance, wordlist, n-grams, keywords & terms, bilingual term extraction trends and OneClick Dictionary.

Audacity
Audacity is an open-source software for field recorders and educators, who can capture, edit or analyze the sounds of environments, ambience, animals.
Avidemux
Avidemux is a free video editor designed for simple cutting, filtering and encoding tasks. It supports many file types, including AVI, DVD compatible MPEG files, MP4 and ASF, using a variety of codecs. Tasks can be automated using projects, job queue and powerful scripting capabilities.
FieldWorks Language Explorer (FLEx)
This dictionary creation software a comprehensive tool that allows you to create a dictionary for your language by collecting texts, words, and cultural information. It allows you to compile texts, analyze grammar, and incorporate cultural details, creating a robust dictionary database that supports multiple languages and scripts. The software simplifies the task of interlinearizing text and analyzing grammar, and it has built-in features for publishing your dictionary or grammatical analysis.
HandBrake
HandBrake is an open-source tool, built by volunteers, for converting video from nearly any format to a selection of modern, widely supported codecs.
Oral History (as) Data
The Oral History as Data (OHD) framework is a static website generator that allows users to analyze and publish coded oral history or qualitative interview files. By turning transcriptions into tagged/coded CSV files, adding a list of filters, and creating a simple markdown file for each interview (to be included in the _transcripts collection), OHD will provide filterable transcripts and a color-coded visualization for all transcripts included.
oTranscribe
A free web app to take the pain out of transcribing recorded interviews. Automatically saved to your browser's storage every second. Private - your audio file and transcript never leave your computer. Video file support with integrated player. No more switching between Quicktime and Word.

Genism: Topic Modeling for Humans
A robust topic modeling library, Gensim provides efficient implementations of algorithms like Latent Dirichlet Allocation (LDA), Word2Vec, and Doc2Vec for topic modeling, text similarity, and word embeddings.
Hugging Face Transformers
A powerful library for state-of-the-art transfer learning models, such as BERT, GPT-2, and XLNet, enabling advanced NLP tasks like text generation, summarization, and question answering.
Natural Language Toolkit (NLTK)
NLTK is a platform for building Python programs to work with human language data. Also see https://www.nltk.org/book/ for NLTK Book updated for Python 3
spaCY
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.
TextBlob
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, and more.
word2vec
word2vec is a tool for computing vector representations of words for natural language processign and other related research. It generates a vocabulary from the user's text and learns a vector representation of its words. This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

ELAN
ELAN allows users to add an unlimited number of textual annotations to audio and/or video recordings. An annotation can be a sentence, word or gloss, a comment, translation or a description of any feature observed in the media. Annotations can be created on multiple layers, called tiers. Tiers can be hierarchically interconnected. An annotation can either be time-aligned to the media or it can refer to other existing annotations. The content of annotations consists of Unicode text and annotation documents are stored in an XML format (EAF).
Phonology Assistant
This is a cutting-edge discovery tool designed to streamline the analysis of phonetic data. By providing a robust platform that automatically charts sounds from a corpus of phonetic data, Phonology Assistant helps users uncover and test sound rules within a language. Whether users are working with lexical items, tone melodies, or phrasal utterances, this software simplifies the intricate processes of phonetic analysis.
Praat
Praat is a free software package for phonetic analysis. Some of the functionalities include spectral analysis (spectrograms) pitch analysis, formant analysis, intensity analysis, jitter, shimmer, voice breaks cochleagram, excitation pattern, speech manipulation, etc
Speech Analyzer
This software enables users to add phonemic, orthographic, tone, and gloss transcriptions seamlessly, or using the slowed playback and repeat loop features to master the nuances of a new language. Speech Analyzer not only makes these tasks possible but also enhances users' learning and research experience. This user-friendly interface makes it easy to perform duration measurements, overlay analyses, and even ethnomusicological studies of music recordings.
WaveSurfer
A well-designed and user-friendly piece of software from the KTH Centre for Speech Technology, Sweden. WaveSurfer is an open source tool for sound visualization and manipulation. Typical applications are speech/sound analysis and sound annotation/transcription. WaveSurfer may be extended by plug-ins as well as embedded in other applications.

AntConc
AntConc is free concordance software. It is multi-platform and easy to deploy and use. AntConc is part of a suite of related tools for text processing and analysis, including applications for parallel corpus analysis, word profiling, PDF to text conversion, text structure analysis, detecting and converting character encodings, Japanese and Chinese segmenter and tokenizer, wordclass tagger, and spelling variant anaysis. The developer is currently drafting a more explicit licence for the use of the software.
Communalytic
Communalytic is a no-code computational social science research tool for studying online communities and public discourse on social media. Communalytic also comes with a comprehensive suite of built-in data analyzers capable of processing both social and non-social media data. Many of the analyzers are equipped with advanced AI features that assist researchers in conducting complex analyses, offering valuable insights and recommendations as needed. Researchers maintain complete control throughout the process, allowing them to add essential context and expertise and make all final decisions.
KH Coder
KH Coder is a computer software package for quantitative content analysis or text mining. It is also utilized for computational linguistics. You can analyze Catalan, Chinese (simplified), Dutch, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Slovenian and Spanish text with KH Coder.
TextDNA
TextDNA is a free tool for large-scale overview analysis of linguistic data offered by the University of Wisconsin, Madison. It identifies patterns within a text, and enables users to compare ordered sets of data with its sequence visualization. It can be used with texts stored in databases, Google N-Grams datasets and CSV files.
TUSTEP (Tubingen System of Text Processing Tools)
TUSTEP (Tubingen System of Text Processing Tools) is a free, open source, widely-used toolbox for text processing. It is aimed at scholarly audiences, can work with texts in both latin and non-latin scripts, and is primarily designed for humanites applications. Its modules include data capture, information retrieval, text collation, text analysis, sorting and ordering, rule-based textual manipulation, data export and typesetting. The user manual is only available in German at this time.
Voyant
An open-source text analysis tool. Upload text and Voyant will automatically determine word frequencies and colocates and display them graphically.

Flourish
Users can transform raw data into interactive visualizations. Create professional-grade visuals in minutes — no coding, no hassle. Simply import data and transform it into interactive visuals that grab attention. Users can seamlessly share and embed their visualizations across platforms—websites, presentations, social media assets and more.
Gephi
Gephi is a free, open source interactive visualization and data exploration tool. Users can manipulate the display to uncover new facets of the data, enabling intuitive exploration.
IRaMuTeQ
An R interface for Multidimensional Text and Questionnaire Analysis.
Textal
Textal is a text analysis app for the iPhone and iPad offered by the UCL Centre for Digital Humanities. It creates word clouds from texts, websites, tweets and documents, then allows users to explore the statistics and relationships between words in the text. As a tool, it is intended as an introduction to to text analysis while expanding the functionality traditionally implemented in word clouds.
Wordij
WORDij is a free semantic network tool for capturing relationships between words and assigning word-pair link strengths. This information is used as the basis for more sophisticated analysis, such as word network structure mapping. WORDij requires registration for download.
WordWanderer
WordWanderer is a free, web-based tool for visualizing and exploring text. It combines search, concordance and word cloud attributes to enable users to explore their texts. Users can paste their own texts into the box provided, or choose from a list of preloaded texts. It includes the ability to include or hide words based on parts of speech, a stop list or a user-provided list, and can display the text based on relative frequency, context or comparison. To view specific words, users can search them or click on them in the visualization.