The following table (to be updated) lists some useful corpus resources for Present-day English. Most are available either on a fileserver at Manchester or on the web, in some cases with their own search engine. There are notes below the table on some of them. For other resources, including using the web as corpus, see the items at the foot of the page.
|Brown Corpus of American English||1961||American||1m||Y/N||On server. Part of ICAME collection|
|LOB: Lancaster-Oslo-Bergen Corpus of British English||1961||British||1m||Y/N||On server. Part of ICAME collection|
|FROWN: Freiburg-Brown Corpus of American English||1992||American||1m||Y/N||On server. Part of ICAME collection|
|F-LOB: Freiburg-LOB Corpus of British English||1992||British||1m||Y/N||On server. Part of ICAME collection|
|Lancaster Parsed Corpus||1961||British||133k||Y/Y||On server. Part of ICAME collection|
|Polytechnic of Wales Corpus||1978–1984||child||65k||Y/Y||On server. Part of ICAME collection|
|Wellington Corpus of NZ English||1986-1990||New Zealand||1m||N/N||On server. Part of ICAME collection|
|ICE Corpus of East African English||1991-1996||East African||1.4m||N/N||On server. Part of ICAME collection|
|Kolhapur Corpus of Indian English||1978||Indian||1m||N/N||On server. Part of ICAME collection|
|London-Lund Corpus||1960-1990||British, spoken||1m||N/N||On server. Part of ICAME collection|
|Lancaster/IBM Spoken English Corpus||1954-1987||British, spoken||53k||Y/N||On server. Part of ICAME collection|
|Corpus of London Teenage Language||1993||teenage, spoken||500k||N/N||On server. Part of ICAME collection|
|BNC: British National Corpus||1960-1993||British||100m||Y/N||Web (protected). Access here, details here|
|ANC2: American National Corpus||1990-||American||22m||Y/Y*||On server. Details here|
|COCA: Corpus of Contemporary American English||1990-2012||American||450m||Y/N||Web (need to register). Access here|
|GloWbE: Corpus of Global Web-Based English||?||20 world Englishes||1.9bn||Y/N||Web (need to register). Access here|
|Google Books American English||?||American||155bn||Y/N||Web (need to register). Access here|
|Collins WordBanks Online (Trial)||?||–||57m||N/N||Web (protected). One-month trial. Access here|
|English Dialects Online||1950-1999||English, dialects||?||N/N||Web (open). Details here|
|ICE-GB corpus (2nd edn)||1990s||British||1m||Y/Y||On server. Part of ICE
|Diachronic Corpus of Present-Day Spoken English||1960-1999||British||800k||Y/Y||On server. Part of ICE|
|FRED: Freiburg Corpus of English Dialects sampler||1970-2000||British, dialects||1m||Y/N||On server. Details here|
|NECTE: Newcastle Electronic Corpus of Tyneside English||1960-1995||Tyneside||?||N/N||On server. Details here|
|Scottish Corpus of Texts & Speech||1945-2007||Scottish||4m||N/N||Web (open). Access here|
|ICE-Ireland corpus||?||Irish English||1m||N/N||On server. Part of ICE|
|LDC-Online||?||–||?||?/?||Web (protected). Details here|
|British Academic Written English Corpus||2004-2007||academic, written||6.5m||N/N||Web (protected). Details here|
|British Academic Spoken English Corpus||2004-2007||academic, spoken||1.6m||N/N||Web (protected). Details here|
|Singapore SMS Corpus||2004-2013?||Singapore, text messages||?||N/N||Web (open). Still growing? Details here|
|Enron Email Dataset||?||?||N/N||Web (open). Details here|
Most texts held at Manchester are stored together on one fileserver, accessible on campus or via the VPN.
- In Windows you can navigate using Windows Explorer or My Computer or within MonoConc Pro to
- On a Mac, in the Finder, from the Go menu, choose Connect to Server. Enter the server address
smb://vdm02-g1.ds.man.ac.uk/fs_shared_01$/HUM1/ALC/LEL_corpora and then click Connect.
- Authenticate with your university username and password.
The subfolder PDE contains a large number of corpora.
Notes on specific corpora
- ICAME Corpus Collection on CD-ROM
Mostly copied to fileserver April 2003. This contains the following PDE items, samples of which can also be accessed on-line, together with various concordance programs. (See also the historical English page.)
- The Brown Corpus of American English
- The Lancaster-Oslo-Bergen Corpus of British English (LOB), tagged and untagged
- The Freiburg-Brown Corpus (FROWN)
- The Freiburg-LOB Corpus (F-LOB).
- In addition to Brown and LOB (1961) and Frown and F-LOB (1991-2), there is now a later corpus, BE06 (2006), and some earlier ones, B-Brown and BLOB-1931 (1931±3), BLOB-1901 (1901±3).
- Lancaster Parsed Corpus
- Polytechnic of Wales Corpus (parsed)
- Wellington Corpus of NZ English
- ICE Corpus of East African English
- Kolhapur Corpus of Indian English
- London Lund Corpus
- Lancaster/IBM Spoken English Corpus (SEC)
- Corpus of London Teenage Language (COLT)
- BNC: British National Corpus with BNCweb (CQP edition) interface
100 million words, written and spoken, mostly 1975-93 (some 1960-74) in new XML edition. We have also bought BNCweb from Zurich to add extra functionality and a very easy user interface. Members of the University can access the software on- or off-campus using an ordinary web browser; normal userid and password needed. On cluster machines there is a shortcut on the Linguistic Resources menu. Click for information on BNC or on BNCweb. BNCweb offers brief help on its search methods as a pdf file. For more elaborate research projects, a template for the FileMaker Pro database program (version 7 and up) is provided for importing a dataset downloaded from BNCweb, displaying tagged or untagged versions of hits in FMP, or jumping back to BNCweb for the wider context, as well as showing all the textual and speaker information for a given example; ask DD for a copy of the template.
There are other interfaces to BNC available. The freeware BNC Indexer apparently made it easier to select subsets by text classification criteria. An alternative interface by Mark Davies (BYU) is available on the same site as COCA and other corpora.
- ANC: American National Corpus
ANC Second Release purchased January 2006. Two versions of ANC2 usable with MonoConc, converted so that part-of-speech annotation is inline, are stored in the PDE folder of English_corpora. Each is divided into one spoken and two written subcorpora. Once loaded into MonoConc (probably in stages to avoid freezing the program!), text and concordance hits can be displayed with or without the tags. The original files are supplied on DVD with different versions of the annotation in ‘standoff’ form (i.e. in completely separate files from the text), available from David Denison. There are different PoS annotation schemes and also some rudimentary parsing, viz. chunking of NPs and VPs.
- English dialects online
From LINGUIST 15.576: ‘The British Library’s “Collect Britain” project has just put 131 recordings of Northern English dialects on-line … The recordings are from the Survey of English Dialects and the Millennium Memory Bank.’ The collection has clearly been widened since then and seems to cover the whole country (288 sites), with samples both from the SED (collected 1950-61 or -1974) and matched samples from the Millennium Memory Bank (1998-99). The collection has now moved to the British Library’s Archival Sound Recordings site.
- ICE corpora
- ICE-GB corpus (2nd edition): A tagged, parsed and checked corpus of British English from the 1990s with its own elaborate search engine. (Aligned sound files may be ordered too if there is demand for them.)
- DCPSE The Diachronic Corpus of Present-Day Spoken English: ‘DCPSE … contains 400,000 words from ICE-GB (collected in the early 1990s) and 400,000 words from the London-Lund Corpus (late 1960s-early 1980s)…. [There is] sociolinguistic information on texts, speakers and authors.’ This corpus uses the same search engine as ICE-GB2 and is available in the same way.
NB. ICE-GB and DCPSE are not currently available under Windows 7.
- ICE-Ireland corpus: 1 million words from the Republic of Ireland and Northern Ireland, in plain text form (not yet tagged and parsed like the other ICE materials). The corpus is stored in the folder English_corpora\PDE.
- LDC Online
A large corpus at the Linguistic Data Consortium, available via Library databases under L.
A special username or password is needed for this resource; ask DD or in the Library. Here is what it offers in English:
- ‘Search […] English telephone conversations from Fisher and Switchboard.’
- ‘Search and listen to audio files for more than 50,000 of the most common words in English.’
- Coventry Academic Corpora
- BAWE Corpus (British Academic Written English, Coventry University): An online resource ‘free of charge to researchers who agree to the conditions of use and who register with the Oxford Text Archive‘, though some functions appear to be available for casual online use.
- BASE Corpus (British Academic Spoken English, Coventry University): The lecture part of the corpus can be searched online here, or the full XML version obtained from the OTA.
- BBC Voices page – A nationwide project run in 2005 on contemporary British dialects (that is, all of the UK), with word maps, digitised speech samples, etc.
- Accents of English from around the world (Edinburgh)
INTUTE describes it as ‘a webpage that allows the user to compare the pronunciation of words between different dialects and varieties of English and some other Germanic languages. Equipped with a sound plug-in the user may listen to words in the many different forms available. Hovering over the IPA transcription of the word (or clicking it) returns the sound of the word in that particular variety. The site can be browsed by region, or by word, thus allowing different kinds of comparisons.’
- WebCorp – Very useful site from the Research and Development Unit for English Studies, Birmingham City University, which will search the whole internet with (e.g.) Google for a word or phrase or pattern and produce a concorded listing of the results.
- The Linguist List – Definitive Resource of Organisations, Programmes and Centres in Linguistics.
This page last updated 9th January 2017.