PDE corpora

The following table (to be updated) lists some useful corpus resources for Present-day English. Most are available either on a fileserver at Manchester or on the web, in some cases with their own search engine. There are notes below the table on some of them. For other resources, including using the web as corpus, see the items at the foot of the page.

Name Period Variety Words Tag/Parse More information
Brown Corpus of American English 1961 American 1m Y/N On server. Part of ICAME collection
LOB: Lancaster-Oslo-Bergen Corpus of British English 1961 British 1m Y/N On server. Part of ICAME collection
FROWN: Freiburg-Brown Corpus of American English 1992 American 1m Y/N On server. Part of ICAME collection
F-LOB: Freiburg-LOB Corpus of British English 1992 British 1m Y/N On server. Part of ICAME collection
Lancaster Parsed Corpus 1961 British 133k Y/Y On server. Part of ICAME collection
Polytechnic of Wales Corpus 1978–1984 child 65k Y/Y On server. Part of ICAME collection
Wellington Corpus of NZ English 1986-1990 New Zealand 1m N/N On server. Part of ICAME collection
ICE Corpus of East African English 1991-1996 East African 1.4m N/N On server. Part of ICAME collection
Kolhapur Corpus of Indian English 1978 Indian 1m N/N On server. Part of ICAME collection
London-Lund Corpus 1960-1990 British, spoken 1m N/N On server. Part of ICAME collection
Lancaster/IBM Spoken English Corpus 1954-1987 British, spoken 53k Y/N On server. Part of ICAME collection
Corpus of London Teenage Language 1993 teenage, spoken 500k N/N On server. Part of ICAME collection
BNC: British National Corpus 1960-1993 British 100m Y/N Web (protected). Access here, details here
ANC2: American National Corpus 1990- American 22m Y/Y* On server. Details here
COCA: Corpus of Contemporary American English 1990-2012 American 450m Y/N Web (need to register). Access here
GloWbE: Corpus of Global Web-Based English ? 20 world Englishes 1.9bn Y/N Web (need to register). Access here
Google Books American English ? American 155bn Y/N Web (need to register). Access here
Collins WordBanks Online (Trial) ? 57m N/N Web (protected). One-month trial. Access here
English Dialects Online 1950-1999 English, dialects ? N/N Web (open). Details here
ICE-GB corpus (2nd edn) 1990s British 1m Y/Y On server. Part of ICE
Diachronic Corpus of Present-Day Spoken English 1960-1999 British 800k Y/Y On server. Part of ICE
FRED: Freiburg Corpus of English Dialects sampler 1970-2000 British, dialects 1m Y/N On server. Details here
NECTE: Newcastle Electronic Corpus of Tyneside English 1960-1995 Tyneside ? N/N On server. Details here
Scottish Corpus of Texts & Speech 1945-2007 Scottish 4m N/N Web (open). Access here
ICE-Ireland corpus ? Irish English 1m N/N On server. Part of ICE
LDC-Online ? ? ?/? Web (protected). Details here
British Academic Written English Corpus 2004-2007 academic, written 6.5m N/N Web (protected). Details here
British Academic Spoken English Corpus 2004-2007 academic, spoken 1.6m N/N Web (protected). Details here
Singapore SMS Corpus 2004-2013? Singapore, text messages ? N/N Web (open). Still growing? Details here
Enron Email Dataset ? e-mail ? N/N Web (open). Details here

Note

Most texts held at Manchester are stored together on one fileserver, accessible on campus or via the VPN.

  • In Windows you can navigate using Windows Explorer or My Computer or within MonoConc Pro to
    \\nask.man.ac.uk\share$\fs_shared_01\Hum1\ALC\LEL_corpora.
  • On a Mac, in the Finder, from the Go menu, choose Connect to Server. Enter the server address
    smb://vdm02-g1.ds.man.ac.uk/fs_shared_01$/HUM1/ALC/LEL_corpora and then click Connect.
  • Authenticate with your university username and password.

The subfolder PDE contains a large number of corpora.

Notes on specific corpora

  • ICAME Corpus Collection on CD-ROM
    Mostly copied to fileserver April 2003. This contains the following PDE items, samples of which can also be accessed on-line, together with various concordance programs. (See also the historical English page.)
    • The Brown Corpus of American English
    • The Lancaster-Oslo-Bergen Corpus of British English (LOB), tagged and untagged
    • The Freiburg-Brown Corpus (FROWN)
    • The Freiburg-LOB Corpus (F-LOB).
      • In addition to Brown and LOB (1961) and Frown and F-LOB (1991-2), there is now a later corpus, BE06 (2006), and some earlier ones, B-Brown and BLOB-1931 (1931±3), BLOB-1901 (1901±3).
    • Lancaster Parsed Corpus
    • Polytechnic of Wales Corpus (parsed)
    • Wellington Corpus of NZ English
    • ICE Corpus of East African English
    • Kolhapur Corpus of Indian English
    • London Lund Corpus
    • Lancaster/IBM Spoken English Corpus (SEC)
    • Corpus of London Teenage Language (COLT)
  • BNC: British National Corpus with BNCweb (CQP edition) interface
    100 million words, written and spoken, mostly 1975-93 (some 1960-74) in new XML edition. We have also bought BNCweb from Zurich to add extra functionality and a very easy user interface. Members of the University can access the software on- or off-campus using an ordinary web browser; normal userid and password needed. On cluster machines there is a shortcut on the Linguistic Resources menu. Click for information on BNC or on BNCweb. BNCweb offers brief help on its search methods as a pdf file. For more elaborate research projects, a template for the FileMaker Pro database program (version 7 and up) is provided for importing a dataset downloaded from BNCweb, displaying tagged or untagged versions of hits in FMP, or jumping back to BNCweb for the wider context, as well as showing all the textual and speaker information for a given example; ask DD for a copy of the template.
    There are other interfaces to BNC available. The freeware BNC Indexer apparently made it easier to select subsets by text classification criteria. An alternative interface by Mark Davies (BYU) is available on the same site as COCA and other corpora.
  • ANC: American National Corpus
    ANC Second Release purchased January 2006. Two versions of ANC2 usable with MonoConc, converted so that part-of-speech annotation is inline, are stored in the PDE folder of English_corpora. Each is divided into one spoken and two written subcorpora. Once loaded into MonoConc (probably in stages to avoid freezing the program!), text and concordance hits can be displayed with or without the tags. The original files are supplied on DVD with different versions of the annotation in ‘standoff’ form (i.e. in completely separate files from the text), available from David Denison. There are different PoS annotation schemes and also some rudimentary parsing, viz. chunking of NPs and VPs.
  • English dialects online
    From LINGUIST 15.576: ‘The British Library’s “Collect Britain” project has just put 131 recordings of Northern English dialects on-line … The recordings are from the Survey of English Dialects and the Millennium Memory Bank.’ The collection has clearly been widened since then and seems to cover the whole country (288 sites), with samples both from the SED (collected 1950-61 or -1974) and matched samples from the Millennium Memory Bank (1998-99). The collection has now moved to the British Library’s Archival Sound Recordings site.
  • ICE corpora
    • ICE-GB corpus (2nd edition): A tagged, parsed and checked corpus of British English from the 1990s with its own elaborate search engine. (Aligned sound files may be ordered too if there is demand for them.)
    • DCPSE The Diachronic Corpus of Present-Day Spoken English: ‘DCPSE … contains 400,000 words from ICE-GB (collected in the early 1990s) and 400,000 words from the London-Lund Corpus (late 1960s-early 1980s)…. [There is] sociolinguistic information on texts, speakers and authors.’ This corpus uses the same search engine as ICE-GB2 and is available in the same way.
      NB. ICE-GB and DCPSE are not currently available under Windows 7.
    • ICE-Ireland corpus: 1 million words from the Republic of Ireland and Northern Ireland, in plain text form (not yet tagged and parsed like the other ICE materials). The corpus is stored in the folder English_corpora\PDE.
  • LDC Online
    A large corpus at the Linguistic Data Consortium, available via Library databases under L.
    A special username or password is needed for this resource; ask DD or in the Library. Here is what it offers in English:
    • ‘Search […] English telephone conversations from Fisher and Switchboard.’
    • ‘Search and listen to audio files for more than 50,000 of the most common words in English.’
  • Coventry Academic Corpora

Other resources

  • BBC Voices page – A nationwide project run in 2005 on contemporary British dialects (that is, all of the UK), with word maps, digitised speech samples, etc.
  • Accents of English from around the world (Edinburgh)
    INTUTE describes it as ‘a webpage that allows the user to compare the pronunciation of words between different dialects and varieties of English and some other Germanic languages. Equipped with a sound plug-in the user may listen to words in the many different forms available. Hovering over the IPA transcription of the word (or clicking it) returns the sound of the word in that particular variety. The site can be browsed by region, or by word, thus allowing different kinds of comparisons.’

Web search

  • WebCorp – Very useful site from the Research and Development Unit for English Studies, Birmingham City University, which will search the whole internet with (e.g.) Google for a word or phrase or pattern and produce a concorded listing of the results.
  • The Linguist List –  Definitive Resource of Organisations, Programmes and Centres in Linguistics.

This page last updated 9th January 2017.

Advertisements
%d bloggers like this: