Other corpora

This page is rather thinly and randomly populated. (The site was previously largely confined to English-language resources.) Suggestions for additions and updates are welcome. 

Name Period Language Words Tag/Parse More information
KorpusDK 1990-2000 Danish 56m Y/N Web (open) at ordnet.dk
Schweizer Textkorpus 1900-1999 Swiss German 20m N/N Web (open). Access here
Romani Morphosyntax Database ? Romani (various) ? N/N Web (open). Details here
Göteborg Spoken Language Corpus ? Swedish 1.4m Y/N Web (protected). Details here
Parole corpus 1976-1997 Swedish 19m Y/N Web (open). Details here
Stockholm Umeå Corpus Version 2.0 ? Swedish 1m Y/N Web (open). Details here
LDC-Online ? Arabic, Chinese … ? ?/? Web (protected). Details here
MCVF Corpus pre-1837 French ? Y/Y Web (protected). Details here
Tycho Brahe Parsed Corpus of Historical Portuguese 1350-1899 Portuguese ? Y/Y On server. Details here
IcePaHC: Icelandic Parsed Historical Corpus 1150-2008 Icelandic 1m Y/Y On server. Details here
HeliPaD Corpus C9th Old Saxon ? Y/Y On server. Details here
Sheffield Corpus of Chinese pre-1911 Chinese 433k Y/N Web (protected). Register & search here


Most texts held at Manchester are stored together on one fileserver, accessible on campus or via the VPN.

  • In Windows you can navigate using Windows Explorer or My Computer or within MonoConc Pro to
  • On a Mac, in the Finder, from the Go menu, choose Connect to Server. Enter the server address
    smb://vdm02-g1.ds.man.ac.uk/fs_shared_01$/HUM1/ALC/LEL_corpora and then click Connect.
  • Authenticate with your university username and password.
  • Subfolders are transparently named by language.

Notes on specific corpora

  • Romani Morphosyntax Database
    The database is interactive and has a mapping function. The Romani Project is based here in the Department. Here is its home page.
  • Swedish corpora
    • Corpora at the Department of Linguistics, Göteborg University
      This page provides links to a number of corpora (mainly Swedish) and information about others. Information about transcription methods, coding and software connected with the corpora is also provided here.
    • Göteborg Spoken Language Corpus
      The GSLC is a corpus of transcribed spoken Swedish taken from a variety of social activities. From the description on the corpus’s homepage: ‘Based on the fact that spoken language varies considerably in different social activities with regard to pronunciation, vocabulary and grammar, the goal of the corpus is to include spoken language from as many social activities as possible.’ Registration is necessary.
    • Språkbanken (The Language Bank)
      This site provides access to various written corpora of Swedish (contemporary and historical) as well as a number of dictionaries and databases. All can be found via the central Språkbanken site; a generic search interface for all the corpora can be accessed here. Two of the most useful corpora are given below, and three dictionaries in the next section:
  • LDC Online
    A large corpus at the Linguistic Data Consortium including archive of newstext in Arabic and Chinese, available via Library databases under L.
    A special username or password is needed for this resource; ask DD or in the Library.
  • MCVF: Modéliser le changement : les voies du français
    A large parsed historical corpus of French. Searchable via an online interface (after registering), though no full text access; alternatively, to search it using CorpusSearch, you can order the CD free of charge by filling in a request form, and they will post it from Ottawa. GW has a copy.
  • Tycho Brahe Parsed Corpus of Historical Portuguese
    A large historical corpus of Portuguese, of which substantial chunks are tagged and parsed. Searchable via an online interface (after registering); it’s also on the server and can be searched using CorpusSearch.
  • Icelandic Parsed Historical Corpus (IcePaHC)
    A large parsed historical corpus of Icelandic, version 0.9. Searchable on the server using CorpusSearch, or online here.
  • Heliand Parsed Database (HeliPaD)
    A small parsed historical corpus of Old Saxon, version 0.9. Searchable on the server using CorpusSearch, or online here.





  • OLAC Language Resources Catalogue. ‘This catalogue, developed by the Open Language Archives Community (OLAC), provides access to a wealth of information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.’
  • GerManC
    The project, located in the School, aims “to compile a representative historical corpus of written German for the years 1650-1800”.
  • Discover Irish website, Raymond Hickey

This page last updated 9th January 2017.

%d bloggers like this: