This page is rather thinly and randomly populated. (The site was previously largely confined to English-language resources.) Suggestions for additions and updates are welcome.
Name | Period | Language | Words | Tag/Parse | More information |
---|---|---|---|---|---|
KorpusDK | 1990-2000 | Danish | 56m | Y/N | Web (open) at ordnet.dk |
Schweizer Textkorpus | 1900-1999 | Swiss German | 20m | N/N | Web (open). Access here |
Romani Morphosyntax Database | ? | Romani (various) | ? | N/N | Web (open). Details here |
Göteborg Spoken Language Corpus | ? | Swedish | 1.4m | Y/N | Web (protected). Details here |
Parole corpus | 1976-1997 | Swedish | 19m | Y/N | Web (open). Details here |
Stockholm Umeå Corpus Version 2.0 | ? | Swedish | 1m | Y/N | Web (open). Details here |
LDC-Online | ? | Arabic, Chinese … | ? | ?/? | Web (protected). Details here |
MCVF Corpus | pre-1837 | French | ? | Y/Y | Web (protected). Details here |
Tycho Brahe Parsed Corpus of Historical Portuguese | 1350-1899 | Portuguese | ? | Y/Y | On server. Details here |
IcePaHC: Icelandic Parsed Historical Corpus | 1150-2008 | Icelandic | 1m | Y/Y | On server. Details here |
HeliPaD Corpus | C9th | Old Saxon | ? | Y/Y | On server. Details here |
Sheffield Corpus of Chinese | pre-1911 | Chinese | 433k | Y/N | Web (protected). Register & search here |
Note
Most texts held at Manchester are stored together on one fileserver, accessible on campus or via the VPN.
- In Windows you can navigate using Windows Explorer or My Computer or within MonoConc Pro to
\\nask.man.ac.uk\share$\fs_shared_01\Hum1\ALC\LEL_corpora. - On a Mac, in the Finder, from the Go menu, choose Connect to Server. Enter the server address
smb://vdm02-g1.ds.man.ac.uk/fs_shared_01$/HUM1/ALC/LEL_corpora and then click Connect. - Authenticate with your university username and password.
- Subfolders are transparently named by language.
Notes on specific corpora
- Romani Morphosyntax Database
The database is interactive and has a mapping function. The Romani Project is based here in the Department. Here is its home page. - Swedish corpora
- Corpora at the Department of Linguistics, Göteborg University
This page provides links to a number of corpora (mainly Swedish) and information about others. Information about transcription methods, coding and software connected with the corpora is also provided here. - Göteborg Spoken Language Corpus
The GSLC is a corpus of transcribed spoken Swedish taken from a variety of social activities. From the description on the corpus’s homepage: ‘Based on the fact that spoken language varies considerably in different social activities with regard to pronunciation, vocabulary and grammar, the goal of the corpus is to include spoken language from as many social activities as possible.’ Registration is necessary. - Språkbanken (The Language Bank)
This site provides access to various written corpora of Swedish (contemporary and historical) as well as a number of dictionaries and databases. All can be found via the central Språkbanken site; a generic search interface for all the corpora can be accessed here. Two of the most useful corpora are given below, and three dictionaries in the next section:- the Parole corpus
Ca. 19 million tokens of written Swedish tagged for word class. - the Stockholm Umeå Corpus Version 2.0
SUC 2.0; ca. 1 million words of written Swedish.
- the Parole corpus
- Corpora at the Department of Linguistics, Göteborg University
- LDC Online
A large corpus at the Linguistic Data Consortium including archive of newstext in Arabic and Chinese, available via Library databases under L.
A special username or password is needed for this resource; ask DD or in the Library. - MCVF: Modéliser le changement : les voies du français
A large parsed historical corpus of French. Searchable via an online interface (after registering), though no full text access; alternatively, to search it using CorpusSearch, you can order the CD free of charge by filling in a request form, and they will post it from Ottawa. GW has a copy. - Tycho Brahe Parsed Corpus of Historical Portuguese
A large historical corpus of Portuguese, of which substantial chunks are tagged and parsed. Searchable via an online interface (after registering); it’s also on the server and can be searched using CorpusSearch. - Icelandic Parsed Historical Corpus (IcePaHC)
A large parsed historical corpus of Icelandic, version 0.9. Searchable on the server using CorpusSearch, or online here. - Heliand Parsed Database (HeliPaD)
A small parsed historical corpus of Old Saxon, version 0.9. Searchable on the server using CorpusSearch, or online here.
Dictionaries
Danish
- Den Danske Ordbog
A corpus-based dictionary of modern Danish (to appear during 2009). - Ordbog over det danske Sprog
A dictionary of Danish from the period 1700-1950.
Swedish
- Svenska Akademiens Ordbok
- Söderwall & Schlyter dictionary of medieval Swedish
- Dahlins dictionary of 19th century Swedish
Links
- OLAC Language Resources Catalogue. ‘This catalogue, developed by the Open Language Archives Community (OLAC), provides access to a wealth of information about thousands of languages, including details of text collections, audio recordings, dictionaries, and software, sourced from dozens of digital and traditional archives.’
- GerManC
The project, located in the School, aims “to compile a representative historical corpus of written German for the years 1650-1800”. - Discover Irish website, Raymond Hickey
This page last updated 9th January 2017.