Our corpus server (overview) currently has corpora available for the following languages.
- Danish text and annotated corpora (DFK, Korpus90/2000, Europarl)
- Danish corpora, reg.ex. interface (DFK and Korpus90/2000)
- Danish treebank (Arboretum)
- English text and annotated corpora (Europarl, BNC)
- English corpora, reg.ex. interface (BNC)
- Esperanto text and annotated corpora (TTT, Monato, Eventoj, literaturo)
- Estonian treebank (Arborest)
- French text and annotated corpora (ECI, Europarl)
- French treebank (L'Arboratoire)
- German text and annotated corpora (MAK, BZK, Europarl)
- German corpora, reg.ex. interface (MAK and BZK)
- Portuguese text and annotated corpora (Europarl)
- Portuguese text and annotated corpora, reg.ex. interface (Newspaper, historical and spoken language data)
- Portuguese treebank (Floresta Sintá(c)tica), also at the Linguateca site, with documentation and the Águia-interface
- Spanish text and annotated corpora (ECI, Europarl, Camtie)
- Spanish corpora, reg.ex. interface (Camtie)
The Danish and most Portuguese and Esperanto corpora, as well as the Europarl corpora for all languages can be accessed without a password. Access to the other corpora is currently limited by password to people and projects affiliated with the Institute of Language and Communication at SDU - Odense University.
The VISL project leader, Eckhard Bick, has developed search engines for these corpora which recognize regular expressions and supply search results in the form of concordances, with search hits highlighted in boldface. For those who may be unfamiliar with regular expressions og VISL's grammatical annotation system, the Corpus Search pages offer a brief on-site user manuals, while in-depth definitions and examples of grammatical categories and tags is profided in the info-folders in the relevant language-section at the main VISL site. Further information on regular expressions can be found in the following publication, A Gentle Introduction to Regular Expressions, (pdf-format) by VISL project members, John Dienhart and Henrik Kasch.
On the corpus overview page, rectangular flag links indicate (old) interfaces based on the use of regular expressions (reg.ex.), while round flag buttons indicate (new) menu-based cqp-interfaces, which have been developed with "non-computational" users in mind. Tree flags indicate treebank corpora, allowing strictured constituent searches.
Information about a wide range of additional corpora and on-line search engines can be found by visiting the corpus index developed by Jens Ahlmann Hansen.