|

Corpus search interface
|
|
VISL's grammatical and NLP research are both largely corpus based. On the one hand, VISL develops taggers, parsers and computational lexica based on corpus data, on the other hand these tools - once functional - are used for the grammatical annotation of large running text corpora. The main methodological approach for automatic corpus annotation is Constraint Grammar (CG), a word based annotation method. |
Hybrid systems, making use of both function based phrase structure and dependency grammar, are used to create syntactic trees from CG output. VISL is involved in many aspects of corpus linguistics: corpus compilation, automatic corpus annotation, manual linguistic corpus revision, providing internet access for searching corpora and language specific corpus based linguistic research. |
|
|
The following is an overview over corpus annotation projects in VISL's various research languages:
Language |
Corpus |
Type |
Size (words) |
Grammatical annotation |
Manual revision |
Partners/Projects |
 |
Corpus90/2000 |
News text, prose |
2 x 26 Million |
PoS, morphology, syntax, CG-dep. |
200.000 words |
DSL |
 |
Arboretum |
News text, prose |
10 Million |
Treebank (TIGER-compatible) |
200.000 words |
Nordic Treebank Network |
 |
dfk-Skalk |
Journal of Archeology |
600.000 |
PoS, morphology, syntax, CG-dep. |
- |
Skalk |
 |
dfk-folketing |
Parliamentary debates |
7 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Folketing |
 |
Floresta Sintá(c)tica |
Newspaper |
1 Million |
Treebank (TIGER-compatible) |
65.000+ words |
Linguateca |
 |
CETEMPúblico |
Portuguese newspaper |
192 Million |
PoS, morphology, syntax, CG-dep. |
cp. Floresta sintá(c)tica |
AC/DC project, Linguateca, Ref.: Público |
 |
CETENFolha |
Brazilian newspaper |
24 Million |
PoS, morphology, syntax, CG-dep. |
cp. Floresta sintá(c)tica |
AC/DC project, Linguateca, Ref.: Folha de São Paulo |
 |
Europarl-pt |
Parliamentary debates |
29 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
 |
Cartas-LR |
Historical letters to/by the editor |
200.000 words |
PoS, morphology, syntax, treebank |
10.000 words |
Ref.: Projeto para a História do Português Brasileiro |
 |
Various |
Dialectal speech data, Historical Portuguese |
70.000 |
PoS, morphology, syntax, CG-dep. |
- |
(1) The CORDIAL-SIN project (2) The Tycho Brahe Project |
 |
Arboratoire/Freebank |
News text, prose |
130.000 |
PoS, morphology, syntax, CG-dep. |
30.000 words |
ATILF |
 |
ECI-FR1 |
Newspaper |
4.4 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: Le Monde, ECI/EACL |
 |
Europarl-fr |
Parliamentary debates |
29 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
 |
ECI-DE1 |
Newspaper (Frankfurter Rundschau) |
34 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: Frankfurter Rundschau, ECI/EACL |
 |
BZK-tag |
Newspaper |
4 Million |
PoS, morphology, syntax, CG-dep. |
- |
Bonner Zeitungskorpus |
 |
MAK-tag |
Newspaper |
3 Million |
PoS, morphology, syntax, CG-dep. |
- |
Mannheimer Korpus |
 |
Europarl-de |
Parliamentary debates |
29 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
 |
BNC-tag |
News text, prose |
35 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: British National Corpus |
 |
Europarl-en |
Parliamentary debates |
29 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
 |
KEMPE |
Early modern play texts |
8.9 Million |
PoS, morphology, syntax, CG-dep. |
- |
University of Bristol |
 |
ECI-ES2 |
Newspaper |
1.4 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: El Diario Sur, ECI/EACL |
 |
Europarl-es |
Parliamentary debates |
29 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
 |
Monato |
News magazine |
2 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: Monato |
 |
Eventoj |
Electronic News letter |
1.6 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: Eventoj |
 |
Elibrejo |
Literature |
7 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: eLibrejo |
 |
Zamenhof |
Esperanto Classics |
1.5 Million |
PoS, morphology, syntax, CG-dep. |
- |
- |
 |
TTT |
Internet |
3.6 Million |
PoS, morphology, syntax, CG-dep. |
- |
- |
 |
Arborest |
News text, prose |
3.500 |
Treebank (TIGER-compatible) |
at CG-level |
Nordic Treebank Network, Ref.: CG Annotated corpus of Estonian |
|
|