|
Corpus search interface
VISL's grammatical and NLP research are both largely corpus based. On the one hand, VISL develops taggers, parsers and computational lexica based on corpus data, on the other hand these tools - once functional - are used for the grammatical annotation of large running text corpora, often with or for external partners (project list 1999-2009. The main methodological approach for automatic corpus annotation is Constraint Grammar (CG), a word based annotation method. Hybrid systems, making use of both function based phrase structure and dependency grammar, are used to create syntactic treebanks from CG output. VISL is involved in many aspects of corpus linguistics:
|
Corpus compilation
Automatic corpus annotation
Manual linguistic corpus revision
Providing internet access for searching corpora (CorpusEye)
Language specific corpus based linguistic research
|
The following is an overview over various ongoing or concluded corpus annotation projects in VISL's various research languages, with overall corpus size given in million words: Danish (160M), English (334M), Esperanto (19M), Estonian(<1M), French (71M), German (99M), Italian (19M), Norwegian (31M), Portuguese (257M), Romanian (21M), Spanish (53M), Swedish (85M). Below the tables a number of relevant publications is listed and linked for download.
Language |
Corpus |
Type |
Size (words) |
Grammatical annotation |
Manual revision |
Partners/Projects |
|
Corpus90/2000 |
News text, prose |
2 x 26 Million |
PoS, morphology, syntax, CG-dep. |
400.000 words |
DSL |
|
Arboretum |
News text, prose |
10 Million |
Treebank (dep. & psg, TIGER-compatible) |
400.000 words |
Nordic Treebank Network |
|
Information |
News text |
80 Million |
PoS, morphology, syntax, CG-dep. |
- |
Dagbladet Information |
|
Europarl-da |
Parliamentary debates |
21 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
|
Wikipedia-da |
Encyclopedia |
3.7 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Wikipedia, The Free Encyclopedia (v. 12/2005) |
|
dfk-Skalk |
Journal of Archeology |
600.000 |
PoS, morphology, syntax, CG-dep. |
- |
Skalk |
|
dfk-folketing |
Parliamentary debates |
7 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Folketing |
|
Floresta Sintá(c)tica |
Newspaper |
1 Million |
Treebank (TIGER-compatible) |
185.000+ words |
Linguateca |
|
CETEMPúblico |
Portuguese newspaper |
192 Million |
PoS, morphology, syntax, CG-dep. |
cp. Floresta sintá(c)tica |
AC/DC project, Linguateca, Ref.: Público |
|
CETENFolha |
Brazilian newspaper |
24 Million |
PoS, morphology, syntax, CG-dep. |
cp. Floresta sintá(c)tica |
AC/DC project, Linguateca, Ref.: Folha de São Paulo |
|
Europarl-pt |
Parliamentary debates |
29 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
|
Wikipedia-pt |
Encyclopedia |
11.3 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Wikipedia, The Free Encyclopedia (v. 12/2005) |
|
Cartas-LR |
Historical letters to/by the editor |
200.000 words |
PoS, morphology, syntax, treebank |
10.000 words |
Ref.: Projeto para a História do Português Brasileiro |
|
Various |
Dialectal speech data, Historical Portuguese |
70.000 |
PoS, morphology, syntax, CG-dep. |
- |
(1) The CORDIAL-SIN project (2) The Tycho Brahe Project |
Language |
Corpus |
Type |
Size (words) |
Grammatical annotation |
Manual revision |
Partners/Projects |
|
Arboratoire/Freebank |
News text, prose |
130.000 |
PoS, morphology, syntax, CG-dep. |
30.000 words |
ATILF |
|
ECI-FR1 |
Newspaper |
4.4 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: Le Monde, ECI/EACL |
|
Europarl-fr |
Parliamentary debates |
29 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
|
Wikipedia-fr |
Encyclopedia |
37.8 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Wikipedia, The Free Encyclopedia (v. 12/2005) |
|
ECI-DE1 |
Newspaper (Frankfurter Rundschau) |
34 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: Frankfurter Rundschau, ECI/EACL |
|
BZK-tag |
Newspaper |
4 Million |
PoS, morphology, syntax, CG-dep. |
- |
Bonner Zeitungskorpus |
|
MAK-tag |
Newspaper |
3 Million |
PoS, morphology, syntax, CG-dep. |
- |
Mannheimer Korpus |
|
Europarl-de |
Parliamentary debates |
29 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
|
Wikipedia-de |
Encyclopedia |
28.7 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Wikipedia, The Free Encyclopedia (v. 12/2005) |
|
BNC-tag |
News text, prose |
35 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: British National Corpus |
|
Europarl-en |
Parliamentary debates |
29 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
|
Wikipedia-en |
Encyclopedia |
115.1 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Wikipedia, The Free Encyclopedia (v. 12/2005) |
|
KEMPE |
Early modern play texts |
8.9 Million |
PoS, morphology, syntax, CG-dep. |
- |
Lene Petersen, University of the West of England |
|
Chat corpus |
Chat logs 2002-2004 |
23.5 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: Project JJ |
|
Enron corpus |
E-mails |
83 Million |
PoS, morphology, syntax, CG-dep. |
- |
History & credits |
Language |
Corpus |
Type |
Size (words) |
Grammatical annotation |
Manual revision |
Partners/Projects |
|
Göteborgsposten |
Newspaper (1992-2003) |
1.4 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: Göteborgsposten |
|
Europarl-sv |
Parliamentary debates |
29 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
|
Leipzig-sv |
Internet Corpus |
2.0 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Leipzig Corpus Collection |
|
Wikipedia-no |
Wikipedia |
26 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Wikipedia, The Free Encyclopedia (v. 12/2005) |
|
Leipzig-no |
Internet Corpus |
4.65 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Leipzig Corpus Collection |
|
ECI-ES2 |
Newspaper |
1.4 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: El Diario Sur, ECI/EACL |
|
Europarl-es |
Parliamentary debates |
29 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: P. Koehn |
|
Wikipedia-es |
Encyclopedia |
22.3 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Wikipedia, The Free Encyclopedia (v. 12/2005) |
|
Monato |
News magazine |
2 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: Monato |
|
Eventoj |
Electronic News letter |
1.6 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: Eventoj |
|
Wikipedia-eo |
Encyclopedia |
3.2 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Wikipedia, The Free Encyclopedia (v. 12/2005) |
|
Elibrejo |
Literature |
7 Million |
PoS, morphology, syntax, CG-dep. |
- |
Ref.: eLibrejo |
|
Zamenhof |
Esperanto Classics |
1.5 Million |
PoS, morphology, syntax, CG-dep. |
- |
- |
|
TTT |
Internet |
3.6 Million |
PoS, morphology, syntax, CG-dep. |
- |
- |
|
Wikipedia-it |
Encyclopedia |
18.9 Million |
PoS, morphology, syntax, CG-dep. |
- |
Source: Wikipedia, The Free Encyclopedia (v. 12/2005) |
|
Adevarul |
Business news (1998-2005) |
18.9 Million |
PoS, morphology, syntax |
- |
Source: Adevarul Economic |
|
Arborest |
News text, prose |
3.500 |
Treebank (TIGER-compatible) |
at CG-level |
Nordic Treebank Network, Ref.: CG Annotated corpus of Estonian |
Some relevant publications on the VISL corpora, the Constraint Grammar and Treebank annotation schemes and parsers, the CorpusEye search interface etc.:
- Bick, Eckhard (2005), Gramática Constritiva na Análise Automática de Sintaxe Portuguesa. In: Berber Sardinha, Tony (ed.), A Língua Portuguesa no Computador [The Portuguese Language on the Computer]. Campinas: Mercado de Letras, São Paulo: FAPESP. ISBN: 85-7591-044-2
- Bick, Eckhard (2005), CorpusEye:Et brugervenligt web-interface for grammatisk opmærkede korpora, In: Peter Widell & Mette Kunøe (eds.), 10. Møde om Udforskningen af Dansk Sprog 7.-8.okt.2004, Proceedings. pp.46-57, Århus University
- Bick, Eckhard (2005), Live use of Corpus data and Corpus annotation tools in CALL: Some new developments in VISL, In: Henrik Holmboe (red.), Nordic Language Technology, Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (Yearbook 2004). pp.171-186. Copenhaguen: Museum Tusculanum
- Bick, Eckhard & Uibo, Heli & Müürisep, Kaili (2005), Arborest - a VISL-Style Treebank Derived from an Estonian Constraint Grammar Corpus, In: Kübler, Sandra et.al. (red.), Proceedings of TLT 2004 (3nd Workshop on Treebanks and Linguistic Theory, Tübingen, December 10th - 11th, 2004), pp.1-14
- Bick, Eckhard & Uibo, Heli & Müürisep, Kaili (2005), Arborest - a Growing Treebank of Estonian, In: Henrik Holmboe (red.), Nordic Language Technology, Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (Yearbook 2004). pp.125-142. Copenhaguen: Museum Tusculanum
- Santos, Diana et. al. (2004), Linguateca: um centro de recursos distribuído para o processamento computacional da língua portuguesa. IX Congreso Iberoamericano de Inteligencia Artificial (IBERAMIA 2004), Tonantzintla, México, Nov. 23rd 2004.
- Bick, Eckhard (2004), Parsing and evaluating the French Europarl corpus, In: Patrick Paroubek, Isabelle Robba & Anne Vilnat (red.): Méthodes et outils pour lévaluation des analyseurs syntaxiques (Journée ATALA, May 15, 2004). pp. 4-9. Paris: ATALA.
- Salmon-Alt, Susanne & Bick, Eckhard & Romary, Laurent & Pierrel, Jean-Marie (2004), La FReeBank: Vers une base libre de corpus annotés, In: Bernard Bel & Isabelle Marlien (eds), 11th Conference on Natural Language Processing (TALN 2004, April 19-22, 2004). pp. 419-428. Fès: ATALA.
- Bick, Eckhard (2003), Arboretum, a Hybrid Treebank for Danish, in: Joakim Nivre & Erhard Hinrich (eds.), Proceedings of TLT 2003 (2nd Workshop on Treebanks and Linguistic Theory, Växjö, November 14-15, 2003), pp.9-20. Växjö University Press
- Bick, Eckhard (2003), Morfosyntaktisk opmærkede corpora for Dansk: Korpus90/2000 og Arboretum, in Peter Widell & Mette Kunøe (eds.): 9. Møde om Udforskningen af Dansk Sprog 10.-11.okt.2002, Proceedings, pp.43-54, Århus University
- Bick, Eckhard (2003), A CG & PSG Hybrid Approach to Automatic Corpus Annotation, in Kiril Simow & Petya Osenova: Proceedings of SProLaC2003 (at Corpus Linguistics 2003, Lancaster), pp. 1-12
- Afonso, Susana, Eckhard Bick, Renato Haber & Diana Santos (2002), Floresta sintá(c)tica: a treebank for Portuguese, Proceedings of LREC'2002.
- Afonso, Susana, Eckhard Bick, Renato Haber & Diana Santos (2002), Floresta sintá(c)tica: um treebank para o português, Anabela Gonçalves & Clara Nunes Correia (orgs.), Actas do XVII Encontro da Associação Portuguesa de Linguística (Lisboa, 2-4 de Outubro de 2001), Lisboa: APL, 2002, pp.533-45
- Bick, Eckhard (2001), En Constraint Grammar Parser for Dansk, in Peter Widell & Mette Kunøe (eds.) 8. Møde om Udforskningen af Dansk Sprog, 12.-13. oktober 2000, pp. 40-50, Århus University
- Santos, Diana & Bick, Eckhard (2000), Providing Internet access to Portuguese corpora: the AC/DC project, In Gavriladou et al. (eds.): Proc. 2nd International Conf. on Language Resources and Evaluation, LREC2000 (Athens, 2000), pp. 205-10.
- Bick, Eckhard (2000), The Parsing System Palavras - Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Famework, Aarhus: Aarhus University Press (preprint version) -- dr.phil. thesis (cf. the Disputatio for an introduction)
- Bick, Eckhard (1998), Tagging Speech Data - Constraint Grammar Analysis of Spoken Portuguese, in: Proceedings of the 17th Scandinavian Conference of Linguistics, (Odense 1998)
|
|