Parsing Nordic Languages (PaNoLa)
PaNoLa will be devoted to Internet development and applications involving the automatic analysis of Danish, Finnish, Norwegian, and Swedish based on a Constraint Grammar (CG) formalism with the view of contributing to distance education regarding the nature, structure, and use of these four Nordic languages.
Though the present application is geared to a time span of only two years, this will be sufficient, we believe, to put the project on a very solid footing and at the same time produce important and highly visible results making new information about Nordic languages and language structure freely available, via the Internet, to the international community. This optimistic view is due to the fact that the project will build upon existing systems in two important respects:
VISL, which stands for "Visual Interactive Syntax Learning", has received financial support from various Danish government institutions since 1996. During that time, VISL has developed a wide range of teaching, learning, and research tools which are freely available to the world community over the Internet (URL: edu.visl.dk).
The goal of PaNoLa is to enhance the Nordic element within the VISL system. Currently, Danish is the only Nordic language among the fifteen VISL languages which consist of Arabic, Bosnian, Danish, Dutch, English, Esperanto, French, German, Greek, Italian, Japanese, Latin, Portuguese, Russian, and Spanish.
To this end it is important to link up with research communities in Norway, Sweden, and Finland which also work within the Constraint Grammar paradigm. This can be done by combining the efforts of the four scholars named in this application: Eckhard Bick (University of Southern Denmark), Janne Bondi Johannessen (University of Oslo), Fred Karlsson (University of Helsinki), and Torbjörn Lager (Uppsala University).
Through the joint efforts of these four scholars, existing CG-systems (some portions of which are available from the Finnish firm Lingsoft) can be enhanced and co-ordinated to yield a powerful unified electronic education and research network for these four Nordic languages.
The importance of developing computational tools for language learning and language processing for Nordic languages is actually mirrored in a joint decision by the European Council, the European Union and UNESCO to declare 2001 the European Year of Languages. The Norwegian coordinator Arne Aarseth explains the goal this way: "Språkåret har som føremål å skapa merksemd om og auka innsikt i det europeiske språklege mangfaldet. Ein ønskjer, gjennom ei rekkje tiltak, å motivera alle europeiske innbyggjarar til å læra språk, gjerne med vekt på dei såkalla minst brukte. Eit av desse verkemidla er større vekt på livslang språklæring."
About the participants and the participating institutions
As can be seen from the descriptions below, all four scholars participating in PaNoLa are well-versed in the CG formalism and have already contributed to the development of CG-systems for their respective languages.
The project leader, Dr. Eckhard Bick, is Senior Researcher in the Institute of Language and Communication at the University of Southern Denmark. He has a cand.med. degree from the University of Bonn (1984), a cand.mag. degree in Nordic languages/literature and Portuguese from Aarhus University, Denmark (1993) and a dr.phil. in Linguistics from the same university (2000). His dissertation is entitled The Parsing System "Palavras" Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Since 1996 he has been head of the VISL project at the University of Southern Denmark. Over the past five years he has worked on enhancing his Portuguese CG, written preliminary Spanish and French CG's, upgraded the English CG which the VISL project purchased from Lingsoft, and for the past year has been engaged in writing a CG-system for Danish.
The chairman of the Institute of Language and Communication has pledged his support to PaNoLa. In particular, he has agreed to make available the necessary working space and computer facilities for the project. In addition, the Dean of Humanities has agreed to contribute DKK 25,000 to the upgrading of the server park to help support PaNoLa activities.
Dr. Torbjörn Lager is a member of the Department of Linguistics at Uppsala University. His research in tagging and parsing of Swedish is related to his interest in rule-based taggers in general an interest and expertise which subsumes Constraint Grammar tagging. He has considerable experience in the design and implementation of software tools for learning and revising rule-based taggers based on input from tagged corpora. This has resulted in a system for generalized Transformation-Based Learning the MUTBL system in which rule-based taggers of different kinds can be learned, among them CG-style taggers. A description of the system can be found at the following address: http://stp.ling.uu.se/~torbjorn/mutbl.html. His goals within PaNoLa will be to further refine the MUTBL system, to build an interactive development environment for Constraint Grammars on top of it, and to use this environment to support the development of a sizable Constraint Grammar for Swedish, parts of which already exist.
Professor Janne Bondi Johannessen is manager of the Text Laboratory at the Department of Linguistics, University of Oslo. She has been in charge of a project for developing a Constraint Grammar tagger for Norwegian (Bokmål and Nynorsk) at the Text Lab. This tagger has achieved very good results, and has been used to tag the Oslo Corpus, which has become very popular among scholars and students of the Norwegian language around the world. The Bokmål corpus contains roughly 18,500,000 words, the Nynorsk corpus about 3,800,000 words (see http://www.tekstlab.uio.no/norsk/bokmaal). The staff at the Text Lab not least, Kristen Hagen, who has been instrumental in developing the Norwegian Constraint Grammar tagger have solid experience and expertise in CG, and welcome the opportunity PaNoLa will provide for improving the tagger and modifying it in ways which will considerably increase both its range of uses and number of users.
The Linguistics Department at the University of Oslo is willing to support PaNoLa by providing working space and computer facilities. In addition, the Text Laboratory can provide an assistant to the project, worth NOK 30,000. It should be mentioned that the goals of PaNoLa will correlate very nicely with the goals of a newly started project in the Linguistics Department. This project aims at making the introductory linguistics course (which is obligatory for all language students) available for long distance learning and teaching. Indeed, the tools that will be developed by PaNoLa will be directly usable by the other project.
Fred Karlsson is Professor of General Linguistics at the University of Helsinki and Dean of the Faculty of Humanities. He is the inventor of the Constraint Grammar Formalism, which he defined in 1990. He programmed the first full-scale Constraint Grammar engine. Fred Karlsson has designed a morphological analyzer for Swedish (SWETWOL) and a Constraint Grammar for Finnish. He was Director of the Research Unit for Computational Linguistics (1985-1994) and the Research Unit for Multilingual Language Technology (1995-1999), both at the University of Helsinki.
The Department of General Linguistics at the University of Helsinki and Lingsoft, Inc. will support the project with a sum amounting to FIM 20,000.
Distance learning tools
From the point of view of application programs, PaNoLa will tap into and enhance existing VISL technology, which already provides a wide range of tools and services for language learning, language teaching and language research. These tools and services are all freely available to the world community via the Internet. The basis for these services is an ongoing implementation and enhancement of a wide range of language modules at the morphological, syntactic and semantic levels supported by the development and maintenance of the necessary lexicographic databases for each language. The interaction of these modules has already resulted in a number of concrete applications, primary among which are:
Current educational uses
The educational materials and software currently available at the VISL website for fifteen languages is already in worldwide use. Until recently, the users were primarily university teachers and students, and members of adult education classes. However, the tools are increasingly being modified to appeal to younger users, with the result that both secondary and primary school teachers have begun to take an interest in the system and to introduce the tools in their classrooms. For example, an IT-subcommittee under the administrative board for the island of Funen in Denmark contracted last autumn for members of the VISL group to provide a 30-hour course for secondary school teachers of English to upgrade their competence in English sentence analysis using the VISL tools. On an even larger scale, representatives from the Danish Ministry of Education arranged with the VISL group to hold a day-long seminar in Odense on January 11, 2001 for 21 secondary school teachers from all over Denmark to be introduced to VISL's educational software. The group of teachers represented eight different languages currently being taught in Danish secondary schools: Danish, English, French, German, Greek, Latin, Russian, and Spanish. As a result of this seminar, interest is growing rapidly for the introduction of these tools into Danish classrooms at the secondary school level. The most recent link between VISL and Danish educational institutions was established on July 1, 2001, when a subcommittee under the Danish ministry of education provided financial support over a two-year period to help introduce the Danish and English VISL systems to students and staff in Denmark's 54 business schools at the secondary school level (HHX). If PaNoLa is implemented, a strong Nordic language component, integrated into the VISL system, will be made available to users at all education levels both inside and outside Scandinavia.
The following timetable provides an estimate of the individual milestones and deadlines that make up the project as a whole:
It is important to stress that PaNoLa is not a project which will start from scratch, nor is it a project that will end when NorFA funding ends. Since CG-taggers already exist, in varying degrees of readiness, for each of the four Nordic languages, and since the VISL educational infrastructure is already in place on the Internet, the integration of CG-taggers for Nordic languages within the VISL framework will be a vital and ongoing process. NorFA funding will initiate the production of new education software for Danish, Finnish, Norwegian and Swedish, making it possible to have a rich and varied selection of new Nordic language materials freely available via the Internet by January 2003 not only to Scandinavian users of all ages, but to the world community at large.
Budget (January 1, 2002 December 31, 2003)
Information about the applicants
Eckhard Bick (lead applicant):
Institute of Language and Communication
University of Southern Denmark Odense Campus
5230 Odense M
phone: (45) 86 28 35 24
fax: 86 28 13 97
Department of General Linguistics
University of Helsinki
P.O. Box 4
phone: 358 9 19 12 35 12
Janne Bondi Johannessen
The Text Laboratory
Department of Linguistics
University of Oslo
P.O. Box 1102 Blindern
phone (47 ) 22 85 68 14
Department of Linguistics
S-751 20 Uppsala
phone: (46) 18 471 7860
Some relevant publications and presentations
Bick, Eckhard. 1996. "Automatic parsing of Portuguese". Proceedings of the Second Workshop on Computational Processing of Written Portuguese. Curitiba, Brazil.
Bick, Eckhard. 1997a. "Dependensstrukturer i Constraint Grammar syntaks for Portugisisk. In: Brønsdsted, Tom and Inger Lytje (eds.), Sprog og multimedier. Aalborg Universitetsforlag, Denmark, pp. 39-57.
Bick, Eckhard. 1997b. "Automatisk analyse af portugisisk skriftsprog". In: Jensen, Per Anker, Stig W. Jørgensen and Annette Hørning (eds.), Danske ph.d-prosjeker i datalingvistikk, formel lingvistikk og sprogteknologi. Kolding, Denmark, pp. 22-30.
Bick, Eckhard. 1997c. "Internet-based grammar teaching". In: Christoffersen, Ellen and Bradley Music (eds.), Datalingvistisk Forenings Årsmøde 1997 i Kolding. Proceedings, pp. 86-106.
Bick, Eckhard. 1998. "Structural lexical heuristics in the automatic analysis of Portuguese". In: Maegaard, Bente (ed.), Proceedings of the 11th Nordic Conference on Computational Linguistics (NODALIDA-98). Copenhagen, January 28-29, pp. 44-56.
Bick, Eckhard and Diana Santos. 2000. "Providing Internet access to Portuguese corpora: the AC/DC project". In: Maria Gavrilidou et al. (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000). Athens, 31 May - 2 June 2000, pp. 205-210.
Bick, Eckhard. 2000. The parsing system "Palavras" automatic grammatical analysis of Portuguese in a constraint grammar framework. Aarhus University Press, Denmark.
Johannesen, Janne Bondi and Kristin Hagen. 1998. "Disambiguering uten syntaks." In: Faarlund, J. T., B. Mæhlum and T. Nordgård (eds.), MONS 7, pp. 68-79. Novus forlag, Oslo.
Johannessen, Janne Bondi. 1998. "Tagging and the case of pronouns". Computers and the Humanities 32, pp. 1-38.
Johannessen, Janne Bondi and Helge Hauglin. 1998. "An automatic analysis of compounds". In: T. Haukioja (ed.), Papers from the 16th Scandinavian Conference of Linguistics, Turku, Finland (1996), pp. 209-220.
Johannessen, Janne Bondi. 1998. Coordination. Oxford University Press, New York, Oxford.
Johannessen, Janne Bondi, Kristin Hagen and Anders Nøklestad. 2000. "A Constraint-based tagger for Norwegian. In: Lindberg, Carl-Erik and Steffen Nordahl Lund (eds.), 17th Scandinavian Conference of Linguistics, Odense Working Papers in Language and Communication 19, University of Southern Denmark, Odense, Denmark (1998), Vol. 1, pp. 31-47.
Johannessen, Janne Bondi, Kristin Hagen and Anders Nøklestad. 2000. "A web-based advanced and user friendly system: the Oslo corpus of tagged Norwegian texts." In: Bavrilidou, M. G. Carayannis, S. Markantonatou, S. Piperidis and G. Stainhaouer (eds.). Proceedings, Second International Conference on Language Resources and Evaluation (LREC 2000), Athens, pp. 1725-1729.
Johannessen, Janne Bondi. 2001. "Sammensatte ord". Norsk Lingvistisk Tidsskrift, pp. 59-92.
Karlsson, Fred. 1994. "Robust parsing of unconstrained text". In: P. de Haan and N. Oostdijk (eds.), Corpus-based research into language. In honour of Jan Aarts. Rodopi, Amsterdam and Atlanta, pp. 121-142.
Karlsson, Fred, Atro Voutilainen, Juha Heikkilä, and Arto Anttila (eds.). 1995a. Constraint Grammar a language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin and New York,.
Karlsson, Fred. 1995b. "Designing a parser for unrestricted text". In Karlsson et al. (eds.) 1995a, pp. 1-40.
Karlsson, Fred. 1995c. "The formalism and environment of Constraint Grammar Parsing". In Karlsson et al. (eds.) 1995a, pp. 41-88.
Karlsson, Fred and Lauri Karttunen. 1997. "Sub-sentential processing". In: G. B. Varile and A. Zampolli (eds.), Survey of the state of the art in human language technology. Cambridge University Press.
Karlsson, Fred. 2000. Finnish: an essential grammar. 2nd ed. (1st ed. 1999). Routledge, London and New York.
Karlsson, Fred, Even Hovdhaugen, Carol Henriksen and Bengt Sigurd. 2000. The history of linguistics in the Nordic countries. Societas Scientiarum Fennica. Helsinki.
Lager, Torbjörn. 1995. A logical approach to computational corpus linguistics. Doctoral dissertation, University of Göteborg: Department of Linguistics.
Lager, Torbjörn. 1998. "Logic for part of speech tagging and shallow parsing". In: Proceedings of the 11th Nordic Conference on Computational Linguistics (NODALIDA-98), Copenhagen, January 28-29, 1998.
Lager, Torbjörn. 1999. "The µ-TBL system: logic programming tools for Transformation-Based Learning". In: Proceedings of the Third International Workshop on Computational Natural Language Learning (CoNLL-99), Bergen, June 12, 1999.
Lager, Torbjörn. 1999. "µ-TBL lite: a small, extensible Transformation-Based Learner". In: Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL-99), Bergen, June 8-12, 1999.
Lager, Torbjörn and Natalia Zinovjeva. 1999. "Training a dialogue act tagger with the µ-TBL system". In: Proceedings of the Third Swedish Symposium on Multimodal Communication, Linköping University Natural Language Processing Laboratory (NLPLAB), Linköping, October 16-17, 1999.
Lager, Torbjörn. 2000. "A logic programming approach to word expert engineering." In: Proceedings of ACIDCA 2000: Workshop on Corpora and Natural Language Processing. Monastir, Tunisia, March 22-24, 2000, pp. 182-189.
Lager, Torbjörn and Joakim Nivre. 2001. "Part of speech tagging from a logical point of view". In: P. de Groote, G. Morril and C. Retoré (eds.), Logical Aspects of Computational Linguistics. Lecture Notes in Artificial Intelligence. Springer-Verlag, Berlin-Heidelberg-New York.