|
|
Semantic Prototype Project Description
Please note: The development and implementation of the semantic prototypes are work
in progress.
Project Researchers
Eckhard Bick
Lone Hegelund
Project Period
Feb. 1 - Dec. 31, 2001.
Short Description
The goal of the Semantic Prototype Project was to develop and implement a
semantics and valency-based systematics for Danish nouns. The semantic tags for
Danish nouns are used within the overall Constraint Grammar (CG) to further
syntactically disambiguate the automatic analyses of running Danish text (open
corpora). The present systematics has been developed and tested on a corpus of
originally 80,000 Danish nouns. Of these 80,000 nouns, app. 75 % nouns have one
or several tags which can be used within CG. Of the remaining 25 %, a large
number of the nouns have tentative tags. Several prototype categories have been
checked and further defined.
Semantics and Prototypes
Semantics in general is about defining the meaning of a word: What does a
word mean? However, within the Constraint Grammar (CG) approach applied in the
VISL project, semantics is used to distinguish between meanings and thus
potential readings of a noun in a syntactic context. The semantic tags applied
do thus not uniquely define a given noun. Instead the tags designate one or
several prototypes for each noun.
A prototype
is an (idealized) best instance of a given class of entities. Prototypes
are thus often what one calls "class hyperonyms", e.g.
"animal", "human beings" and "plant". Iin other words,
a class hyperonym is the most general term that can be used as a common reference for human beings
who can be individually characterized by being an architect or a child or an
Italian or that annoying guy hanging over the counter or...
The
prototypes used in the Semantic Prototype Project are based on earlier work done
by Eckhard Bick for Portuguese (Bick 2000). At present the systematics includes
app. 150 prototypes. The prototype systematics are under continuous
revision. Ultimately the systematics should be developed to such an extent that
they will enhance the automatic machine-translation of Danish by improving the
ability of the CG system to handle semantic disambiguation. This development is
already well under way for Portuguese (cf. this site and Bick 2000).
Constraint Grammar and Prototypes
CG is a grammar of
constraint rules which are used on a text which has been automatically tagged
with all possible morphological and syntactical readings. Out of this
multitude of possible readings of a text, the CG formalism eliminates all
impossible readings. Thus, only one - or in the case of a truly ambiguous
text, several - possible reading(s) is (are) left (for more information, see Constraint
Grammar).
In earlier
work, Bick has proved that it is possible to extend CG to a semantic level
through semantic tagging (Bick 2000). Semantic tags can work as secondary
tags for the disambiguation of syntactic ambiguity. However, if the
long-term goal of machine-translation is to be achieved, it is important to
distinguish between different readings of polysemic words in order to translate
correctly. Thus the semantic tags become primary tags and subject to
disambiguation themselves through the use of atomic semantic features. Again,
neither the use of the prototypes nor of their atomic semantic features are
meant to uniquely define a noun. CG formalism allows for a highly context
dependent and thus detailed disambiguation of semantic prototypes, both
as secondary and primary tags.
Valency and Valency Tags
Valency refers to the number and types of bonds which syntactic elements may
form with each other. Valency is typically connected with the verb and its
dependent elements, referred to as e.g. arguments, complements or valents.
Different verbs require a different number of arguments and are thus accordingly
considered monovalent, bivalent or trivalent: "to disappear" is a
monovalent verb, as it requires only a subject in order to form a well-formed
sentence such as "he disappeared". "to buy", on the other
hand, is bivalent as the verb requires a subject and an object: "She bought
a new computer".
However, in
this project valency refers to nouns and their ability to form different kinds
of syntatic relations, for example a noun which can be the dependency attachment
point of a non-finite or finite subclause, i.e. the subclauses are postnominal:
1. Non-finite
subclause: Vi har forskellige behov, forskellige måder at udtrykke vores følelser
og drifter på.
2. Finite
subclause: Han tog den beslutning, at han ville holde op med at ryge.
At present, only few valency tags have been applied in the corpus. However,
nouns have been tagged for "prespositional valency": prepositions in
typical co-occurrence with the noun. This tagging is not, however, valency
marking in a strict (or even less strict) sense, but more of a
probability/statistics hint. xx
Related Projects
Other projects have dealt with semantic tagging of corpora. The project
SIMPLE - Semantic Information for Multifunctional Plurilingual LExicons - aimed
at developing wide-coverage semantic lexicons for 12 European languages.
SIMPLE used a common semantic model and semantic coding formalism. 139 semantic
types were distinguished in SIMPLE. Each semantic type was defined as a cluster of
structured semantic information, e.g. semantic class, domain, argument structure
of predicative expressions and selectional restrictions and qualia roles (cf.
Pustejovsky 1995).
The SIMPLE project
was an extension of the European PAROLE project: Preparatory
Action for Linguistic Resources Organisation for Language Engineering. The main
objective of PAROLE was to produce corpora and lexical resources for each of the
14 participating European languages. One of the results of the PAROLE project
was a set of harmonised lexica containing a minimum of 20.000 entries provided
with morphosyntactic and syntactic information:12.000 nouns, 3000 verbs, 3000
adjectives, 500 adverbs and 1500 entries with other POS categories.
For information on SIMPLE
and PAROLE for Danish, see the documentation on SIMPLE
and PAROLE provided by Center for
Language Technology.
References
Bick, Eckhard. 2000. The Parsing System "Palavras". Automatic
Grammatical Analysis of Portuguese in a Constraint Grammar Framework.
Århus: Aarhus University Press.
Karlsson, Fred et al. (eds.). 1995. Constraint Grammar. A
Language-Independent System for Parsing Unrestricted Text. Berlin: Mouton de
Gruyter.
Pustejovsky, James. 1996. The Generative Lexicon. Cambridge, MA/London,
England: MIT Press
|
|