Eckhard
Bick, vislcg how-to 6/2006
Basic
how-to for vislcg
1.Command-line usage:
standard
call: vislcg --grammar rulesfile
without mapping rules: --no-mapping
with rule-number traces for debugging: --verbosity
minimal
limited number of n least heuristic constraint
sections: --sections=n
special mapping prefix (default ='@'): --prefix='...'
Ordinarily
input is piped from a lexicon-based morphological multitagger, but
input from probabilistic taggers (Treetagger, TnT, Brill etc.) can
also be used, in which case the first rule section typically will be
a correction grammar rather than a morphological disambiguation
grammar. In order to prevent syntactic rules from interfering with
morphological ones (by being run on morphologically not-yet
disambiguated input), it is recommended to run vislcg twice - first
without, then with syntactic mapping. Finally, disambiguated/tagged
output can be piped directly to a file, or processed with layout
filters or further grammars in other formalisms (constituent grammar,
dependency grammar, field grammar etc.).
cat
textfile | multitagger | vislcg --grammar rulesfile
--no-mapping | vislcg --grammar rulesfile
| postfilter > textfile.cg
with tracing: cat textfile |
multitagger | vislcg --grammar rulesfile --no-mapping
--verbosity minimal | vislcg --grammar rulesfile
--verbosity minimal | postfilter > textfile.cg
Multitagger
or other input has to deliver so-called verticalized text, i.e. one
token pr. line, with non-punctuation tokens followed by a cohort
of one or more possible
analyses, indented, one pr. line. Conventionally, cohort lines start
with the lexeme or base-form (in quotes), followed by word class
(PoS) and inflexion tags in upper case. Secondary tags, meant to be
used as disambiguation context, but not intended for disambiguation
themselves, such as subclass, valency and semantic tags, should be
placed in <...> brackets between lexeme and word class tags:
ordform
“lexeme-1”
<valency> .. <semantics> .. POS-1 INFLEXION
“lexeme-1”
<valency> .. <semantics> .. POS-2 INFLEXION
“lexeme-2”
<valency> .. <semantics> .. POS-3 INFLEXION
“lexeme-2”
<valency> .. <semantics> .. POS-4 INFLEXION
2.The rules file
A vislcg rules file consists of the following sections:
DELIMITERS (1 line, defines sentence boundaries)
SETS (1 or more sections of set definitions, compiled as
one)
MAPPINGS (1 section of mapping rules, adding tags at the
end of a reading line)
CORRECTIONS (1 section of correction rules, replacing
tags anywhere in a reading)
CONSTRAINTS (1 or more sections of REMOVE or SELECT
rules, with each section compiled and run seperately)
END
Set sections contain
LIST definitions of sets, written as lists of ORed tags or tag chains
(in parentheses). Once defined, sets may be combined into new sets
with a SET definition.
Mapping and Correction sections have
MAP/ADD and SUBSTITUTE rules, respectively. These rules are applied
in strict sequential order. But while MAP/ADD rules can't "see"
in their context conditions what earlier mapping rules have mapped,
this is not true of SUBSTITUTE rules, which do interact with the
result previous substitution rules.
Constraint sections will be interpretad as
heuristicity batches, with safer rules in the first sections, and
more heuristic rules in later sections. Each section is repeated
until no further of its rules can be instantiated (i.e. meet their
context conditions), then the next section is run and the first
section re-run after second-section disambiguation to check for
changed contexts. After that, a third section is run, and the lower
ones rerun, etc.
Within one and the same constraint section, rules should be regarded
as "simultaneous", since their order may be changed by the
compiler for optimisation purposes. However, word form rules will be
run first, and SELECT (due to their greater disambiguation potential)
have priority over REMOVE rules with the same target.
Each set definition or rule is terminated with a
semicolon, but can run over serveral lines. As in several programming
languages, the #-symbol marks the rest of a line as a comment.
3.The individual operators
3.2.Delimiters
The vislcg compiler applies rules within a certain
context window, defined by delimiters. Typically,
delimiters will be sentence boundary markers (i.e. punctuation), but
paragraphs, corpus section markers or even specific stop-words could
be used. Rules can refer to the boundaries with the reserved symbols
>>> (left boundary) and <<< (right boundary).
DELIMITERS = “<.>” “<!>”
"<?>" ;
The example defines a fullstop, exclamation mark or
question mark as a delimiter. Note that punctuation notation follows
wordform notation, with quotes and angle brackets.
3.3.Set definitions
In both their targes and context conditions, CG rules
can refer not only to words, lexemes and tags, but also sets of
words, lexemes or tags, or even combinations of these three types.
Two kinds of set definitions are used:
(a) LIST set-name
=
followed by a list of tags or tag combinations (the
latter in parentheses), separated by spaces. The list constitutes the
set, and a rule targeting a set is equivalent to a batch of rules
targeting each set element separately.
(b) SET set-name
=
defining a new set as a mathematical operation on
existing sets. Sets used in a SET definition, must occur earlier in
the grammar. Tags can be used as sets on the fly by enclosing them in
parentheses.
A set element can be:
a tag, word form or lexeme, e.g.
N [for noun], "<bought>" [word form] or "buy"
[lexeme]
a combination of (1), as a kind
of "snapshot" from a reading, in parentheses. The snapshot
may have "holes" (i.e. interfering tags appearing in the
reading but not in the set element). For instance, (N M P) [for
noun masculine plural], or (“eat” INF).
In a SET definition (b), sets can be combined with the
following operators:
union:
OR
or | ,
e.g. set1 OR set2 OR (tag3) OR (N F S)
concatenation: +
, e.g. set1 + set2, yields all possible combinations of the 2 sets'
elements. Thus, a concatenation of SET set1 = V and SET
set2 = INF GER PCP covers all non-finite verb forms: (V INF) (V
GER) (V PCP).
negation:
- , e.g. set1 but not
set2, means set1 as long as the reading in question does not contain
elements from set2. Thus, rather than just a removal of set2 elements
form the set1 list (i.e. set difference, as
used in Tapanainen's cg2), vislcg interprets the minus
operation as a kind of NOT condition, so the presence of a set2
element in a reading will block and override the presence of a set1
reading. Thus, (N) - (P) means non-plural nouns. In the upcoming
visl-cg3, a clear distinction will be made between negation and set
difference.
The + and - operators have precedence over OR.
Note that the same operators, as well as the parenthesis
convention for creating sets on-the-fly, can be used in targets and
context conditions of REMOVE and SELECT rules in the CONSTRAINTS
section.
3.4.Constraints
Constraint rules are ordered in sections, usually in
order to separate safer rules (to be used earlier) from more
heuristic rules (to be used later). Within one section, rules should
be regarded as simultaneous, though REMOVE rules will be used after
SELECT rules. One and the same grammar can be run at different levels
of heuristicity by using the --sections=n flag
when calling vislcg, meaning that only the first (=safest) n
constraint sections of the
grammar will be used.
A CG rule has the following general form, with []
brackets indicating optional elements:
["<Wordform>"] OPERATION
TARGET [[IF] (CONTEXT-1) (CONTEXT-2) ...] ;
OPERATION:
(a) REMOVE
Removes a reading from a cohort, if it contains a
TARGETed tag - unless this reading is the last surviving reading. In
the case of morphological or PoS tag this means that one (entire)
reading line, in a cohort of readings for a given token, will be
removed - for instance the reading line "comer" V PR 1S IND
will be removed from the analysis cohort of "como", if
either the V (verb) or PR (present tense) tags are TARGETed by a
successful REMOVE rule, leaving the "como" ADV reading to
survive. If the target is a MAPped tag (i.e. a @-tag), it is
removed from the reading line, and if it is the only or last
surviving MAPped tag, the whole reading line will be removed.
(b) SELECT
Selects a reading, if it contains a TARGETed tag. In
practice, selection is equivalent to a removal of all other
readings. In the case of @-tag
target, the reading line is cleared of all other @-tags.
WORDFORM:
Optional part of a rule, restricting the rule to the
wordform in question. Since the operation is case sensitive,
preprocessing (lowercasing) is necessary, if a rule targeting e.g. an
English noun also is to apply if the noun occurs in sentence-initial
position. VISL grammars use lowercasing of initials, storing the
uppercase information as a tag (<*>) instead.
WORDFROM may be a set of wordforms, but the set must not
include other tag types. Otherwise, the WORDFORM condition works like
a context condition for position 0 (self).
TARGET:
Obligatory part of a rule. A target is always a set,
either a predefined set from the SETS section, or a tag string
defined as a set on-the-fly by using parentheses, e.g. NOMINAL
(defined by LIST = N ADJ PCP) or (N) or (N F P). Using predefined
sets as targets, effectively fuses what in the cg-1 formalism was a
same-context batch of multiple rules, into one rule:
SELECT NOMINAL IF (-1C DET) ;
(same as 3 rules targeting (N), (ADJ) and (PCP)
separately).
CONTEXT:
One or more contexts can be used, but (heuristic) rules
without any context are allowed, too. Each context is enclosed in
parentheses. Contexts are applied as AND-linked conditions, i.e. all
conditions of a given rule must be true ("instantiated")
for the rule to apply. A context condition may contain the following
elements:
An obligatory position marker, consisting
of a number indicating relative distance in tokens. The default
(positive number) is a right context, while a negative number
indicates a left context. A context can be negated by using
NOT in front of the position marker. An asterisk (*),
prefixed to the position
marker number means "unbounded context". In this
case, a context condition has to be true all the way to the left (-)
or right (+) sentence boundary - even if the context search should
cross the TARGET position (position 0).
A positive unbounded context condition is instantiated at the
closest possible position - unless a double asterisk (**)
is
used, which will allow instantiation at the second or later
occurrence. Later instantiation is relevant only in the presence of
LINKed contexts (which might not be true of the first, but yes a
later occurrence of the original condition). An at-sign
(@) in front of a position number means absolute context, e.g.
@1 for the first token/cohort, @2 for the second, and @-2 for the
second-but-last token/cohort in the sentence.
An obligatory context condition consists
of a (position-restricted) set (or set-ified tags or tag sequences).
As elsewhere, sets may be combined by set operators: OR
(union), + (concatenation in one and the same reading)
or AND (intersection, both tags in the same cohort).
A C (careful) condition attached to the position number means
that the context condition has to be a safe (i.e. the only)
reading of the cohort in question. For instance, (-1C N) denotes an
unambiguous noun one position to the left (i.e. left adjacent). A
word with both a noun (N) and a verb (V) reading in this position
would not fulfill the context condition.
An optional linked context, where
the word LINK chains 2 contexts (within the same
context parenthesis). The second, linked context condition is
written in the same fashion as the first one, but its relative
position is calculated from the instantiated first context rather
than the rule target. In other words, each LINK resets the context
position to 0. In this way, it is possible to to create arbitrarily
long chains of LINKed context conditions. In practice, all links in
achain point to the same side (i.e. either right or left), but in
theory, a change of direction is allowed.
An optional barrier context, where
the word BARRIER is used right after an unboundad
context (*-context). A barrier context blocks the preceding context
search, if the barrier condition is instantiated before the unbounde
context can be instantiated. As usual, barrier contexts may consist
of sets, set-ified tags or set combinations, but do not need a
postion marker. For instance, (*1 VFIN BARRIER CLB) looks for a
finite verb (VFIN) anywhere to the right (*1), but only if there is
no interfering clause boundary (CLB) in between. A subordinator or
comma would thus block further VFIN-searching.
3.5.Mappings
A MAPPING-rule has the following gerneal layout:
OPERATION (MAPTAG-1 MAPTAG-2 ...)
(TARGET) IF (CONTEXT-1) ... (CONTEXT-n)
Mapping rules add tags to a cohort line (i.e. reading),
if that line contains a certain TARGET and if certain optional
CONTEXTs are fulfilled. Context conditions are expressed as in the
CONSTRAINT section, and sets are used and constructed in the usual
way. Any kind of tag may be added. However, only mapped tags with a
special mapping-prefix (by default, @) will be treated as real @tags.
@tags are traditionally
syntactic tags, added and disambiguated on one cohort line
(itself representing a PoS/inflexion reading). During disambiguation,
@tags will be cut down to the last reading. If there is only one
reading in the cohort, this last @tag is untouchable, otherwise the
whole reading line dies together with its last @tag. When calling a
grammar with vislcg, the @-prefix may be changed by using the
--prefix='...' flag.
The following OPERATIONs are allowed in mapping rules:
MAP: This is the general mapping operator. It is
a feature of the special @tags,
that MAP rules cannot apply if the targeted cohort line already
contains one or more @tags (from an earlier MAP rule or the
lexicon). Thus, if ambiguity is desired, the @tags in question have
to be MAPped at the same time (i.e. by the same rule). In order to
allow further mapping, ADD rules have to be used instead of MAP
rules.
ADD: Mapping of @tags is performed independently
of the presence of other @tags on the cohort line. Thus, @-mapping
may continue until a MAP rule "closes" the @tag-list for a
given cohort line.
REPLACE: This is a CG-2 operator deprecated in
vislcg in favour of the more powerfull SUBSTITUTE operator. REPLACE
deletes all tags but the first one (normally the lexeme tag), and
adds the mapped tags instead.
Unlike constraint rules, mapping-rules are applied in
exactly the order they are given in. Mapping rules in the same
grammar (section?) cannot use earlier mapped tags as contexts.
3.6.Corrections
Correction
rules are used to correct faulty input - for instance from a
probabilistic tagger, or in a spell checker - by replacing tags with
other tags. Deletion can be handled by nil-replacements, and
insertion by replacing a tag with an appended version containing also
the new, inserted tag.
The
general shape of a correction rule is the following:
SUBSTITUTE
(TAG-1) (TAG-2) TARGET (TAG-3) IF (CONTEXT-1) ... (CONTEXT-2)
Here,
TAG-1 is replaced with TAG-2 in cohort lines that contain the target
tag TAG3 with (optional) context conditions structured in the usual
fashion. As usual, on-the-fly sets (as in the example) can be used on
par with predefined or combined sets.
4.Sample
rules file
DELIMITERS = "<.> "<!>"
"<?>" ; # sentence window
SETS
LIST NOMINAL = N PROP ADJ PCP ; # nominals, i.e.
potentieal nominal heads
LIST PRE-N = DET ADJ PCP ; # prenominals
LIST P = P S/P ; # plural
SET PRE-N-P = PRE-N + P ; # plural prenominals,
equivalent to (DET P) (DET S/P) (ADJ P) (ADJ S/P) (PCP P) (PCP S/P)
LIST CLB = "<,>" KS (ADV <rel>)
(ADV <interr>) ; # clause boundaries
LIST ALL = N PROP ADJ DET PERS SPEC ADV V PRP KS KC IN ;
# all word classes
LIST V-SPEAK = "dizer" "falar"
"propor" ; # speech verbs
LIST @MV = @FMV @IMV ; # main verbs
CONSTRAINTS
REMOVE (N S) IF (-1C PRE-N-P) ; # remove a singular noun
reading if there is a safe plural prenominal directly to the left.
REMOVE NOMINAL IF (NOT 0 P) (-1C (DET) + P) ; # remove a
nominal if it isn't plural but preceded by a safe plural determiner.
REMOVE (VFIN) IF (*1 VFIN BARRIER CLB OR (KC) LINK *1
VFIN BARRIER CLB OR (KC)) ; # remove a finite verb reading if there
are to more finite verbs to the right none of them barred by a clause
boundary (CLB) and coordinating conjunction (KC).
"<que>" SELECT (KS) (*-1 V-SPEAK BARRIER
ALL - (ADV)) ; # select the conjunction reading for the word form
'que', if there is a speech-verb to the left with nothing but advers
in between.
MAPPINGS
MAP (@SUBJ> @ACC>) TARGET (PROP) IF (*1C VFIN
BARRIER ALL - (ADV)) (NOT -1 PROP OR PRP) (NOT *-1 VFIN) ; # a proper
noun can be either forward subject or forward direct object, if there
follows a finite verb to the right with nothing but adverbs in
between, provided there is no proper noun or preposition directly to
the left, and a finite verb anywhere to the left.
CONSTRAINTS
REMOVE (@SUBJ>) IF (*1 @MV BARRIER CLB LINK *1C
@<SUBJ BARRIER @MV) ; # remove a forward subject if there is a
safe backward subject to the right with only one main verb in between
|