Table of Contents
Things to be aware of.
In CG-3 all cohorts have at least one reading. If none are given in the input, one is generated from the wordform. These magic readings can be the target of rules, which may not always be intended.
For example, given the input
"<word>" "word" N NOM SG "<$.>"
a magic reading is made so the cohorts internally looks like
"<word>" "word" N NOM SG "<$.>" "<$.>" <<<
The above input combined with a rule a'la
MAP (@X) (*) ;
will give the output
"<word>" "word" N NOM SG @X "<$.>" "<$.>" <<< @X
because MAP promoted the last magic reading to a real reading.
If you do not want these magic readings to be the possible target of rules, you can use the cmdline option --no-magic-readings. Internally they will still be generated and contextual tests can still reference them, but rules cannot touch or modify them directly. SETCHILD is an exception.
In CG-2 and VISLCG the keyword NOT behaved differently depending on whether it was in front of the first test or in front of a linked test. In the case of
(NOT 1 LSet LINK 1 KSet LINK 1 JSet)
the NOT would apply last, meaning it would invert the result of the entire chain, but in the case of
(1 LSet LINK NOT 1 KSet LINK 1 JSet)
it only inverts the result of the immediately following test.
CG-3 implements the NEGATE keyword to make the distinction clearer. This means that if you are converting grammars to CG-3 you must replace starting NOTs with NEGATEs to get the same functionality. So the first test should instead be
(NEGATE 1 LSet LINK 1 KSet LINK 1 JSet)
Alternatively you can use the --vislcg-compat (short form -2) to work with older grammars that you do not wish to permanently update to use NEGATE.
PREFERRED-TARGETS is currently ignored in CG-3. See PREFERRED-TARGETS for details.
CG-3 will auto-detect the codepage from the environment, which in some cases is not what you want. It is not uncommon to work with UTF-8 data but have your environment set to US-ASCII which would produce some unfortunate errors. You can use the runtime option -C to override the default codepage, and you should always enforce it if you plan on distributing packages that depend on a certain codepage.
In CG-2 the - operator meant set difference; in VISLCG it meant set fail-fast; in CG-3 operator - means something in between. The new operator ^ takes place of VISLCG's behavior, and operator \ takes the place of CG-2's behavior.
In CG-1 and some versions of CG-2, scanning tests could not pass the point of origin, but in CG-3 they can by default do so. The cmdline flag --no-pass-origin can set the default behavior to that of CG-1. See Scanning Past Point of Origin for more details.
In VISLCG the magic tags >>> and <<<, denoting sentence start and end respectively, could sometimes wind up in the output. In CG-3 they are never part of the output.
In CG-2 the order in which rules are applied on cohorts cannot be reliably predicted.
In VISLCG rules can be forced to be applied in the order they occur in the grammar, but VISLCG will try to run all rules on the current cohort before trying next cohort:
ForEach (Window) ForEach (Cohort) ForEach (Rule) ApplyRule
CG-3 always applies rules in the order they occur in the grammar, and will try the current rule on all cohorts in the window before moving on to the next rule. This yields a far more predictable result and cuts down on the need for many sections in the grammar.
ForEach (Window) ForEach (Rule) ForEach (Cohort) ApplyRule
Since any rule can be in any section, it is possible to write endless loops.
For example, this grammar will potentially loop forever:
SECTION ADD (@not-noun) (N) (0 (V)) ; ADD (@noun) (N) ; SECTION REMOVE (@noun) IF (0 (V)) ;
Since ADD is in a SECTION it will be run again after REMOVE, and since ADD does not block from further appending of mapping tags it can re-add @noun each time, leading to REMOVE finding it and removing, ad nauseum.
In order to prevent this, the REMOVE rule can in most cases be rewritten to:
REMOVE (N) IF (0 (@noun) + (N)) (0 (V)) ;
That is, the target of the REMOVE rule should be a non-mapping tag with the mapping tag as 0 context. This will either remove the entire reading or nothing, as opposed to a single mapping tag, and will not cause the grammar to rewind.
Similarly, it is possible to make loops with APPEND and SELECT/REMOVE/IFF combinations, and probably many other to-be-discovered mixtures of rules. Something to be aware of.
In CG-1, CG-2, and VISLCG it was not always clear when you could refer to a previously mapped in the same grammar. In VISL CG-3 all changes to readings become visible to the rest of the grammar immediately.
The contextual positions 'cc' and 'c*' may at first glance seem to behave exactly the same, but there is a subtle difference when combined with the left-of/right-of filters that can lead to wildly different cohorts being chosen in rules asking for TO/FROM contextual targets:
cc will first create a complete list of all children and grand-children, then apply any left-of/right-of filters.
c* will apply left-of/right-of filters at each step down the child tree, not following any branch which doesn't uphold the filter.
The CG-2 spec says that readings in the format
"word" tag @MAP @MUP ntag @MIP
should be equivalent to
"word" tag @MAP "word" tag @MUP "word" tag ntag @MIP
Since the tag order does not matter in CG-3, this is instead equivalent to
"word" tag ntag @MAP "word" tag ntag @MUP "word" tag ntag @MIP
The CG-2 spec says that the first tag of a reading is the baseform, whether it looks like [baseform] or "baseform". This is not true for CG-3; only "baseform" is valid.
The reason for this is that CG-3 has to ignore all meta text such as XML, and the only way I can be sure what is a reading and what is meta text is to declare that a reading is only valid in the forms of
"baseform" tag tags moretags "base form" tag tags moretags
and not in the forms of
[baseform] tag tags moretags baseform tag tags moretags