Table of Contents
First some example tags as we know them from CG-2 and VISLCG:
"<wordform>" "baseform" <W-max> ADV @error (<civ> N)
Now some example tags as they may look in VISL CG-3:
"<Wordform>"i "^[Bb]ase.*"r /^@<?ADV>?$/r <W>65> (<F>=15> <F<=30>) !ADV ^<dem> (N <civ>)
The tag '>>>' is added to the 0th (invisible) cohort in the window, and the tag '<<<' is added to the last cohort in the window. They can be used as markers to see if scans have reached those positions.
Starting with the latter, (N <civ>), as this merely signifies that tags with multiple parts do not have to match in-order; (N <civ>) is the same as (<civ> N). This is different from previous versions of CG, but I deemed it unncecessary to spend extra time checking the tag order when hash lookups can verify the existence so fast.
The first two additions to the feature sheet all display what I refer to as literal string modifiers, and there are two of such: 'i' for case-insensitive, and 'r' for a regular expression match. Using these modifiers will significantly slow down the matching as a hash lookup will no longer be enough. You can combine 'ir' for case-insensitive regular expressions. Regular expressions are evaluated via ICU, so their documentation is a good source. Regular expressions may also contain groupings that can later be used in variable string tags (see below).
Due to tags themselves needing the occasional escaping, regular expressions need double-escaping of symbols that have special meaning to CG-3. E.g. literal non-grouping () need to be written as "a\\(b\\)c"r. Metacharacters also need double-escaping, so \w needs to be written as \\w.
This will not work for wordforms used as the first qualifier of a rule, e.g:
"<wordform>"i SELECT (tag) ;
but those can be rewritten in a form such as
SELECT ("<wordform>"i) + (tag) ;
which will work, but be slightly slower.
Tags in the form //r and //i and //ri are general purpose regular expression and case insensitive matches that may act on any tag type, and unlike Literal String Modifiers they can do partial matches. Thus a tag like /^@<?ADV>?$/r will match any of @<ADV, @<ADV>, @ADV>, and plain @ADV. A tag like /word/ri will match any tag containing a substring with any case-variation of the text 'word'. Asides from that, the rules and gotchas are the same as for Literal String Modifiers.
Tags in the form //l
are regular expressions matched on the whole literal reading. Used to match exact tag sequences. Special helper __
will expand to (^|$| | .+? )
and can be used to ensure there are only whole tags before/after/between something.
Variable string tags contain markers that are replaced with matches from the previously run grouping regular expression tag. Regular expression tags with no groupings will not have any effect on this behavior. Time also has no effect, so one could theoretically perform a group match in a previous rule and use the results later, though that would be highly unpredictable in practice.
Variable string tags are in the form of "string"v, "<string>"v, and <string>v, where variables matching $1 through $9 will be replaced with the corresponding group from the regular expression match. Multiple occurances of a single variable is allowed, so e.g. "$1$2$1"v would contain group 1 twice.
Alternative syntax is prefixing with VSTR:
. This is used to build tags that are not textual, or tags
that need secondary processing such as regex or case-insensitive matching. E.g., VSTR:@m$1
would
create a mapping tag or VSTR:"$1.*"r
to create a regex tag. To include spaces in such tags, escape them
with a backslash, e.g. VSTR:"$1\ $2"r
(otherwise it is treated as two tags).
One can also manipulate the case of the resulting tag via %U, %u, %L, and %l. %U upper-cases the entire following string. %u upper-cases the following single letter. %L lower-cases the entire following string. %l lower-cases the following single letter. The case folding is performed right-to-left one-by-one.
It is also possible to include references to unified $$sets or &&sets in {} where they will be replaced with the tags that the unification resulted in. If there are multiple tags, they will be delimited by an underscore _.
It should be noted that you can use varstring tags anywhere, not just when manipulating tags. When used in a contextual test they are fleshed out with the information available at the time and then attempted matched.
# Adds a lower-case <wordform> to all readings. ADD (<%L$1>v) TARGET ("<(.*)>"r) ; # Adds a reading with a normalized baseform for all suspicious wordforms ending in 'ies' APPEND ("$1y"v N P NOM) TARGET N + ("<(.*)ies>"r) IF (1 VFIN) ; # Merge results from multiple unified $$sets into a single tag LIST ROLE = human anim inanim (bench table) ; LIST OTHER = crispy waffles butter ; MAP (<{$$ROLE}/{$$OTHER}>v) (target tags) (-1 $$OTHER) (-2C $$ROLE) ;
Then there are the numerical matches, e.g. <W>65>. This will match tags such as <W:204> and <W=156> but not <W:32>. The second tag, (<F>15> <F<30>), matches values 15>F>30. These constructs are also slower than simple hash lookups.
The two special values MIN and MAX (both case-sensitive) will scan the cohort for their respective minimum or maximum value, and use that for the comparison. Internally the value is stored in a double, and the range is capped between -281474976710656.0 to +281474976710655.0, and using values beyond that range will also act as those limits.
# Select the maximum value of W. Readings with no W will also be removed. SELECT (<W=MAX>) ; # Remove the minimum F. Readings with no F will not be removed. REMOVE (<N=MIN>) ;
Table 17.1. Valid Operators
Operator | Meaning |
---|---|
= | Equal to |
!= | Not equal to |
< | Less than |
> | Greater than |
<= | Less than or equal to |
>= | Greater than or equal to |
<> | Not equal to |
Anywhere that an = is valid you can also use : for backwards compatibility.
Table 17.2. Comparison Truth Table
A | B | Result |
---|---|---|
= x | = y | True if x = y |
= x | != y | True if x != y |
= x | < y | True if x < y |
= x | > y | True if x > y |
= x | <= y | True if x <= y |
= x | >= y | True if x >= y |
< x | != y | Always true |
< x | < y | Always true |
< x | > y | True if x > y |
< x | <= y | Always true |
< x | >= y | True if x > y |
> x | != y | Always true |
> x | > y | Always true |
> x | <= y | True if x < y |
> x | >= y | Always true |
<= x | != y | Always true |
<= x | <= y | Always true |
<= x | >= y | True if x >= y |
>= x | != y | Always true |
>= x | >= y | Always true |
!= x | != y | True if x = y |
CG-3 will store and forward any data between cohorts,
attached to the preceding cohort.
META:/.../r
and META:/.../ri
lets you query and capture from this data with regular expressions.
Data before the first cohort is not accessible.
ADD (@header) (*) IF (-1 (META:/<h\d+>$/ri)) ;
I recommend keeping META tags in the contextual tests, since they cannot currently be cached and will be checked every time.
In the CG and Apertium stream formats, it is allowed to have tags after the word form / surface form. These tags behave as if they contextually exist in every reading of the cohort - they will not be seen by rule targets.
Global variables are manipulated with rule types SETVARIABLE and REMVARIABLE, plus the stream commands SETVAR and REMVAR. Global variables persist until unset and are not bound to any window, cohort, or reading.
You can query a global variable with the form VAR:name
or query whether a variable has a
specific value with VAR:name=value
. Both the name and value test can be in the form of
// regular expressions, a'la VAR:/ame/r=value
or VAR:name=/val.*/r
,
including capturing parts.
The runtime value of mapping prefix is stored in the special variable named _MPREFIX
.
REMOVE (@poetry) IF (0 (VAR:news)) ; SELECT (<historical>) IF (0 (VAR:year=1764)) ;
I recommend keeping VAR tags in the contextual tests, since they cannot currently be cached and will be checked every time.
Almost identical to global variables, but uses LVAR
instead of VAR
, and variables are bound to windows.
Global variables are remembered on a per-window basis. When the current window has no more possible rules, the current variable state is recorded in the window. Later windows looking back with W can then query what a given variable's value was at that time. It is also possible to query future windows' variable values if the stream contains SETVAR and the window is in the lookahead buffer.
When LVAR queries the current window, it is the same as VAR.
A Fail-Fast tag is the ^ prefix, such as ^<dem>. This will be checked first of a set and if found will block the set from matching, regardless of whether later independent tags could match. It is mostly useful for sets such as LIST SetWithFail = (N <bib>) (V TR) ^<dem>. This set will never match a reading with a <dem> tag, even if the reading matches (V TR).
If you are worried about typos or need to otherwise enforce a strict tagset, STRICT-TAGS
is your friend.
You can add tags to the list of allowed tags with STRICT-TAGS += ... ;
where ...
is a list
of tags to allow. Any tag parsed while the STRICT-TAGS list is non-empty will be checked against the list, and an
error will be thrown if the tag is not on the list.
It is currently only possible to add to the list, hence +=
.
Removing and assigning can be added if anyone needs those.
STRICT-TAGS += N V ADJ etc ... ;
By default, STRICT-TAGS always allows wordforms, baseforms, regular expressions, case-insensitive, and VISL-style secondary tags
("<…>"
, "…"
, <…>
), since those are too prolific to list
individually. If you are extra paranoid, you can change that with OPTIONS.
To get a list of unique used tags, pass --show-tags to CG-3. To filter this list to the default set of interesting tags, cg-strictify can be used:
cg-strictify grammar-goes-here
For comparison, this yields 285 tags for VISL's 10000-rule Danish grammar. Edit the resulting list to remove any tags you can see are typos or should otherwise not be allowed, stuff it at the top of the grammar, and recompile the grammar. Any errors you get will be lines where forbidden tags are used, which can be whole sets if those sets aren't used in any rules.
Once you have a suitable STRICT-TAGS list, you can further trim the grammar by taking advantage the fact that any
tag listed in STRICT-TAGS may be used as an implicit set that contains only the tag itself.
No more need for LIST N = N ;
constructs.
Very similar to STRICT-TAGS, but only performs the final part of making
LIST N = N ;
superfluous. Any tag listed in LIST-TAGS
has an implicit set created for it.