Table of Contents
The cg-conv tool converts between various stream formats.
The cg-proc front-end processes the Apertium stream format, or can convert for use via cg-conv.
The VISL CG stream format is a verticalized list of word forms with readings and optional plain text in between. For example, the sentence "They went to the zoo to look at the bear." would in VISL format look akin to:
"<They>"
"they" <*> PRON PERS NOM PL3 SUBJ
"<went>"
"go" V PAST VFIN
"<to>"
"to" PREP
"<the>"
"the" DET CENTRAL ART SG/PL
"<zoo>"
"zoo" N NOM SG
"<to>"
"to" INFMARK>
"<look>"
"look" V INF
"<at>"
"at" PREP
"<the>"
"the" DET CENTRAL ART SG/PL
"<bear>"
"bear" N NOM SG
"<.>"
Or in CG terms:
"<word form>" static_tags
"base form" tags
Also known as:
"<surface form>" static_tags
"lexeme" tags
In more formal rules:
If the line begins with "< followed by non-quotes and/or escaped quotes followed by >"
(regex /^"<(.|\\")*>"/) then it opens a new cohort.
If the line begins with whitespace followed by " followed by non-quotes and/or escaped quotes followed by "
(regex /^\s+"(.|\\")*"/) then it is parsed as a reading, but only if a cohort is open at the time.
Thus, any such lines seen before the first cohort is treated as text.
Any line not matching the above is treated as text. Text is handled in two ways: If no cohort is open at the time, then it is output immediately. If a cohort is open, then it is appended to that cohort's buffer and output after the cohort. Note that text between readings will thus be moved to after the readings. Re-arranging cohorts will also re-arrange the text attached to them. Removed cohorts will still output their attached text.
This means that you can embed all kinda of extra information in the stream as long as you don't hit those exact
patterns. For example, we use <s id="unique-1234"> </s> tags around sentences to keep track of them
for corpus markup.
Niceline input can be converted for use via cg-conv.
The Niceline format is primarily used in VISL and GrammarSoft chains to make the output more readable. Using the same example as for VISL CG format, that would look like:
They [they] <*> PRON PERS NOM PL3 SUBJ
went [go] V PAST VFIN
to [to] PREP
the [the] DET CENTRAL ART SG/PL
zoo [zoo] N NOM SG
to [to] INFMARK>
look [look] V INF
at [at] PREP
the [the] DET CENTRAL ART SG/PL
bear [bear] N NOM SG
.
Or in CG terms:
word form TAB [base form] tags TAB [base form] tags
...or quotes...
word form TAB "base form" tags TAB "base form" tags
...or mixed...
word form TAB "base form" tags TAB [base form] tags
In more formal rules:
If the line does not begin with < and contains a tab (\t, 0x09), then it is a cohort. Anything up to the first tab is the word form. Readings are tab delimited, where if the first tag is contained in [] or "" then it is taken as the base form. Tags are otherwise whitespace delimited.
Any line not matching the above is treated as text, same rules as for VISL CG format. Note that a tab character is required for it to be a cohort - a word or punctuation without the tab will be treated as text.
Plain text can be tokenized for use via cg-conv. It is a naive tokenizer that you should not use, and is only included as a last resort. Five minutes in any scripting language should give you a much better tokenizer.
The tokenization rules are simple:
Split tokens on any kind of whitespace.
Split punctuation from the start and end of tokens into tokens. Each punctuation character becomes a separate token.
Detect whether the token is ALLUPPER, Firstupper, or MiXeDCaSe and add a tag denoting it.
The token then becomes a cohort with one reading with a lower-case variant of the token as base form.
The binary format can be generated by cg-conv and can be parsed either by cg-conv or by the Python bindings. It is designed for faster parsing than the textual formats. The intended usecase is cases where the same input needs to be processed multiple times (such as when testing several grammars).
The stream begins with a header containing CGBF followed by a 4-byte version number (currently 1).
After that, each packet begins with 1 byte indicating its contents.
1 is a window, 2> is a command, and 3> is text.
Command packets have a second byte identifying the command: 1 for FLUSH, 2 for EXIT, 3 for IGNORE, and 4 for RESUME.
Commands which manipulate variables are represented in window packets.
Text packets consist of a 2-byte length followed by the contents in UTF-8.
Each window packet begins with 4 bytes specifying the length of the block and then the following structure:
window flags [2]
> 1 = has multi-window dependencies
tags [array of str]
variables [array]
mode
> 1 = SETVAR (var = val)
> 2 = SETVAR (var = *)
> 3 = REMVAR
var [tag]
val or 0 [tag]
text [str]
text_post [str]
cohorts [array]
flags [2]
> 1 = is target of a relation
wordform [tag]
static_tags [array of tag]
dep_self [4]
dep_parent or 0xFFFFFFFF [4]
relations [array]
tag [tag]
head [4]
text [str]
wblank [str]
readings [array]
flags [2]
> 1 = is subreading of predecessor
> 2 = deleted
baseform [tag]
tags [array of tag]
Where arrays and strings are both encoded with a 2-byte length followed by the specified number of objects or UTF-8 bytes.
Each item of type [tag] is a 2-byte index into the window-wide tags array.