Chapter 7. Input/Output Stream Format

Chapter 7. Input/Output Stream Format
Prev		Next

Table of Contents

Apertium Format
HFST/XFST Format
VISL CG Format
Niceline CG Format
Plain Text Format
Binary Format

The cg-conv tool converts between various stream formats.

Apertium Format

The cg-proc front-end processes the Apertium stream format, or can convert for use via cg-conv.

HFST/XFST Format

HFST/XFST input can be converted for use via cg-conv.

VISL CG Format

The VISL CG stream format is a verticalized list of word forms with readings and optional plain text in between. For example, the sentence "They went to the zoo to look at the bear." would in VISL format look akin to:

        "<They>"
            "they" <*> PRON PERS NOM PL3 SUBJ
        "<went>"
            "go" V PAST VFIN
        "<to>"
            "to" PREP
        "<the>"
            "the" DET CENTRAL ART SG/PL
        "<zoo>"
            "zoo" N NOM SG
        "<to>"
            "to" INFMARK>
        "<look>"
            "look" V INF
        "<at>"
            "at" PREP
        "<the>"
            "the" DET CENTRAL ART SG/PL
        "<bear>"
            "bear" N NOM SG
        "<.>"

Or in CG terms:

        "<word form>" static_tags
            "base form" tags

Also known as:

        "<surface form>" static_tags
            "lexeme" tags

In more formal rules:

If the line begins with "< followed by non-quotes and/or escaped quotes followed by >" (regex /^"<(.|\\")*>"/) then it opens a new cohort.
If the line begins with whitespace followed by " followed by non-quotes and/or escaped quotes followed by " (regex /^\s+"(.|\\")*"/) then it is parsed as a reading, but only if a cohort is open at the time. Thus, any such lines seen before the first cohort is treated as text.
Any line not matching the above is treated as text. Text is handled in two ways: If no cohort is open at the time, then it is output immediately. If a cohort is open, then it is appended to that cohort's buffer and output after the cohort. Note that text between readings will thus be moved to after the readings. Re-arranging cohorts will also re-arrange the text attached to them. Removed cohorts will still output their attached text.

This means that you can embed all kinda of extra information in the stream as long as you don't hit those exact patterns. For example, we use <s id="unique-1234"> </s> tags around sentences to keep track of them for corpus markup.

Niceline CG Format

Niceline input can be converted for use via cg-conv.

The Niceline format is primarily used in VISL and GrammarSoft chains to make the output more readable. Using the same example as for VISL CG format, that would look like:

        They  [they] <*> PRON PERS NOM PL3 SUBJ
        went  [go] V PAST VFIN
        to    [to] PREP
        the   [the] DET CENTRAL ART SG/PL
        zoo   [zoo] N NOM SG
        to    [to] INFMARK>
        look  [look] V INF
        at    [at] PREP
        the   [the] DET CENTRAL ART SG/PL
        bear  [bear] N NOM SG
        .

Or in CG terms:

        word form TAB [base form] tags TAB [base form] tags
        ...or quotes...
        word form TAB "base form" tags TAB "base form" tags
        ...or mixed...
        word form TAB "base form" tags TAB [base form] tags

In more formal rules:

If the line does not begin with < and contains a tab (\t, 0x09), then it is a cohort. Anything up to the first tab is the word form. Readings are tab delimited, where if the first tag is contained in [] or "" then it is taken as the base form. Tags are otherwise whitespace delimited.
Any line not matching the above is treated as text, same rules as for VISL CG format. Note that a tab character is required for it to be a cohort - a word or punctuation without the tab will be treated as text.

Plain Text Format

Plain text can be tokenized for use via cg-conv. It is a naive tokenizer that you should not use, and is only included as a last resort. Five minutes in any scripting language should give you a much better tokenizer.

The tokenization rules are simple:

Split tokens on any kind of whitespace.
Split punctuation from the start and end of tokens into tokens. Each punctuation character becomes a separate token.
Detect whether the token is ALLUPPER, Firstupper, or MiXeDCaSe and add a tag denoting it.
The token then becomes a cohort with one reading with a lower-case variant of the token as base form.

Binary Format

The binary format can be generated by cg-conv and can be parsed either by cg-conv or by the Python bindings. It is designed for faster parsing than the textual formats. The intended usecase is cases where the same input needs to be processed multiple times (such as when testing several grammars).

The stream begins with a header containing CGBF followed by a 4-byte version number (currently 1). After that, each packet begins with 1 byte indicating its contents. 1 is a window, 2> is a command, and 3> is text.

Command packets have a second byte identifying the command: 1 for FLUSH, 2 for EXIT, 3 for IGNORE, and 4 for RESUME. Commands which manipulate variables are represented in window packets.

Text packets consist of a 2-byte length followed by the contents in UTF-8.

Each window packet begins with 4 bytes specifying the length of the block and then the following structure:

        window flags [2]
          > 1 = has multi-window dependencies
        tags [array of str]
        variables [array]
          mode
            > 1 = SETVAR (var = val)
            > 2 = SETVAR (var = *)
            > 3 = REMVAR
          var [tag]
          val or 0 [tag]
        text [str]
        text_post [str]
        cohorts [array]
          flags [2]
            > 1 = is target of a relation
          wordform [tag]
          static_tags [array of tag]
          dep_self [4]
          dep_parent or 0xFFFFFFFF [4]
          relations [array]
            tag [tag]
            head [4]
          text [str]
          wblank [str]
          readings [array]
            flags [2]
              > 1 = is subreading of predecessor
              > 2 = deleted
            baseform [tag]
            tags [array of tag]

Where arrays and strings are both encoded with a 2-byte length followed by the specified number of objects or UTF-8 bytes. Each item of type [tag] is a 2-byte index into the window-wide tags array.

Prev		Next
Chapter 6. Command Line Reference	Home	Chapter 8. Grammar