The Common Elementary Query Language (CEQL)

What is CEQL?

CEQL is a simplified query language which gives access to all the most frequently-used features of CQP, but in a form that is much more accessible and friendly for beginners. For example, rather than using full regular expressions, simpler wildcard characters are used - such as a single star (*), rather than a dot-star (.*), meaning "any amount of anything".

As such it is particularly suited to being implemented in web interfaces. Its most prominent use is in BNCweb, where by default all queries use CEQL rather than CQP syntax. Indeed, much of its design is modelled around the annotation of the BNC (POS tags, simplified POS tags, and lemmata).

A more generalised and configurable version of CEQL is built into CQPweb. In CQPweb, the system administrator decides how the different query styles available in CEQL are linked to the underlying CWB corpus attributes.

In CQPweb and BNCweb, CEQL is referred to as “Simple Query Syntax” in an effort to reduce acronym prevalence. It's exactly the same thing, though!

Some other web interfaces with a CWB backend also allow queries to be expressed in CEQL.

The CEQL parser, which converts CEQL query strings into the corresponding CQP-syntax queries, is not part of the CWB core. Instead, a reference implementation is part of the main CWB Perl module. Click here to find out how to install this software. TODP update link Recent versions of CQPweb include an implementation of CEQL in PHP.

Where can you read more about CEQL?

For a comprehensive introduction to CEQL with many examples and exercises see Chapter 6 (pp. 93-117) of Hoffmann et al. (2008), Corpus Linguistics with BNCweb.

The full specification of CEQL syntax can be found below on this page. A short “cheat sheet” summary is built in to both BNCweb and CQPweb.

Outside the web interfaces, there are other ways to get documentation. When you have installed CWB/Perl, the command perldoc CWB::CEQL will open a manual page describing the reference implementation of CEQL, including all grammar rules. Read this page if you want to use CEQL in your own software. If you wish to extend the CEQL grammar or implement your own simple query syntax, see perldoc CWB::CEQL::Parser for details on the grammar formalism and the parser implementation.

CEQL syntax specification

Wildcard patterns

CEQL is based on wildcard patterns for matching word forms and annotations. A wildcard pattern by itself finds all tokens whose surface form matches the pattern. Wildcard patterns must not contain blanks or other whitespace.

The basic wildcards are

    ?    a single arbitrary character
    *    zero or more characters
    +    one or more characters

These wildcards are often used for prefix or suffix searches, e.g. +able (all words ending in "-able" except for the word "able" itself). Clusters of wildcards specify a minimum number of characters, e.g. ???* for 3 or more.

Most other characters match only themselves. However, all CEQL metacharacters (not just wildcards) must be escaped with a backslash \ to match the literal character (e.g. \? to find a question mark). The full set of metacharacters in the core CEQL grammar is

    ? * + , : ! @ / ( ) [ ] { } _ - < >

Some of them are only interpreted as metacharacters in particular contexts. It is safest, and recommended, to escape every literal ASCII punctuation character with a backslash.

Groups of alternatives are separated by commas and enclosed in square brackets, e.g. [north,south,west,east]. They can include wildcards and an empty alternative can be appended to make the entire set optional (e.g. walk[s,ed,ing,] to match any form of the verb "walk").

Various escape sequences, consisting of a backslash followed by a letter, match specific sets and sequences of characters. Escape sequences recognised by the core CEQL grammar are:

    \a   any single letter
    \A   any sequence of letters (one or more)
    \l   any single lowercase letter
    \L   any sequence of lowercase letters (one or more)
    \u   any single uppercase letter
    \U   any sequence of uppercase letters (one or more)
    \d   any single digit
    \D   any sequence of digits (one or more)
    \w   a single "word" character (letter, number, apostrophe, hyphen)
    \W   any sequence of "word" characters (one or more)

The escape sequences are guaranteed to work correctly for UTF-8 encoded corpora, but may not be fully supported for legacy 8-bit encodings (in which case they might only match ASCII letters and digits).

Wildcard patterns can be negated with a leading exclamation mark !; a negated pattern finds any string that does not match the pattern.

Linguistic annotation

CEQL queries provide access to three items of token-level annotation in addition to surface forms. They are described below as lemma, POS (part-of-speech tag) and simple POS, which is the original intention. However, keep in mind that corpus search interfaces may be configured to access other annotation layers (say, semantic tags instead of simple POS).

A lemma search is carried out by enclosing the wildcard pattern in curly braces, e.g. {go}. All elements of the wildcard pattern described above must be enclosed in the braces, including negation ({!go}). Note that word form and lemma constraints are mutually exclusive on the same token.

A single-token expression in CEQL combines such a lexical constraint with a part-of-speech tag, separated by an underscore _. The POS tag can either be matched directly with a wildcard pattern, or one of a pre-defined set of simple POS tags can be selected (in curly braces). There are four possible combinations for a full token expression:

    WORD_POS
    {LEMMA}_POS
    WORD_{Simple POS}
    {LEMMA}_{Simple POS}

Keep in mind that POS tags may differ between corpora and make sure to read documentation on the respective tagset for successful POS searches. Full POS constraints are wildcard patterns, which is convenient with complex tagsets. In particular, the pattern can be negated, e.g. can_!MD to exclude the frequent modal reading of can. Also keep in mind that simple POS tags are available only if they have been set up for the corpus at hand by an administrator. Even though simple POS constraints aren't wildcard patterns, they can be negated (e.g. {walk}_{!V}).

The lexical constraint can be omitted in order to match a token only by its POS tag. Assuming the Penn treebank tagset and a simple POS tag A for adjectives, these four token expressions are fully equivalent:

    _JJ*     *_JJ*
    _{A}     *_{A}

Optional modifier flags can be appended to each constraint: :c for case-insensitive matching, :d to ignore diacritics (Unicode combining marks, including all accents and umlauts) and :cd for both. If an annotation defaults to case- or diacritic-insensitive mode, this can be overridden with an uppercase modifier :C, :D or :CD. (Mixed combinations are allowed, e.g. :Cd to override a case-insensitive default but ignore diacritics.) Keep in mind that modifiers go outside curly braces:

    {fiancee}:cd_N*:C

Phrase queries

Phrase queries match sequences of tokens. They consist of one or more token expressions separated by whitespace. Note that the query has to match the tokenization conventions of the corpus at hand. For example, a tag question (", isn't it?") is typically split into five tokens and can be found with the query

    \, is n't it \?

A single + stands for an arbitrary token, a single * for an optional token. Multiple + and/or * can (and should) be bundled for a flexible number of tokens, e.g. ++*** for 2 to 5 arbitrary tokens.

Groups of tokens can be enclosed in round parentheses within a phrase query. Such groups may contain alternatives delimited by pipe symbols (vertical bar, |):

    it was ( ...A... | ...B... | ...C... )

will find "it was" followed by a token sequence that matches either the phrase query A, the phrase query B or the phrase query C. Empty alternatives are not allowed in this case. Whitespace can be omitted after the opening parenthesis, around the pipe symbols and before the closing parenthesis.

A quantifier can be appended to the closing parenthesis of a group, whether or not it includes alternatives. Note that there must not be any whitespace between the closing parenthesis and the quantifier (otherwise it would be interpreted as a separate token expression). Quantifiers specify repetition of the group:

    ( ... )?        0 or 1 (group is optional)
    ( ... )*        0 or more
    ( ... )+        1 or more
    ( ... ){N}      exactly N
    ( ... ){N,M}    between N and M
    ( ... ){N,}     at least N
    ( ... ){0,M}    at most M

Groups can contain further subgroups with alternatives and quantification. Note that group notation is needed to match an open-ended number of arbitrary tokens; it can also be more readable for finite ranges

    (+)?            same as: *
    (+)*            any number of arbitrary tokens
    (+)+            at least one arbitary token
    (+){2,5}        same as: ++***

You can think of the group (+) as a matchall symbol for an arbitrary token.

A token expression can be marked as an anchor point with an initial @ sign (for the "target" anchor). There must be no whitespace between the marker and the token expression. Numbered anchors are set with @0:, @1: through @9:. By default, @0: sets the "target" anchor and @1: sets the "keyword" anchor. Further numbered anchors need special support from the GUI software executing the CEQL queries.

Use XML tags to match the start and end of a s-attribute region, e.g. <s> for the start of a sentence and </s> for a sentence end. Since such tags denote token boundaries rather than full tokens, a tag by itself is not a valid query: always specify at least one token expression. A list of all <text> regions is obtained with

    <text> +

which matches the first token in each text. A pair of corresponding start and end tags matches a complete s-attribute region, e.g.

    <quote> (+)+ </quote>

a <quote> region containing an arbitary number of tokens (but keep in mind that CQP imposes limits on the number of tokens that can be matched, so very long quotations might not be found).

Attributes on XML start tags can be tested with the notation

    <tag_attribute=PATTERN>

where PATTERN is a wildcard pattern, possibly including negation and case/diacritic modifier flags. It is a quirk of the underlying CQP query language that every XML tag annotation is represented as a separate s-attribute following the indicated naming convention. Therefore, multiple start tags must be specified in order to test several annotations. Also keep in mind that an end tag with the same name is required for matching a full region. A named entity annotated in the input text as

    ... <ne type="ORG" status="fictional">Sirius Cybernetics Corp.</ne> ...

would be matched by the query

    <ne_type=org:c> <ne_status=fict*> (+)+ </ne_type>

Phrase queries can use different matching strategies, selected by a modifier at the start of the query. The default strategy (explicitly selected with (?standard)) includes optional elements at the start of the query, but uses non-greedy matching afterwards; in particular all optional elements at the end of the query are dropped. In some cases, the (?longest) strategy can be useful to include such optional elements and enable greedy matching of quantifiers. See the CQP Query Language Tutorial, Sec. 6.1 for details on matching strategies.

Proximity queries

Proximity queries match co-occurrence patterns. They also build on token expressions, but do not allow any of the constructions of phrase queries. Instead, tokens are filtered based in their co-occurrence with other tokens. There are six basic forms of co-occurrence tests:

    A <<N>> B       B occurs within N tokens around A
    A <<N<< B       B occurs within N tokens to the left of A
    A >>N>> B       B occurs within N tokens to the right of A
    A <<REG>> B     A and B occur in the same region of s-attribute REG

    A <<K,N<< B     B occurs within N tokens to the left of A,
                    but at a distance of at least K tokens
    A >>K,N>> B     B occurs within N tokens to the right of A,
                    but at a distance of at least K tokens

In each case, those occurrences of token expression A are returned which satisfy the constraint. The corresponding positions of B cannot be accessed in the query result. As an example,

   {bucket} <<s>> {kick}_V*

would return all instances of the lemma "bucket" that occur in the same sentence as the verb "kick", but not the matching instances of "kick".

A and B can also be proximity queries themselves, using parentheses to determine the order of evaluation. As an example,

    (A <<3<< B) <<s>> (C <<2>> D)

finds all instances of A that are preceded by B (within 3 tokens to the left) and that also occur in the same sentence as a combination of C and D (within 2 tokens). Proximity queries can be nested to arbitrary depth.

There are two special cases for sequences without parentheses:

    A <<5>> B <<3<< C <<s>> D

applies multiple tests to the instance of A, i.e. it is implicitly parenthesised as

    ((A <<5>> B) <<3<< C) <<s>> D

A sequence of token expressions without any co-occurrence specifiers in between is interpreted as neighbouring tokens, i.e.

    out of {coin}

is rewritten to

    out >>1>> of >>2>> {coin}

and therefore returns only the positions of "out".

Neither XML tags nor anchor points are supported by proximity queries. Likewise, co-occurrence constraints cannot be negated, i.e. you cannot test for non-cooccurrence.