Query language

Let’s use our example sentence again to demonstrate the power of CATMA’s query language in more detail. (If you’d rather jump ahead to a complete systematic description, call up the comprehensive manual.)

In the sentence

Snoopy had lunch, and Tigger had breakfast.

we had highlighted the words Snoopy and Tigger and then asigned the Tag <Animal> to both of them. In CATMA we can now run a search for all occurrences of it by executing the query

tag = "Animal"

This can then be combined with another powerful feature of the query language: the Refinement. To find all occurrences of the word dog that have been annotated with <Animal>, the query would take the form:

"dog" where tag = "Animal"

But what if an annotation operates on a larger granularity level, such as sentences or paragraphs, or on a smaller granularity level, such as morphemes? The Refinement query offers a match mode for these cases too. To analyze whether a segment of text and an annotation match, CATMA compares two so-called “offsets:”

  • Offset 1: What is the string position of the annotated segment’s first and last character? (Remember: by definition, string positions are fixed in a text. The first character has position 1, the second position 2, and so on.) In our example sentence

Snoopy had lunch, and Tigger had breakfast.

The “S” in Snoopy is at position 1, the “y” at position 6. These are the character count values that determine the “offset” of the word Snoopy: 1 to 6.

  • Offset 2: What is the “range” of text that was highlighted when the relevant annotation was assigned? If you highlighted our example sentence like this:

Snoopy had lunch, and Tigger had breakfast.

and then assigned the Tag

<Animal>

to the highlighted segment, you will have an exact match: the Annotation <Animal> refers to the range of characters in the positions 1-6. If, however, you had highlighted just one character beyond the word boundary of Snoopy

Snoopy had lunch, and Tigger had breakfast.

the match would no longer be exact – even if that additional character is just an empty space: these, too, are counted, as are punctuation marks, line feeds, and paragraph markers.

Using this simple comparison logic CATMA can distinguish between three types of match between text and annotation when it evaluates a query Refinement:

  1. exact – this is the default. In the example above the range of the <Animal> annotation must have the same start and end offset as the range of the occurrence of Snoopy.
  2. boundary – the occurrence of Snoopy needs to be within the boundaries of the annotation <Animal>.
  3. overlap – the ranges of Snoopy and annotation <Animal> need to overlap.

So if we had highlighted our example sentence like this:

Snoopy had lunch, and Tigger had breakfast.

and then annotated this particular segment using the Tag

<Animal>

a Refinement query would produce the results:

exact = false: the two offsets differ

boundary = true: Snoopy is within the boundaries of the annotation range for <Animal>

In this example, overlap would obviously also be the case. To make the distinction between boundary and overlap Refinement clearer, let’s look at a second example where we accidentally highlighted Snoopy like this:

Snoopy had lunch, and Tigger had breakfast.

Neither the exact nor the boundary Refinement criterion would be met – but the third one would produce:

overlap = true

Finally, let us assume we had just highlighted the entire sentence in order to make explicit through our annotation that this sentence, according to our understanding, deals with animals:

Snoopy had lunch, and Tigger had breakfast.

To get all occurrences of Tigger within this sentence annotated with the Tag <Animal> the query would look like this:

"Tigger" where tag = "Animal" boundary

The query language also allows querying for Properties and their values. Let’s say annotation <Animal> has a property named family that allows the user to identify cats among the animals. A query like

tag = "Animal" property = "family" value = "cat"

would then find Tigger, but not Snoopy.

CATMA presents the query results grouped and counted either by text sequence or by Tag. Additionally, subgroups per document, and in case of annotations per Annotation Collection, are created. From the grouped results the user can compile concordances that also link back to the annotated text within the documents.