One of CATMA’s central entities is the text range. All markup generated by the user references text in terms of character offsets. So the smallest referenceable text sequence is one character long, this also includes punctuation marks and blanks. Another central entity is the Tag Instance which is a typed set of key value pairs that is attached to one or more text ranges. The latter allows CATMA to support discontinuous markup. The type of a Tag Instance is defined by a Tag Definition which is a named set of properties. The property set contains at least a color property. The Tag Definitions can be created freely by the user. CATMA is not bound to the analysis of a specific topic, it can be extended easily to any field of research that can benefit from text analysis.
For better tool interoperability we offer a TEI export of the user generated markup. To be able to support the free creation of Tag Definitions CATMA makes heavy use of the Feature Declaration System and Feature Structures. We use Standoff Markup. Text ranges (<seg>) are linked to Tag Instances by the use of the @ana attribute. The text itself is represented by <ptr>-Elements that reference the original Source Document. The Source Document, i.e. the text to be annotated is not changed in any way. A detailed description of the TEI export format can be found here.
The Analyzer’s main component is the CATMA Query Language that consists of several simple query types that can be combined to form complex queries. The query engine is based on the antlr library. Especially the CATMA Query Language is heavily inspired by TACT. Some query types are backed by a database containing index tables and are implemented in SQL. The language offers a regular expression query which uses Java’s native regular expression capabilities.