Technology and Prior CATMA Versions
The Annotate module: User generated markup (annotations) in CATMA references the text in terms of character offsets. The smallest referenceable text sequence thus is one character long. Punctuation marks and blanks consequently are counted as characters, too. An annotation is a typed set of key value pairs that is attached to one or more text passages. Discontinuous annotations are possible. The tag definition, which can be freely created by the user, defines the type of an annotation and is a named set of properties. This set contains at least a color property. CATMA can be used in any field of text research and is not bound to the analysis of a specific topic.
The user generated annotations can be exported as TEI for enhanced tool interoperability. CATMA makes extensive use of the Feature Declaration System and Feature Structures in order to allow free tag creation. We use stand-off markup. With the help of @ana attributes, text segments (<seg>) are linked to annotations. The text itself is represented by <ptr>-Elements that reference the original text document. The text document usually cannot be changed in any way. Small edits, however, can be done by creating a new copy in the GitLab backend. The TEI export format is described in detail here.
The Analyze module compounds the CATMA Query Language as well as interactive and manipulable VEGA visualizations. CATMA’s query language is inspired by TACT and consists of several simple query types that can—in combination—form complex queries. The ANTLR library is the basis of the query engine. Some of the query types are implemented in SQL and are backed by a database containing index tables. Regular expression queries (based on Java’s native regular expression capabilities) are part of the language.
CATMA currently runs in its sixth version. Each version came with particular technological changes. The main improvements are listed below.
CATMA 5 offered a redesigned user interface, a graph database (NEO4J) and automated routines. The new UI changed the window-based floating navigation to a fixed layout that allowed navigation via five module buttons. NEO4J was used to index the documents and to execute user generated queries. The automated routines were developed in the heureCLÉA project and are accessible from CATMA via pipeline. They allow for the automatic annotation of tenses and time signals in German texts. An automatic part of speech annotation routine for German texts was implemented as well.
CATMA 4 was implemented as a web application. In contrast to the older desktop versions, CATMA 4 facilitated the collaborative work on texts due to the web environment. Further new features included an improved user interface, new query possibilities and the possibility to analyze whole corpora. CATMA 4.1 brought a major performance improvement. The rewritten persistence layer speeded up communication between server and underlying database repository. It also included many usability features. Additionally, CATMA 4.2 provided the following major improvements:
- Upgrade of the underlying framework Vaadin from version 6 to version 7
- Complete rewrite of the Distribution Chart: based on the small multiples visualization technique and onclick jump back to KWIC instances as well as the full text
- Improved support for RTL languages (Hebrew, Arabic)
- Open Tagsets and Annotation Collections directly from Annotate.
- Small UI improvements on buttons, menu and annotation info section
- Revised set-property-dialog in Annotate
- Graph database (Neo4J) backed indexing of documents for fast collocation, frequency, phrase and wildcard queries
- Buffered background indexing for fast document additions
- Queries executed in the background
- Support for XML based documents
CATMA 3 was fully JAVA© based, required JRE 1.6 or higher and was available for Mac and Windows or other Unix/Linux-based systems. It included a slightly changed query language, a simple-to-use query builder, an improved context-sensitive help functionality as well as many more improvements, such as the possibility to search for properties and to jump from a graphic chart to the exact reference point within the document view. CATMA 3.2 included various bug fixes and minor improvements like the possibility to synchronize outdated tag databases contained in Annotation Collections with the local tag database. The user could export data directly from CATMA to Voyant and take advantage of the rich visualization possibilities of the tool, too.
CATMA 2 was platform-independent, fully JAVA© based version with extended functionality. Among others, new features were
- punctuation aware indexing and searching facility
- phrase results when searching for annotated sequences
- display of tags in result view
- display of totals in selection views
- (basic) support for html, pdf, rtf and doc files
- search facility within the Annotate module
- selective multiple annotating and deleting of annotations
- annotating from result view
- distribution analysis
- collocation analysis
- full regular expression power in regex queries
- collocate query span per collocate (sub-)query
CATMA 1 was our first beta and implemented for Windows only. CATMA 1.1. combined two separate, but seamlessly integrated applications. For experimental reasons we developed the analysis section in Microsoft© C# / .NET whereas the annotation section was programmed in JAVA©.