The Annotate module: User generated markup (annotations or tags) in CATMA references the text in terms of character offsets. The smallest referenceable text sequence thus is one character long. Punctuation marks and blanks consequently are counted as characters, too. An annotation is a typed set of key value pairs that is attached to one or more text passages. Discontinuous annotations are possible too. The tag definition, which can be freely created by the user, defines the type of an annotation and is a named set of properties. This set contains at least an author and a color property. CATMA can be used in any field of text research and is not bound to the analysis of a specific topic.
The user generated annotations can be exported as TEI-XML for enhanced tool interoperability. CATMA makes extensive use of the Feature Declaration System and Feature Structures in order to allow free tag creation. We use stand-off markup. With the help of @ana attributes, text segments (<seg>) are linked to annotations. The text itself is represented by <ptr>-elements that reference the original text document. The text document usually cannot be changed in any way. Small edits, however, can be done by creating a new copy in the GitLab backend. The TEI export format is described in detail here.
The Analyze module compounds the CATMA Query Language as well as interactive and manipulable Vega visualizations. CATMA’s query language is inspired by TACT and consists of several simple query types that can – in combination – form complex queries. The ANTLR library is the basis of the query engine. Regular expression queries (based on Java’s native regular expression capabilities) are part of the language.
Current and Prior Versions
Each version of CATMA came with particular technological changes. The main improvements are listed below.
CATMA 6, the current version, is another major overhaul that includes multiple foundational changes:
- The relational database backend was replaced with the open-source GitLab platform and practically all storage was made file-based within Git repositories. JSON is the predominant file format being used.
- The Neo4J graph database was replaced with the in-memory TinkerGraph from the Apache TinkerPop™ project.
- Support for different document types was enhanced using Apache Tika™.
- The distribution and newly added word cloud visualizations were implemented using the Vega visualization grammar, affording users a new level of control over the visualizations, and allowing them to create new or derived visualizations.
- The user interface was redesigned and inspired, in part, by Google’s Material Design language
Together, these changes take especially the versioning and collaborative aspects of CATMA to the next level and open the door to direct integrations with other tools, as well as giving power users complete access to and control over their data.
Version 6.1 added a collaborative comment feature to the Annotate module. Comments can also feed back into the analysis via the query language.
Version 6.2 made it easier for users to create so-called “Personal Access Tokens”, which can be used to give other tools access to project data within CATMA.
Version 6.3 came with some small internal improvements, including a way for users to be more easily notified about problems or important maintenance events directly on the login page.
Version 6.4 again introduced a number of smaller improvements, focusing mainly on user experience, security and stability.
Version 6.5 introduced a new experimental JSON API for interoperability with other tools, which can be used to retrieve project resources (documents as well as annotations and their corresponding tags) in a condensed/flattened format.
CATMA 5 offered a redesigned user interface and automated routines. The new UI changed the window-based floating navigation to a fixed layout that allowed navigation via five module buttons. The automated routines were developed in the heureCLÉA project and are accessible from CATMA via pipeline. They allow for the automatic annotation of tenses and time signals in German texts. An automatic part of speech annotation routine for German texts was implemented as well.
CATMA 4 was the first version that was implemented as a web application. In contrast to the older desktop versions, version 4 facilitated the collaborative work on texts by taking advantage of the web environment. Additional new features included an improved user interface, new query possibilities and the possibility to analyze whole corpora.
Version 4.1 brought a major performance improvement. The rewritten persistence layer sped up communication between server and underlying database repository. It also included many usability features.
Later, Version 4.2 came with the following major improvements:
- Upgrade of the underlying Vaadin web framework from version 6 to version 7
- Complete rewrite of the Distribution Chart: based on the small multiples visualization technique and supporting on-click navigation back to KWIC instances as well as the full text
- Improved support for RTL languages (Hebrew, Arabic)
- Opening of Tagsets and Annotation Collections directly from the Annotate module
- Small UI improvements on buttons, menu and annotation info section
- Revised Property dialog in the Annotate module
- Graph database (Neo4J) backed indexing of documents for fast collocation, frequency, phrase and wildcard queries
- Buffered background indexing for fast document additions
- Queries executed in the background
- Support for XML based documents
CATMA 3 was a platform-independent, fully JAVA© based desktop application and an evolution of CATMA 2. It required a newer JAVA© runtime (JRE 1.6 or higher). Updates included a slightly changed query language, a simple-to-use query builder, improved context-sensitive help functionality as well as many more improvements, such as the possibility to search for properties and to jump from a graphic chart to the exact reference point within the document view.
Version 3.2 included various bug fixes and minor improvements like the possibility to synchronize outdated tag databases contained in Annotation Collections with the local tag database. The user could export data directly from CATMA to Voyant and take advantage of the rich visualization possibilities of the tool, too.
CATMA 2 was a platform-independent, fully JAVA© based desktop application with extended functionality. Among others, new features were:
- punctuation aware indexing and searching
- phrase results when searching for annotated sequences
- display of tags in result view
- display of totals in selection views
- (basic) support for html, pdf, rtf and doc files
- search facility within the Annotate module
- selective multiple annotating and deleting of annotations
- annotating from result view
- distribution analysis
- collocation analysis
- full regular expression power in regex queries
- collocate query span per collocate (sub-)query
CATMA 1 was our first beta and was implemented as a desktop application for Windows only. CATMA 1.1. combined two separate, but seamlessly integrated applications. For experimental reasons we developed the analysis module in Microsoft© C# / .NET whereas the annotation module was programmed in JAVA©.
Tools that We Use – Acknowledgement and Thanks
EJ Technologies were kind enough to provide us with an open source project license for their excellent Java profiler: JProfiler. Thank you very much!