CATMA works locally on the Git projects. When you synchronize your CATMA Project you actually synchronize the local Git projects with their remote counterparts of the GitLab backend. That means before you access your data via Gitlab/Git you should always synchronize your CATMA Project within CATMA.
A CATMA Project is equivalent to a GitLab Group.
The CATMA ProjectID is equivalent to the GitLab Group name. The name and the description of your CATMA Project are stored in the GitLab Group description.
CATMA uses the user, role and permission management of GitLab. So if you are working with a team on your CATMA Project you will find all participating team members in the Members section of the corresponding GitLab Group.
If you have a CATMA account you can use the same credentials to log in to the GitLab backend (your username and password or your Google account).
A CATMA Project usually contains several resources. A resource can either be a Document, an Annotation Collection or a Tagset. All resources are modelled as GitLab/Git projects. A CATMA resource is a GitLab/Git project within the GitLab Group namespace. In order to avoid confusing a CATMA Project with a GitLab/Git resource project we will use those names throughout this article explicitly.
The Root GitLab/Git Project
You will find all your CATMA resources as GitLab/Git projects within your GitLab Group. There is one additional GitLab/Git project in the GitLab Group which is special: the root GitLab/Git project. Its name is a concatenation of the CATMA ProjectID and _root. Let’s assume you have a CATMA Project named Shakespeare with the ID CATMA_900F812B-69EF-4326-A3E6-58BCFF509719_Shakespeare. Then the root GitLab/Git Project would be named:
So what is this root GitLab/Git project all about? Changes to your resources, e.g. adding Annotations to a Collection, are all versioned within the GitLab/Git project that backs a CATMA resource. But what about changes to the structure of your CATMA Project that are made when you add or remove resources? Changes to this structure, the CATMA Project configuration so to speak, are recorded and versioned in the root GitLab/Git project. Luckily Git offers a standard way to manage such a configuration: git submodules. So all resources that are part of the current version of your CATMA Project are Git submodules of the root GitLab/Git project. We will come back to that later when I talk about how to actually work with your CATMA Project via Git. Note for now that CATMA versions your CATMA Project configuration with Git submodules.
When you add a resource to your CATMA Project it gets added as a GitLab/Git project to the GitLab Group and it gets added as a Git submodule to the root GitLab/Git project.
When you remove a resource from your CATMA Project it gets removed as a submodule from the root GitLab/Git project. It does not get removed as a GitLab/Git project from the GitLab Group automatically. So if you get back to older versions of your CATMA Project configurations the resources are still available.
The folder structure of the root GitLab/Git project is as follows:
CATMA_900F812B-69EF-4326-A3E6-58BCFF509719_Shakespeare_root \__ .git \__ .gitmodules \__ documents\ \__ collections\ \__ tagsets\
Besides the .git folder with the Git management and configuration files there is also a .gitmodules file which maintains the list of Git submodules and a subfolder for each of the resource categories of a CATMA Project.
The Documents are all within the
documents folder of your root GitLab/Git project.
Each Document has its own folder named after the CATMA ID of the Document.
Each Document folder contains four files:
- header.json – contains metadata of Document
- CATMA_ID_orig.EXT – the original file that was uploaded with a format specific extension
- CATMA_ID.txt – the extracted text in UTF-8 plain/text
- CATMA_ID.json – the indexed types with the start, end and token offset of their tokens, i. e. the word list
The Extracted Text
The file with the extracted text in UTF-8 plain/text is the most important file as all start and end offsets of the Annotations can be resolved against the character offsets of the extracted text.
The Tagsets are all within the
tagsets folder of your root GitLab/Git project.
Each Tagset has its own folder named after the CATMA ID of the Tagset.
Each Tagset folder contains a
header.json file with some meta data like the Tagset’s name.
Remember a Tagset contains zero or more Tags that can form a hierarchical structure:
- A Tag has either no parent (top level Tag) or exactly one parent.
- A Tag can have zero or more child Tags.
Each Tag has its own folder named after the Tag’s CATMA ID.
Tag folders of Tags that have a parent Tag are located as subfolders of a folder named after the parent Tag’s ID.
This sounds more complicated than it is, because in the end each Tag is represented by a file named
propertydefs.json. To load the Tags of a Tagset you just need to parse all propertydefs.json files that can be found in the sub directories of the Tagset’s folder.
The propertydefs.json file contains:
- the name
- the ID, which is a UUID
- the parent ID, which is a UUID but can be empty in case of a top level Tag
- two system Properties:
- the author (catma_markupauthor)
- the color (catma_displaycolor)
The values of the system Properties are to be found in the property named
The color is encoded as an integer value containing red, green and blue values encoded as bits: red component in bits 16-23, the green component in bits 8-15, and the blue component in bits 0-7. This corresponds to the color encoding in HTML.
Besides the system Properties a Tag can have zero or more user defined Properties. Each Property has a name and a list of possible or proposal values (possibleValueList) that get presented to the user upon application of a Tag.
The Collections are all within the
collections folder of your root GitLab/Git project.
Each Collection has its own folder named after the CATMA ID of the Collection.
Each Collection folder contains a
header.json file with some meta data like the Collections’s name and the ID of the Document it belongs to (sourceDocumentId).
The Annotations are located in a subfolder called
annotations. Each Annotation has its own file named after the CATMA ID of the Annotation.
Annotations follow the Web Annotation Data Model and are serialized as JSON-LD.
Each Annotation has a type, i. e. its Tag, and one or more references to possibly non-adjacent (discontinuous markup) text segments. Each Annotation has a timestamp and an author (not to be confused with the author of the Tag). The Annotation inherits the color, the name and the user defined Properties from its Tag. A user defined Property of an Annotation can have zero or more values which are either drawn from the set of possible values of the Tag or defined by the user while annotating (ad-hoc values).
Within the Annotation’s file you’ll find a body section with the following subsections:
- tagset – the URL of the Tagset GitLab/Git project. This URL also contains the ID of the Tagset.
- tag – the URL of the Tag GitLab/Git project. This URL also contains the ID of the Tag.
- properties – The Properties and their Annotation specific values:
- system – timestamp and author of the Annotation
- user – user defined Properties with the CATMA ID and a list of values for each Property
- target – contains a list of TextPositionSelector selectors with start and end offsets that reference the aforementioned UTF-8 plain/text file with the extracted text
Working with Git
The GitLab backend is accessible at git.catma.de. Once you’ve logged in with your CATMA account credentials (or your Google account) you can access your settings in the upper right corner. On the settings page you will find in the menu on the left the Access Tokens menu item.
Add a Personal Access Token with the ‘api’ scope enabled.
Make sure you copy the token right after creation and put it somewhere safe. You won’t be able to see the token itself after you leave the page!
Before you start please make sure that your local installation of Git has the right setting for the handling of line endings. The value of
core.autocrlf needs to be set to
false. You can check the value with:
git config --global --get core.autocrlf
If it doesn’t print anything out then it defaults to
false. You can set the value with
git config --global core.autocrlf false
Note that it is also possible to set this value per Git repository or on the system level. Just make sure that it is set to
false for all CATMA Git repositories/projects!
Now you can work with the GitLab API.
For example to get a list of all of your CATMA Projects:
or with the parameter
GROUPID_OR_NAME set to the CATMA Project ID to get a list of all the resource GitLab/Git projects and the root GitLab/Git project:
Taking the Git URL of the corresponding root GitLab/Git project you can also work with Git directly to clone a CATMA Project. Assuming the abovementioned Shakespeare CATMA Project the command would look like this:
git clone --recurse-submodules https://git.catma.de/CATMA_900F812B-69EF-4326-A3E6-58BCFF509719_Shakespeare/CATMA_900F812B-69EF-4326-A3E6-58BCFF509719_Shakespeare_root.git
Use the created access token as username and password. Alternatively you can also add an SSH key to your account and clone with SSH.
All you need is the URL of the root GitLab/Git repository. The resources of the current version get initialized automatically by the
You can use the cloned repository for backup purposes or to integrate external systems. However, be aware that CATMA can handle only a certain amount of complexity when resolving merge conflicts. You should therefore always resolve conflicts on your side.
CATMA always works on a local-only
dev branch. Changes get merged into the local
master branch. When synchronizing a CATMA Project the local
master branches of the resource Git projects and of the root Git project get merged with their remote GitLab counter parts.
Note that you should take care to stick to the abovementioned folder and file structures and formats to avoid errors.