The CATMA-TEI Export Format
Each Annotation Collection in CATMA is represented as one TEI XML file. The TEI header contains a basic file description with title, author and publisher of the represented Annotation Collection. There is also a short source description which includes a CATMA version number that specifies the version of the CATMA TEI export format. As of this writing we are at version 5 but we try to be backward compatible with all versions of the CATMA history.
This is an example of the file description (at the bottom of the page you will find a link to the full version of the example):
<fileDesc> <titleStmt> <title>Alice Demo Markup</title> <author>Marco Petris</author> </titleStmt> <publicationStmt> <publisher>Marco Petris</publisher> </publicationStmt> <sourceDesc> <p>Demo Markup for Alice in Wonderland</p> <ab> <fs xml:id="CATMA_TECH_DESC"> <f name="version"> <string>4</string> </f> </fs> </ab> </sourceDesc> </fileDesc>
However, the most important part of the TEI header is the encoding description which contains all the type information, the actual CATMA Tags. The encoding description represents the Tag Library of the Annotation Collection. This Tag Library lists all the Tagsets relevant for the annotations contained in this Annotation Collection. Each Tagset is represented by a <fsdDecl> element. There is a unique identifier of the Tagset in the @xml:id property. The @n propery contains the name and a version timestamp of the Tagset:
<fsdDecl xml:id="CATMA_D044A28F-73CE-43F4-96D0-954B4D585862" n="Demo Tagset 2017-06-09T10:54:11.000+0200">
The Tagset lists all its Tags with <fsDecl> elements. Each Tag has a unique @xml:id and a version timestamp in @n. The @type property holds the unique identifier again. The @type and @baseTypes property build up the Tag hierarchy in this Tagset. The Tag hierarchy is a rooted directed tree. Each Tag has zero or one parent Tag (encoded in @baseTypes) and each Tag can have zero or more child Tags. The @type property is also referenced by the actual Annotations at the bottom of the TEI document, but we’ll get to that later. Each Tag has a name encoded in the mandatory <fsDescr> element. The Properties of the Tag are encoded as <fDecl> elements. Each Property has its unique identifier in @xml:id and its name in @name. There is an optional list of possible values for each property encoded as a <vRange><vColl> combination that contains a list of <string> elements for the actual values. If there are no predefined values the <vColl> can be empty. There are two system properties that are manged by CATMA and that are always present: the Tag color (catma_displaycolor) and the Tag author (catma_markupauthor). The color is encoded as an integer value containing red, green and blue values encoded as bits: red component in bits 16-23, the green component in bits 8-15, and the blue component in bits 0-7 (I know… it has historical reasons).
This is an example of a Tag named “Demo Tag 2” with the two system properties and two user defined properties. One of the user defined properties has a list of possible values:
<fsDecl xml:id="CATMA_FF67AFFF-98F4-41C1-8222-8E9B3BF4626E" n="2017-06-09T11:10:14.000+0200" type="CATMA_FF67AFFF-98F4-41C1-8222-8E9B3BF4626E"> <fsDescr>Demo Tag 2</fsDescr> <fDecl xml:id="CATMA_F2712002-5212-4646-8727-CA37C995FD94" name="catma_markupauthor"> <vRange> <vColl> <string>firstname.lastname@example.org</string> </vColl> </vRange> </fDecl> <fDecl xml:id="CATMA_FC0EA60F-8A26-49BA-BCE4-E5EE3BC61578" name="catma_displaycolor"> <vRange> <vColl> <string>-5311554</string> </vColl> </vRange> </fDecl> <fDecl xml:id="CATMA_7A32A769-2B3A-4AF7-A97F-25B502054796" name="My_Property_with_values"> <vRange> <vColl> <string>my first value</string> <string>my second value</string> </vColl> </vRange> </fDecl> <fDecl xml:id="CATMA_71955049-B496-43A8-9572-D540C83B4B7D" name="My_Property_without_values"> <vRange> <vColl/> </vRange> </fDecl> </fsDecl>
The next important part of the TEI document is the <text> element which follows the header. It has two sections. The first section is the <body> section which contains the references to the text and the references to the Annotations. The second sections contains <fs> elements for each Annotation. Let’s first have a look at the Anntotations. The Annotations are the instances of the Tags. Each Annotation has a uniqe identifier in @xml:id and a @type attribute that points to the Tag that is the type of this Annotation. The @type attribute contains the identifier of the type that is specified in the encoding description we talked about earlier. Each Annotation has exactly one type. All properties of the Annotation are listed as <f> elements. The name in @name corresponds to the Property definitions of the Annotation’s type from the encoding description. The Annotation specific value of the Property is encoded with a <string> element. Note that in the case of the catma_displaycolor all Annotations of a type have the same value because the color of two Annotations is the same if they have the same type. The catma_markupauthor can be different for each Annotation as it holds the author of the Annotation not the author of the Tag that types the Annotation.
This is an example of an Annotation with the Tag “Demo Tag 2”:
<fs xml:id="CATMA_651065DA-4D0A-4308-833E-83B751237B36" type="CATMA_FF67AFFF-98F4-41C1-8222-8E9B3BF4626E"> <f name="catma_markupauthor"> <string>email@example.com</string> </f> <f name="catma_displaycolor"> <string>-5311554</string> </f> <f name="My_Property_with_values"> <string>my first value</string> </f> <f name="My_Property_without_values"> <string>this is an adhoc value </string> </f> </fs>
The user defined Property “My_Property_with_values” has a value from the list of predefined values: “my first value”. The user defined Property “My_Property_without_values” has a value which wasn’t predefined but added on-the-fly: “this is an adhoc value.”
Now let’s have a look at the content of the <body> element. The whole text from the Source Document is referenced by <ptr> elements. These have a @target attribute that holds an URI to the text. The start and end offsets of the referenced text chunk is encoded as a fragment identifier of that URI. An annotated text is enclosed by a <seg> element that references the annotations in its @ana attribut by their @xml:ids. These are the identifiers of the <fs> encoded annotations we explained in the previous section. The character offsets are calculated with the UTF-8 version of the Source Document which can also be exported with CATMA.
<body> <ab type="catma"> <ptr target="catma://CATMA_AF494965F045#char=0,134" type="inclusion"/> <seg ana="#CATMA_314549B8-5603-4031-B9BF-C162067DE3EB"> <ptr target="catma://CATMA_AF494965F045#char=134,139" type="inclusion"/> </seg> <ptr target="catma://CATMA_AF494965F045#char=139,397" type="inclusion"/> <seg ana="#CATMA_8F779E0C-B902-4AF1-83A1-C250F6B9EFC7"> <ptr target="catma://CATMA_AF494965F045#char=397,402" type="inclusion"/> </seg> <ptr target="catma://CATMA_AF494965F045#char=402,447" type="inclusion"/> <seg ana="#CATMA_C04A09C1-64C4-48B3-BD61-18BB9EED889B"> <ptr target="catma://CATMA_AF494965F045#char=447,450" type="inclusion"/> </seg> <ptr target="catma://CATMA_AF494965F045#char=450,691" type="inclusion"/> <seg ana="#CATMA_651065DA-4D0A-4308-833E-83B751237B36"> <ptr target="catma://CATMA_AF494965F045#char=691,703" type="inclusion"/> </seg> <ptr target="catma://CATMA_AF494965F045#char=703,147781" type="inclusion"/> </ab> </body>
Please note that a <seg> element can be annotated by multiple Annotations by having a whitespace separated list of references in the @ana attribute and a single Annotation can be referenced by multiple <seg> elements that can be adjacent to each other but don’t need to be. We call the case where a single Annotation is referenced by two or more segments that are not adjacent to each other “discontinous markup”.Here you have the full version of the Annotation Collection in TEI format (right-click on it to download) and the corresponding UTF-8 Source Document (right-click on it to download).