The CLARIN service center of the Zentrum Sprache at the BBAW

Data Curation at the BBAW Language Center and German Text Archive

Digital research data of texts from the late 16th to early 20th century can be published via the German Text Archive (DTA) at BBAW Language Center. Here, the module DTA Extensions forms the setting of the publication. The respective research data are transformed into the DTA input format (DTABf) and integrated in the DTA and CLARIN-D infrastructure. This way, they can be provided not only for research in further external contexts but can also be made available and exploitable in combination with other corpora of the BBAW Language Center.

Preliminaries for Data Curation at BBAW

Contents

Suitable extensions to the DTA’s historical corpora are primary sources which were received by a large audience, which are key texts for notable discourses or epochs, or which by other characteristics justify being object to research today.

Ideally, texts should date back to some time between the late 16th and the early 20th century. This is the time frame for the vast majority of DTA corpus texts. However, there is some tolerance beyond these time limits, as well.

Technical

The images which the transcription is based on should be available in high resolution, in an uncomprised format (TIFF) and with a license that allows for further reuse. Input format for texts is the DTA Base Format (DTABf), an XML format following the TEI P5 Guidelines. Text annotation has to be carried out according to the DTABf, or according to guidelines which are compatible with the DTABf, allowing for lossless conversion of the texts.

The CLARIN centre at BBAW grants support for planning and carrying out digitization projects, especially concerning the following tasks:

Image Digitization
The DTA cooperates with various libraries which are specialized on collecting prints from the 17th to 20th century. We can offer advise and connect you with the libraries most suitable for your digitization task.
Recording of Metadata
It is important to record bibliographic metadata, information on the dgital facsimile, as well as on the process of text recognition. The CLARIN center at BBAW provides a web form that facilitates the proper recording of metadata according to the DTABf Guidelines and to gain a DTABf conformant TEI Header.
Text Recognition
For text recognition two workflows are possible. The DTA can either perform the entire text recognition task according to a standardized workflow, or advise users in performing text recognition on their own. For the latter, we provice comprehensive documentation and schemas of the DTABf. Additionally, a Framework for the Author-Mode of the oXygen-XML-Editor may be used which facilitates text annotation in a WYSIWYG-like view.

DTAE Checklist

The following information are usually collected in order to estimate the necessary efforts for the integration of texts into the DTA:

General Information about the Document(s)

Short description of text selection criteria
Time frame of text origins (publication dates)
Language
Text type/Genre
Discourse
Print or manuscript
Complete or partial
Extent

General Information about the Project

Time frame of background project
Time frame for data integration
Responsible person(s)/institution(s)
Website of project
Contact person(s) and address

Text Recognition and Annotation

Text recognition (TR) completed y/n
Edition number of source
Metadata at hand y/n
Metadata format
Guidelines for TR
Deviation and general distance of TR guidelines from the DTA guidelines
Person/institution/company responsible for text digitization
Estimated quality of text recognition
Text annotation y/n
Guidelines for text annotation
Text annotation with TEI y/n
Text annotation according to DTABf y/n
Estimated effort necessary for conversion into DTABf
Text annotation in XML?
Licenses/conditions for reuse
disclosure risk present y/n
anonymization necessary y/n

Images

Images at hand y/n
Possessing library - signature
Format
License/conditions for reuse

Publication

Link of external publication
Self-link of DTAE publication
Static or dynamic corpus
Scheduled extensions

These information are also needed to create minimal metadata records for the DTA publication of each text.

The CLARIN service centerof the Zentrum Sprache at the BBAW

The CLARIN service center of the Zentrum Sprache at the BBAW

The CLARIN service center
of the Zentrum Sprache at the BBAW