Data Curation at the BBAW Language Center and German Text Archive
Digital research data of texts from the late 16th to early 20th century can be published via the German Text Archive (DTA) at BBAW Language Center. Here, the module DTA Extensions forms the setting of the publication. The respective research data are transformed into the DTA input format (DTABf) and integrated in the DTA and CLARIN-D infrastructure. This way, they can be provided not only for research in further external contexts but can also be made available and exploitable in combination with other corpora of the BBAW Language Center. For example, texts by Alexander von Humboldt were gathered from four different projects and can now be analysed as one corpus
within the DTA infrastructure. Similarly, a corpus of historical newspapers can be queried
which has been gathered from various sources and is still extended continually.
Preliminaries for Data Curation at BBAW
Suitable extensions to the DTA’s historical corpora are primary sources which were received by a large audience, which are key texts for notable discourses or epochs, or which by other characteristics justify being object to research today.
Ideally, texts should date back to some time between the late 16th and the early 20th century. This is the time frame for the vast majority of DTA corpus texts. However, there is some tolerance beyond these time limits, as well.
The images which the transcription is based on should be available in high resolution, in an uncomprised format (TIFF) and with a license that allows for further reuse.
Input format for texts is the DTA Base Format (DTABf)
, an XML format following the TEI P5 Guidelines. Text annotation has to be carried out according to the DTABf, or according to guidelines which are compatible with the DTABf, allowing for lossless conversion of the texts.
The CLARIN centre at BBAW grants support for planning and carrying out digitization projects, especially concerning the following tasks:
- Image Digitization
The DTA cooperates with various libraries which are specialized on collecting prints from the 17th to 20th century. We can offer advise and connect you with the libraries most suitable for your digitization task.
- Recording of Metadata
It is important to record bibliographic metadata, information on the dgital facsimile, as well as on the process of text recognition. The CLARIN center at BBAW provides a web form that facilitates the proper recording of metadata according to the DTABf Guidelines and to gain a DTABf conformant TEI Header.
- Text Recognition
For text recognition two workflows are possible. The DTA can either perform the entire text recognition task according to a standardized workflow, or advise users in performing text recognition on their own. For the latter, we provice comprehensive documentation and schemas of the DTABf. Additionally, a Framework for the Author-Mode of the oXygen-XML-Editor may be used which facilitates text annotation in a
Finally, the XSL transformation script for DTABf to HTML is freely
available, so that annotators may easily visualize and check the outcome
of their text.
The following information are usually collected in order to estimate the necessary efforts for the integration of texts into the DTA:
General Information about the Document(s)
General Information about the Project
- Short description of text selection criteria
- Time frame of text origins (publication dates)
- Text type/Genre
- Print or manuscript
- Complete or partial
Text Recognition and Annotation
- Time frame of background project
- Time frame for data integration
- Responsible person(s)/institution(s)
- Website of project
- Contact person(s) and address
- Text recognition (TR) completed y/n
- Edition number of source
- Metadata at hand y/n
- Metadata format
- Guidelines for TR
- Deviation and general distance of TR guidelines from the DTA guidelines
- Person/institution/company responsible for text digitization
- Estimated quality of text recognition
- Text annotation y/n
- Guidelines for text annotation
- Text annotation with TEI y/n
- Text annotation according to DTABf y/n
- Estimated effort necessary for conversion into DTABf
- Text annotation in XML?
- Licenses/conditions for reuse
- disclosure risk present y/n
- anonymization necessary y/n
- Images at hand y/n
- Possessing library - signature
- License/conditions for reuse
- Link of external publication
- Self-link of DTAE publication
- Static or dynamic corpus
- Scheduled extensions
These information are also needed to create minimal metadata records for the DTA publication of each text.