Jak stworzyć korpus równoległy „dla wszystkich”? O pracy nad Polsko-Niemieckim i Niemiecko-Polskim Korpusem Równoległym
von Waldenfels, Ruprecht
MetadataShow full item record
mailto:?subject=I recommend a publication at CeON Repository&body=I recommend a publication “Jak stworzyć korpus równoległy „dla wszystkich”? O pracy nad Polsko-Niemieckim i Niemiecko-Polskim Korpusem Równoległym” available at CeON Repository [https://depot.ceon.pl/handle/123456789/13393]. Recommend
The article summarizes the Polish-German and German-Polish Parallel Corpus currently under development under the auspices of the University of Mainz, Germany. The corpus includes about 1 million tokens in texts in both translation directions and from various genres; at the moment mainly including press and fictional prose. In the future, it is planned to be expanded to other genres, e.g. legal documents and other specialized text types. The text is tagged, lemmatized and automatically sentence and word aligned using standard tools (UPlug, Hunalign). The article focuses on a new interface that was developed on the basis of the existing ParaVoz interface and published as open source. This new query interface aims to be “for all” in the sense that it includes a graphical query builder as well as it allows the user to directly input sophisticated CQP queries, thus providing both ease of use and access to the full possibilities of the CQP query language, a close relative of the query language used with the IPI PAN query interface to the NKJP. Besides being convenient, the interface has an educational aspect: inexperienced users can observe correct CQP queries being constructed on the fly reflecting the choices in the graphical interface, helping them to learn what is a straightforward, but also rather strict formal and technical query language. The interface thus flattens what is often a rather steep learning curve for users that are not used to such query languages, like many traditionally inclined linguists. The interface is available in German, Polish and English and implemented using AngularJS, a modern framework that affords smooth interaction and uncomplicated customization and servicing of the interface. Search facilities offer queries by lemma and grammatical tag, as well as the filtering of results on the basis of metadata, including, for example, a choice of the source language and different genres. The queries generated in this interface are then evaluated by an OpenCorpusWorkbench (CWB) backend, which is modified to output XML. The output is transformed to HTML using client-based XSLT. A difference to earlier versions of the interface is that word alignment is now routinely visualized: the equivalents of the word forms that were found by the query string in the first language are highlighted in the results in the second language. The article gives an in-depth description of the rationale and solutions taken, and concludes with an outlook on future developments.
- Inne prace ILS 
Using this material is possible in accordance with the relevant provisions of fair use or other exceptions provided by law. Other use requires the consent of the holder.