Institutional Excellence Program for the Higher Education (2018-)

Problem Solving Systems

Project team
  • Lead
    • Gábor Palkó
  • Researchers
    • Péter Horváth
    • Balázs Indig
    • Eszter Kovács
    • Tünde Molnár
    • Ádám Smrcz
    • Emma Takács
  • PhD students
    • Botond Szemes
    • Ádám Sebestyén

The research program is carried out in the framework of the Institutional Excellence Program for the Higher Education established in 2018. The focus of the research is the peculiarities and differences of the analogue and digital approaches of textual analysis. The methods of problem solving differ considerably, especially if the problems change from analog to digital. The research is closely linked to the international trend of “distant reading” (Franco Moretti), which is now one of the leading scientific fields in the humanities. Distant reading uses computer tools, which, in contrast to the close and slow close reading of a limited corpus, can handle enormous amounts of texts.

Natural Language Processing

The Hungarian language is an agglutinative language. Compared to the English, the quantity of a text needed to properly train any machine learning algorithm would be much higher. The number of morphological variation per word is two order of magnitude greater, resulting in disproportionally more infrequent words (due to Zipf’s law) that do not occur in the same amount of training material, therefore, it is unseen during training. To be able to properly examine Hungarian literary texts of various genre, due to their low volume, one must utilize many ingenious Hungarian specific preprocessing modules that are in constant development to keep up with the current natural language processing (NLP) trends. These are claimed to be `language independent’, but in reality, they are `English-driven’.

The Center relies heavily on these tools for many applications, including but not limited to stylometry and semantic web. The most current Hungarian NLP pipeline is developed by professional computational linguists using the well-founded mature methods that have proved to be functional for the Hungarian language, but still these tools have a small number of applications. As there are limited resources, our primary focus in their development, is the standardisation and code reuse to be able to
utilize as much usable language independent parts from existing tools as
possible, whilst keeping the highest quality that can be achieved in technical and linguistic terms.

We are participating in the application centric development and standardization of the e-magyar language processing system, and our goal is to make it a modular, open, and standard compliant platform that fulfills any digital humanities related needs and can be further extended on demand. The center also gathers current Hungarian language text from the internet into large corporas for web archiving purposes. To be able to process them with the e-magyar system into richly-annotated data, currently popular neural network based vector embedding models can be used, which are needed by the digital humanities and social sciences.

Handwritten Text Recognition Using Machine Intelligence

In cooperation with the Institute for Literary Studies of Hungarian Academy of Sciences (MTA ITI), ELTE.DH is currently working on the automatic recognition and transcription of handwritten letters by János Arany (1817-1882). The project relies on Transkribus, which is a comprehensive platform for the automated recognition, transcription and searching of historical documents. It uses machine learning, therefore in theory, it can be applied to documents in any language, layout, or style. After uploading the manuscript images to the Transkribus server, it can automatically recognise text regions, as well as the baseline in each row. For the automated text recognition, it requires at least 100 pages of transcribed images that are used as training data for the machine learning algorithms. Transkribus offers a number of tools not only for the word by word transcription, but also for tagging metadata: personal names, places or locations, dates, as well as organizations, all that can be marked with the corresponding tags.

When the training data is ready, a Handwritten Text Recognition (HTR) model can be created with the help of the Transkribus team. Accuracy differs from project to project, as it depends on a number of different factors like the quantity of the training data, consistency of the handwriting, etc., but there are promising projects with a Character Error Rate of around 5%.

Annotation of Big Scale Poetical Corpora

The goal of the research project is to create an annotated corpus of 19th and 20th century Hungarian poetry. The annotation is based on automatic methods. We annotate the structural elements of poems and we are developing a program that annotates rhythm, rhyme patterns, and alliterations. In case of words, we also annotate part of speech, morphological and phonological features. In collaboration with the Research Group in Stylistics at ELTE, we are developing a subcorpus in which the verbal constructions are annotated manually. This research focuses on person-marking constructions in poetic discourses. Currently, we are working on the annotational scheme for the manual annotation. In the future, we plan to broaden the scope of the project by creating subcorpora containing less canonical poetic texts, such as song lyrics and slam poetry texts. It is hoped that the corpus will be useful not only in literary and
linguistic researches, but in education as well. One of our aims is to make the corpus accessible for everyone.

Stylometry

The goal of this research is to better understand the R programming language and the ‘stylo’ package in order to apply them in stylometric analysis. Stylometry is able to create data about the linguistic and thematic structure of texts that would not be possible for humans. In contrast to the human interpretation, the analysis made by a computer focuses on different levels of the text (e.g. the micro environment, as word frequency or the macro environment, as an analysis of even thousands of texts at once), thus creating different patterns. The R programming language and the ‘stylo’ package were created for this exact purpose. Plans for the future include the stylometric analysis of the works of Péter Nádas, for the size of the texts, the thematic complexity, and linguistic structure.

Prosopographical Database development

The research focuses on the 15-16th-century humanist network of East Central Europe and on the possible digital exploitation (network analysis, construction of a database, data visualization etc.) of the forthcoming book of the Lendület Research Team (HECE) led by Gábor Farkas Kiss, entitled Companion to Humanism in East Central Europe. The researchers analyse the biography of Central European humanists to find common features in their career and to establish a model for the construction of a prosopographical database. Following our preliminary experience, they are currently preparing the HECEdata Policy, focusing on a consequent data curation. The SPARQL endpoint of HECEdata enables a particularly effective control on the relational network of the raw database.