ParliSearch – a system for large text corpus discourse analysis

The 7th Conference

Human Language Technologies - the Baltic Perspective

Riga, Latvia
October 6-7, 2016

ParliSearch – a system for large text corpus discourse analysis

Roberts DARĢISa,, Guna RĀBANTE-BUŠAa, Ilze AUZIŅAa and Sergejs KRUKSb

a Institute of Mathematics and Computer Science, University of Latvia

b Rīga Stradiņš University

 

ParliSearch – the system that enables easy discourse analysis in large text corpus, providing stem search with additional search criteria. The system contains verbatim reports from debates of plenary sittings of the European Parliament and the Saeima (the Parliament of Latvia).

For now, two corpora are available in the ParliSearch system:

  • saeima.kospuss.lv – the Corpus of the Saeima (Parliament of Latvia)

  • europarl.korpuss.lv – the Corpus of the EuroParl (Parliament of Europe Union).

The Corpus of the Saeima

The data for the Corpus of the Saeima was taken from Saeima’s website where transcriptions of all the sessions of the Saeima are published. Also, the information about members of the parliament can be found there.

The corpus contains transcriptions of plenary sittings of the Saeima from 7 parliamentary terms (from 5th to 11th). In this period, there have been 17 governments, from 1993 to 2014.

Transcriptions of the Corpus of Saeima contain 4.4 million tokens, 465,644 utterances and 647 speakers. Speakers are grouped in 7 categories and 83 subcategories.

The Corpus of the Europarl

The material – transcriptions, translations and information about speakers – for the Corpus of the EuroParl was collected from the Talk of Europe project.

Transcriptions of the Corpus of the EuroParl contain 2,769,433 utterances, 24 languages, 2,260 speakers from 28 countries. Speakers are grouped in 28 categories.

The original and translated versions of utterances are included from 3 parliamentary terms (1999-2014).