TSD 2012

Corpora are not easy to get a handle on. The usual way of getting to grips with text is to read it, but corpora are mostly too big to read (and not designed to be read). We show, with examples, how keyword lists (of one corpus vs. another) are a direct, practical and fascinating way to explore the characteristics of corpora, and of text types. Our method is to classify the top one hundred keywords of corpus1 vs. corpus2, and corpus2 vs. corpus1. This promptly reveals a range of contrasts between all the pairs of corpora we apply it to. We also present improved maths for keywords, and briefly discuss quantitative comparisons between corpora. All the methods discussed (and almost all of the corpora) are available in the Sketch Engine, a leading corpus query tool.

#102: Coreference Resolution: to What Extent Does it Help NLP Applications?

Ruslan Mitkov, Richard Evans, Constantin Orasan, Iustin Dornescu, Miguel Rios

This paper describes a study of the impact of coreference resolution on NLP applications. Further to our previous study itep{mitkov-DAARC-07}, in which we investigated whether anaphora resolution could be beneficial to NLP applications, we now seek to establish whether a different, but related task --- that of coreference resolution, could improve the performance of three NLP applications: text summarisation, recognising textual entailment and text classification. The study discusses experiments in which the aforementioned applications were implemented in two versions, one in which the BART coreference resolution system was integrated and one in which it was not, and then tested in processing input text. The paper discusses the results obtained.

#455: 2B\ -- Testing Past Algorithms in Nowadays Web

Hugo Rodrigues, Luísa Coheur

In this paper we look into Who Wants to Be a Millionaire, a contest of multiple-answer questions, as an answer selection subproblem. Answer selection, in Question Answering systems, allows them to boost one or more correct candidate answers over a set of candidate answers. In this subproblem we look only to a set of four candidate answers, in which one is the correct answer. The built platform is language independent and supports other languages besides English with no effort. In this paper we compare some techniques for answer selection, employing them to both English and Portuguese in the context of Who Wants to Be a Millionaire. The results showed that the strategy may be applicable to more than a language without damaging its performance, getting accuracies around 73%.

#434: A Bilingual HMM-based Speech Synthesis System

Tadej Justin, Miran Pobar, Ivo Ipšić, France Mihelič, Janez Žibert

In this paper we investigate a bilingual HMM-based speech synthesis developed for Slovenian and Croatian languages. The primary goals of this research are to investigate the performance of an HMM-based synthesis build from two similar languages and to perform a comparison of such synthesis system with standard monolingual speaker-dependent HMM-based synthesis. The bilingual HMM synthesis is built by joining all the speech material from both languages by defining proper mapping of Slovenian and Croatian phonemes and by adapting acoustic models of Slovenian and Croatian speakers. Adapted acoustic models are then served as basic building blocks for speech synthesis in both languages. In such a way we are able to obtain synthesized speech of both languages, but with the same speaker voice. We made the quantitative comparison of such kind of synthesis with monolingual counterparts and study the performance of the synthesis in a relation to the amount of data, which is used for building the synthesis system.

#470: A Genetic Programming Experiment in Grammar Engineering

Marcin Junczys-Dowmunt

This paper describes an experiment in grammar engineering for a shallow syntactic parser using Genetic Programming and a treebank. The goal of the experiment is to improve the Parseval score of a previously manually created seed grammar. We illustrate the adaptation of the Genetic Programming paradigm to the problem of grammar engineering. The used genetic operators are described. The performance of the evolved grammar after 1,000 generations on an unseen test set is improved by 2.7 points F-score (3.7 points on the training set). Despite the large number of generations no overfitting effect is observed.

#491: A Manually Annotated Corpus of Pharmaceutical Patents

Márton Kiss, Ágoston Nagy, Veronika Vincze, Attila Almási, Zoltán Alexin, János Csirik

The language of patent claims differs from ordinary language to a great extent, which results in the fact that tools especially adapted to patent language are needed in patent processing. In order to evaluate these tools, manually annotated patent corpora are necessary. Thus, we constructed a corpus of English language pharmaceutical patents belonging to the class A61K, on which several layers of manual annotation (such as named entities, keys, NucleusNPs, quantitative expressions, heads and complements, perdurants) were carried out and on which tools for patent processing can be evaluated.

#450: A New Annotation Tool for Aligned Bilingual Corpora

Georgios Petasis, Mara Tsoumari

This paper presents a new annotation tool for aligned bilingual corpora, which allows the annotation of a wide range of information, ranging from information about words (such as part-of-speech tags or named-entities) to quite complex annotation schemas involving links between aligned segments, such as co-reference or translation equivalence between aligned segments in the two languages. The annotation tool is implemented as a component of the Ellogon language engineering platform, exploiting its extensive annotation engine, its cross-platform abilities and its linguistic processing components, if such a need arises. The new annotation tool is distributed with an open source license (LGPL), as part of the Ellogon language engineering platform.

#367: A Romanian Language Corpus for a Commercial Text-To-Speech Application

Mihai Alexandru Ordean, Andrei Saupe, Mihaela Ordean, Gheorghe Cosmin Silaghi, Corina Giurgea

Text and speech corpora are a prerequisite for the development of an effective commercial text-to-speech system, using the concatenative technology. Given that such a system needs to synthesize both common and domain-specific discourses, the considered corpora are of main importance. This paper presents the authors' experience in creating a corpus for the Romanian language, designed to support a concatenative TTS system, able to reproduce common and domain-specific sentences with naturalness.

#480: A Space-Efficient Phrase Table Implementation

Marcin Junczys-Dowmunt

We describe the structure of a space-efficient phrase table for phrase-based statistical machine translation with the Moses decoder. The new phrase table can be used in-memory or be partially mapped on-disk. Compared to the standard Moses on-disk phrase table implementation a size reduction by a factor of 6 is achieved. The focus of this work lies on the source phrase index which is implemented using minimal perfect hash functions. Two methods are discussed that reduce the memory consumption of a baseline implementation.

#467: A Type-Theoretical Wide-Coverage Computational Grammar for Swedish

Malin Ahlberg, Ramona Enache

The work describes a wide-coverage computational grammar for Swedish. It is developed using GF (Grammatical Framework), a functional language specialized for grammar programming. We trained and evaluated the grammar by using Talbanken, one of the largest treebanks for Swedish. As a result 65% of the Talbanken trees were translated into the GF format in the training stage and 76% of the noun phrases were parsed during the evaluation. Moreover, we obtained a language model for Swedish which we use for disambiguation.

#464: Acoustic Segmentation Using Group Delay Functions...

Srikanth R. Madikeri, Hema A. Murthy

In this paper, a new approach to keyword spotting is presented that uses event based signal processing to obtain approximate locations of sub-word units. A segmentation algorithm based on group delay functions is used to determine the boundaries. units. These sub-word units are used as individual inputs to an unconstrained endpoints dynamic time warping-based (UE-DTW) template matching algorithm. Appropriate score normalisation is performed using scores of background words. The technique is tested using MFCC and Modified Group Delay features. Performance gains of 13.7% (relative) improvement over the baseline for clean speech is observed. Further, for noisy speech, the degradation is graceful.

#406: Actionable Clause Detection from Non-imperative Sentences

Jihee Ryu, Yuchul Jung, Sung-Hyon Myaeng

Constructing a sophisticated experiential knowledge base for solving daily problems is essential for many intelligent human centric applications. A key issue is to convert natural language instructions into a form that can be searched semantically or processed by computer programs. This paper presents a methodology for automatically detecting actionable clauses in how-to instructions. In particular, this paper focuses on processing non-imperative clauses to elicit implicit instructions or commands. Based on some dominant linguistic styles in how-to instructions, we formulate the problem of detecting actionable clauses using linguistic features including syntactic and modal characteristics. The experimental results show that the features we have extracted are very promising in detecting actionable non-imperative clauses. This algorithm makes it possible to extract complete action sequences to a structural format for problem solving tasks.

#387: Adaptive Language Modeling with A Set of Domain Dependent Models

Yangyang Shi, Pascal Wiggers, Catholijn M. Jonker

An adaptive language modeling method is proposed in this paper. Instead of using one static model for all situations, it applies a set of specific models to dynamically adapt to the discourse. We present the general structure of the model and the training procedure. In our experiments, we instantiated the method with a set of domain dependent models which are trained according to different socio-situational settings (\almosd). We compare it with previous topic dependent and socio-situational setting dependent adaptive language models and with a smoothed \ngram model in terms of perplexity and word prediction accuracy. Our experiments show that \almosd achieves perplexity reductions up to almost 12% compared with the other models.

#442: Aggression Detection in Speech Using Sensor and Semantic Information

Iulia Lefter, Leon J. M. Rothkrantz, Gertjan J. Burghouts

By analyzing a multimodal (audio-visual) database with aggressive incidents in trains, we have observed that there are no trivial fusion algorithms to successfully predict multimodal aggression based on unimodal sensor inputs. We proposed a fusion framework that contains a set of intermediate level variables (meta-features) between the low level sensor features and the multimodal aggression detection . In this paper we predict the multimodal level of aggression and two of the meta-features: Context and Semantics. We do this based on the audio stream, from which we extract both acoustic (nonverbal) and linguistic (verbal) information. Given the spontaneous nature of speech in the database, we rely on a keyword spotting approach in the case of verbal information. We have found the existence of 6 semantic groups of keywords that have a positive influence on the prediction of aggression and of the two meta-features.

#446: An Ambiguity Aware Treebank Search Tool

Marcin Woliński, Andrzej Zaborowski

We present a search tool for constituency treebanks with some interesting new features. The tool has been designed for a treebank containing several alternative trees for any given sentence, with one tree marked as the correct one. The tool allows to compare the selected tree with other candidates. The query language is modelled after TIGER Search, but we extend the use of the negation operator to be able to use a class of universally quantified conditions in queries. The tool is built on top of an SQL engine, whose indexing facilities provide for efficient searches.

#352: An In-Car Speech Recognition System for Disabled Drivers

Jozef Ivanecký, Stephan Mehlhase

Automatic Speech Recognition (ASR) is becoming a standard in nowadays cars. However, ASR in cars is usually restricted to activities not directly influencing the driving process. Thus, the voice-controlled functions can rather be classified as comfort functions, e. g. controlling the air condition, the navigation and entertainment system or even the mobile phone of the driver. Obviously this usage of an ASR system could be extended in two directions: On the one side, the speech recognition system could be used to control secondary functions in the car like lights, windscreen wipers or windows. On the other side, the comfort functions could be enriched by utilizing services like weather inquiries, SMS dictation or online traffic information. Compared to todays usage these extensions require a different approach than the one employed today. Controlling secondary functions in the car by voice demands the usage of a very reliable, real-time, local ASR. At the same time a large vocabulary ASR system is required for comfort functions like dictation of messages. In this paper, we describe our efforts towards a hybrid speech recognition system to control secondary functions in the car. We also provide an extended comfort functionality to the driver. The hybrid speech recognition system contains a fast, grammar-based, embedded recognizer and a remote, server-based, LM-based, large vocabulary ASR system. We will analyze different aspects of such a design and the integration of it into a car. The main focus of the paper will be on maximizing the reliability of the embedded recognizer and designing an algorithm for switching dynamically between the embedded recognizer and the server-based ASR system.

#362: Analysis of the Influence of Speech Corpora in the PLDA Verification

Lukáš Machlica, Zbyněk Zajíc

In the paper recent methods used in the task of speaker recognition are presented. At first, the extraction of so called i-vectors from GMM based supervectors is discussed. These i-vectors are of low dimension and lie in a subspace denoted as Total Variability Space (TVS). The focus of the paper is put on Probabilistic Linear Discriminant Analysis (PLDA), which is used as a generative model in the TVS. The influence of development data is analyzed utilizing distinct speech corpora. It is shown that it is preferable to cluster available speech corpora to classes, train one PLDA model for each class and fuse the results at the end. Experiments are presented on NIST Speaker Recognition Evaluation (SRE) 2008 and NIST SRE 2010.

#435: Assigning Deep Lexical Types

João Silva, António Branco

Deep linguistic grammars provide complex grammatical representations of sentences, capturing, for instance, long-distance dependencies and returning semantic representations, making them suitable for advanced natural language processing. However, they lack robustness in that they do not gracefully handle words missing from the lexicon of the grammar. Several approaches have been taken to handle this problem, one of which consists in pre-annotating the input to the grammar with shallow processing machine-learning tools. This is usually done to speed-up parsing (supertagging) but it can also be used as a way of handling unknown words in the input. These pre-processing tools, however, must be able to cope with the vast tagset required by a deep grammar. We investigate the training and evaluation of several supertaggers for a deep linguistic processing grammar and report on it in this paper.

#414: Authorship Attribution: Single-layer and Double-layer Machine Learning

Jan Rygl, Aleš Horák

In the traditional authorship attribution task, forensic linguistic specialists analyse and compare documents to determine who was their (real) author. In the current days, the number of anonymous documents is growing ceaselessly because of Internet expansion. That is why the manual part of the authorship attribution process needs to be replaced with automatic methods. Specialized algorithms (SA) like delta-score and word length statistic were developed to quantify the similarity between documents, but currently prevailing techniques build upon the machine learning (ML) approach. In this paper, two machine learning approaches are compared: Single-layer ML, where the results of SA (similarities of documents) are used as input attributes for the machine learning, and Double-layer ML with the numerical information characterizing the author being extracted from documents and divided into several groups. For each group the machine learning classifier is trained and the outputs of these classifiers are used as input attributes for ML in the second step. Generating attributes for the machine learning in the first step of double-layer ML, which is based on SA, is described in detail here. Documents from Czech blog servers are utilized for empirical evaluation of both approaches.

#425: Automatic Rating of Hoarseness by Cepstral and Prosodic Evaluation

Tino Haderlein, Cornelia Moers, Bernd Möbius, Elmar Nöth

The standard for the analysis of distorted voices is perceptual rating of read-out texts or spontaneous speech. Automatic voice evaluation, however, is usually done on stable sections of sustained vowels. In this paper, text-based and established vowel-based analysis are compared with respect to their ability to measure hoarseness and its subclasses. 73 hoarse patients (48.3 \pm 16.8 years) uttered the vowel /e/ and read the German version of the text "The North Wind and the Sun". Five speech therapists and physicians rated roughness, breathiness, and hoarseness according to the German RBH evaluation scheme. The best human-machine correlations were obtained for measures based on the Cepstral Peak Prominence (CPP; up to |r| = 0.73). Support Vector Regression (SVR) on CPP-based measures and prosodic features improved the results further to r \approx 0.8 and confirmed that automatic voice evaluation should be performed on a text recording.

#386: Captioning of Live TV Programs Through Speech Recognition and Re-speaking

Aleš Pražák, Zdeněk Loose, Jan Trmal, Josef V. Psutka, Josef Psutka

In this paper we introduce our complete solution for captioning of live TV programs used by the Czech Television, the public service broadcaster in the Czech Republic. Live captioning using speech recognition and re-speaking is on the increase and widely used for example in BBC; however, many specific issues have to be solved each time a new captioning system is being put in operation. Our concept of re-speaking assumes a complex integration of re-speaker's skills, not only verbatim repetition with fully automatic processing. This paper describes the recognition system design with advanced re-speaker interaction, distributed captioning system architecture and neglected re-speaker training. Some evaluation of our skilled re-speakers is presented too.

#516: Classification of Healthy and Pathological Speech

Klára Vicsi, Viktor Imre, Gábor Kiss

A number of experiments were made in the field of speech diagnostic analysis in which researchers wanted to examine whether it was the acoustic characteristics of sustained voice or continuous speech that were more appropriate for distinguishing healthy from pathological voice. Since in phoniatric practice, doctors mainly use continuous speech, we also wanted to concentrate on the examination of continuous speech. In this paper we present a series of classification experiments showing how it is possible to separate healthy from pathological speech automatically, on the basis of continuous speech. It is demonstrated that the results of the automatic classification of healthy vs. pathological voice improved to a large extent by a multi-step processing methodology, in which most examples in which uncertainties occurred in the measurement of the acoustic parameters can be accounted for separately. That multi-step processing could be especially useful when pathological data is not sufficient for statistical point of view.

#484: Combining Manual and Automatic Annotation of a Learner Corpus

Tomáš Jelínek, Barbora Štindlová, Alexandr Rosen, Jirka Hana

We present an approach to building a learner corpus of Czech, manually corrected and annotated with error tags using a complex grammar-based taxonomy of errors in spelling, morphology, morphosyntax, lexicon and style. This grammar-based annotation is supplemented by a formal classification of errors based on surface alternations. To supply additional information about non-standard or ill-formed expressions, we aim at a synergy of manual and automatic annotation, deriving information from the original input and from the manual annotation.

#426: Common Sense Inference Using Verb Valency Frames

Zuzana Nevěřilová, Marek Grác

In this paper we discuss common-sense reasoning from verb valency frames. While seeing verbs as predicates is not a new approach, processing inference as a transformation of valency frames is a promising method we developed with the help of large verb valency lexicons. We went through the whole process and evaluated it on several levels: parsing, valency assignment, syntactic transformation, syntactic and semantic evaluation of the generated propositions. We have chosen the domain of cooking recipes. We built a corpus with marked noun phrases, verb phrases and dependencies among them. We have manually created a basic set of inference rules and used it to infer new propositions from the corpus. Next, we extended this basic set and repeated the process. At first, we generated 1,738 sentences from 175 rules. 1,633 sentences were judged as (syntactically) correct and 1,533 were judged as (semantically) true. After extending the basic rule set we generated 2,826 propositions using 276 rules. 2,598 propositions were judged correct and 2,433 of the propositions were judged true.

#490: Coupled Pragmatic and Semantic Automata

Jolanta Bachan

Dialogue managers are often based explicitly on finite state automata, but the present approach couples this type of dialogue manager with a semantic model (a city map) whose traversal is also formalised with a finite state automaton. The two automata are coupled in a scenario-specific fashion within an emergency rescue dialogue between an accident observer and an ambulance station, i.e. a stress scenario which is essentially different from traditional information negotiation scenarios. The purpose of this use of coupled automata is to develop a prototype dialogue system for investigating semantic alignment and non-alignment in a dialogue. The research on alignment of interlocutors is to improve human-computer communication in a Polish adaptive dialogue system, focusing on the stress scenario. The investigation was performed on two dialogue corpora and resulted in creating a working text-in-speech-out (TISO) dialogue system based on the two linked finite-state automata, evaluated with about 130 human users.

#432: Czech Expressive Speech Synthesis in Limited Domain

Martin Grůber, Zdeněk Hanzlíček

This paper deals with expressive speech synthesis in a limited domain restricted to conversations between humans and a computer on a given topic. Two different methods (unit selection and HMM-based speech synthesis) were employed to produce expressive synthetic speech, both with the same description of expressivity by so-called communicative functions. Such a discrete division is related to our limited domain and it is not intended to be a general solution for expressivity description. Resulting synthetic speech was presented to listeners within a web-based listening test to evaluate whether the expressivity is perceived as expected. The comparison of both methods is also shown.

#396: Dealing with Numbers in Grapheme-based Speech Recognition

Miloš Janda, Martin Karafiát, Jan Černocký

This article presents the results of grapheme-based speech recognition for eight languages. The need for this approach arises in situation of low resource languages, where obtaining a pronunciation dictionary is time- and cost-consuming or impossible. In such scenarios, usage of grapheme dictionaries is the most simplest and straight-forward. The paper describes the process of automatic generation of pronunciation dictionaries with emphasis on the expansion of numbers. Experiments on GlobalPhone database show that grapheme-based systems have results comparable to the phoneme-based ones, especially for phonetic languages.

#400: Dependency Relations Labeller for Czech

Rudolf Rosa, David Mareček

We present a MIRA-based labeller designed to assign dependency relation labels to edges in a dependency parse tree, tuned for Czech language. The labeller was created to be used as a second stage to unlabelled dependency parsers but can also improve output from labelled dependency parsers. We evaluate two existing techniques which can be used for labelling and experiment with combining them together. We describe the feature set used. Our final setup significantly outperforms the best results from the CoNLL 2009 shared task.

#479: Detecting Errors in a Treebank of Polish

Katarzyna Krasnowska, Witold Kieras, Marcin Wolinski, Adam Przepiórkowski

The paper presents a modification --- aimed at highly inflectional languages --- of a recently proposed error detection method for syntactically annotated corpora. The technique described below is based on Synchronous Tree Substitution Grammar (STSG), i.e. a kind of tree transducer grammar. The method involves induction of STSG rules from a treebank and application of their subset meeting a certain criterion to the same resource. Obtained results show that the proposed modification can be successfully used in the task of error detection in a treebank of an inflectional language such as Polish.

#392: Detection of Semantic Compositionality Using Semantic Spaces

Lubomír Krčmář, Karel Ježek, Massimo Poesio

Any Natural Language Processing (NLP) system that does semantic processing relies on the assumption of semantic compositionality: the meaning of a compound is determined by the meaning of its parts and their combination. However, the compositionality assumption does not hold for many idiomatic expressions such as "blue chip". This paper focuses on the fully automatic detection of these, further referred to as non-compositional compounds. We have proposed and tested an intuitive approach based on replacing the parts of compounds by semantically related words. Our models determining the compositionality combine simple statistic ideas with the COALS semantic space. For the evaluation, the shared dataset for the Distributional Semantics and Compositionality 2011 workshop (DISCO 2011) is used. A comparison of our approach with the traditionally used Pointwise Mutual Information (PMI) is also presented. Our best models outperform all the systems competing in DISCO 2011.

#462: Did You Say what I Think You Said?

Bernd Ludwig, Ludwig Hitzenberger

In this paper we discuss the problem that in a dialogue system, speech recognizers should be able to guess whether the speech recognition failed, even if no correct transcription of the actual user utterance is available. Only with such a diagnosis available, the dialogue system can choose an adequate repair strategy and try to recover from the interaction problem with the user and avoid negative consequences for the successful completion of the dialogue. We present a data collection for a controlled out-of-vocabulary scenario and discuss an approach to estimate the success of a speech recognizer's results by exploring differences between the N-gram distribution in the best word chain and in the language model. We present the results of our experiments that indicate that differences can be found to be significant if the speech recognition failed severely. From these results, we derive a quick test for failed recognition that is based on a negative language model.

#477: Disambiguating Word Translations with Target Language Models

André Lynum, Erwin Marsi, Lars Bungum, Björn Gambäck

Word Translation Disambiguation is the task of selecting the best translation(s) for a source word in a certain context, given a set of translation candidates. Most approaches to this problem rely on large word-aligned parallel corpora, resources that are scarce and expensive to build. In contrast, the method presented in this paper requires only large monolingual corpora to build vector space models encoding sentence-level contexts of translation candidates as feature vectors in high-dimensional word space. Experimental evaluation shows positive contributions of the models to overall quality in German-English translation.

#360: Discretion of Speech Units for Automatic Transcription

Svatava Škodová, Michaela Kuchařová, Ladislav Šeps

In this paper we introduce an experiment leading to the improvement of the text post-processing phase of automatic transcription of spoken documents stored in the large Czech Radio audio archive of oral documents. This archive contains the largest collection of spoken documents recorded during the last 90 years. The underlying aim of the project introduced in the paper is to transcribe a part of the audio archive and store the transcription in the database, in which it will be possible to search for, and retrieve information. The value of the search is that one can find the information on the two linguistic levels: in the written form and the spoken form. This doubled information-storage is important especially for the comfortable retrieval of information and it diametrically extends the possibilities of work with the information. One of the important issues of the conversion of spoken speech to written texts is the automatic delimitation of speech units and sentences/clauses in the final text processing, which is connected with the punctuation important for convenient perception of the rewritten texts. For this reason we decided to test Czech native speakers' perception of speech and their need of punctuation in the rewritten texts. We compared their results with the punctuation added by an automaton. The results should serve to train a program for automatic discretion of speech units and the correct supplying of punctuation. For the experiment we prepared a sample of texts spoken by typologically various speakers (the amount of speech was 30 minutes; 5,247 words), these automatically rewritten texts were given to 59 respondents whose task was to supply punctuation to the automatically rewritten texts. We used two special tools to run this experiment; NanoTrans -- this tool was used by respondents for supplying the punctuation. The other tool for viewing and comparing the respondents' and machine performance, especially written for the probe, was Transcription Viewer. In the text we give detailed information about these comparisons. In the final part of the paper we propose further improvements and ideas for future research.

#530: E-V MT of Proper Names: Error Analysis and Some Proposed Solutions

Thi Thanh Thao Phan, Izabella Thomas

This paper presents some problems involved in the machine translation of proper names (PNs) from English into Vietnamese. Based on the building of an English-Vietnamese comparable corpus of texts with numerous PNs extracted from online BBC News and translated by four machine translation (MT) systems, we implement the PN error classification and analysis. Some pre-processing solutions for reducing and limiting errors are also proposed and tested with a manually annotated corpus in order to significantly improve the MT quality.

#403: Expanding Opinion Attribute Lexicons

Aleksander Wawer, Konrad Gołuchowski

The article focuses on acquiring new vocabulary used to express opinion attributes. We apply two automated expansion techniques to a manually annotated corpus of attribute-level opinions. The first method extracts opinion attribute words using patterns. It has been augmented by the second, wordnet and similarity-based expansion procedure. We examine the types of errors and shortcomings of both methods and end up proposing two hybrid, machine learning approaches that utilise all the available information: rules, lexical and distributional. One of them proves highly successful.

#515: Experiments and Results with Diacritics Restoration in Romanian

Cristian Grozea

The purpose of this paper is (1) to make an extensive overview of the field of diacritics restoration in Romanian texts, (2) to present our own experiments and results and to promote the use of the word-based Viterbi algorithm as a better accuracy solution used already in a free web-based TTS implementation, (3) to announce the production of a new, high-quality, high-volume corpus of Romanian texts, twice the size of the Romanian language subset of the JRC-Acquis.

#370: Exploration of Metaphor and Affect Sensing Using Semantic Interpretation...

Li Zhang

We developed a virtual drama improvisation platform to allow human users to be creative in their role-play with the interaction of an AI agent. Previously, the AI agent was able to detect affect from users’ inputs with strong affect indicators. In this paper, we integrate context-based affect detection to enable the intelligent agent to detect affect from inputs with weak or no affect signals. Topic theme detection using latent semantic analysis is applied to such inputs to identify their discussions themes and potential target audiences. Relationships between characters are also taken into account for affect analysis. Such semantic interpretation of the dialogue contexts also proofs to be effective in the recognition of metaphorical phenomena.

#472: Foot-Syllable Grammars for Dialogue Systems

Daniel Couto Vale, Vivien Mast

This paper aims at improving the accuracy of user utterance understanding for an intelligent German-speaking wheelchair. We compare three different corpus-based context-free restriction grammars for its speech recognizer, which were tested for surface recognition and semantic feature extraction on a dedicated corpus of 135 utterances collected in an experiment with 13 participants. We show that grammars based on phonologically motivated units such as the foot and the syllable outperform phrase-structure grammars in complex scenarios where the extraction of a large number of semantic features is necessary.

#423: Heterogeneous Named Entity Similarity Function

Jan Kocoń, Maciej Piasecki

Many text processing tasks require to recognize and classify Named Entities. Currently available morphological analysers for Polish cannot handle unknown words (not included in analyser's lexicon). Polish is a language with rich inflection, so comparing two words (even having the same lemma) is a non-trivial task. The aim of the similarity function is to match unknown word form with its word form in named-entity dictionary. In this article a complex similarity function is presented. It is based on a decision function implemented as a Logistic Regression classifier. The final similarity function is a combination of several simple metrics combined with the help of the classifier. The proposed function is very effective in word forms matching task.

#482: Impact of Statistical and Semantic Features on Text Summarization

Tatiana Vodolazova, Elena Lloret, Rafael Muñoz, Manuel Palomar

This paper evaluates the impact of a set of statistical and semantic features as applied to the task of extractive summary generation for English. This set includes word frequency, inverse sentence frequency, inverse term frequency, corpus-tailored stopwords, word senses, resolved anaphora and textual entailment. The obtained results show that not all of the selected features equally benefit the performance. The term frequency combined with stopwords filtering is a highly competitive baseline that nevertheless can be topped when semantic information is included. However, in the selected experiment environment the recall values improved less than expected and we are interested in further investigating the reasons.

#496: Improved Phrase Translation Modeling Using MAP Adaptation

A. Ryan Aminzadeh, Jennifer Drexler, Timothy Anderson, Wade Shen

In this paper, we explore several methods of improving the estimation of translation model probabilities for phrase-based statistical machine translation given in-domain data sparsity. We introduce a hierarchical variant of maximum a posteriori (MAP) adaptation for domain adaptation with an arbitrary number of out-of-domain models. We note that domain adaptation can have a smoothing effect, and we explore the interaction between smoothing and the incorporation of out-of-domain data. We find that the relative contributions of smoothing and interpolation depend on the datasets used. For both the IWSLT 2011 and WMT 2011 English-French datasets, the MAP adaptation method we present improves on a baseline system by 1.5+ BLEU points.

#401: Induction of Rules for Recognition of Relations between PNs

Michał Marcinczuk, Marcin Ptak

In the paper we present a preliminary work on automatic construction of rules for recognition of semantic relations between pairs of proper names in Polish texts. Our goal was to check the feasibility of automatic rule construction using existing inductive logic programming (ILP) system as an alternative or supporting method for manual rule creation. We present a set of predicates in first-order logic that is used to represent the semantic relation recognition task. The background knowledge encode the morphological, orthographic and named entity-based features. We applied an ILP on the proposed representation to generate rules for relation extraction. We have utilized an existing ILP system called Aleph . The performance of automatically generated rules was compared with a set of hand-crafted rules developed on the basis of training set for 8 categories of relations (affiliation, alias, creator, composition, location, nationality, neighbourhood, origin). Finally, we proposed several ways how to improve to preliminary results in the future work.

#433: Integrating Dialogue Systems with Images

Ivan Kopeček, Radek Ošlejšek, Jaromír Plhák

The paper presents a novel approach, in which images are integrated with a dialogue interface that enables them to communicate with the user. The structure of the corresponding dialogue system is supported by graphical ontologies and enables the system learning from the dialogues. The Internet environment is used for retrieving additional information about the images as well as for solving more complex tasks related with exploiting other relevant knowledge. Further, the paper deals with some problems that arise from the system initiative dialogue mode and discusses the structure and algorithms of the dialogue system. Some examples and applications of the presented approach are presented as well.

#389: Investigation on Most Frequent Errors in Large-scale ASR Applications

Marek Boháč, Jan Nouza, Karel Blavka

When automatic speech recognition (ASR) system is being developed for an application where a large amount of audio documents is to be transcribed, we need some feedback information that tells us, what the main types of errors are, why and where they occur and what can be done to eliminate them. While the algorithm commonly used for counting the number of word errors is simple, it does not care much about the nature and source of the errors. In this paper, we introduce a scheme that offers a more detailed insight into analysis of ASR errors. We apply it to the performance evaluation of a Czech ASR system whose main goal is to transcribe oral archives containing hundreds of thousands spoken documents. The analysis is performed by comparing 763 hours of manually and automatically transcribed data. We list the main types of errors and present methods that try to eliminate at least the most relevant ones. We show that the proposed error locating method can be useful also when porting an existing ASR system to another language, where it can help in an efficient identification of errors in the lexicon.

#424: Joint Part-of-Speech Tagging and Named Entity Recognition Using Factor Graphs

György Móra, Veronika Vincze

We present a machine learning-based method for jointly labeling POS tags and named entities. This joint labeling is performed by utilizing factor graphs. The variables of part of speech and named entity labels are connected by factors so the tagger jointly determines the best labeling for the two labeling tasks. Using the feature sets of SZTENER and the POS-tagger magyarlanc, we built a model that is able to outperform both of the original taggers.

#443: Key Phrase Extraction of Lightly Filtered Broadcast News

Luís Marujo, Ricardo Ribeiro, David Martins de Matos, João P. Neto, Anatole Gershman, Jaime Carbonell

This paper explores the impact of light filtering on automatic key phrase extraction (AKE) applied to Broadcast News (BN). Key phrases are words and expressions that best characterize the content of a document. Key phrases are often used to index the document or as features in further processing. This makes improvements in AKE accuracy particularly important. We hypothesized that filtering out marginally relevant sentences from a document would improve AKE accuracy. Our experiments confirmed this hypothesis. Elimination of as little as 10% of the document sentences lead to a 2% improvement in AKE precision and recall. AKE is built over MAUI toolkit that follows a supervised learning approach. We trained and tested our AKE method on a gold standard made of 8 BN programs containing 110 manually annotated news stories. The experiments were conducted within a Multimedia Monitoring Solution (MMS) system for TV and radio news/programs, running daily, and monitoring 12 TV and 4 radio channels.

#445: Language Modeling of Nonverbal Vocalizations in Spontaneous Speech

Dmytro Prylipko, Bogdan Vlasenko, Andreas Stolcke, Andreas Wendemuth

Nonverbal vocalizations are one of the characteristics of spontaneous speech distinguishing it from written text. These phenomena are sometimes regarded as a problem in language and acoustic modeling. However, vocalizations such as filled pauses enhance language models at the local level and serve some additional functions (marking linguistic boundaries, signaling hesitation). In this paper we investigate a wider range of nonverbals and investigate their potential for language modeling of conversational speech, and compare different modeling approaches. We find that all nonverbal sounds, with the exception of breath, have little effect on the overall results. Due to its specific nature, as well as its frequency in the data, modeling of breath as a regular language model event leads to a substantial improvement in both perplexity and speech recognition accuracy.

#523: Large-scale Experiments with NP Chunking of Polish

Adam Radziszewski, Adam Pawlaczek

The published experiments with shallow parsing for Slavic languages are characterised with small size of the corpora used. With the publication of the National Corpus of Polish (NCP), a new opportunity was opened: to test several chunking algorithms on the 1-million token manually annotated subcorpus of the NCP. We test three Machine Learning techniques: Decision Tree induction, Memory-Based Learning and Conditional Random Fields. We also investigate the influence of tagging errors on the overall chunker performance, which happens to be quite substantial.

#493: Lemmatization and Summarization Methods in Topic Identification Module

Lucie Skorkovská

The paper presents experiments with the topic identification module which is a part of a complex system for acquisition and storing large volumes of text data. The topic identification module processes each acquired data item and assigns it topics from a defined topic hierarchy. The topic hierarchy is quite extensive -- it contains about 450 topics and topic categories. It can easily happen that for some narrowly focused topic there is not enough data for the topic identification training. Lemmatization is shown to improve the results when dealing with sparse data in the area of information retrieval, therefore the effects of lemmatization on topic identification results is studied in the paper. On the other hand, since the system is used for processing large amounts of data, a summarization method was implemented and the effect of using only the summary of an article on the topic identification accuracy is studied.

#398: Literacy Demands and Information to Cancer Patients

Dimitrios Kokkinakis, Markus Forsberg, Sofie Johansson Kokkinakis, Frida Smith, Joakim Öhlen

This study examines language complexity of written health information materials for patients undergoing colorectal cancer surgery. Written and printed patient information from 28 Swedish clinics are automatically analyzed by means of language technology. The analysis reveals different problematic issues that might have impact on readability. The study is a first step, and part of a larger project about patients’ health information seeking behavior in relation to written information material. Our study aims to provide support for producing more individualized, person centered information materials according to preferences for complex and detailed or legible texts and thus enhance a movement from receiving information and instructions to participating in knowing. In the near future the study will continue by integrating focus groups with patients that may provide valuable feedback and enhance our knowledge about patients’ use and preferences of different information material.

#395: Making Community and ASR Join Forces in Web Environment

Oldřich Krůza, Nino Peterek

The paper presents a system for combining human transcriptions with automated speech recognition to create a quality transcription of a large corpus in good time. The system uses the web as interface for playing back audio, displaying the automatically-acquired transcription synchronously, and enabling the visitor to correct errors in the transcription. The human-submitted corrections are then used in the statistical ASR to improve the acoustic as well as language model and re-generate the bulk of transcription. The system is currently under development. The paper presents the system design, the corpus processed as well as considerations for using the system in other settings.

#373: Mapping a Linguistic Resource to a Common Framework

Milena Slavcheva

In recent years the proliferation of language resources has brought up the question of their interoperability, reuse and integration. Currently, it is appropriate not only to produce a language resource, but to connect it to prominent frameworks and global infrastructures. This paper presents the mapping of SemInVeSt -- a knowledge base of the semantics of verb-centred structures in Bulgarian, French and Hungarian, to the Lexical Markup Framework (LMF) -- an abstract metamodel, providing a common, standardized framework for the representation of computational lexicons. SemInVeSt and LMF share their underlying models, that is, both are based on the four-layer metamodel architecture of the Unified Modeling Language (UML). A two-step mapping of the SemInVeSt and LMF models is considered: the first step provides an LMF conformant schema of SemInVeSt as a multilingual lexical resource with a reference to an external system containing the semantic descriptors of the lexical units; the second step implies an LMF conformant representation of the semantic descriptors themselves, which are a product of the application of the Unified Eventity Representation (UER) -- a cognitive theoretical approach to verb semantics and a graphical formalism, based on UML.

#473: Mining the Web for Idiomatic Expressions Using Metalinguistic Markers

Filip Gralinski

In this paper, methods for identification and delimitation of idiomatic expressions in large Web corpora are presented. The proposed methods are based on the observation that idiomatic expressions are sometimes accompanied by metalinguistic expressions, e.g. the word "proverbial", the expression "as they say" or quotation marks. Even though the frequency of such idiom-related metalinguistic markers is not very high, it is possible to identify new idiomatic expressions with a sufficiently large corpus (only type identification of idiomatic expressions is discussed here, not the token identification). In this paper, we propose to combine infrequent but reliable idiom-related markers (such as the word "proverbial") with frequent but unreliable markers (such as quotation marks). The former could be used for the identification of idiom candidates, the latter -- for their delimitation. The experiments for the estimation of recall upper bound of the proposed methods are also presented in this paper. Even though the paper is concerned with identification and delimitations of Polish idiomatic expressions, the approaches proposed here should also be feasible for other languages with sufficiently large web corpora, English in particular.

#526: Morphological Resources for Precise IR

Anne-Laure Ligozat, Brigitte Grau, Delphine Tribout

Question answering (QA) systems aim at providing a precise answer to a given user question. Their major difficulty lies in the lexical gap problem between question and answering passages. We present here the different types of morphological phenomena in question answering, the resources available for French, and in particular a resource that we built containing deverbal agent nouns. Then, we evaluate the results of a particular QA system, according to the morphological knowledge used.

#456: Natural Language Understanding: From Laboratory Predictions to Real Interactions

Pedro Mota, Luísa Coheur, Sérgio Curto, Pedro Fialho

In this paper we target Natural Language Understanding in the context of Conversational Agents that answer questions about their topics of expertise, and have in their knowledge base question/answer pairs, limiting the understanding problem to the task of finding the question in the knowledge base that will trigger the most appropriate answer to a given (new) question. We implement such an agent and different state of the art techniques are tested, covering several paradigms, and moving from lab experiments to tests with real users. First, we test the implemented techniques in a corpus built by the agent's developers, corresponding to the expected questions; then we test the same techniques in a corpus representing interactions between the agent and real users. Interestingly, results show that the best "lab" techniques are not necessarily the best for real scenarios, even if only in-domain questions are considered.

#458: Neural Network Language Model with Cache

Daniel Soutner, Zdeněk Loose, Luděk Müller, Aleš Pražák

In this paper we investigate whether a combination of statistical, neural network and cache language models can outperform a basic statistical model. These models have been developed, tested and exploited for a Czech spontaneous speech data, which is very different from common written Czech and is specified by a small set of the data available and high inflection of the words. As a baseline model we used a trigram model and after its training several cache models interpolated with the baseline model have been tested and measured on a perplexity. Finally, an evaluation of the model with the lowest perplexity has been performed on speech recordings of phone calls.

#471: On the Impact of Annotation Errors on Unit-Selection Speech Synthesis

Jindřich Matoušek, Daniel Tihelka, Luboš Šmídl

Unit selection is a very popular approach to speech synthesis. It is known for its ability to produce nearly natural-sounding synthetic speech, but, at the same time, also for its need for very large speech corpora. In addition, unit selection is also known to be very sensitive to the quality of the source speech corpus the speech is synthesised from and its textual, phonetic and prosodic annotations and indexation. Given the enormous size of current speech corpora, manual annotation of the corpora is a lengthy process. Despite this fact, human annotators do make errors. In this paper, the impact of annotation errors on the quality of unit-selection-based synthetic speech is analysed. Firstly, an analysis and categorisation of annotation errors is presented. Then, a speech synthesis experiment, in which the same utterances were synthesised by unit-selection systems with and without annotation errors, is described. Results of the experiment and the options for fixing the annotation errors are discussed as well.

#418: On the Impact of Non-Speech Sounds on Speaker Recognition

Artur Janicki

This paper investigates the impact of non-speech sounds on the performance of speaker recognition. Various experiments were conducted to check what the accuracy of speaker classification would be if non-speech sounds, such as breaths, were removed from the training and/or testing speech. Experiments were run using the GMM-UBM algorithm and speech taken from the TIMIT speech corpus, either original or transcoded using the G.711 or GSM 06.10 codecs. The results show a remarkable contribution of non-speech sounds to the overall speaker recognition performance.

#383: Opinion Mining on a German Corpus of a Media Response Analysis

Thomas Scholz, Stefan Conrad, Lutz Hillekamps

This contribution introduces a new corpus of a German Media Response Analysis called the pressrelations dataset which can be used in several tasks of Opinion Mining: Sentiment Analysis, Opinion Extraction and the determination of viewpoints. Professional Media Analysts created a corpus of 617 documents which contains 1,521 statements. The statements are annotated with a tonality (positive, neutral, negative) and two different viewpoints. In our experiments, we perform sentiment classifications by machine learning techniques which are based on different methods to calculate tonality.

#469: Optimizing Sentence Boundary Detection for Croatian

Frane Šarić, Jan Šnajder, Bojana Dalbelo Bašić

A number of natural language processing tasks depend on segmenting text into sentences. Tools that perform sentence boundary detection achieve excellent performance for some languages. We have tried to train a few publicly available language independent tools to perform sentence boundary detection for Croatian. The initial results show that off-the-shelf methods used for English do not work particularly well for Croatian. After performing error analysis, we propose additional features that help in resolving some of the most common boundary detection errors. We use unsupervised methods on a large Croatian corpus to collect likely sentence starters, abbreviations, and honorifics. In addition to some commonly used features, we use these lists of words as features for classifier that is trained on a smaller corpus with manually annotated sentences. The method we propose advances the state-of-the art accuracy for Croatian sentence boundary detection on news corpora to 99.5%.

#404: PSI-Toolkit: How to Turn a Linguist into a Computational Linguist

Krzysztof Jassem

The paper presents PSI-Toolkit, a set of text processing tools, being developed within a project funded by the Polish Ministry of Science and Higher Education. The toolkit serves two objectives: to deliver a set of advanced text processing tools (with the focus set on the Polish language) for experienced language engineers and to help linguists without any technological background learn using linguistics toolkits. The paper describes how the second objective can be achieved: First, a linguist, thanks to PSI-Toolkit, becomes a conscious user of NLP tools. Next, he designs his own NLP applications.

#382: Question Classification with Active Learning

Domen Marinčič, Tomaž Kompara, Matjaž Gams

In a question answering system, one of the most important components is the analysis of the question. One of the steps of the analysis is the classification of the question according to the type of the expected answer. In this paper, two approaches to the classification are compared: the passive learning approach with a random choice of training examples, and the active learning approach upgraded by a domain model where the learning algorithm proposes the most informative examples for the train set. The experiments performed on a set of questions in Slovene show that the active learning algorithm outperforms the passive learning algorithm by about ten percentage points.

#416: Robust Adaptation Techniques

Zbyněk Zajíc, Lukáš Machlica, Luděk Müller

The worst problem the adaptation is dealing with is the lack of adaptation data. This work focuses on the feature Maximum Likelihood Linear Regression (fMLLR) adaptation where the number of free parameters to be estimated significantly decreases in comparison with other adaptation methods. However, the number of free parameters of fMLLR transform is still too high to be estimated properly when dealing with extremely small data sets. We described and compared various methods used to avoid this problem, namely the initialization of the fMLLR transform and a linear combination of basis matrices varying in the choice of the basis estimation (eigen decomposition, factor analysis, independent component analysis and maximum likelihood estimation). Initialization methods compensate the absence of the test speaker's data utilizing other suitable data. Methods using linear combination of basis matrices reduce the number of estimated fMLLR parameters to a smaller number of weights to be estimated. Experiments are aimed to compare results of proposed basis and initialization methods.

#463: SBFC: An Efficient Approach to CLWSD

Dieter Mourisse, Els Lefever, Nele Verbiest, Yvan Saeys, Martine De Cock, Chris Cornelis

The Cross-Lingual Word Sense Disambiguation (CLWSD) problem is a challenging Natural Language Processing (NLP) task that consists of selecting the correct translation of an ambiguous word in a given context. Different approaches have been proposed to tackle this problem, but they are often complex and need tuning and parameter optimization. In this paper, we propose a new classifier, Selected Binary Feature Combination (SBFC), for the CLWSD problem. The underlying hypothesis of SBFC is that a translation is a good classification label for new instances if the features that occur frequently in the new instance also occur frequently in the training feature vectors associated with the same translation label. The advantage of SBFC over existing approaches is that it is intuitive and therefore easy to implement. The algorithm is fast, which allows processing of large text mining data sets. Moreover, no tuning is needed and experimental results show that SBFC outperforms state-of-the-art models for the CLWSD problem w.r.t. accuracy.

#524: SOM-based Corpus Modeling for Disambiguation Purposes in MT

George Tambouratzis, George Tsatsanifos, Ioannis Dologlou, Nikolaos Tsimboukakis

The PRESEMT project constitutes a novel approach to the machine translation (MT) task. This project aims to develop a language-independent MT system architecture that is readily portable to new language pairs. PRESEMT falls within the Corpus-based MT (CBMT) paradigm, using a small bilingual parallel corpus and a large TL monolingual corpus. The present article investigates the process of selecting the best translation for a given token, by choosing over a set of suggested translations. For this disambiguation task, a dedicated module based on the SOM model (Self-Organizing Map) is presented. Though the SOM has been studied extensively for text processing applications, the present application on translation disambiguation is novel. The actual features employed are described, which project textual data on the SOM lattice. Details are provided on the modifications required to model very large corpora and on experimental results of integrating SOM to the PRESEMT system.

#361: Semantic Similarity Functions in Word Sense Disambiguation

Łukasz Kobylinski, Mateusz Kopeć

This paper presents a method of improving the results of automatic Word Sense Disambiguation by generalizing nouns appearing in a disambiguated context to concepts. A corpus-based semantic similarity function is used for that purpose, by substituting appearances of particular nouns with a set of the most closely related similar words. We show that this approach may be applied to both supervised and unsupervised WSD methods and in both cases leads to an improvement in disambiguation accuracy. We evaluate the proposed approach by conducting a series of lexical sample WSD experiments on both domain-restricted dataset and a general, balanced Polish-language text corpus.

#453: Semi-Supervised Acquisition of Croatian Sentiment Lexicon

Goran Glavaš, Jan Šnajder, Bojana Dalbelo Bašić

Sentiment analysis aims to recognize subjectivity expressed in natural language texts. Subjectivity analysis tries to answer if the text unit is subjective or objective, while polarity analysis determines whether a subjective text is positive or negative. Sentiment of sentences and documents is often determined using some sort of a sentiment lexicon. In this paper we present three different semi-supervised methods for automated acquisition of a sentiment lexicon that do not depend on pre-existing language resources: latent semantic analysis, graph-based propagation, and topic modelling. Methods are language independent and corpus-based, hence especially suitable for languages for which resources are very scarce. We use the presented methods to acquire sentiment lexicon for Croatian language. The performance of the methods was evaluated on the task of determining both subjectivity and polarity at ( subjectivity + polarity task) and the task of determining polarity of subjective words ( polarity only task). The results indicate that the methods are especially suitable for the polarity only task.

#378: Sentence Classification...

Yu Nagai, Tomohisa Senzai, Seiichi Yamamoto, Masafumi Nishida

Computer Assisted Language Learning (CALL) systems are one of the key technologies in assisting learners to master a second language. The progress in automatic speech recognition has advanced research on CALL systems that recognize speech constructed by students. Reliable recognition is still difficult from speech by second language speakers, which contains pronunciation, lexical, and grammatical errors. We developed a dialogue-based CALL system using a learner corpus. The system uses two kinds of automatic speech recognizers using ngram and finite state automaton (FSA). We also propose a classification method for classifying the speech recognition results from the recognizer using FSA as accepted or rejected. The classification method uses the differences in acoustic likelihoods of both recognizers as well as the edit distance between strings of output words from both recognizers and coverage estimation by FSA over various expressions.

#391: Sentence Modality Assignment in PDT

Magda Ševčíková, Jiří Mírovský

The paper focuses on the annotation of sentence modality in the Prague Dependency Treebank (PDT). Sentence modality (as the contrast between declarative, imperative, interrogative etc. sentences) is expressed by a combination of several means in Czech, from which the category of verbal mood and the final punctuation of the sentence are the most important ones. In PDT 2.0, sentence modality was assigned semi-automatically to the root node of each sentence (tree) and further to the roots of parenthesis and direct speech subtrees. As this approach was too simple to adequately represent the linguistic phenomenon in question, the method for assigning the sentence modality has been revised and elaborated for the forthcoming version of the treebank (PDT 3.0).

#380: Spoken Dialogue System Design in 3 Weeks

Tomáš Valenta, Jan Švec, Luboš Šmídl

\parhack This article describes knowledge-based spoken dialogue system design from scratch. It covers all stages which were performed during the period of three weeks: definition of semantic goals and entities, data collection and recording of sample dialogues, data annotation, parser and grammars design, dialogue manager design and testing. The work was focused mainly on rapid development of such a dialogue system. The final implementation was written in dynamically generated VoiceXML. The large vocabulary continuous speech recognition system was used and the language understanding module was implemented using non-recursive probabilistic context free grammars which were converted to finite states transducers. The design and implementation has been verified on a railway information service task with a real large-scale database. The paper describes an innovative combination of data, expert knowledge and state-of-the-art methods which allow fast spoken dialogue system design.

#514: State Relevance in HMM-based Feature Extraction Method

Rok Gajšek, Simon Dobrišek, France Mihelič

In the article we evaluate the importance of different HMM states in an HMM-based feature extraction method used to model paralinguistic information. Specifically, we evaluate the distribution of the paralinguistic information across different states of the HMM in two different classification tasks: emotion recognition and alcoholization detection. In the task of recognizing emotions we found that the majority of emotion-related information is incorporated in the first and third state of a 3-state HMM. Surprisingly, in the alcoholization detection task we observed a somewhat equal distribution of task-specific information across all three states, resulting in constantly producing better results if more states are utilized.

#437: Supervised Distributional Semantic Relatedness

Alistair Kennedy, Stan Szpakowicz

Distributional measures of semantic relatedness determine word similarity based on how frequently a pair of words appear in the same contexts. A typical method is to construct a \tc matrix, then re-weight it using some measure of association, and finally take the vector distance as a measure of similarity. This has largely been an unsupervised process, but in recent years more work has been done devising methods of using known sets of synonyms to enhance relatedness measures. This paper examines and expands on one such measure, which learns a weighting of a \tc matrix by measuring associations between words appearing in a given context and sets of known synonyms. In doing so we propose a general method of learning weights for \tc matrices, and evaluate it on a word similarity task. This method works with a variety of measures of association and can be trained with synonyms from any resource.

#520: TENOR: A Lexical Normalisation Tool for Spanish Web 2.0 Texts

Alejandro Mosquera, Paloma Moreda

The lexical richness and its ease of access to large volumes of information converts the Web 2.0 into an important resource for Natural Language Processing. Nevertheless, the frequent presence of non-normative linguistic phenomena that can make any automatic processing challenging. We therefore propose in this study the normalisation of non-normative lexical variants in Spanish Web 2.0 texts. We evaluate our system by restoring the canonical version of Twitter texts, increasing the F1 measure of a state-of-the-art approach for English texts by a 10%.

#431: Taggers Gonna Tag:...

Adam Radziszewski, Szymon Acedański

Usually tagging of inflectional languages is performed in two stages: morphological analysis and morphosyntactic disambiguation. A number of papers have been published where the evaluation is limited to the second part, without asking the question of what a tagger is supposed to do. In this article we highlight this important question and discuss possible answers. We also argue that a fair evaluation requires assessment of the whole system, which is very rarely the case in the literature. Finally we show results of the full evaluation of three Polish morphosyntactic taggers. The discrepancy between our results and those published earlier is striking, showing that these issues do make a practical difference.

#494: The Role of Nasal Contexts on Quality of Vowel Concatenations

Milan Legát, Radek Skarnitzl

This paper deals with the traditional problem of occurrence of audible discontinuities at concatenation points at diphone boundaries in the concatenative speech synthesis. We present results of an analysis of effects of nasal context mismatches on the quality of concatenations in five short Czech vowels. The study was conducted with two voices (one male and one female), and the results suggest that the female voice vowels /a/, /e/ and /o/ are inclined to concatenation discontinuities due to nasalized contexts.

#419: The Rule-Based Approach to Czech Grammaticalized Alternations

Václava Kettnerová, Markéta Lopatková, Zdeňka Urešová

Under the term grammaticalized alternations, we understand changes in valency frames of verbs corresponding to different surface syntactic structures of the same lexical unit of a verb. Czech grammaticalized alternations are expressed either (i) by morphological means (diatheses), or (ii) by syntactic means (reciprocity). These changes are limited to changes in morphemic form(s) of valency complementations; moreover, they are regular enough to be captured by formal syntactic rules. In this paper a representation of Czech grammaticalized alternations and their possible combination is proposed for the valency lexicon of Czech verbs, VALLEX.

#390: The Soundex Phonetic Algorithm Revisited for SMS Text Representation

David Pinto, Darnes Vilari no, Yuridiana Alemán, Helena Gómez, Nahun Loya, Héctor Jiménez-Salazar

The growing use of information technologies such as mobile devices has had a major social and technological impact such as the growing use of Short Message Services (SMS), a communication system broadly used by cellular phone users. In 2011, it was estimated over 5.6 billion of mobile phones sending between 30 and 40 SMS at month. Hence the great importance of analyzing representation and normalization techniques for this kind of texts. In this paper we show an adaptation of the Soundex phonetic algorithm for representing SMS texts. We use the modified version of the Soundex algorithm for codifying SMS, and we evaluate the presented algorithm by measuring the similarity degree between two codified texts: one originally written in natural language, and the other one originally written in SMS "sub-language". Our main contribution is basically an improvement of the Soundex algorithm which allows to raise the level of similarity between the texts in SMS and their corresponding text in English or Spanish language.

#461: Towards a Constraint Grammar Based Morphological Tagger for Croatian

Hrvoje Peradin, Jan Šnajder

A Constraint Grammar (CG) uses context-dependent hand-crafted rules to disambiguate the possible grammatical readings of words in running text. In this paper we describe the development of a CG-based morphological tagger for Croatian language. Our CG tagger uses a morphological analyzer based on an automatically acquired inflectional lexicon and an elaborate tagset based on MULTEXT-East and the Croatian Verb Valence Lexicon. Currently our grammar has 290 rules, organized into cleanup and mapping rules, disambiguation rules, and heuristic rules. The grammar is implemented in the CG3 formalism and compiled with the vislcg3 open-source compiler. The preliminary tagging performance is P: 96.1%, R: 99.8% for POS tagging and P: 88.2%, R: 98.1% for complete morphosyntactic tagging.

#374: Unsupervised Clustering of Prosodic Patterns

András Beke, György Szaszák

Dealing with spontaneous speech constitutes big challenge both for linguistics and engineers of speech technology. For read speech, prosody was assessed as an automatic decomposition for phonological phrases using supervised method (HMM) in earlier experiments. However, when trying to adapt this automatic approach for spontaneous speech, the clustering of phonological phrase types becomes problematic: it is unknown which types can be characteristic and hence worth modelling. The authors decided to carry out a more flexible, unsupervised learning to cluster the data in order to evaluate and analyse whether some typical "spontaneous" patterns become selectable in spontaneous speech based on this automatic approach. This paper presents a method for clustering the typical prosody patterns of spontaneous speech based on k-means clustering.

#440: Unsupervised Synchronization of Hidden Subtitles

Petr Stanislav, Jan Švec, Luboš Šmídl

This paper deals with a processing of hidden subtitles and with an assignment of subtitles without time alignment to the corresponding parts of audio records. The first part of this paper describes processing of hidden subtitles using a software framework designed for handling large volumes of language modelling data. It evaluates characteristics of a corpus built from publicly available subtitles and compares them with the corpora created from other sources of data such as news articles. The corpus consistency and similarity to other data sources is evaluated using a standard Spearman rank correlation coefficients. The second part presents a novel algorithm for unsupervised alignment of hidden subtitles to the corresponding audio. The algorithm uses no prior time alignment information. The method is based on a keyword spotting algorithm. This algorithm is used for approximate alignment, because large amount of redundant information is included in obtained results. The longest common subsequence algorithm then determines the best alignment of an audio and a subtitle. The method was verified on a set of real data (set of TV shows with hidden subtitles).

#411: User Adaptation in a Hybrid MT System

Susanne Preuß, Hajo Keffer, Paul Schmidt, Georgios Goumas, Athanasia Asiki, Ioannis Konstantinou

In this paper we present the User Adaptation (UA) module implemented as part of a novel Hybrid MT translation system. The proposed UA module allows the user to enhance core system components such as synchronous grammars and system dictionaries at run-time. It is well-known that allowing users to modify system behavior raises the willingness to work with MT systems. However, in statistical MT systems user feedback is only `a drop in the ocean' in the statistical resources. The hybrid MT system proposed here uses rule-based synchronous grammars that are automatically extracted out of small parallel annotated bilingual corpora. They account for structural mappings from source language to target language. Subsequent monolingual statistical components further disambiguate the target language structure. This approach provides a suitable substrate to incorporate a lightweight and effective UA module. User corrections are collected from a post-editing engine and added to the bilingual corpus, whereas the resulting additional structural mappings are provided to the system at run-time. Users can also enhance the system dictionary. User adaptation is organized in a user-specific commit-and-review cycle that allows the user to revise user adaptation input. Preliminary experimental evaluation shows promising results on the capability of the system to adapt to user structural preferences.

#368: User Modeling for Language Learning in Facebook

Maria Virvou, Christos Troussas, Jaime Caro, Kurt Junshean Espinosa

The rise of Facebook presents new challenges for matching users with content of their preferences. In this way, the educational aspect of Facebook is accentuated. In order to emphasize the educational usage of Facebook, we implemented an educational application, which is addressed to Greek users who want to learn the Conditionals grammatical structure in Filipino and vice versa. Given that educational applications are targeted to a heterogeneous group of people, user adaptation and individualization are promoted. Hence, we incorporated a student modeling component, which retrieves data from the user’s Facebook profile and from a preliminary test to create a personalized learning profile. Furthermore, the system provides advice to each user, adapted to his/her knowledge level. To illustrate the modeling component, we presented a prototype Facebook application. Finally, this study indicates that the wider adoption of Facebook as an educational tool can further benefit from the user modeling component.

#481: Using Cognates to Improve Lexical Alignment Systems

Mirabela Navlea, Amalia Todirascu

In this paper, we describe a cognate detection module integrated into a lexical alignment system for French and Romanian. Our cognate detection module uses lemmatized, tagged and sentence-aligned legal parallel corpora. As a first step, this module apply a set of orthographic adjustments based on orthographic and phonetic similarities between French - Romanian pairs of words. Then, statistical techniques and linguistic information (lemmas, POS tags) are combined to detect cognates from our corpora. We automatically align the set of obtained cognates and the multiword terms containing cognates. We study the impact of cognate detection on the results of a baseline lexical alignment system for French and Romanian. We show that the integration of cognates in the alignment process improves the results.

#500: Using Dependency-Based Annotations for Authorship Identification

Charles Hollingsworth

Most statistical approaches to stylometry to date have focused on lexical methods, such as relative word frequencies or type-token ratios. Explicit attention to syntactic features has been comparatively rare. Those approaches that have used syntactic features typically either used very shallow features (such as parts of speech) or features based on phrase structure grammars. This paper investigates whether typed dependency grammars might yield useful stylometric features. An experiment was conducted using a novel method of depicting information about typed dependencies. Each token in a text is replaced with a "DepWord," which consists of a concise representation of the chain of grammatical dependencies from that token back to the root of the sentence. The resulting representation contains only syntactic information, with no lexical or othographic information. These DepWords can then be used in place of the original words as the input for statistical language processing methods. I adapted a simple method of authorship attribution --- nearest neighbor based on word frequency rankings --- for use with DepWords, and found it performed comparably to the same technique trained on words or parts of speech, even outperforming lexical methods in some cases. This indicates that the grammatical dependency relations between words contains stylometric information sufficient for distinguishing authorship. These results suggest that further research into typed-dependency-based stylometry might prove fruitful.

#439: Using a Double Clustering Approach to Build Extractive Multi-document Summaries

Sara Botelho Silveira, António Branco

This paper presents a method for extractive multi-document summarization that explores a two-phase clustering approach. First, sentences are clustered by similarity, and one sentence per cluster is selected, to reduce redundancy. Then, in order to group them according to topics, those sentences are clustered considering the collection of keywords that represent the topics in the set of texts. Evaluation reveals that the approach pursued produces highly informative summaries, containing many relevant data and no repeated information.

#536: Contribution of Terminological Paradigm to English Linguistic Discourse

Olha Ivashchyshyn

The paper focuses on the research of language diversity in English linguistic discourse. Semantic and structural methods of linguistic analysis have been applied to determine the contribution of terms to creating linguistic discourse. Investigations in the area of discourse (Dijk, 2000; Karasik, 2004; Bhatia, 2010) have pointed to the fact that discourse is a complex communicative phenomenon reflecting linguo-cognitive processes with linguistic and extra-linguistic constituents. Linguistic discourse is aimed at achieving cognitive and communicative purposes within linguistic domain. Terminology theory (Kyyak, 2003; Wright & Budin, 1997) has defined the terminology as a complex system of language units expressing specific concepts and belonging to the theory and practice of a certain branch. The investigation has shown that linguistic terminology is a determining factor of linguistic discourse since it expresses the concepts of this specific discourse, which set it apart from discourses with other cognitive and communicative purposes. The discussion of pragmatic functions of terminological units and the correlation between the term, meaning and context as well as structural and semantic organization of terminological phrases in linguistic discourse is at the heart of the paper. The presentation will demonstrate the results of theoretical research and discuss the outcomes of its practical implementation.

#524: A SOM-based method for word translation disambiguation

George Tambouratzis (additional author names to be added)

The PRESEMT project constitutes a novel approach to the machine translation (MT) task. The aim of PRESEMT is to develop a language-independent MT system architecture that is readily portable to new language pairs or adaptable to specialised domains with minimal effort by the developer. PRESEMT falls within the Corpus-based MT (CBMT) paradigm, using a small bilingual parallel corpus and a large TL monolingual corpus. The PRESEMT approach is characterised by (a) the employment of cross-disciplinary techniques from the machine learning and computational intelligence domains and (b) the use of inexpensive-to-collect language resources that can largely be retrieved over the web. The present article investigates the process of selecting the best translation for a given token, by choosing over a set of suggested translations, as provided by a bilingual lexicon. The motivation for the proposed module originates from earlier document organization systems based on the SOM model (Self-Organizing Map). The SOM model is an unsupervised neural network that performs clustering tasks, based on the similarity of patterns in terms of features. It is the choice of input features to the SOM model that differentiates the applications and can lead to the optimal retrieval performance. Though the SOM has been studied extensively for text processing applications, the present article represents a novel application of the SOM, for disambiguation in Machine translation tasks. In the article, the actual features employed are described, which project textual data (more precisely lemmas) on the two-dimensional SOM lattice, so that related lemmas are situated at small distances to each other. This description is followed by details on the actual maps being developed to model very large corpora and experimental results of the map integration in the PRESEMT MT system.

#402: ABBYY Syntactic and Semantic Parser

Konstantin Druzhkin, Eugene Indenbom, Philip Minlos

The paper presents a parser developed by Abbyy Software House. The linguistic descriptions compatible with the software are now available both for Russian and for English. The resulting parses are used in further processing, such as machine translation and information retrieval. The parser is based on a hand-crafted description of morphology, syntax and semantics. The other major source of data is a statistical processing of corpora, which supplement the declarative description with weights. The syntax model is rich enough to account for non-local dependencies (anaphora, ellipsis, raising, equi-verbs). Lexical items are stored in of a thesaurus hierarchical tree. The structure of the tree (its branches, so to say) is language-independent; that is, now they are shared by the descriptions of Russian and English.

#545: SYNC3: A System for Synergistically Structuring News Content from Traditional Media and the Blogosphere

Georgios Petasis

This demo will present the SYNC3 platform along with the Annotation infrastructure of the Ellogon language engineering platform, a customisation of which is presented in paper 450. The user will have the opportunity to use the Web UI of the SYNC3 platform and how SYNC3 structures news content from traditional media and the blogosphere. In addition, the user will have the opportunity to see and use the aligned bilingual corpora annotation tool, or any other instantiation of the Ellogon annotation engine. News and social media are emerging as a dominant source of information for numerous applications. However, their vast unstructured content present challenges to efficient extraction of such information. The SYNC3 system aims to intelligently structure content from both traditional news media and the blogosphere. To achieve this goal, SYNC3 incorporates innovative algorithms that first model news media content statistically, based on fine clustering of articles into so-called “news events”. Such models are then adapted and applied to the blogosphere domain, allowing its content to map to the traditional news domain. Furthermore, appropriate algorithms are employed to extract news event labels and relations between events, in order to efficiently present news content to the system end users.

#548: Commonest match

Vít Baisa

The aim of this demonstration is to present a new functionality which extracts so called commonest match for a lemma and its collocate. Sometimes the couple itself (e.g. see - final) may be confusing since it may not be obvious in which relation and context they occur, namely in case there are some intervening words (saw world cup finals). The result may be very useful especially for lexicographers. We will show examples of commonest match implemented inside the Sketch Engine, advanced corpus manager.

#543: Fips multilingual parser/tagger

Eric Wehrli

Fips is a robust, "deep linguistic" multilingual parser (English, French, German, Italian, Spanish) based on generative linguistics concepts and object-oriented design. The demo will focus on the flexibility (the user can select phrase-structure or POS tagging representations, with a large variety of other options for the latter), and show how Fips can cope with multiword expressions (in particular collocations) as well as some cases of anaphora resolution.

#544: PSI-Toolkit - a hands-on NLP toolkit

Filip Graliński, Marcin Junczys-Dowmunt, Krzysztof Jassem

PSI-Toolkit is a toolchain for automatic processing of Polish (and - to lesser extent - other languages: English, German, French, Spanish and Russian) with the focus on machine translation. PSI-Toolkit is designed for linguists, language engineers or any users who need to process natural language texts and get the result now! PSI-Toolkit provides an easy-to-use web-service as well as a set of advanced tools for users with some computer skills.

#542: LDA-Frames: an Unsupervised Approach to Generating Semantic Frames

Jiří Materna

LDA-frames is an unsupervised approach to identifying semantic frames from semantically unlabelled text corpora. There are many frame formalisms but most of them suffer from the problem that all frames must be created manually and the set of semantic roles must be predefined. The LDA-Frames approach, based on the Latent Dirichlet Allocation, avoids both these problems by employing statistics on a syntactically tagged corpus. The only information that must be given is a number of semantic frames and a number of semantic roles to be identified. In the future implementation, however, this limitation will be avoided by automatic estimation of both these parameters.

#537: Semantic Orientation Extraction of Chinese Phrases by Discriminative Model and Global Features

Xiao Sun, Degen Huang, Fuji Ren

Extracting semantic orientation of phrases, especially some newly generated phrases from internet, is an important task for sentiment analysis of web texts or other real word text. This paper proposes a novel algorithm, which attempts to attack this problem by integrating discriminative model and latent value model. Although Chinese phrases consist of multiple words, the semantic orientation of the phrase is not just simple integration of orientations of the component words, as some words can invert the orientation so the phrases can have totally different semantic orientation. In order to capture the property of such phrases, hidden semi-CRF which includes a latent valuable layer is adopted. The method is tested experimentally by adopting a manually labeled set of positive and negative phrases and the experiments have shown very promising results, which is comparable to the best value ever reported.

#546: Morphosyntactic toolchain for Polish

Adam Radziszewski

The demonstration presents a set of tools for processing Polish developed at Wrocław University of Technology. The tools include a morphosyntactic tagger, a syntactic chunker, as well as a formalism and utility to generate morphosyntactic features from tagged text. The software is available under GNU LGPL 3.0, which allows for both scientific and commercial usages.

#547: Inforex — a web-based tool for text corpora management and annotation

Michał Marcińczuk

Inforex is a web-based system for text corpora management. The system supports a wide range of tasks related to the text corpus construction, including: document clean-up and selection, annotation of semantic chunks and named entities, annotation of relations between semantic chunks and named entities, annotation of anaphora relations, word senses and document metadata description. There are several features that makes the system unique in their way.

#549: Automatic Collocation Dictionaries

Milos Husak

Automatic Collocation Dictionary (ACD) is a an experimental application based on corpora provided in Sketch Engine. Using the Word Sketches and an algorithm for finding Good Dictionary EXamples we are providing sample sentences illustrating the usage of tens of thousands words and their most salient collocates. The demo includes the presentation of the method for creating the ACD's and a live preview of dictionaries for several available languages.

#550: TEDDCLOG - Testing English with Data-Driven CLOze Generation

Avinesh PVS

English is first language for 309-400 million and second language for 380 million-1.8 billion. English language learner has to be fluent in vocabulary, grammar and syntax of the language. Gap-fill questions play an important role in language teaching. They allow students to demonstrate their proficiency of a language. It is time consuming and difficult for an item-setter to prepare good test set. We present a system, TEDDCLOG that generates draft test items using a very large corpus of English, using functions for finding collocates, distractors and carrier sentences in the Sketch Engine, a leading corpus query tool.

#551: Multiword Sketches

Vojtěch Kovář

Word Sketch is a one-page, automatic, corpus-derived summary of a words grammatical and collocational behaviour, built into the Sketch Engine corpus management system. It has been available since 2004 and it is used very intensively in professional lexicography and advanced language teaching. So far, it has been available for single words and lemmas. Now we present its extension for multiword units, to be able to show word sketches for noun phrases, phrasal verbs etc. The demo will include explanation of the method used and examples available on-line on the Sketch Engine website.

#552: Corpus similarity in the Sketch Engine

Vít Suchomel

There are large corpora for over sixty languages available on the Sketch Engine website. An approach to studying content of large corpora in comparison with other corpora is measuring similarity (or difference) of corpora. The method was proposed by Kilgarriff and is a part of his invited talk at this conference. A new functionality based on Kilgarriff's method has been developed in the Sketch Engine. It allows the users to measure similarity of their own corpora and/or inbuilt corpora. The demonstration will include description of the implementation as well as showing similarity cross-tables for corpora available on the Sketch Engine website.

TSD 2011 | TSD 2010 | TSD 2009