Hitz anitzeko terminoen detekzioa eta ulermena ez da arazo erraza. Alemaniako Saarbruken-eko  DFKI laborategi ospetsutik bisitan datorkigun Valia Kordoni ikertzaileak horretaz hitz egingo digu: Nola erauzi automatikoki hitz anitzeko terminoak eta nola erabili horiek gramatika eleanitzak sortzeko.

Gaia: Automated Annotation and Acquisition of Linguistic Knowledge for Efficient Multilingual Grammar Engineering
(Hitz anitzeko terminoen erauzketa automatikoa  gramatika eleanitzak sortzeko).
Tokia:  3.2 aretoa. Informatika Fakultatea
Valia Kordoni (LT-Lab DFKI GmbH & Dept. of Computational Linguistics, Saarland University).
Eguna: Azaroaren 18an

Ordua: 16:00-17:00


In this talk, I mainly deal with automated acquisition
of linguistic knowledge as a means of enhancing
robustness of lexicalised grammars for real life applications. The
case study I focus on in the best part of this talk is
Multiword Expressions (henceforward MWEs). Specifically,
in the first part of the talk I am taking a closer
look at the linguistic properties of MWEs, in particular,
their lexical, syntactic, as well as semantic
characteristics. The term Multiword Expressions has been
used to describe expressions for which the
syntactic or semantic properties of the whole expression
cannot be derived from its parts (cf., Sag et al., 2002),
including a large number of related but distinct
phenomena, such as phrasal verbs (e.g., “come along”),
nominal compounds (e.g., “frying pan”), institutionalised
phrases (e.g., “breadand butter”), and many
others. Jackendoff (1997) estimates the number of MWEs in a
speaker’s lexicon to be comparable to the number of single
However, due to their heterogeneous characteristics,
MWEs present a tough challenge for both
linguistic and computational work (cf., Sag et al., 2002).
For instance, some MWEs are fixed, and do not present
internal variation, such as “ad hoc”, while
others allow different degrees of internal variability and
modification, such as “spill beans” (“spill
several/musical/mountains of beans”). With the observations
about the linguistic properties of MWEs at hand, I turn in
the second part of the talk to methods for the automated
acquisition of these properties for robust grammar
engineering. To this effect, I first investigate
the hypothesis that MWEs can be detected by the distinct statistical
properties of their component words, regardless of their
type, comparing various statistical measures, a
procedure which leads to extremely
interesting conclusions. I then investigate the
influence of the size and quality of different
corpora, using the BNC and the Web search engines Google and
Yahoo. I conclude that, in terms of language usage, web
generated corpora are fairly similar to more
carefully built corpora, like the BNC, indicating that the
lack of control and balance of these corpora are probably
compensated by their size.
Then, I show a qualitative evaluation of the results of
automatically adding extracted MWEs to existing
linguistic resources. To this effect, I first discuss two
main approaches commonly employed in NLP for treating MWEs:
the words-with-spaces approach which models an MWE as a
single lexical entry and it can adequately capture fixed
MWEs like “by and large”, and compositional approaches which
treat MWEs by general and compositional methods of
linguistic analysis, being able to capture more
syntactically flexible MWEs, like “rock boat”, which cannot
be satisfactorily captured by a wordswith-spaces
approach, since this would require lexical entries to be
added for all the possible variations of an MWE (e.g.,
“rock/rocks/rocking this/that/his…boat”). On this basis, I
argue that the process of the automatic addition of
extracted MWEs to existing linguistic resources improves qualitatively,
if a more compositional approach to grammar/lexicon
extension is adopted.
Finally, I also propose that the methods developed for
the acquisition of linguistic knowledge in the case of the
English MWEs can be tuned to enhance robustness
of lexicalised grammars for languages with richer morphology
and freer word order, as is the case of German, and can
benefit from gold standard syntactically and
semantically annotated corpora, for the (semi-automated)
development of which I am briefly
showing a very simple statistical ranking model which
significantly improves treebanking efficiency by
prompting human annotators to the most relevant linguistic
annotation decisions.