Language, Speech and Multimedia Technologies Observatory
11/16/2010 - 08:30

It all started with an email to the baypiggies mailing list. An acquisition editor for Packt was looking for authors to expand their line of python cookbooks. For some reason I can’t remember, I thought they wanted to put together a multi-author cookbook, where each author contributes a few recipes. That sounded doable, because I’d already written a number of articles that could serve as the basis for a few recipes. So I replied with links to the following articles:

The reply back was:

The next step is to come up with around 8-14 topics/chapters and around 80-100 recipes for the book as a whole.

My first reaction was “WTF?? No way!” But luckily, I didn’t send that email. Instead, I took a couple days to think it over, and realized that maybe I could come up with that many recipes, if I broke my knowledge down into small pieces. I also decided to choose recipes that I didn’t already know how to write, and use them as motivation for learning & research. So I replied back with a list of 92 recipes, and got to work. Not surprisingly, the original list of 92 changed significantly while writing the book, and I believe the final recipe count is 81.

I was keenly aware that there’d be some necessary overlap with the original NLTK book, Natural Language Processing with Python. But I did my best to minimize that overlap, and to present a different take on similar content. And there’s a number of recipes that (as far as I know) you can’t find anywhere else, the largest group of which can be found in Chapter 6, Transforming Chunks and Trees. I’m very pleased with the result, and I hope everyone who buys the book is too. I’d like to think that Python Text Processing with NLTK 2.0 Cookbook is the practical companion to the more teaching oriented Natural Language Processing with Python.

If you’d like a taste of the book, checkout the online sample chapter (pdf) Chapter 3, Custom Corpora, which details how many of the included corpus readers work, how to use them, and how to create your own corpus readers. The last recipe shows you how to create a corpus reader on top of MongoDB, and it should be fairly easy to modify for use with any other database.

Packt has also published two excerpts from Chapter 8, Distributed Processing and Handling Large Datasets, which are partially based on those original 2 articles:
11/15/2010 - 13:30
Wikicorpus, v. 1.0: Catalan, Spanish and English portions of the Wikipedia.

The Wikicorpus contains portions of the Catalan, Spanish, and English Wikipedias
based on a 2006 dump. The corpora have been automatically tagged with lemma and
part of speech information using the open source library FreeLing. Also, they have
been WordNet-sense annotated with the state of the art Word Sense Disambiguation
algorithm UKB. In its current version, the corpora have the following sizes:

* Catalan: around 50 million words
* Spanish: around 120 million words
* English: around 600 million words

We provide access to the corpora in their raw text and tagged versions, under the
same license as Wikipedia itself. To our knowledge, these are the largest Catalan
and Spanish corpora freely available for download. Moreover, we also provide an
open source Java-based parser for Wikipedia pages developed for the construction
of the corpus. For more information and download, please visit the project's page:
11/15/2010 - 13:30
*Call for Papers

Translation and Technology Conference*
/Paris (France), March 3-4, 2011/

We no longer translate as we did 50, 20 or even 10 years ago. What form will the translation process take 10, 20 or 50 years from now? What
will be the demand for translation and what kind of tools, what kind of approach, will we need to meet that demand? Who will be the translators
11/15/2010 - 13:30

Unlike simple instructions or an automatic answering machine, an avatar or virtual person carrying out the tasks as an assistant enables its user to have “more intuitive and natural” communication, states Ms María del Puy Carretero, computer engineer and researcher at the Vicomtech-IK4 centre, who is working on perfecting avatars.
11/15/2010 - 13:30

Al contrario que unas simples instrucciones o un contestador automático, un avatar o personaje virtual que realiza labores de asistente permite al usuario tener una comunicación más intuitiva y natural. Así lo afirma María del Puy Carretero, ingeniera informática e investigadora del centro Vicomtech-IK4, que trabaja en el perfeccionamiento de los avatares.
11/15/2010 - 13:30

Argibide huts batzuekin edo erantzungailu automatiko batekin ez bezala, laguntzaile-lanak egiten dituen avatar edo pertsonaia birtual batekin komunikazioa intuiziozkoagoa eta naturalagoa da erabiltzailearentzat. Hala dio Maria del Puy Carretero ingeniari informatiko eta Vicomtech-IK4 zentroko ikertzaileak. Pertsonaia birtual horiek hobetzen dihardu berak.
11/11/2010 - 18:40

Magazine stack - flickr photo by bravenewtravelerI know “semantic web” probably sounds like something the Chief Technology Officer should worry about, as opposed to the people who create and manage the content of a site. But for traditional content publishers, the realm of digital is such a vastly different landscape that understanding the unique opportunities – as well as the challenges – will offer a real advantage to content professionals who prepare themselves today.

For the first couple decades of digital content distribution, media companies primarily focused on how the digitization of their product made it harder for them to control the distribution channels. Initial technological developments attempted to recreate the controls and restrictions of traditional media distribution, keeping conditions as close as possible to the familiar expectations of content producers, advertisers, and consumers. But these attempts to slow the march of progress have been, at best, only partially successful. continued…

New Career Opportunities Daily: The best jobs in media.
11/11/2010 - 13:00
18-19 November 2010, Oxford Circus, London

There is still time to book your place for this year’s exciting conference!

This will be the thirty-second conference in the series and is supported by BCS NLTSG, EAMT, ITI and TILP. The Aslib event draws together a diverse group of delegates, who will gain new insights and brainstorm ideas on the use of information technology for translation.  The conference has a single stream, facilitating discussion on issues highlighted by the wide mix of presentations from both users and developers.   In addition to the presentations, there will be a panel discussion on “The Right to Access to Content in your Language needs to be extended beyond the G20”.  Why not join us for lively discussion, sharing new ideas and networking with other professionals?

Full details, including the programme, registration fees and how to exhibit, can be found at:<>

I hope you can attend and I look forward to
11/11/2010 - 08:20

My new book, Python Text Processing with NLTK 2.0 Cookbook, has been published. You can find it at both Packt and Amazon. For those of you that pre-ordered it, thank you, and I hope you receive your copy soon.

The Packt page has a lot more details, including the Table of Contents and a sample chapter (pdf). The sample chapter is Chapter 3, Creating Custom Corpora, which covers the following:

  • creating your own corpora
  • using many of the included corpus readers
  • creating custom corpus readers
  • creating a corpus reader on top of MongoDB

I hope you find Python Text Processing with NLTK Cookbook useful, informative, and maybe even fun.
11/10/2010 - 08:20

Posted by Pedro J. Moreno, Staff Research Scientist and Johan Schalkwyk, Senior Staff Engineer


Today we’re introducing Voice Search support for Zulu and Afrikaans, as well as South African-accented English. The addition of Zulu in particular represents our first effort in building Voice Search for underrepresented languages.

We define underrepresented languages as those which, while spoken by millions, have little presence in electronic and physical media, e.g., webpages, newspapers and magazines. Underrepresented languages have also often received little attention from the speech research community. Their phonetics, grammar, acoustics, etc., haven’t been extensively studied, making the development of ASR (automatic speech recognition) voice search systems challenging.

We believe that the speech research community needs to start working on many of these underrepresented languages to advance progress and build speech recognition, translation and other Natural Language Processing (NLP) technologies. The development of NLP technologies in these languages is critical for enabling information access for everybody. Indeed, these technologies have the potential to break language barriers.

We also think it’s important that researchers in these countries take a leading role in advancing the state of the art in their own languages. To this end, we’ve collaborated with the Multilingual Speech Technology group at South Africa’s North-West University led by Prof. Ettiene Barnard (also of the Meraka Research Institute), an authority in speech technology for South African languages. Our development effort was spearheaded by Charl van Heerden, a South African intern and a student of Prof. Barnard. With the help of Prof. Barnard’s team, we collected acoustic data in the three languages, developed lexicons and grammars, and Charl and others used those to develop the three Voice Search systems. A team of language specialists traveled to several cities collecting audio samples from hundreds of speakers in multiple acoustic conditions such as street noise, background speech, etc. Speakers were asked to read typical search queries into an Android app specifically designed for audio data collection.

For Zulu, we faced the additional challenge of few text sources on the web. We often analyze the search queries from local versions of Google to build our lexicons and language models. However, for Zulu there weren’t enough queries to build a useful language model. Furthermore, since it has few online data sources, native speakers have learned to use a mix of Zulu and English when searching for information on the web. So for our Zulu Voice Search product, we had to build a truly hybrid recognizer, allowing free mixture of both languages. Our phonetic inventory covers both English and Zulu and our grammars allow natural switching from Zulu to English, emulating speaker behavior.

This is our first release of Voice Search in a native African language, and we hope that it won’t be the last. We’ll continue to work on technology for languages that have until now received little attention from the speech recognition community.

Salani kahle!**

* “Welcome” in Afrikaans
** “Stay well” in Zulu

Syndicate content