Language, Speech and Multimedia Technologies Observatory
12/02/2010 - 21:20

Posted by Posted by Yun-hsuan Sung (宋雲軒) and Martin Jansche, Google Research

On November 30th 2010, Google launched Cantonese Voice Search in Hong Kong. Google Search by Voice has been available in a growing number of languages since we launched our first US English system in 2008. In addition to US English, we already support Mandarin for Mainland China, Mandarin for Taiwan, Japanese, Korean, French, Italian, German, Spanish, Turkish, Russian, Czech, Polish, Brazilian Portuguese, Dutch, Afrikaans, and Zulu, along with special recognizers for English spoken with British, Indian, Australian, and South African accents.

Cantonese is widely spoken in Hong Kong, where it is written using traditional Chinese characters, similar to those used in Taiwan. Chinese script is much harder to type than the Latin alphabet, especially on mobile devices with small or virtual keyboards. People in Hong Kong typically use either “Cangjie” (倉頡) or “Handwriting” (手寫輸入) input methods. Cangjie (倉頡) has a steep learning curve and requires users to break the Chinese characters down into sequences of graphical components. The Handwriting (手寫輸入) method is easier to learn, but slow to use. Neither is an ideal input method for people in Hong Kong trying to use Google Search on their mobile phones.

Speaking is generally much faster and more natural than typing. Moreover, some Chinese characters – like “滘” in “滘西州” (Kau Sai Chau) and “砵” in “砵典乍街” (Pottinger Street) – are so rarely used that people often know only the pronunciation, and not how to write them. Our Cantonese Voice Search begins to address these situations by allowing Hong Kong users to speak queries instead of entering Chinese characters on mobile devices. We believe our development of Cantonese Voice Search is a step towards solving the text input challenge for devices with small or virtual keyboards for users in Hong Kong.

There were several challenges in developing Cantonese Voice Search, some unique to Cantonese, some typical of Asian languages and some universal to all languages. Here are some examples of problems that stood out:

  • Data Collection: In contrast to English, there are few existing Cantonese datasets that can be used to train a recognition system. Building a recognition system requires both audio and text data so it can recognize both the sounds and the words. For audio data, our efficient DataHound collection technique uses smartphones to record and upload large numbers of audio samples from local Cantonese-speaking volunteers. For text data, we sample from anonymized search query logs from to obtain the large amounts of data needed to train language models.
  • Chinese Word Boundaries: Chinese writing doesn’t use spaces to indicate word boundaries. To limit the size of the vocabulary for our speech recognizer and to simplify lexicon development, we use characters, rather than words, as the basic units in our system and allow multiple pronunciations for each character.
  • Mixing of Chinese Characters and English Words: We found that Hong Kong users mix more English into their queries than users in Mainland China and Taiwan. To build a lexicon for both Chinese characters and English words, we map English words to a sequence of Cantonese pronunciation units.
  • Tone Issues: Linguists disagree on the best count of the number of tones in Cantonese – some say 6, some say 7, or 9, or 10. In any case, it’s a lot. We decided to model tone-plus-vowel combinations as single units. In order to limit the complexity of the resulting model, some rarely-used tone-vowel combinations are merged into single models.
  • Transliteration: We found that some users use English words while others use the Cantonese transliteration (e.g.,: “Jordan” vs. “佐敦­”). This makes it challenging to develop and evaluate the system, since it’s often impossible for the recognizer to distinguish between an English word and its Cantonese transliteration. During development we use a metric that simply checks whether the correct search results are returned.
  • Different Accents and Noisy Environment: People speak in different styles with different accents. They use our systems in a variety of environments, including offices, subways, and shopping malls. To make our system work in all these different conditions, we train it using data collected from many different volunteers in many different environments.

Cantonese is Google’s third spoken language for Voice Search in the Chinese linguistic family, after Mandarin for Mainland China and Mandarin for Taiwan. We plan to continue to use our data collection and language modeling technologies to help speakers of Chinese languages easily input text and look up information.
11/30/2010 - 17:25

The company says the unique user interface will make communicating easier.
11/30/2010 - 17:25

The Call for Presentations (CFP) for the 2011 Semantic Technology Conference is now open!  Full details on the CFP requirements and process are available here:

SemTech 2011 Call for Presentations

Presentations for the conference will be reviewed by the Program Advisory Committee (PAC) – a group of semantic technology experts who have offered their time and expertise to help ensure that the educational agenda is once again representative of the best work in the field.  I’m grateful for the assistance of all the PAC members, who are listed here:

SemTech 2011 Program Advisory Committee

We’re interested in all sorts of presentations, from business and consumer applications through to fundamental technology discussions.  With each year, the emphasis of the program gradually shifts more towards practical experience as semantic technology matures and is more widely deployed, so we’re very much looking for semantic case studies and project experience. If you are developing a tool, or using semantics to create an exciting new start-up company, we want to hear your story as well.  And having said all that, we still keep timeslots available for cutting edge research and thought-leadership.  So SemTech really is an all-inclusive program for the entire community.

Feel free to email me with any questions at  And I’ll reiterate something I said a couple of weeks ago – if you know someone who is doing interesting work, please tell me about him or her so I can invite that person to participate.  We don’t claim to know everyone in the field so your assistance in finding good speakers is invaluable to the process of creating a great educational forum.

Tony Shaw

SemTech Program Co-Chair

New Career Opportunities Daily: The best jobs in media.
11/26/2010 - 22:50

Albisteak 2.0 saioan QWiki eta commons aipatzeaz gain leku berezia hartu du OpenMT (Machine Translations) eta wikipedia. Euskarazko itzulpengintza automatikoa hobetu nahi dute eta horretarako Wikipediarekin egin daitekeen lanak garrantzi berezia hartzen du.
11/24/2010 - 10:15
Status: stable
Last release: 3.1.0 (20th August 2010)
License: LGPL
Affiliation: University of Manchester
Web resources

The OWL API is a Java interface and implementation for the W3C Web Ontology Language (OWL), used to represent Semantic Web ontologies. The API is focused towards OWL DL and the upcoming OWL 2. The OWL API also offers specific bindings for the OWL DL reasoners FaCT++ and Pellet.

The OWL API includes the following components:
11/19/2010 - 19:30

This week of course saw the winners of the Elsevier 2010 Semantic Web Challenge revealed. Among the distinguishing features of the winners: The marriage of meaty semantic web technology and accessible user interfaces.

In the past entries might veer to the cool, funky and flashy on the surface, but a bit weak in the supporting infrastructure, or to some very compelling technical stuff that didn’t have a lot of interface appeal to the end user. But that wasn’t so much the case this time around, says Diana Maynard, one of this year’s co-chairs and a research associate in the computer science department at the University of Sheffield, where she focuses on NLP.

“People are starting to combine the two,” she says – good looks and good legs, so to speak.

That matters if semantic web technologies are to continue making headway outside the research communities and into some real practical applications. “You have to make the UI appealing to the outside user, especially with the take-up of semantic web technologies that we are starting to see in the real world,” Maynard says. “It’s important that these applications are not just research but things you can use in real life.” (The Semantic Web Blog recently looked at the UI issue in some depth here.)


New Career Opportunities Daily: The best jobs in media.
11/19/2010 - 18:20

Nuance will power aisle411, an app that let's users navigate a store using voice.
11/19/2010 - 08:50

Ksenia Security will use Loquendo TTS for security and home automation market
11/19/2010 - 08:50

GanganJH creates innovative language learning tool for iPhone, iPad, and iPod Touch.
11/16/2010 - 17:40

"Ulises, nire nabigazio mapak ezabatu egin dira". TomTomik ez dago Ulises 31ren sasoian!

Txikitan txoratzen ninduten zientzia-fikzioko telesail eta marrazki bizidunetan, gauzarik flipanteenetako bat garai hartan ‘konputagailuak’ deitzen zirenak ziren. Zer Google eta zer Googleondo! Nahikoa zen galdera ahoz botatzea, eta makinak di-da erantzuten zuen, erantzuna osatzeko beharrezko irudiak eta bideoak aldi berean erakutsiz… Halaxe egiten zuten G Komandoko kideek, Ulises 31k edota Star Trek-eko eskifaia osoak.

Bada, gaur flipatu egin dut. Qwiki izeneko webgunearen alfa bertsioa probatzeko aukera izan dut eta, galderak idatziz egin behar zaizkiola alde batera utzita, horixe egiten du: sartutako bilaketa terminoa hainbat gunetan arakatu, haietatik testu, irudi eta bideo informazioa atera, eta horixe erakusten digu era ezin txukunagoan. Ahots gozo batek testuzko informazio klikagarria irakurtzen digu, eta lokuzioa entzuten dugun bitartean erlazionatutako bideo eta irudiak ikusteko aukera dugu.

Oso-oso era erakargarria iruditu zait informazioa erakusteko, batez ere helburua informazio orokorra eskuratzea denean. Wikipediako artikulu batek eman dezakeen sakontasuna nekez emango digu Qwikik, baina kontsulta azkarretarako ezin aproposagoa iruditu zait. Hona, esaterako, ‘Bilbao’ galdetuta zer erantzuten digun. (‘Bilbo’ idatziz gero, Eraztunen Jaunako Bilbo Zorrozabalen inguruko informazioa ematen digu!).

Imajinatu zer izan daitekeen hau galderak ahoz sartzeko eta beste hizkuntzetarako euskarria erantsiz gero. Flipatzekoa, ezta?

(Bide batez, alpha bertsioa probatzeko gonbidapena nahi baduzue, eskatzea baino ez duzue!)

Syndicate content