Language, Speech and Multimedia Technologies Observatory
12/20/2010 - 20:15

VoiceDJ allows users to control their music libraries entirely with their voices.
12/20/2010 - 10:00

Helsinkiko Unibertsitatetik datorren Mans Hulden ikerlari finlandiarra
gurekin egongo da maiatza bitartean. Mans-ek automatekin eta
transduktoreekin lan egiten duen Foma tresna garatu du. Foma
konpiladore, programazio-lengoaia eta C liburutegi batekin osatuta dago. Formatu berezi batean erregela multzo bat idatziz gero, erregela horiek
egoera finituetako transduktore eta autometara itzultzen ditu. Oso-oso inplementazio eraginkorrak lortzen ditu eta software librea da.

Xuxen zuzentzailea Foma tresnarekin lantzen ari gara. Gaur egun Xuxenen erregelak eta lexikoak XFST programarekin inplementatuta daude, baina XFST ez da software librea. Xuxen zuzentzailea Fomarekin birdefinituko bagenu, banaketa libreko programetan ere bere osotasunean integratu genezake. Hortik doa gure motibazioa Foma aztertzeko. Baina egoera finitutako beste edozein aplikazio mota guztietarako ere  interesgarria da.

Mans-ek urtarrilean Fomari buruzko ikastaro bat emango du eta, noski, gonbidatuta zaude bertara joateko:

Egunak: urtarrilaren 11, 12, eta 13a

Ordutegia: 9:30-12:30

Lekua: Informatika Fakultatea

Bitartean, pasa den maiatzean LREC2010 kongresurako Mans-ek eta Iñaki Alegriak prestatu zuten tutorialaren materiala ikus dezakezu.
12/20/2010 - 08:50

Since my earlier post on the new trending tool provided by Google Books, I've been thinking more about the service. While I've found plenty of interesting trends (more of which later), I've also been considering the underlying data and interface.Many of these considerations are common to any trending or other data probing interface (such as BlogPulse).

While there have been lots of reasonably visible copy written about the opportunities presented by the data set - the potential to understand trends in our culture and linguistics - this enthusiastic data geekery is somewhat lacking in data diligence. The original article in  Science, for example, doesn't describe the data in the most basic terms.

At the very least, the data needs to be described in terms of design, accuracy and bias.

By data design I mean the intentions of the data. These intentions are somewhat exposed in the interface (where one can choose from things like 'American English', 'British English', etc.). I'd love to understand the rationale behind some of the corpora - e.g. English 1 Million - and the reason for missing corpora (we have English Fiction but not English Non-Fiction).

The accuracy of the data, with respect to the design, can at least be considered in terms of the current specifications. How accurate are the years associated with the articles? How accurate is the origin of publication? In addition, as Google points out, the accuracy of the OCR is also of great interest, especially for older texts (Danny Sullivan has an interesting post on this topic).

Finally, given any data designed along a set of dimensions, one can always take another set of dimensions and see how they are distributed and correlated - if at all. For example, what is the mixture of fiction and non-fiction in the English corpus? What is the distribution of topics? Are these representative with respect to historically accurate accounts of linguistic and cultural shifts (e.g. the introduction of the novel, the impact of the enlightenment on the mixture of fiction and non-fiction). What is the sampling from different publishing houses and is that representative of the number of books, or the number of copies sold? This last point is intriguing - does a book with 1 million copies in circulation have more 'culturonomic' impact than a book with only a single copy out there.

While the data sets are clearly labeled as 'American English' and 'British English' the books in those collections are not actually classified as such. Rather they are defined by their country of publication. With this in mind, how do we interpret the color v colour graph from my earlier post? As Natalie pointed out in an email, the trend in 'British English' of the difference between these terms could be described either by an underlying cultural shift towards the American spelling, or by a change in the ratio of American books published in the UK without editorial 'translations'.


Searching for foreign terms in certain languages brings up hits for the foreign language (e.g. 'because' in the Spanish corpus, 'pourquoi' in the English corpora).

Regarding the English Fiction corpus, I was surprised to see mentions of figures and tables in works of fiction.


Drilling down on these in the interface surfaces what are clearly non-fiction publications (but it is not clear if this search is filtered by the various corpora visualized in the ngram interface). It is also important to bear in mind when looking at these anomolies the volume of hits. Here we are seeing very small fractions of the overall corpus containing what look like terms indicating false positives.

Another subtle, but easily missed (I missed it!) aspect of the interface is that it is case sensitive. This allows us to do interesting queries like 'however' versus 'However.'


How do we interpret this? The most obvious interpretation might be that 'however' at the beginning of a sentence is becoming more frequent. We could also conclude that 'however' in general is becoming more frequent (imagine if we could combine the lines). Alternatively, it could mean that sentence length in the corpus is shifting. Given that we don't know the exact cultural mix of the 'British English' corpus, it could be somehow related to the mixture of American and British content. Finally, it could be due to the mix of fiction and non-fiction. Interestingly, the 'American English' corpus has quite a different signal.


When investigating temporal data, it is always interesting to try to discover things that don't change over time. What words would we expect to be relatively stable? From a simple initial probing, it seems like numbers and days of the week are reasonably stable. In looking at this, I did find that certain colours come and go in a very correlated pattern.


Overall, I find this to be a hugely exciting project. I'm disappointed in the general lack of analysis given to the data set before jumping to conclusions, but perhaps this is more a reflection of the blogosphere and the quality of writing. I'd love to see a more in depth analysis of the corpora provided by the team that wrote the Science article.

 Update: Read more at The Binder Blog (1, 2), and at the Language Log.
12/17/2010 - 18:45

I love this new feature in Google's book search product which allows you to look at the time series trends for terms according to the publication dates of books. The example below shows the trend for the tokens 'colour' and 'color'.


This type of statistical analysis brings up lots of questions, simultaneously about the occurrence of terms in books in general, and the distribution of books in Google's collection. Does it show a decline in the ratio of British to American publications? or a decline in the British spelling of colour? or a bias in the corpus towards recent American publications and earlier British publications? Hard to say, and interesting to ask if probing only via this tool one could find out.

 Update: I'm actually quite serious in being keen on understanding both the distribution of terms in our language and the nature of the collection. While this article on ReadWriteWeb rightly celebrates the insights that this data set can bring, it lacks in any questioning on the representative nature of the underlying data set.

Update: It seems Chomsky throws the tool off. In searching for 'manufacturing consent' I'm shown this artwork:

This is probably a positive sign indicating a lot of interest in the tool - though surprising to see this in a Google product.
12/16/2010 - 23:30

The scientific book publisher enables text-to-speech on all its titles, increasing access for people with print disabilities.
12/16/2010 - 23:30

When I was a student at the end of the 1970's, I never dared imagine, even in my wildest dreams, that the scientific community would one day have the means of analyzing computerized corpuses of texts of several hundreds of billions of words. At the time, I marvelled at the Brown Corpus, which included an extraordinary quantity of one million words of American English, and that after serving to compile the American Heritage Dictionary, was made widely available to scientists. This corpus, despite its size, which now seems derisory, enabled an impressive quantity of studies and largely contributed to the development of language technologies... The study to be published tomorrow in Science by a team comprising scientists from Google, Harvard, MIT, the Encyclopaedia Britannica and Houghton Mifflin Harcourt (publisher of the American Heritage Dictionary) deals with the largest linguistic corpus of all time: 500 billion words. This is the data collected by Google in its (sometimes controversial) programme to digitise books, used, for the first time to my knowledge, for an extensive linguistic study.

I was lucky to have had access to the study before publication, and I felt rather light-headed on reading it... My fingers were itching to talk about it on this blog, but I was forced to respect the embargo (I think the team have organised a bit of a buzz, you'll hear about it in the press as far as I can tell by all the journalists calling me). This corpus contains 4% of all the books ever published on Earth. As the authors say, to read only the texts published in only 2000 (i.e. a tiny fraction of the whole), without pausing to eat or sleep, you would need 80 years, a whole life time for us humans. The sequence of letters in the whole corpus is 1000 times longer than our genome, and if it was all written on one line, it would reach to the moon and back 10 times!

Let's not get carried away though, the corpus will not be accessible to common mortals, who will have to make do with pre-calculated results, the list of words and "n-grams" (i.e. sequences of n consecutive words) extracted from the corpus (limited to 5 words), for English and six other languages, including French. It's already a lot, let's not be churlish, all the more so as the data are organised consolidated by year, allowing for some very interesting studies, and can already be tasted from the on-line interface.

The authors provide a few examples, illustrated with curves that are rather like those from the Chronologue – some readers may remember this tool I made in 2005 for French (and which unfortunately died with the decline of the search engine by Free, where I was working at the time). Except of course I had neither the resources nor the material collected by Google, that can trace lexical curves over more than two centuries! The fields covered are as varied as the evolution of grammar (compared usage of regular and irregular forms of English verbs such as burnt/burned), or the effect of censorship (the disappearance of names such as Marc Chagall during the Nazi period)...

The correlation between the use of names of diseases and peaks in epidemics especially hit me, as it reminded me exactly of the curves I obtained on bird 'flu [fr] – except these new data go all the way back to the 19th century! I won't take an image from Science, I'll let you read the article, but here's another image, from an internal team report, that illustrates peaks in the use of the word cholera since 1800. The bluish zones correspond to the terrible epidemics that hit the United States and Europe (in particular the south of France, there area where i live, with thousands of deaths in Marseille, Toulon, etc.).

For the occasion, the team came up with a new word, culturomics, to qualify this new activity, a portmanteau word that starts with culture and ends like genomics, and it is interesting to note that except for computer scientists (Dan Clancy and Peter Norvig at Google, for example) and lexicographers (including Joe Pickett, the current director of the American Heritage Dictionary), the team includes cognitive scientists and biologists, such as the well-known Steven Pinker and Martin Nowak, and two mathematician-biologists, Jean-Baptiste Michel, the first author of the study, (a Frenchman, from the Ecole Polytechnique and doing a post-doc at Harvard) and Erez Liberman Aiden. This is no coincidence: biology and language processing share many things alongside algorithms and mathematics (I gave one example myself with phylogenetic trees – for example here, here or here).

And for French? Well, it all remains to be done. My sleeves are rolled up! Here's the very first curve, obtained exclusively thanks to the complicity of the team, who in passing, I would like to thank warmly. It's for the word blog in French, the adoption of which from English we can see as it happened [see update below]...

Today, I am feeling the fascination that astronomers must have felt when they turned Hubble for the first time on an unexplored corner of the universe. Something has happened, a giant step has been taken in the tools available to the linguist.

Will linguists (French ones anyway) be aware of it? That's a whole other story. There is often a huge gap between numbers and letters...


Update : superimposed curves for blog en French (light blue) et en American English(dark blue). The shift between the two languages is clearly visible (NB: vertical scales do not match).

12/16/2010 - 20:40

Lorsque j'étais étudiant, à la fin des années 70, je n'aurais jamais osé imaginer, même dans mes rêves les plus fous, que la communauté scientifique ait un jour les moyens d'analyser des corpus de textes informatisés de plusieurs de centaines de milliards de mots. A l'époque, j'étais émerveillé par le Brown Corpus, qui comportait la quantité extraordinaire d'un million de mots d'anglais américain, et qui après avoir servi à la compilation de l'American Heritage Dictionary, avait été mis assez largement à disposition des chercheurs. Ce corpus, malgré sa taille, qui apparaît maintenant dérisoire, a permis une quantité impressionnante d'études et a contribué largement à l'essor des technologies du langage... L'étude que publiera demain dans Science une équipe composée de chercheurs de Google, de Harvard, du MIT, de l'Encyclopaedia Britannica et d'Houghton Mifflin Harcourt (éditeur de l'American Heritage Dictionary) a porté sur le plus gros corpus linguistique de tous les temps : 500 milliards de mots. Il s'agit des données engrangées par Google dans son programme (parfois controversé) de numérisation de livres, qui servent ainsi à ma connaissance pour la première fois à une étude linguistique de grande ampleur.

J'ai eu la chance d'avoir pu accéder à l'étude avant publication, et j'ai eu quelque peu le vertige... Les doigts me démangeaient d'en parler sur ce blog, mais je me suis forcé à respecter l'embargo (je crois que l'équipe a quelque peu organisé un buzz, vous devriez voir ça dans la presse si j'en juge par les appels de journalistes que j'ai reçus). Ce corpus contient 4% des livres jamais publiés sur Terre. Comme le disent les auteurs, pour lire seulement les textes de l'année 2000 (c'est à dire une toute petite portion du tout, qui s'étale sur plus de deux siècles !), sans s'arrêter pour manger ni pour dormir, il faudrait 80 ans, soit une vie entière, à un être humain. La séquence de lettres du corpus dans sa totalité est 1000 fois plus longue que notre génome, et si on écrivait le tout sur une ligne, celle-ci ferait 10 fois l'aller-retour de la Terre à la Lune !

Hélas, il ne faut pas trop rêver tout de même, le corpus ne sera pas accessible au commun des mortels, qui devra se contenter de résultats précalculés, en l'occurrence la liste des mots et "n-grammes" (c'est-à-dire des suites de n mots consécutifs) extraits du corpus (avec une limite à 5 mots), pour l'anglais et six autres langues dont le français. Mais c'est déjà beaucoup, ne boudons pas notre plaisir, d'autant que les données sont organisées avec une consolidation par année qui permettra des études tout à fait intéressantes.

Les auteurs en donnent quelques exemples, illustrés par des courbes qui ne sont pas sans rappeler celles du Chronologue — quelques lecteurs se souviennent peut-être de cet outil que je m'étais amusé à réaliser en 2005 (et qui malheureusement est mort avec le déclin du moteur de Free, avec qui je collaborais à l'époque). Sauf que, bien entendu, je n'avais ni les moyens ni le matériau engrangé par Google, qui permettent de tracer de telles courbes lexicales sur plus de deux siècles ! Les domaines couverts sont aussi variés que l'évolution grammaticale (l'usage comparé des formes régulières et irrégulières pour des verbes anglais comme burnt/burned), ou l'effet de la censure (la disparition de noms comme Marc Chagall pendant la période nazie)...

La corrélation entre l'usage des noms de maladie et les pics d'épidémie m'a particulièrement frappé, car elle m'a rappelé très exactement les courbes que j'avais obtenues sur la grippe aviaire — sauf que les nouvelles données permettent de remonter au XIXè siècle ! Je ne vais pas reprendre une image de Science, je vous laisse y lire l'article, mais voici une autre image, issue d'un rapport interne de l'équipe, qui illustre les pics d'usage du mot cholera (en anglais) depuis 1800. Les zones bleutées correspondent aux terribles périodes d'épidémie qui ont frappé les Etats-Unis et l'Europe (notamment le sud de la France, avec des milliers de morts à Marseille, Toulon, etc.).

L'équipe a pour l'occasion forgé un mot, culturomics, pour qualifier ce nouveau type d'activité, un mot-valise qui débute comme culture et qui finit comme genomics, et il est tout à fait intéressant de noter qu'à part des informaticiens (Dan Clancy et Peter Norvig de Google, par exemple) et des lexicographes (dont Joe Pickett, le directeur actuel de l'American Heritage Dictionary), l'équipe comporte plusieurs biologistes, dont les bien connus Steven Pinker et Martin Nowak, et de jeunes et brillants mathématiciens-biologistes : Jean-Baptiste Michel, premier auteur de l'étude, (c'est un français, issu de l'Ecole Polytechnique et en post-doc à Harvard) et Erez Liberman Aiden. Ce n'est pas un hasard : la biologie et le traitement des langues partagent beaucoup de choses du côté des algorithmes et des mathématiques (j'en ai moi-même donné un exemple avec les arbres phylogénétiques -- par exemple ici, ici ou ici).

Et pour le français ? Eh bien, tout est à faire. Je remonte les manches ! Voici la toute première courbe, obtenue en avant première grâce à la complicité de l'équipe, que je remercie vivement au passage. Il s'agit du mot blog, dont on peut assister à la naissance en direct...

Je ressens aujourd'hui la fascination qu'ont eue sans doute les astronomes qui ont braqué pour la première fois Hubble vers un coin inexploré de l'univers. Quelque chose s'est passé, une étape a été franchie dans l'outillage à disposition du linguiste.

Les linguistes (français en tout cas) en auront-ils conscience ? C'est une autre histoire. Entre les chiffres et les lettres, il y a parfois un bien grand fossé...

Pour en savoir plus

Nuance Releases a Gaming Speech Command Set
12/14/2010 - 18:20
12/09/2010 - 09:20

Gamers can use voice commands to play PC games; the software comes with standard Dragon Dictation.
12/05/2010 - 17:25

Nuance releases Dragon Dictation and Dragon Search in Spanish.

Syndicate content