Language, Speech and Multimedia Technologies Observatory
04/14/2011 - 16:30

The Multimodal Interaction Working Group has published a last call working draft and preliminary vocabularies standard.
04/14/2011 - 16:30

A new article from Victoria Nicks takes a look at OWL, the Web Ontology Language. OWL “consists of various families and specifications, and, like other web ontology languages, is used by computer programs to interpret information on the Internet. One use for OWL is the interpretation of web pages for use in other algorithms, such as search engines.”

Nicks continues, “There are a variety of Web ontology languages, operating on different principles but with the same basic goals. In order to grasp the concept of OWL, you’ll need to understand terms such as: web, ontologies, agents, languages and the semantic web.” Nicks goes on to give her definitions of each. continued…

New Career Opportunities Daily: The best jobs in media.
04/14/2011 - 16:30

Hello my name is ontologyIn my most recent post, I introduced RDF as a flexible and schema-less data model. However, some of you may think then that using RDF data is going to be a complete mess. In some cases, that may be true; and it’s fine! There are use cases in which all you want is messy data. What if you want to do more interesting stuff with your RDF data like infer new knowledge? This is where ontologies come in.

What is an Ontology?

Let me scare you for a minute. The computer science definition of ontology is:

a formal and explicit specification of a shared conceptualization


Let’s break this down and get our hands dirty. continued…

New Career Opportunities Daily: The best jobs in media.
03/28/2011 - 22:15

A new article from provides some insight into HTML5 and RDF for those of us who don’t work on the technical side of things. The article explains that “since the data on the web is often in forms that make it computationally complex to parse or recognize, new HTML tags and standards had to be developed and integrated with HTML5 to provide this functionality.” continued…

New Career Opportunities Daily: The best jobs in media.
03/26/2011 - 00:25

You know Semantic Web technologies are going mainstream when the company that is so closely associated with making PCs mainstream is getting in on the action. That company is Dell, and who knows but that the work it’s pursuing in the Semantic Web today won’t have just as much of an impact as its supply chain innovations did to help drive its success in those early PC days?

The proof-of-concept Semantic Web work at Dell is taking place under the direction of Yijing (Jenna) Zhou, enterprise architecture consultant, and Chary Tamirisa, enterprise architecture senior consultant. What’s the impetus for Dell to pursue this? Zhou and Tamirisa provided some insight into the whys, whats, and hows in an email discussion with The Semantic Web Blog.

“The questions raised initially were:  why Semantic Web and how can Dell benefit from its use?” Zhou and Tamirisa note. “Our answer is as follows: Semantic technology is a key enabler for Dell to model enterprise business objects to enable end-to-end mapping and reuse across current and future business models, processes, and systems. We are leveraging enterprise architecture management support for semantic technology and ontology modeling to build broader awareness and knowledge across our business and IT stakeholders.  Our long-term plan is to provide tangible value propositions that address current and future business challenges and opportunities. We are also focused on developing the change management strategies required to enable and adopt the techniques and technologies related to semantic-based solutions.”


New Career Opportunities Daily: The best jobs in media.
03/18/2011 - 09:00
"Beyond Search: enabling biomedical knowledge
discovery through natural language processing"
"Bilaketaz haruntzago: ezagutza biomedikoa
lortzen hizkuntzaren prozesaketaren bidez"
Hizlaria: Karin Verspoor
Research Assistant Professor
Professor Larry Hunter's research lab,
Center for Computational Pharmacology,
University of Colorado Denver
Tokia: Informatika Fakultateko 3.1 mintegia
Eguna: Urtarrilaren 18a
Ordua: 15:30
03/11/2011 - 09:30

Posted by Slav Petrov and Ryan McDonald, Research Team

One major hurdle in organizing the world’s information is building computer systems that can understand natural, or human, language. Such understanding would advance if systems could automatically determine syntactic and semantic structures.

This analysis is an extremely complex inferential process. Consider for example the sentence, "A hearing is scheduled on the issue today." A syntactic parser needs to determine that "is scheduled" is a verb phrase, that the "hearing" is its subject, that the prepositional phrase "on the issue" is modifying the "hearing", and that today is an adverb modifying the verb phrase. Of course, humans do this all the time without realizing it. For computers, this is non-trivial as it requires a fair amount of background knowledge, typically encoded in a rich statistical model. Consider, "I saw a man with a jacket" versus "I saw a man with a telescope". In the former, we know that a "jacket" is something that people wear and is not a mechanism for viewing people. So syntactically, the "jacket" must be a property associated with the "man" and not the verb "saw", i.e., I did not see the man by using a jacket to view him. Whereas in the latter, we know that a telescope is something with which we can view people, so it can also be a property of the verb. Of course, it is ambiguous, maybe the man is carrying the telescope.

Linguistically inclined readers will of course notice that this parse tree has been simplified by omitting empty clauses and traces.

Computer programs with the ability to analyze the syntactic structure of language are fundamental to improving the quality of many tools millions of people use every day, including machine translation, question answering, information extraction, and sentiment analysis. Google itself is already using syntactic parsers in many of its projects. For example, this paper, describes a system where a syntactic dependency parser is used to make translations more grammatical between languages with different word orderings. This paper uses the output of a syntactic parser to help determine the scope of negation within sentences, which is then used downstream to improve a sentiment analysis system.

To further this work, Google is pleased to announce a gift to the Linguistic Data Consortium (LDC) to create new annotated resources that can facilitate research progress in the area of syntactic parsing. The primary purpose of the gift is to generate data sets that language technology researchers can use to evaluate the robustness of new parsing methods in several web domains, such as blogs and discussion forums. The goal is to move parsing beyond its current focus on carefully edited text such as print news (for which annotated resources already exist) to domains with larger stylistic and topical variability (where spelling errors and grammatical mistakes are more common).

The Linguistic Data Consortium is a non-profit organization that produces and distributes linguistic data to researchers, technology developers, universities and university libraries. The LDC is hosted by the University of Pennsylvania and directed by Mark Liberman, Christopher H. Browne Distinguished Professor of Linguistics.

The LDC is the leader in building linguistic data resources and will annotate several thousand sentences with syntactic parse trees like the one shown in the figure. The annotation will be done manually by specially trained linguists who will also have access to machine analysis and can correct errors the systems make. Once the annotation is completed, the corpus will be released to the research community through the LDC catalog. We look forward to seeing what they produce and what the natural language processing research community can do with the rich annotation resource.
03/03/2011 - 21:55

Hiru berri dakartzagu OPENMT-2 proiektutik (2010-2012):

Gorka Labaka-ren tesiaren ondorioak

Tesi honetan euskararako itzulpen automatikoa estatistikoa aztertu du Gorkak; edo zehatzago esanda: nola erabili ezagutza morfologikoa eta sintaktikoa itzulpenaren emaitzak hobetze-aldera.

EUSMT: Incorporating Linguistic Information to Statistical Machine Translation for Basque

Inguruko erdaretatik euskarara itzultzea ez da lan erraza, ez eskuz, ez automatikoki:

  • Euskararen morfologia oso aberatsa da. Horrek zailtasun handia ekartzen dio itzulpen estatistikoari. Hitz-formak euskaraz askoz gehiago direnez (etxe, etxea, etxera, etxetik...), zailagoa baita hitz guztientzat agerpen kopuru altuak aurkitzea corpus elebidunetan (lehenago itzulitako testuetan).
  • Hitzen ordena oso bestelakoa da.
  • Hiztun gutxiko hizkuntza izanik inguruko erdarek baino askoz testu itzuli gutxiago bil daitezke. Eta hori da estatistikaren euskarria!

Egoera horretan Gorka Labakak bi teknika garatu ditu itzulpen estatistikoaren kalitatea hobetzeko:

  • Hitzak segmentatzea. Lemak eta atzizkiak banatzea. Lau modu desberdin aztertu ditu, horrela ez-ohiko hitz-formen arazoa bideratzeko.
  • Erdarazko hitzak berrordenatzea. Izen-sintagmaren mailan eta esaldi mailan. beren ordainek euskaraz izango duten ordenara erakarriz. Berrantolaketa hau oso lagungarria izaten zaio dekodetzaile estatistikoari itzulpen egokiak bilatzerakoan.

Azkenaldian ikerlari gehienek itzulpen-sistema estatistikoei ematen diete protagonismo osoa, askok erregelan oinarritutako sistemak baztertzen dituzte. Baina Gorka Labakaren emaitzen ebaluazioaren arabera hori ez da jokaera zuzena.

Gorkak, besteak beste, ondorio hauek lortu ditu:

  • Erregelatan oinarritutako batek (Matxin) eta 8 miloi hitzeko corpusa darabilen sistema estatistiko estandar batek maila bereko emaitzak lortzen dutela.
  • Bere hobekuntzekin egindako EUSMT sistema estatistiko aztertutaak aurreko bi horiek baino emaitza hobeak lortzen dituela (HTER neurrian %10 hobea).
  • Sistema hobe bat eraiki daitekeela sistema biak konbinatuz. Beste %10ean hobeago izan liteke sistema "orakulo" bat, sistema bien emaitzak konparatu eta hoberena itzuliko balu. 
    Aukeren %55ean EUSMTen proposamena hartu beharko luke, %41ean Matxinena, eta gainontzeko %4an itzulpen-memoriatan patroien bidez bilatuta.

Ondorio horiek ikusita, ikerketaren iparra hibridazioan eta postedizioan jarri dugu. Matxin eta EUSMTen emaitzak konbinatzeko modu eraginkorren bila ari gara. Eta ildo horietatik datoz ondoko beste berri biak.

Lluis Marquez ikerlaria gurekin izango dugu udara arte bisitan.

Itzulpen-sistemak konbinatzeko hibridazioan ikertzeko udara arte gurekin izango dugu Lluis MarquezOPENMT-2
proiektu barruan UPC-ko burua dena. Bera nazioarteko aditua da hizkuntza-teknologian, ikasketa automatikoko teknikak erabiltzen batez ere. Gorka Labakaren esperimentuetan egiaztatu zen aukera badagoela Matxin eta EUSMT sistemak konbinatuz emaitza hobeak lortzeko. Orain konbinazio mota egokiena bilatzen ari gara.

Lankidetza Euskal Wikipediarekin postedizioan ikertzeko.

proiektuaren barruan informatikari buruzko Wikipediako 50 artikulu luze gehitzeko iniziatiba bat martxan jarri dugu. Matxin itzulpen-sistemak sortuko ditu lehen zirriborroak espainierako Wikipediatik itzulita, eta ondoren hainbat boluntarioren artean, eta eu.wikipedia elkarteak koordinatuta, zirriborro horiek zuzendu (OmegaT programa erabiliz) eta argitaratuko dituzte.

Esperientzia aberasgarria izango da bi norabideetan. Wikipediarentzat esperientzia onuragarria izango da 50 artikulu berri sortuko direlako, eta itzulpen automatikoarentzat ere bai eskuz posteditatutako itzulpenekin 100.000 hitzeko corpusa batuko delako. Corpus hori, itzulpen-sistema automatikoaren kalitatea hobetzeko funtsezko baliabide izango da, teknika estatistikoak erabiliz. (ikus IEB2011-ra bidalitako aurkepena, edo ingelesez  Wikimania2010-ra)
03/01/2011 - 20:55

Semantic Web Journal special issue: Linked Data for Science and Education
02/25/2011 - 15:45

Read about the FreeRBMT 2011 workshop which was held in Barcelona on January 20th and 21st...

Syndicate content