Google releases TyDi QA, a data set that aims to capture the uniqueness of languages

February 7, 2020 Technology Comments Off 490 Views

Google hopes to spur the development of AI capable of understanding the ways in which languages express different meanings. To this end, company researchers today detailed a data set — TyDi QA, a question-answering data set covering 11 languages — inspired by typological diversity, or the notion that different languages express meaning in structurally unique ways.

TyDi QA is something of a complement to the English-language Natural Questions corpus Google released last year, and it attempts to capture t he idiosyncrasies and features of tongues like Japanese and Arabic. The researchers point out, for instance, that English changes words to indicate one object (“book”) versus many (“books”), and that Arabic has a third form to indicate if there are two of something (“كتابان”, kitaban) beyond just singular (“كتاب”, kitab) or plural (“كتب”, kutub).

“Because we selected a set of languages that are typologically distant from each other for this corpus, we expect models performing well on this dataset to generalize across a large number of the languages in the world,” wrote Google Research scientist Jonathan Clark in a blog post.

TyDi QA includes over 200,000 question-answer pairs from languages representing a “diverse range” of linguistic phenomena and data challenges, many of which use non-Latin alphabets (such as Arabic, Bengali, Korean, Russian, Telugu, and Thai) and form words in complex ways (including Arabic, Finnish, Indonesian, Kiswahili, and Russian). The languages also range from those with an abundance of available data on the web (English and Arabic) to those with very little (Bengali and Kiswahili).

The questions were collected from people who wanted an answer but who didn’t yet know the answer, so as to head off original questions that contained the same words as the answer. To inspire questions, the researchers showed contributors a passage from Wikipedia written in their native language. They then had them ask a question — any question — as long as it wasn’t answered by the passage and they actually wanted to know the answer. (i.e., “Does a passage about ice make you think about popsicles in summer? Great! Ask who invented popsicles.”) Importantly, the questions were written directly in each language, not translated, such that many questions were unlike those seen in an English-first corpus. (E.g., সফেদা ফল খেতে কেমন?, or “What does sapodilla taste like?”)

For each of the questions, the researchers performed a Google Search for the best-matching Wikipedia article in the appropriate language and asked a person to find and highlight the answer within that article. In some languages, they found that words were represented very differently in question and answer — so differently that they expect designing a system to successfully select an answer out of a Wikipedia article will prove to be a challenge.

To track the community’s progress, they’ve established a leaderboard where participants can evaluate the quality of their machine learning systems. “It is our hope that this dataset will push the research community to innovate in ways that will create more helpful question-answering systems for users around the world,” wrote Clark.

Let’s block ads! (Why?)

VentureBeat

Web Wad

Google releases TyDi QA, a data set that aims to capture the uniqueness of languages

About

Related Articles

Check Also

The scale of ambition in gaming is getting bigger | Brian Ward fireside chat

How RapidCanvas automates 70% of data tasks for gen AI projects

10 Tree Shapes to Transform Your Yard

Unifying gen X, Y, Z and boomers: The overlooked secret to AI success

Tomato.ai launches zero-shot accent softening model to revolutionize call center industry

The scale of ambition in gaming is getting bigger | Brian Ward fireside chat

Could a Keto Diet Be Bad for Athletes’ Bones?

How to Invest in Real Estate to Achieve FIRE

Appeal Cosmetics New Products!

What Might Fasting Insulin Predict About Health?

8 Things I Always Buy at Thrift Stores

Could a Keto Diet Be Bad for Athletes’ Bones?

How to Invest in Real Estate to Achieve FIRE

Appeal Cosmetics New Products!

Frying Up A Storm: How S’pore Zi Char Brand WOK HEY Grew To 27 Outlets In 4 Years

Bubonic Plague Found in a Herder in Inner Mongolia, China Says

Zynga’s acquisitions push it to record bookings of $628 million for Q3 2020

How RapidCanvas automates 70% of data tasks for gen AI projects

10 Tree Shapes to Transform Your Yard

Unifying gen X, Y, Z and boomers: The overlooked secret to AI success