Allen Institute launches GENIE, a leaderboard for human-in-the-loop language model benchmarking

January 20, 2021 Technology Comments Off on Allen Institute launches GENIE, a leaderboard for human-in-the-loop language model benchmarking 282 Views

There’s been an explosion in recent years of natural language processing (NLP) datasets aimed at testing various AI capabilities. Many of these datasets have accompanying leaderboards, which provide a means of ranking and comparing models. But the adoption of leaderboards has thus far been limited to setups with automatic evaluation, like classification and knowledge retrieval. Open-ended tasks requiring natural language generation such as language translation, where there are often many correct solutions, lack techniques that can reliably automatically evaluate a model’s quality.

To remedy this, researchers at the Allen Institute for Artificial Intelligence, the Hebrew University of Jerusalem, and the University of Washington created GENIE, a leaderboard for human-in-the-loop evaluation of text generation. GENIE posts model predictions to a crowdsourcing platform (Amazon Mechanical Turk), where human annotators evaluate them according to predefined, dataset-specific guidelines for fluency, correctness, conciseness, and more. In addition, GENIE incorporates various automatic machine translation, question answering, summarization, and common-sense reasoning metrics including BLEU and ROUGE to show how well they correlate with the human assessment scores.

As the researchers note, human-evaluation leaderboards raise a couple of novel challenges, first and foremost potentially high crowdsourcing fees. To avoid deterring submissions from researchers with limited resources, GENIE aims to keep submission costs around $ 100, with initial submissions to be paid by academic groups. In the future, the coauthors plan to explore other payment models including requesting payment from tech companies while subsidizing the cost for smaller organizations.

Evaluating generated text is hard! No automated method so far is up to the challenge.
Today we’re announcing GENIE🧞‍♂️, a human-in-the-loop leaderboard for streamlining text evaluation.
Learn more in today’s post on the AI2 Blog from @DanielKhashabi:https://t.co/I8v5egi9J7
— Allen Institute for AI (@allen_ai) January 19, 2021

To mitigate another potential issue — the reproducibility of human annotations over time across various annotators — the researchers use techniques including estimating annotator variance and spreading the annotations over several days. Experiments show that GENIE achieves “reliable scores” on the included tasks, they claim.

“[GENIE] standardizes high-quality human evaluation of generative tasks, which is currently done in a case-by-case manner with model developers using hard-to-compare approaches,” Daniel Khashabi, a lead developer on the GENIE project, explained in a Medium post. “It frees model developers from the burden of designing, building, and running crowdsourced human model evaluations. [It also] provides researchers interested in either human-computer interaction for human evaluation or in automatic metric creation with a central, updating hub of model submissions and associated human-annotated evaluations.”

The coauthors believe that the GENIE infrastructure, if widely adopted, could alleviate the evaluation burden for researchers while ensuring high-quality, standardized comparison against previous models. Moreover, they anticipate that GENIE will facilitate the study of human evaluation approaches, addressing challenges like annotator training, inter-annotator agreement, and reproducibility — all of which could be integrated into GENIE to compare against other evaluation metrics on past and future submissions.

“We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation,” the coauthors wrote in a paper describing their work. “This is a novel deviation from how text generation is currently evaluated, and we hope that GENIE contributes to further development of natural language generation technology.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

up-to-date information on the subjects of interest to you
our newsletters
gated thought-leader content and discounted access to our prized events, such as Transform
networking features, and more

Become a member

Let’s block ads! (Why?)

VentureBeat

Web Wad

Allen Institute launches GENIE, a leaderboard for human-in-the-loop language model benchmarking

VentureBeat

About

Related Articles

Check Also

The scale of ambition in gaming is getting bigger | Brian Ward fireside chat

How RapidCanvas automates 70% of data tasks for gen AI projects

10 Tree Shapes to Transform Your Yard

Unifying gen X, Y, Z and boomers: The overlooked secret to AI success

Tomato.ai launches zero-shot accent softening model to revolutionize call center industry

The scale of ambition in gaming is getting bigger | Brian Ward fireside chat

Could a Keto Diet Be Bad for Athletes’ Bones?

How to Invest in Real Estate to Achieve FIRE

Appeal Cosmetics New Products!

What Might Fasting Insulin Predict About Health?

8 Things I Always Buy at Thrift Stores

Could a Keto Diet Be Bad for Athletes’ Bones?

How to Invest in Real Estate to Achieve FIRE

Appeal Cosmetics New Products!

The Best 30-Year Mortgage Refinance Companies

Emotional Awareness and Processing Emotions Through Hard Times

Facebook launches a Messenger hub to inform users about coronavirus

How RapidCanvas automates 70% of data tasks for gen AI projects

10 Tree Shapes to Transform Your Yard

Unifying gen X, Y, Z and boomers: The overlooked secret to AI success