Google today open-sourced a machine learning model that can point to answers to natural language questions (for example, “Which wrestler had the most number of reigns?”) in spreadsheets and databases. The model’s creators claim it’s capable of finding even answers spread across cells or that might require aggregating multiple cells.
Much of the world’s information is stored in the form of tables, points out Google Research’s Thomas Müller in a blog post, like global financial statistics and sports results. But these tables often lack an intuitive way to sift through them — a problem Google’s AI model aims to fix.
To process questions like “Average time as champion for top 2 wrestlers?,” the model jointly encodes the question as well as the table content row by row. It leverages a Transformer-based BERT architecture — an architecture that’s both bidirectional (allowing it to access content from past and future directions) and unsupervised (meaning it can ingest data that’s neither classified or labeled) — extended along with numerical representations called embeddings to encode the table structure.
A key addition was the embeddings used to encode the structured input, according to Müller. Learned embeddings for the column index, the row index, and one special rank index indicate to the model the order of elements in numerical columns.
For each table cell, the model generates a score indicating the probability that the cell will be part of the answer. In addition, it outputs an operation (e.g., “AVERAGE,” “SUM,” or “COUNT”) indicating which operation (if any) must be applied to produce the final answer.
To pre-train the model, the researchers extracted 6.2 million table-text pairs from English Wikipedia, which served as a training data set. During pre-training, with relatively high accuracy, the model learned to restore words in both tables and text that had been removed — 71.4% of items were restored correctly for tables unseen during training.
After pre-training, Müller and team fine-tuned the model via weak supervision, using limited sources to provide signals for labeling the training data. They report that the best model outperformed the state-of-the-art for the Sequential Answering Dataset, a Microsoft-created benchmark for exploring the task of answering questions on tables, by 12 points. It also bested the previous top model on Stanford’s WikiTableQuestions, which contains questions and tables sourced from Wikipedia.
“The weak supervision scenario is beneficial because it allows for non-experts to provide the data needed to train the model and takes less time than strong supervision,” said Müller.