Curating biological studies is an important yet labor-intensive process performed by researchers in life sciences fields. Among other tasks, curators must recognizing experiment methods, identifying the underlying protocols that net the figures published in research articles. In other words, “biocurators” need to take figures, captions, and more into their consideration and make decisions about how each were derived. This requires careful labeling, which doesn’t scale well when the experiments to classify total in the hundreds or thousands.
In search of a solution, researchers at the University of California, Los Angeles; the University of Southern California; Intuit; and the Chan Zuckerberg Initiative developed a dataset called Multimodal Biomedical Experiment Method Classification (“Melinda” for short) containing 5,371 labeled data records, including 2,833 figures from biomedical papers paired with corresponding text captions. The idea was to see whether state-of-the-art machine learning models could curate studies as well as human reviewers by benchmarking those models on Melinda.
Automatically identifying methods in studies poses challenges for AI systems. One is grounding visual concepts to language; most multimodal algorithms rely on object detection modules for grounding finer-granularity visual and linguistics concepts. However, because it requires extra effort from experts and thus is more expensive, scientific images often lack ground-truth object annotations. This hurts the performance of pretrained detection models because the labels are the way in which they learn to make classifications.
In Melinda, each data entry consists of a figure, an associated caption, and an experiment method label from the IntACt database. IntAct stores in an ontology manually annotated labels for experiment method types paired with figure identifiers and the ID of the paper featuring the figures. The papers — 1,497 in total — came from the Open Access PubMed Central, an archive of freely available life sciences journals.
In experiments, the researchers benchmarked several vision, language, and multimodal models against Melinda. Specifically, they looked at unimodal models that take an image (image-only) or a caption (caption-only) as input and multimodal models that take both.
The results suggest that despite the fact that the multimodal models generally demonstrated superior performance compared with the others, there’s room for improvement. The best-performing multimodal model, VL-BERT, achieved between 66.49% and 90.83% accuracy — a far cry from the 100% accuracy of which human reviewers are capable.
The researchers hope the release of Melinda motivates the advancements in multimodal models, particularly in the areas of low-resource domains and reliance on pretrained object detectors. “The Melinda dataset could serve as a good testbed for benchmarking,” they wrote in a paper describing their work. “The recognition [task] is fundamentally multimodal [and challenging], where justification of the experiment methods takes both figures and captions into consideration.”