Medical Report Generation from Jointly Learned Medical Image and Text Representations

Published in Novartis NIBR internal publications, 2019

Recommended citation: Filipavicius, Modestas et al., (2019). "Medical Report Generation from Jointly Learned Medical Image and Text Representations." NIBR Publications . 12(2). http://mfilipav.github.io/files/2019cxr.pdf

Hierarchical LSTM Co-Attention model architecture:

alt text

  • Decoding begins with VGG-19 CNN visual features for image patches, fed directly into a multi-label classifier to predict disease tags, represented as word2vec 512 dim embedding (semantic feature) vectors.

  • Those vectors are fed together with visual features to generate context vector which pays attention to both of these features.

  • Decoder takes in the context vector, and produces sentences in a hierarchical manner: context vector is fed into a sentence LSTM, which unrolls for a few steps to produce a 512-dim topic vector at each step.

  • Meanwhile, word LSTM produces words for each topic/sentence vector.

Abstract

This study explored multimodal deep learning applications for medical images and text, in particular, automatic report generation from chest X-ray images, used daily in hospitals diagnose chest diseases. Reading the images and writing reports requires considerable training and time-investment. We speculate that to match the physician-level disease understanding, the representations learned in unsupervised manner for images and text should be jointly embedded into the same vector space. As such, we could perform cross modal retrieval, i. e., ask the model to generate a paragraph with Findings and Impressions sections for a given image.

To learn the necessary skillset and get the first-hand experience for training a multimodal encoder-decoder architecture and its computational re- source requirements, we first re-implemented several automatic image captioning models based on Microsoft COCO dataset. Next, we attempted to address the issues specific for medical paragraph generation, namely, generating several rather than a single sentence and associating each word with the relevant image patch.

We focused on an image-encoder text-decoder architectural variant called Hierarchical LSTM Co-Attention model by Jing et al. (2018), and implemented this closed-source paper in Python and PyTorch. Although, our implementation could not verify their results, we are encouraged to have successfully applied the earlier encoder-decoder captioning (with attention) to Open-I dataset. We hope this exploratory study will encourage future research into generating representations with multimodal learning in pharmaceutical setting.

Download paper here