Scaling Up Vision-Language Pre-training for Image Captioning
The recent advances in neural language models have also been successfully
applied to the field of chemistry, offering generative solutions for classical
problems in molecular design and synthesis planning. These new methods have the
potential to fuel a new era of data-driven automation in scientific discovery.
However, specialized models are still typically required for each task, leading
to the need for problem-specific fine-tuning and neglecting task
interrelations. The main obstacle in this field is the lack of a unified
representation between natural language and chemical representations,
complicating and limiting human-machine interaction. Here, we propose the first
multi-domain, multi-task language model that can solve a wide range of tasks in
both the chemical and natural language domains. Our model can handle chemical
and natural language concurrently, without requiring expensive pre-training on
single domains or task-specific models. Interestingly, sharing weights across
domains remarkably improves our model when benchmarked against state-of-the-art
baselines on single-domain and cross-domain tasks. In particular, sharing
information across domains and tasks gives rise to large improvements in
cross-domain tasks, the magnitude of which increase with scale, as measured by
more than a dozen of relevant metrics. Our work suggests that such models can
robustly and efficiently accelerate discovery in physical sciences by
superseding problem-specific fine-tuning and enhancing human-model
interactions.