GroupViT: Semantic Segmentation Emerges from Text Supervision
High-quality datasets for learning-based modelling of polyphonic symbolic
music remain less readily-accessible at scale than in other domains, such as
language modelling or image classification. Deep learning algorithms show great
potential for enabling the widespread use of interactive music generation
technology in consumer applications, but the lack of large-scale datasets
remains a bottleneck for the development of algorithms that can consistently
generate high-quality outputs. We propose that models with narrow expertise can
serve as a source of high-quality scalable synthetic data, and open-source the
JS Fake Chorales, a dataset of 500 pieces generated by a new learning-based
algorithm, provided in MIDI form.
We take consecutive outputs from the algorithm and avoid cherry-picking in
order to validate the potential to further scale this dataset on-demand. We
conduct an online experiment for human evaluation, designed to be as fair to
the listener as possible, and find that respondents were on average only 7%
better than random guessing at distinguishing JS Fake Chorales from real
chorales composed by JS Bach. Furthermore, we make anonymised data collected
from experiments available along with the MIDI samples. Finally, we conduct
ablation studies to demonstrate the effectiveness of using the synthetic pieces
for research in polyphonic music modelling, and find that we can improve on
state-of-the-art validation set loss for the canonical JSB Chorales dataset,
using a known algorithm, by simply augmenting the training set with the JS Fake
Chorales.