出处:2018 AAAI

SourceCode:https://github.com/salu133445/musegan

abstract

(写得不错 值得借鉴)重点阐述了生成音乐和生成图片,视频及语音的不同。首先音乐是基于时间序列的;其次音符在和弦、琶音(arpeggios)、旋律、复音等规则的控制之下的;同时一首歌曲是多track的。总之不能简单堆叠音符。本文基于GAN提出了三种模型来生成音乐:jamming model, the composer model and the hybrid model。作者从摇滚音乐中挑选出了10万个bar来进行训练,生成5个轨道的piano-rolls:bass, drums, guitar, piano and strings。同时作者使用了一些intra-track and inter-track objective metrics来衡量生成的音乐质量(?)。

Introduction:

GAN在文字,图片,视频上取得了巨大的成就,音乐方面也有些进展,但问题在于:

(1)音乐有自己的基于时间的架构,如下图所示:

(2)音乐是多轨道/多乐器的

现代管弦乐(orchestra)常常有4个部分:brass, strings, woodwinds and percussion,摇滚乐队常用的是bass, a drum set, guitars and possibly a vocal,音乐理论要求这些元素按时间展开后harmony并且counterpoint.

(3)musical notes are often grouped into chords,arpeggios or melodies.所以,单音(monophonic)的音乐和NLP的生成都不能直接被引入来生成复调(polyphonic)的音乐。

由于上述三个问题,许多已有工作做了一些简化的处理方式,生成单轨单音音乐,introducing a chronological ordering of notes for polyphonic music,组合单音音乐变成复音音乐等。作者的目标是摒弃这些简化手法,1) harmonic and rhythmic structure, 2) multi-track interdependency, and 3) temporal structure。该模型能够产生音乐from scratch (i.e. without human inputs),也能follow the underlying temporal structure of a track given a priori by human.作者提出了三种方式来处理track之间的交互

(1)每个track独立生成 one generates tracks independently by their private generators (one for each)

(2)所有track由一个生成器生成 another generates all tracks jointly with only one generator

(3)在(1)的基础上,每个track生成时有额外的input信息,以保证harmonious and coordinated

为了突出group的性质,作者关注bars(这点参考了[1]),而不是notes,并使用CNN来提取隐藏特征。

除了刚才提到的测量标准,最后居然找了144个路人来对生成音乐进行评测。

contribution:

接下来介绍了GAN 和WGAN,WGAN-GP并最终选用WGAN-GP。

Proposed Model:

这里再次强调了关注的是Bar[1],并列举了一些理由。

  • 数据表示

使用了multiple-track piano-roll representation表示方式,a piano-roll representation is a binary-valued, scoresheet-like matrix representing the presence of notes over different time steps, and a multiple-track piano-roll is defined as a set of piano-rolls of different tracks。一个有M个track,每个trank有R个time_step,候选bar数量为S的bar记录为X,其数据形式为$X^{RxSxM}$,T个bar则被表示为 $\{X^{t}\}_{t=1}^{T}$。因此每个X的矩阵大小是固定的,有利于CNN训练特征

  • 构建Tranck间的相关性(Interdependency)

提出了三种谱曲方式

Jamming Model--每一个Track拥有自己的一组G和D,及独立的隐空间变量Zi。

Composer Model -- 全局一组G和D,公用Z来生成所有的Track

Hybrid Model -- 混合上面两种模式,每个track一个Gi接受独立的Zi(intra-track random vector)及全局的Z(inter-track random vector)共同组合成的输入向量,同时公用一个D来生成track。与Composer Model相比,混合模式更加灵活,可以在G模型中使用不同的参数(如层数,卷积核大小等),将音轨的独立生成和全局和谐结合起来。

  • 构建时序相关性(Temporal Structure)

上面提到的结构目的在于怎样在不同音轨中生成单个的bar,bar与bar之间的时序关联需要其他的结构来补充生成。作者采用了两种方式:

Generation from Scratch -- 将G分为两个sub network:$G_{temp}$和$G_{bar}$,$G_{temp}$将z映射成一个隐空间向量的序列,作者希望它能承载一些时序信息,随后被送入$G_{bar}$,序列化地生成piano-rolls。

Track-conditional Generation--这种方式假定了各个track的n个bar已经被给定了,即为,这里添加了一个编码器E,负责将映射为(这个也是从[1]里参考得来的)

  • MuseGAN

模型的输入由4部分构成:

an inter-track time-dependent random vectors $z_t$        轨道间全局  时间相关 向量

an inter-track time-independent random vectors z            轨道间全局 时间无关 向量

an intra-track time-independent random vectors $z_i$      轨道内单独 时间无关 向量

an intra-track time-dependent random vectors $z_{i,t}$    轨道内单独  时间相关 向量

从该生成公式上可以清楚地看出,各轨道间的输入变量(分为时间相关和无关)和全局输入变量(分为时间相关和无关)如何结合起来,形成MuseGan生成系统

Dataset

MuseGAN的piano-roll训练数据是基于Lakh MIDI dataset (LMD)[3],原数据集噪声很大,使用了三步来做清理(如下图),midi解析使用了pretty midi[2]

要注意的是,(1)一些track上的note非常稀疏,这里作者对这种不平衡数据做了merge操作(merging tracks of similar instruments by summing their piano-rolls,具体可能需要看代码),对于非bass, drums, guitar, piano and strings 这5类的track统一归纳到string上去.[5,6]中对track类型进行了较好的数据预归类;(2)选取piano-roll时选取higher confidence score in matching的,rock标签,4/4拍;(3)piano-roll的segment采用了state_of_art的方式structural features[7],每4个小节为一个phrase。Notably, although we use our models to generate fixed-length segments only, the track-conditional model is able to generate music of any length according to the input.(4)音域使用C1到C8(钢琴最右键)。

最终输出一首歌的tensor为:4 (bar) × 96 (time step) × 84 (note) × 5 (track)

模型设置:

根据WGAN的理论,update G once every five updates of D and apply batch normalization only to G。其余略。

Objective Metrics for Evaluation:

使用了4个intra-track和1个inter-track(最后一个)度量标准

  • EB: ratio of empty bars (in %)
  • UPC: number of used pitch classes per bar (from 0 to 12)
  • QN: ratio of “qualified” notes (in %) 一个长度不少于3个time_step的音符被认为是qualified的。这个指标可以衡量是否生成的音乐是否过于碎片化。
  • DP, or drum pattern: ratio of notes in 8- or 16-beat patterns, common ones for Rock songs in 4/4 time (in %).
  • TD: or tonal distance [8]. It measures the hamornicity between a pair of tracks. Larger TD implies weaker inter-track harmonic relations.调式距离?

[9]:综述 [10]RNN生成music [11]生成chorales [12]:Song from PI [13]:C-RNN-GAN [14]:seqGAN(combined GANs and reinforcement learning to gen sequences of discrete tokens. It has been applied to generate monophonic music, using the note event representation) [15]:midi_net(convolutional GANs to generate melodies that follows a chord sequence given a priori, either from scratch or conditioned on the melody of previous bars)

[1]Yang, L.-C.; Chou, S.-Y.; and Yang, Y.-H. 2017. MidiNet: A convolutional generative adversarial network for symbolicdomain music generation. In ISMIR.

[2]Raffel, C., and Ellis, D. P. W. 2014. Intuitive analysis, creation and manipulation of MIDI data with pretty midi. In ISMIR Late Breaking and Demo Papers.
[3]Raffel, C., and Ellis, D. P. W. 2016. Extracting ground truth information from MIDI files: A MIDIfesto. In ISMIR.
[4]Raffel, C. 2016. Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. Ph.D. Dissertation, Columbia University.

[5]Chu, H.; Urtasun, R.; and Fidler, S. 2017. Song from PI: A musically plausible network for pop music generation. In ICLR Workshop.

[6]Yang, L.-C.; Chou, S.-Y.; and Yang, Y.-H. 2017. MidiNet: A convolutional generative adversarial network for symbolic-domain music generation. In ISMIR.

[7]Serrà, J.; Mller, M.; Grosche, P.; and Arcos, J. L. 2012. Unsupervised detection of music boundaries by time series structure features. In AAAI.

[8]Harte, C.; Sandler, M.; and Gasser, M. 2006. Detecting harmonic change in musical audio. In ACM MM workshop on Audio and music computing multimedia.

[9]Briot, J.-P.; Hadjeres, G.; and Pachet, F. 2017. Deep learning techniques for music generation: A survey. arXiv preprint arXiv:1709.01620.

[10] Sturm, B. L.; Santos, J. F.; Ben-Tal, O.; and Korshunova, I. 2016. Music transcription modelling and composition using deep learning. In Conference on Computer Simulation of Musical Creativity.

[11]Hadjeres, G.; Pachet, F.; and Nielsen, F. 2017. DeepBach:A steerable model for Bach chorales generation. In ICML.

[12]Chu, H.; Urtasun, R.; and Fidler, S. 2017. Song from PI: A musically plausible network for pop music generation. In ICLR Workshop.

[13]Mogren, O. 2016. C-RNN-GAN: Continuous recurrent neural networks with adversarial training. In NIPS Worshop on Constructive Machine Learning Workshop.

[14]Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. SeqGAN: Sequence generative adversarial nets with policy gradient. In AAAI.

[15]Yang, L.-C.; Chou, S.-Y.; and Yang, Y.-H. 2017. MidiNet: A convolutional generative adversarial network for symbolic-domain music generation. In ISMIR.

《MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment》论文阅读笔记的更多相关文章

  1. (转)Introductory guide to Generative Adversarial Networks (GANs) and their promise!

    Introductory guide to Generative Adversarial Networks (GANs) and their promise! Introduction Neural ...

  2. 生成对抗网络(Generative Adversarial Networks,GAN)初探

    1. 从纳什均衡(Nash equilibrium)说起 我们先来看看纳什均衡的经济学定义: 所谓纳什均衡,指的是参与人的这样一种策略组合,在该策略组合上,任何参与人单独改变策略都不会得到好处.换句话 ...

  3. 生成对抗网络(Generative Adversarial Networks, GAN)

      生成对抗网络(Generative Adversarial Networks, GAN)是一种深度学习模型,是近年来复杂分布上无监督学习最具前景的学习方法之一.   GAN 主要包括了两个部分,即 ...

  4. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks 论文笔记

    StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks  本文将利 ...

  5. 论文笔记之:Semi-Supervised Learning with Generative Adversarial Networks

    Semi-Supervised Learning with Generative Adversarial Networks 引言:本文将产生式对抗网络(GAN)拓展到半监督学习,通过强制判别器来输出类 ...

  6. 《Self-Attention Generative Adversarial Networks》里的注意力计算

    前天看了 criss-cross 里的注意力模型  仔细理解了  在: https://www.cnblogs.com/yjphhw/p/10750797.html 今天又看了一个注意力模型 < ...

  7. Paper Reading: Perceptual Generative Adversarial Networks for Small Object Detection

    Perceptual Generative Adversarial Networks for Small Object Detection 2017-07-11  19:47:46   CVPR 20 ...

  8. SalGAN: Visual saliency prediction with generative adversarial networks

    SalGAN: Visual saliency prediction with generative adversarial networks 2017-03-17 摘要:本文引入了对抗网络的对抗训练 ...

  9. Generative Adversarial Networks,gan论文的畅想

    前天看完Generative Adversarial Networks的论文,不知道有什么用处,总想着机器生成的数据会有机器的局限性,所以百度看了一些别人 的看法和观点,可能我是机器学习小白吧,看完之 ...

随机推荐

  1. 【概率dp】D. Card Collector

    https://www.bnuoj.com/v3/contest_show.php?cid=9147#problem/D [题意] 为了集齐n张卡片,必须要买多少袋零食?题目给定每种卡片出现在零食中的 ...

  2. hdu 4474

    //因为n是小于等于10000可以利用这点进行搜索对n取余则余数为零时就为所找的.因为他的余数肯定小于10000所以不会无休止下去 #include<stdio.h> #include&l ...

  3. jQuery的对象访问函数(get,index,size,each)

    1.get() 元素集合 取得所有匹配的 DOM 元素集合. 这是取得所有匹配元素的一种向后兼容的方式(不同于jQuery对象,而实际上是元素数组). 如果你想要直接操作 DOM 对象而不是 jQue ...

  4. iOS 混合变换旋转 CGAffineTransform 的使用

    在ios 中, Core Graphics 提供了一系列的函数可以在一个变换的基础上做深层次的变换,如果做一个既要缩放又要旋转的变换,以下的方法比较实用. CGAffineTransformScale ...

  5. Django学习之 - 基础视图函数

    视图:Views 获取用户请求的方法: 1: request.GET 2: request.POST 3: request.FILES # checkbox 等多选文件 4:request.POST. ...

  6. BZOJ 2440 莫比乌斯函数+容斥+二分

    2440: [中山市选2011]完全平方数 Time Limit: 10 Sec  Memory Limit: 128 MBSubmit: 5473  Solved: 2679[Submit][Sta ...

  7. FatMouse's Speed--hdu1160(dp+输出路径)

    Problem Description FatMouse believes that the fatter a mouse is, the faster it runs. To disprove th ...

  8. RabbitMQ Hello World

    RabbitMQ Hello World rabbitmq operation: C:\Program Files\RabbitMQ Server\rabbitmq_server-3.7.2\sbin ...

  9. C# 性能优化 之 秒表 Stopwatch。 Dapper一个和petapoco差不多的轻量级ORM框架

    Sweet小马 小马同学的编程日记. C# 性能优化 之 秒表 Stopwatch. 生词解释:Diagnostics[,daɪəg'nɑstɪks] n.诊断学 using System.Diagn ...

  10. Android中View窗口getWidth和getMeasuredWidth的差别

    今天在研究自己定义listview的下拉刷新的效果.想移植到项目需求中,再看自己定义源代码时发现了一个问题就是getWidth和getMeasuredWidth两个方法有什么差别,求教万能的百度,经调 ...