Download presentation
Presentation is loading. Please wait.
1
Generating Music through RNNs + RL
Heewon Hah
2
Generating Music How many of you have made music? Whether it be composing professionally or just whistling random notes As many of you probably know, machine learning can also be used to generate music
3
Generating Music Generating Celtic folk music
Performing Blues improvisation Google also has an AI that can be used to create compositions
4
Generating Music with Deep Neural Networks
Train Recurrent Neural Network (RNN) Learn to predict sequence of notes Predicts next note in a monophonic melody Note RNN Implemented using Long Short-Term Memory (LSTM) network ? When DNN is used to generate music, typically This model is called note rnn Note rnn is often implemented using lstm
5
LSTM Networks in which each recurrent cell learns to control the storage of information through the use of Input gate Output gate Forget gate Good for learning long-term dependencies in data Can adapt quickly to new data 1&2 control whether info is able to flow into & out of cell 3 controls whether or not contents of cell should be reset By using these 3 gates, lstm is good for
6
Probability and Training
Softmax function Applied to final outputs of network Obtains the probability the network places on each note Training LSTM Softmax cross-entropy loss Back propagation through time (BPTT) So, after the note rnn is implemented using lstm, a softmax function is applied to the final outputs Loss and bptt CAN be used to train LSTM
7
Melody Generation Prime model with short sequence of notes
At each time step, choose next note by sampling from the output distribution given by softmax layer At next time step, feed sampled note back into the network as input To generate melodies from this model
8
problem Melodies tend to wander Lacks musical structure
Notes repeat excessively
9
Solution Reinforcement learning (RL) Deep Q-learning
Use reinforcement learning (RL) to impose structure on LSTM trained data
10
RL Basics An agent interacts with an environment
State, s - what is currently happening; scenario Action, a - how the agent affects the environment in a way that changes the state Policy, (a|s) - a function that determines an agent’s action; P[a|s] Reward, r(s,a) - scalar feedback signal at time t indicating how well the agent is doing Given the state, an agent does an action according to the policy and receives a reward The environment then transitions to a new state according to the state transition probability
11
RL Goal Maximize reward over sequence of actions Return, Gt
Discount factor, - allows modeling uncertainty of future Q function of policy Optimal deterministic policy * satisfies the Bellman optimality equation . These long-term rewards are modeled as a sum of rewards Future rewards have a gamma factor Q function is an evaluation function of the return
12
Q-learning Off-policy control
Explores possible policies to update the Q function without making assumptions of behavior policy Q-learning techniques learn the optimal Q function by iteratively minimizing the Bellman residual This is the Bellman optimality equation I just showed on the last slide The Bellman residual is the difference b/w the 2 sides of the equation. So minimizing the difference would mean the two sides are becoming more equal
13
Deep Q-learning Uses deep-Q network (DQN) to approximate the Q function, Q(s, a ; ) are learned by applying stochastic gradient descent updates w.r.t. the loss function -greedy method for exploration DQN, which is a neural network Theta = network parameters; trainable weights of the network (Beta = exploration policy) Theta negative = parameter of target q-network; held fixed during gradient computation To increase experience to improve Q estimation
14
Model Design Now, going back to our model for generating music, we’re going to refine the note RNN by using RL We will use the trained note rnn to supply initial weights for the 3 networks in our model: Target q network and q network are part of the dqn I mentioned in the last slide The state consists of notes placed in the composition & the internal state of the LSTM cells of both the Q network and reward rnn The action is when the next note is placed in the composition The reward is made up of 2 parts. Reward RNN is used to compute log probability of a note action given a composition state. A set of music theory based rules is also used to impose constraints on the melody through a reward signal r_MT. So the total reward is the sum of these 2 parts with a constant used to control the emphasis placed on the music theory reward. log
15
Loss Function and Learned Policy Updates
16
Experiment Training Note RNN Training RL RNN model Model evaluation
Corpus of 30,000 MIDI songs 1 LSTM layer of 100 cells 30,000 iterations with batch size of 128 Validation accuracy of 92% and a log perplexity score of .2536 Training RL RNN model 3,000,000 iterations with batch size of 32 Model evaluation Every 100,000 training epochs Generated 100 compositions Assessed average rMT and log p(a|s) Monophonic melodies were extracted from a corpus of 30,000 Architecture of the note RNN consisted of 1 lstm layer Network was trained for 30,000 iterations Eventually, the trained RNN had a validation accuracy of
17
Results Quantitative results based on music theory rules the model was trained on These are the statistics based on 100,000 compositions randomly generated from the model using just note rnn and our model using deep q-learning and rl
18
Results Now, we want to see whether our model also retained info about the training data This figure shows the average log probability as outputted by the reward rnn for compositions generated by the models every 100,000 training epochs Ignore the red and green lines, but the dark blue line is our model, and the light blue line is an RL only model trained using only the music theory rewards, with no information about the log probability. This acts as the baseline and basically represents a random policy wrt the distribution defined by the Note RNN
19
Conclusion Combination of ML and RL Can correct unwanted behavior
Can still retain information learned from trained data Ex.: text generation / automatic question answering Here, we used a combination of ML and RL to generate more pleasing melodies, but this approach of using RL to fine-tune RNN models could be used in other applications as well Can correct unwanted behavior generated by rnns by imposing constraint
20
Reference Douglas Eck, Shixiang Gu, Natasha Jaques, and Richard E.Turner. Generating Music by Fine-Tuning Recurrent Neural Networks with Reinforcement Learning
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.