Separate Training and Validation Data Automatically in Keras with validation_split

Chris Achard
InstructorChris Achard

Share this video with your friends

Send Tweet
Published 6 years ago
Updated 5 years ago

When training and testing a neural net, it’s important to separate training data from validation, so that you aren’t checking the accuracy of the model with the same data that you use to train it. This will help reduce overfitting. We’ll use the validation_split parameter when fitting our model to automatically split data up into a training set and a validation set, and use that to check the validation accuracy of our model.

Instructor: [00:01] We've gotten a low loss in only 100 epochs, but that loss number is calculated with the same data that we're using to train the network, which might mean that we're just overfitting to that data. Keras has a built-in way to split data into training and validation data sets.

[00:16] We'll use it by supplying a validation split parameter to the fit function. Validation split is a decimal between zero and one, which represents the percentage of the training data to use as the validation data set.

[00:29] For this example, we'll split into 20 percent validation and 80 percent training. Common values are .2 or .33, but you can try different values to see what works best for your data set. There's one very important note when using Keras's automatic validation split, though.

[00:48] It always takes the last X percent of the data that you give it, so that if you have ordered data of any kind, you may want to shuffle your data before training. We want to make sure to keep the correct Y value with the correct X value, if we shuffle, though.

[01:03] Let's first make a permutation of numbers that match the size of the output array, which gives us a random array of array indices that we can use to reset the order of our X and Y arrays. When we retrain the network with the validation set, now, there is an extra val loss output value, which represents the mean squared error loss on the validation set.

[01:30] We want to see low values for the validation loss, and hopefully around the same values as the regular loss. If we rerun this several times, the validation and training losses tend to jump all over the place. This tells us that sometimes, the network is overfitting the data, and sometimes, the network is underfitting the data.

[01:53] For the specific example, this is because we don't have enough data. We only have six input data points. After we take 20 percent off to be the validation set, we're only left with four. It makes sense that we can't get the network to fit properly all of the time.

[02:06] The real key thing to look for, though, is that the loss and the validation loss are more consistent as you increase the amount of data that you give the network.

[02:17] For example, if we doubled the number of inputs and outputs so that we have 12 and rerun the training, we're now training on 9 data points instead of just 4. The loss and validation numbers are lower and more consistent across runs.