Join egghead, unlock knowledge.

Want more egghead?

This lesson is for members. Join us? Get access to all 3,000+ tutorials + a community with expert developers around the world.

Unlock This Lesson
1×
Become a member
to unlock all features

Level Up!

Access all courses & lessons on egghead today and lock-in your price for life.

Autoplay

    Separate Training and Validation Data Automatically in Keras with validation_split

    Chris AchardChris Achard
    pythonpython
    ^3.0.0

    When training and testing a neural net, it’s important to separate training data from validation, so that you aren’t checking the accuracy of the model with the same data that you use to train it. This will help reduce overfitting. We’ll use the validation_split parameter when fitting our model to automatically split data up into a training set and a validation set, and use that to check the validation accuracy of our model.

    Code

    Code

    Become a Member to view code

    You must be a Member to view code

    Access all courses and lessons, track your progress, gain confidence and expertise.

    Become a Member
    and unlock code for this lesson
    Discuss

    Discuss

    Transcript

    Transcript

    Instructor: We've gotten a low loss in only 100 epochs, but that loss number is calculated with the same data that we're using to train the network, which might mean that we're just overfitting to that data. Keras has a built-in way to split data into training and validation data sets.

    We'll use it by supplying a validation split parameter to the fit function. Validation split is a decimal between zero and one, which represents the percentage of the training data to use as the validation data set.

    For this example, we'll split into 20 percent validation and 80 percent training. Common values are .2 or .33, but you can try different values to see what works best for your data set. There's one very important note when using Keras's automatic validation split, though.

    It always takes the last X percent of the data that you give it, so that if you have ordered data of any kind, you may want to shuffle your data before training. We want to make sure to keep the correct Y value with the correct X value, if we shuffle, though.

    Let's first make a permutation of numbers that match the size of the output array, which gives us a random array of array indices that we can use to reset the order of our X and Y arrays. When we retrain the network with the validation set, now, there is an extra val loss output value, which represents the mean squared error loss on the validation set.

    We want to see low values for the validation loss, and hopefully around the same values as the regular loss. If we rerun this several times, the validation and training losses tend to jump all over the place. This tells us that sometimes, the network is overfitting the data, and sometimes, the network is underfitting the data.

    For the specific example, this is because we don't have enough data. We only have six input data points. After we take 20 percent off to be the validation set, we're only left with four. It makes sense that we can't get the network to fit properly all of the time.

    The real key thing to look for, though, is that the loss and the validation loss are more consistent as you increase the amount of data that you give the network.

    For example, if we doubled the number of inputs and outputs so that we have 12 and rerun the training, we're now training on 9 data points instead of just 4. The loss and validation numbers are lower and more consistent across runs.