Regression involves analyzing multiple sources of data and predicting a response based off the results of the data. It takes using the k-nearest neighbors algorithm to the next level by choosing a N number of closest neighbors and taking the average between them. Let’s write a function that tries to predict how many pies our restaurant should make today!
Instructor: [00:00] Let's say that I own a restaurant that sells pies and lately I've been having trouble making enough pies for the day. Sometimes I make too much and sometimes too little. For the past few days I've been keeping track of the weather, if it's a weekend or not, and how many pies I sold that day.
[00:16] If it was a weekend, I put the number one. If not, it's a zero. The smaller the temperature, the colder it was that day, and the total number of pies I sold. Using this data, I want to calculate how many pies I should prepare for today.
[00:31] In order to do this, we first need to have a function that will calculate the distance between today's data and our previous results array of data. This calc distance function implements using Euclidean distance to calculate the distance, or in other words, puts a number to how similar today's data is with our previous results.
[00:51] Let's begin writing this function by destructuring the first and second results we'll get back from our calc neighbors function we need to create. Passing through our previous results and today's information, we'll say it is a weekend and the temperature outside is mild. Let's create our calc neighbors function, which we know takes results and let's destructure today's information.
[01:14] We're going to return the results of reducing over our results array, which inside of here we're going to call our calc distance function over the features that we've defined, which include is it a weekend and what's today's temperature.
[01:31] Once we have those features outlined we're going to return a new array each time, spreading what we've currently accumulated but adding a new object with a dis property that shows our calculated distance.
[01:45] At the end here, we're going to sort so that we get the smallest distance object first. This way when we destructure first and second on line 19 we're getting the two closest neighbors.
[01:59] Again, we're looping over each of our data sets and calculating how similar today's information is against the previous data. We return a sorted array that puts the closest neighbors, meaning the most similar to today's data first. This algorithm of calculating the distance of neighbors is called the K nearest neighbors algorithm.
[02:20] Then if we console dot logged first pies plus second pies and divide this by two, we'll get the number of pies we should make for today. This process of finding an N number of closest neighbors and then finding the average between them is called regression. It's a simple but powerful way of predicting future events based on past events.
[02:42] In our example we're only getting the average of the two closest neighbors, though this is just one-use case and there's no magic number of neighbors to get the best answer. A general rule of thumb is to use the square root of the number of items that we have to know how many neighbors to calculate against.
[02:59] As you can probably imagine, the more features that we have to work with, the more accurate our function will become. There's nothing special about having just two features of today.
🤯
Lovely! We want more!!!