# Use Linear Regression To Estimate Continuous Values with Python and Scikit-learn

InstructorHannah Davis

Published 5 years ago
Updated 3 years ago

Linear regression is a linear model that is used for regression problems, or problems where the goal is to predict a value on a continuous spectrum (as opposed to a discrete category).

We’ll use linear regression to estimate continuous values. In this case, we’ll predict house prices in Boston. We'll also look at how to visualize our results with matplotlib, and how to evaluate our models with different metrics for regression problems.

Instructor: [00:00] We're going to import mapplotlib.pyplot as plt for visualization later. From scikit-learn, which is sklearn, we'll import scikit-learn's datasets. We'll import metrics to evaluate our model. From sklearn.model_selection, we'll import train-test split, which will allow us to make training and test data. Finally, from sklearn.linear_model, we'll import linearRegression.

[00:43] The first thing we want to do is load in our dataset, and scikit-learn has in-built datasets that we'll be using. First, we'll be working with Boston housing prices. To load this dataset, we say datasets.load_boston. It's always a good idea to explore the dataset a little bit, let's see what this dataset contains.

[01:04] We can print boston.keys, and my filename here is linear.py. We can see four keys here -- data, which is the data, our feature_names, which are the feature names, descr, which is the describing the data set, and target, which are our target labels.

[01:23] Let's example these a little bit more. We can say print boston.feature_names, and we can see 13 variables here. If we print boston.description, we can take a look at some of these features. We can see that the features are things like crime rate, whether it's next to the river, the average number of rooms, the distance to work, et cetera.

[01:52] Let's also look at the data. We'll just look at the first five data points. Let's also look at the target values. We can see that each data point has 13 variables and that the target is one number, which is the housing price, in the thousands. This is old data. We're using these 13 variables to predict the housing price in the thousands.

[02:20] One more thing we can do is we can print boston.data.shape and the boston.target.shape. We can see that there are 506 data points, with 13 variables each. For the target, there are 506 single data points. Great.

[02:42] We're going to assign x to be boston.data, and y will be boston.target, our target labels. One important component of machine learning is to have training data and test data. We can create this with the train-test split function. We'll say x_train, x_test, y_train, and y_test equals train-test split.

[03:12] This takes x, y, the proportion of data that we want to go into our test data, and this number can vary, though you want your training data to be significantly larger. We'll say a third of the data can go into testing. The last thing is to add this optional randomState argument. This will just make it deterministic so that the data will get split the same way every time we run it. This can be literally any number, we'll just say 16.

[03:42] From there, we just say model equals LinearRegression, and model.fit, x_train, y_train. That's all there is to it. Then, if we want to use this model to make predictions, we say model.predict our test data. We can see a whole bunch of housing predictions here.

[04:10] We can visualize these by saying plt.scatter, our actual y test data labels versus the predictive labels. We'll add an x label, which is the actual prices and a y label, which is the predicted prices. Then, we say plt.show. We can see here that this looks pretty good. It's not perfectly linear, but the model's predicting the prices OK.

[04:40] From here, we'll look at a couple of metrics that will help us evaluate this model. Most models have an in-built score, which you can access by typing model.score and passing in the x test data and the y test labels.

[04:53] For linear regression, the in-built score is based on r^2, also known as the coefficient of determination. This is a number between zero percent and 100 percent that basically describes how well the model fits the data. Generally, though not always, it's better to have a higher r^2 number, and that's worth reading more on.

[05:15] There's also the mean squared error, which can be found by typing metrics.mean_squared_error and passing the y test data, or the actual labels, and the predicted labels. This is a metric that calculates the error of the model. It's generally better for it to be lower. Here, we can see our r^2 gave us about 64 percent, and our mean squared error is 23.4.

Andrew
~ 5 years ago

Looks awesome. Im a newcomer to Python.
Is there a quick setup explanation for this tutorial?

Hannah Davisinstructor
~ 5 years ago

Scikit-learn and matplotlib need to be installed. If you have pip, typing `pip install scikit-learn` and `pip install matplotlib` into terminal generally works (same with conda). Otherwise, you can install pip by typing `sudo easy_install pip`. For lesson 5, pandas_ml will also need to be installed if you want to visualize their confusion matrix. For lesson 6, graphviz needs to be installed if you want to visualize the decision tree.

Hannah Davisinstructor
~ 5 years ago

I'm using Python 2.7 for this course. You can see your version number by typing `python --version` into terminal. Please let me know any obstacles you come across so I can make it clearer to get up and running! :)

Andrew
~ 5 years ago

Thanks Hannah,

is there a way to download multiple Py versions and switch like with Node and nvm use?

Andrew
~ 5 years ago

Since pip and Conda seem to be somewhat interchangeable (like npm and yarn...?) I was able to do this:

conda install python=2.7

Hannah Davisinstructor
~ 5 years ago

Pip and conda can both install python packages, yup! But you generally want to use one or the other (they can install packages in different places and it can get messy....)

The best thing to do is create a virtual environment. You can do that with conda by typing:

`conda create -n yourenvname python=2.7`

and then

`source activate yourenvname`

Then install your packages, and do the course from there!

Alan O'Donnell
~ 5 years ago

One small mistake: r squared scores don't necessarily need to be positive; they can be anything <= 1.

Elias Moreno
~ 4 years ago

I keep getting this hanging error: "UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. 'Matplotlib is building the font cache using fc-list.' " I have looked in various places and I can't seem to find a fix. please help.

Hannah Davisinstructor
~ 4 years ago

Elias - that sounds like a problem on your system. Have you followed the instructions here? https://stackoverflow.com/questions/34771191/matplotlib-taking-time-when-being-imported

Hannah Davisinstructor
~ 4 years ago

Alan - thank you! I'll update that soon.

Elias Moreno
~ 4 years ago

I fixed it. Thank you!!

Jerry
~ 4 years ago

I decided to use jetbrain's pycharm community edition as my development environment for these lessons and, I got problems starting right out of the box. You can tell me Python is easy and, maybe once, long ago it was but, no more. I'm a Python newbie but not new to programming so I can safely say Python is no cake walk.

I installed Python 3 using brew on my Mac. When I ran lesson 1 I kept getting this Python framework not installed error. This error occured when it tried to process the import statement:

import matplotlib.pyplot as plt

When I checked for the framework, it was there in Library/Frameworks folder. When I did a google search, it told me that my matplotlib "backend" actually, it should be called display driver, that which displays the plot, was not defined and, that I would have to create a .mapplotlibrc in my .matplotlib folder located in my root folder and, the problem would go away. Well it didn't because, pycharm defines its own environment (i.e venv). I then re-read the docs about using matplotlib and what it told me was you could define it several ways however, if you used it in your script that trumped all the rest. So I defined the backend in the script file itself as follows:

import matplotlib

matplotlib.use('TkAgg') #TkAgg is the back end, there are more backends

However, you need to do this before you execute the statement:

import matplotlib.pyplot as plt

otherwise, it will not work

Once I figured that out it got past the framework error and then it told me I was missing scipy. Did a pip install for that and, the plot rendered in a separate window.