⚠️ This lesson is retired and might contain outdated information.

Predict Categories in Python using Scikit-learn's Logistic Regression module

Hannah Davis
InstructorHannah Davis
Share this video with your friends

Social Share Links

Send Tweet

Despite its often confusing name, logistic regression is a linear model that is used for classification, or estimating discrete values.

We'll use an inbuilt scikit-learn dataset of iris data to classify irises into three categories. We'll also look at metrics and tools to evaluate our classification models, including the accuracy score, classification report, and confusion matrix.

Instructor: [00:00] From scikit-learn, we'll import our datasets. We'll import our metrics. We'll import train_test_split. From scikit-learn.linear_model, we'll import logistic regression. We'll be working with the iris dataset which is datasets.load_iris.

[00:32] Let's explore this. Let's print iris keys. We'll print the target names. We'll print the feature names. Let's print a couple lines of data and a couple of the target labels. Let's also print the shapes of the data and the shape of the target.

[01:05] We can see our three iris classes. We can see four features. We can see that each data point has a value for each feature. We can see that the first three targets are all class zero. We can see that our data is a 150 data points, with four features each, and our target also has a 150 points.

[01:28] We'll assign X to our iris.data and Y to iris.target. Then, we're going to split our data into training and test data sets. We say, X_train, X_test, Y_train, Y_test equals train_test_split. We pass it our X, and our Y, and a test size, which is the percent of our data that we want to go into the test dataset, so we'll say 15 percent. Then our random state, which we'll say is 42.

[02:09] From there, we say model equals logistic regression and say model.fit, pass it our training data. Then we can make predictions by saying model.predict and our X_test data. Let's print our Y_test data, or the accurate labels and the predicted labels. We can see this looks well predicted.

[02:40] To evaluate the model, we can say model.score and pass it our X_test data and our Y_test data. We'll print that out, and that's a perfect score. For logistic regression, the default score is called an accuracy score. This is the same as doing metrics.accuracy_score and passing in the Y_test data, or the accurate labels and the predicted labels.

[03:16] Scikit-learn also comes with a classification report which you can access by typing metrics.classification_report, and passing in the accurate labels and the predicted labels. There are four variables here. Precision is the amount of true positive predictions over the amount of true positives and false positives.

[03:43] This is like the probability of a positive production being accurately positive. Recall is the amount of true positives over the amount of true positives plus false negatives. This is basically saying, "What is the probability that the model will pick up on a true positive?" The F1 score is a combination of precision and recall, and support is the number of samples in each class in the dataset.

[04:10] One last tool that scikit-learn gives us is the confusion matrix. We can type metrics.confusion_matrix, pass in our accurate labels and our predicted labels. This is the matrix of how the model performed with the actual classes on the left and the predictions on the top.

[04:34] This confusion matrix shows a perfect prediction, where the eight samples in class zero were accurately predicted, the nine samples in class one were accurately predicted, and the six samples in class two were accurately predicted.