Instructor: From sklearn, we'll import our datasets. We'll import metrics. From sklearn.feature_extraction.text, we'll import the TfidfVectorizer, which will help make our text understandable to the model. Then from sklearn.naive_bayes, we'll import the multinomial Naive Bayes.
We'll also import matplotlib.pyplot as plt. If you'd like to visualize the confusion matrix at the end of this, from pandas_ml import ConfusionMatrix.
We're going to be working with the newsgroups dataset. We access this a little differently. Newsgroups_train will be datasets.fetch_20newsgroups, and we'll pass in an argument subset='train', and newsgroups_test = datasets.fetch_20newsgroups(subset='test'). The data has already been split into training and test datasets for us. Let's explore this dataset a little bit.
We see the common keys. Let's print now a couple items of the data and a couple target labels. Let's print our category names or target names. We can see that each data point is a bunch of text. We have three category labels in our target. The target names include these 20 categories of news text. We have baseball, for sale, motorcycles, religious talk, political talk, etc.
The next thing we need to do is vectorize our text, and this means turn it from words into a model-understandable vector of features represented by numbers. To do this, we can say vectorizer = TfidfVectorizer. Tfidf stands for term frequency-inverse document frequency. It's a metric commonly used for analyzing text. We'll be using this as the lens through which to examine our text.
We'll say X_train = vectorizer.fit_transform(newsgroups_train.data). X_test will be vectorizer.transform(newsgroups_test.data). Our y_train will be our newsgroups_train.target. Our y_test will be newsgroups_test.target. From there, we can say model = MultinomialNB.
We can say model.fit our X training data and our y training data. Then our predictions will be model.predict our X_test data. We can print our model.score and metrics.classification_report with our accurate labels and our predictions. For 20 classes, that's not bad.
One last thing we can do is visualize our confusion matrix. For this many classes, let's make the visualization a little bit better. We'll use pandas_ml to do that.
First, we're going to make a variable called labels, which is a list of our newsgroups_train.target_names, so those are the category names, and then our confusion_matrix = ConfusionMatrix. We pass in our accurate labels, our predictions, and the label names. We say cm.plot and, finally, plt.show.
This confusion matrix helps us a lot. We can see that in general the model does a decent job predicting most of these categories. It also highlights areas where there was confusion very easily.
We can see that there were a lot of predictions for religion.christian for other categories. Looking at the categories, it kind of makes sense. The atheism category probably has a big overlap, medicine a tiny bit. General, miscellaneous religious talk will also have a lot of overlap. This is a great tool to see what you can focus on and optimize next.