Text Feature Extraction

Count Vectorization

Document Term Matrix (DTM) : a matrix of counts, with columns as words.

Term Frequency (TF)

Inverse Term Frequency (ITF)

TF-ITF : Term Frequency times Inverse Document Frequency

TF-ITF allows us to understand the context of words across an entire corpus of documents, instead of just its relative importance in a single document

Text Feature Extraction using Scikit-Learn’s Vectorizer classes

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
messages = ['Hey, lets go to the game today!', 'Call your sister.', 'Want to go walk your dog?']
vect = CountVectorizer()
count = vect.fit_transform(messages)
print(vect.get_feature_names())
print(count.todense())

['call', 'dog', 'game', 'go', 'hey', 'lets', 'sister', 'the', 'to', 'today', 'walk', 'want', 'your']
[[0 0 1 1 1 1 0 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0 0 0 0 0 1]
 [0 1 0 1 0 0 0 0 1 0 1 1 1]]

from sklearn.feature_extraction.text import TfidfVectorizer
tdidfvect = TfidfVectorizer()
tdidfcount = tdidfvect.fit_transform(messages)
print(tdidfvect.get_feature_names())
print(tdidfcount.todense())

['call', 'dog', 'game', 'go', 'hey', 'lets', 'sister', 'the', 'to', 'today', 'walk', 'want', 'your']
[[0.         0.         0.40301621 0.30650422 0.40301621 0.40301621
  0.         0.40301621 0.30650422 0.40301621 0.         0.
  0.        ]
 [0.62276601 0.         0.         0.         0.         0.
  0.62276601 0.         0.         0.         0.         0.
  0.4736296 ]
 [0.         0.45954803 0.         0.34949812 0.         0.
  0.         0.         0.34949812 0.         0.45954803 0.45954803
  0.34949812]]

Transform Counts to Frequencies with Tf-idf

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
#(3733, 7082)

the fit_transform() method actually performs two operations: it fits an estimator to the data and then transforms our count-matrix to a tf-idf representation.

Train a Classifier

Here we’ll introduce an SVM classifier that’s similar to SVC, called LinearSVC. LinearSVC handles sparse input better, and scales well to large numbers of samples.

from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

Build a Pipeline

Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we’ll have to submit it to the same procedures. Fortunately scikit-learn offers a Pipeline class that behaves like a compound classifier.

from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)

Test the classifier and display results

# Form a prediction set
predictions = text_clf.predict(X_test)

# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[1586    7]
 [  12  234]]

# Print a classification report
print(metrics.classification_report(y_test,predictions))

          precision    recall  f1-score   support

     ham       0.99      1.00      0.99      1593
    spam       0.97      0.95      0.96       246
    
   micro avg       0.99      0.99      0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839

Using the text of the messages, our model performed exceedingly well; it correctly predicted spam 98.97% of the time!

Now let’s apply what we’ve learned to a text classification project involving positive and negative movie reviews.

text_clf.predict(["Hello, how are you?"])
#array(['ham'], dtype=object)

Text Classification Project

Step 1 - Import and load the dataset

The dataset contains the text of 2000 movie reviews. 1000 are positive, 1000 are negative, and the text has been preprocessed as a tab-delimited file.

Step 2 - Prep dataset, check for missing values

We have intentionally included records with missing data. Some have NaN values, others have short strings composed of only spaces. This might happen if a reviewer declined to provide a comment with their review. We will show two ways using pandas to identify and remove records containing empty data.

NaN records are efficiently handled with .isnull() and .dropna()
Strings that contain only whitespace can be handled with .isspace(), .itertuples(), and .drop()

Detect & remove NaN values

df.dropna(inplace=True)

By setting inplace=True, we permanently affect the DataFrame currently in memory, and this can’t be undone. However, it does not affect the original source data. If we needed to, we could always load the original DataFrame from scratch.

Detect & remove empty strings

Technically, we’re dealing with “whitespace only” strings. If the original .tsv file had contained empty strings, pandas .read_csv() would have assigned NaN values to those cells by default.

In order to detect these strings we need to iterate over each row in the DataFrame. The .itertuples() pandas method is a good tool for this as it provides access to every field. For brevity we’ll assign the names i, lb and rv to the index, label and review columns.

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
            
df.drop(blanks, inplace=True)

Step 3 - Split the data into train & test sets

from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Step 4 - Build pipelines to vectorize the data, then train and fit a model

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

Step 5 - Feed the training data through the first pipeline

text_clf_nb.fit(X_train, y_train)

Step 6 - Run predictions and analyze the results (naïve Bayes)

# Form a prediction set
predictions = text_clf_nb.predict(X_test)

# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

# Print a classification report
print(metrics.classification_report(y_test,predictions))

# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

Advanced Topic - Adding Stopwords to CountVectorizer

By default, CountVectorizer and TfidfVectorizer do not filter stopwords. However, they offer some optional settings, including passing in your own stopword list.

There are some known issues using Scikit-learn’s built-in stopwords list. Some words that are filtered may in fact aid in classification. In this section we’ll pass in our own stopword list, so that we know exactly what’s being filtered.

The CountVectorizer class accepts the following arguments:

CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)

TfidVectorizer supports the same arguments and more. Under stop_words we have the following options:

stop_words : string {‘english’}, list, or None (default)

That is, we can run TfidVectorizer(stop_words='english') to accept scikit-learn’s built-in list,
or TfidVectorizer(stop_words=[a, and, the]) to filter these three words. In practice we would assign our list to a variable and pass that in instead.

Scikit-learn’s built-in list contains 318 stopwords. However, there are words in this list that may influence a classification of movie reviews. With this in mind, let’s trim the list to just 60 words:

stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \
             'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \
             'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \
             'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \
             'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']

text_clf_lsvc2 = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                     ('clf', LinearSVC()),
])
text_clf_lsvc2.fit(X_train, y_train)

Feed new data into a trained model

myreview = "I enjoyed this movie"
print(text_clf_nb.predict([myreview]))
# ['pos']
print(text_clf_lsvc.predict([myreview]))
# ['neg']