【NLP】Text Feature Extraction & Classification
Text Feature Extraction
Count Vectorization
Document Term Matrix (DTM) : a matrix of counts, with columns as words.
Term Frequency (TF)
Inverse Term Frequency (ITF)
TF-ITF : Term Frequency times Inverse Document Frequency
TF-ITF allows us to understand the context of words across an entire corpus of documents, instead of just its relative importance in a single document
Text Feature Extraction using Scikit-Learn’s Vectorizer classes
from sklearn.feature_extraction.text import CountVectorizer |
['call', 'dog', 'game', 'go', 'hey', 'lets', 'sister', 'the', 'to', 'today', 'walk', 'want', 'your'] |
from sklearn.feature_extraction.text import TfidfVectorizer |
['call', 'dog', 'game', 'go', 'hey', 'lets', 'sister', 'the', 'to', 'today', 'walk', 'want', 'your'] |
Transform Counts to Frequencies with Tf-idf
from sklearn.feature_extraction.text import TfidfTransformer |
the fit_transform()
method actually performs two operations: it fits an estimator to the data and then transforms our count-matrix to a tf-idf representation.
Train a Classifier
Here we’ll introduce an SVM classifier that’s similar to SVC, called LinearSVC. LinearSVC handles sparse input better, and scales well to large numbers of samples.
from sklearn.svm import LinearSVC |
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, |
Build a Pipeline
Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we’ll have to submit it to the same procedures. Fortunately scikit-learn offers a Pipeline class that behaves like a compound classifier.
from sklearn.pipeline import Pipeline |
Test the classifier and display results
# Form a prediction set |
[[1586 7] |
# Print a classification report |
precision recall f1-score support
ham 0.99 1.00 0.99 1593
spam 0.97 0.95 0.96 246
micro avg 0.99 0.99 0.99 1839
macro avg 0.98 0.97 0.98 1839
weighted avg 0.99 0.99 0.99 1839
Using the text of the messages, our model performed exceedingly well; it correctly predicted spam 98.97% of the time!
Now let’s apply what we’ve learned to a text classification project involving positive and negative movie reviews.
text_clf.predict(["Hello, how are you?"]) |
Text Classification Project
Step 1 - Import and load the dataset
The dataset contains the text of 2000 movie reviews. 1000 are positive, 1000 are negative, and the text has been preprocessed as a tab-delimited file.
Step 2 - Prep dataset, check for missing values
We have intentionally included records with missing data. Some have NaN values, others have short strings composed of only spaces. This might happen if a reviewer declined to provide a comment with their review. We will show two ways using pandas to identify and remove records containing empty data.
- NaN records are efficiently handled with .isnull() and .dropna()
- Strings that contain only whitespace can be handled with .isspace(), .itertuples(), and .drop()
Detect & remove NaN values
df.dropna(inplace=True) |
By setting inplace=True, we permanently affect the DataFrame currently in memory, and this can’t be undone. However, it does not affect the original source data. If we needed to, we could always load the original DataFrame from scratch.
Detect & remove empty strings
Technically, we’re dealing with “whitespace only” strings. If the original .tsv file had contained empty strings, pandas .read_csv() would have assigned NaN values to those cells by default.
In order to detect these strings we need to iterate over each row in the DataFrame. The .itertuples() pandas method is a good tool for this as it provides access to every field. For brevity we’ll assign the names i
, lb
and rv
to the index
, label
and review
columns.
blanks = [] # start with an empty list |
Step 3 - Split the data into train & test sets
from sklearn.model_selection import train_test_split |
Step 4 - Build pipelines to vectorize the data, then train and fit a model
from sklearn.pipeline import Pipeline |
Step 5 - Feed the training data through the first pipeline
text_clf_nb.fit(X_train, y_train) |
Step 6 - Run predictions and analyze the results (naïve Bayes)
# Form a prediction set |
Advanced Topic - Adding Stopwords to CountVectorizer
By default, CountVectorizer and TfidfVectorizer do not filter stopwords. However, they offer some optional settings, including passing in your own stopword list.
There are some known issues using Scikit-learn’s built-in stopwords list. Some words that are filtered may in fact aid in classification. In this section we’ll pass in our own stopword list, so that we know exactly what’s being filtered.
The CountVectorizer class accepts the following arguments:
CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)
TfidVectorizer supports the same arguments and more. Under stop_words we have the following options:
stop_words : string {‘english’}, list, or None (default)
That is, we can run TfidVectorizer(stop_words='english')
to accept scikit-learn’s built-in list,
or TfidVectorizer(stop_words=[a, and, the])
to filter these three words. In practice we would assign our list to a variable and pass that in instead.
Scikit-learn’s built-in list contains 318 stopwords. However, there are words in this list that may influence a classification of movie reviews. With this in mind, let’s trim the list to just 60 words:
stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \ |
text_clf_lsvc2 = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')), |
Feed new data into a trained model
myreview = "I enjoyed this movie" |