Applying Machine Learning to Sentiment Analysis

1. Obtaining the IMDb movie review dataset ：

A compressed archive of the movie review dataset ---- http://ai.stanford.edu/~amaas/data/sentiment/

import pandas as pd
df = pd.read_csv('./datasets/movie/movie_data.csv')
print('Excerpt of the movie dataset', df.head(3))

('Excerpt of the movie dataset', review sentiment

0 In 1974, the teenager Martha Moxley (Maggie Gr... 1

1 OK... so... I really like Kris Kristofferson a... 0

2 ***SPOILER*** Do not read this, if you think a... 0)

2. Introducing the bag-of-words model

2.1 Transforming words into feature vectors

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)
print('Vocabulary', count.vocabulary_)
print('bag.toarray()', bag.toarray())

('Vocabulary', {u'and': 0, u'weather': 6, u'sweet': 4, u'sun': 3, u'is': 1, u'the': 5, u'shining': 2})

('bag.toarray()', array([[0, 1, 1, 1, 0, 1, 0],

[0, 1, 0, 0, 1, 1, 1],

[1, 2, 1, 1, 1, 2, 1]], dtype=int64))

2.2 Assessing word relevancy via term frequency-inverse document frequency

from sklearn.feature_extraction.text import TfidfTransformer
np.set_printoptions(precision=2)
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[ 0. 0.43 0.56 0.56 0. 0.43 0. ]

[ 0. 0.43 0. 0. 0.56 0.43 0.56]

[ 0.4 0.48 0.31 0.31 0.31 0.48 0.31]]

2.3 Cleaning text data

print('Excerpt:\n\n', df.loc[0, 'review'][-50:])

('Excerpt:\n\n', 'is seven.<br /><br />Title (Brazil): Not Available')

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

print('Preprocessor on Excerpt:\n\n', preprocessor(df.loc[0, 'review'][-50:]))

('Preprocessor on Excerpt:\n\n', 'is seven title brazil not available')

res = preprocessor("</a>This :) is :( a test :-)!")
print('Preprocessor on "</a>This :) is :( a test :-)!":\n\n', res)
df['review'] = df['review'].apply(preprocessor)

('Preprocessor on "</a>This :) is :( a test :-)!":\n\n', 'this is a test :) :( :)')

2.4 Processing documents into tokens

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()


def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]


t1 = tokenizer('runners like running and thus they run')
print("Tokenize: 'runners like running and thus they run'")
print(t1)

t2 = tokenizer_porter('runners like running and thus they run')
print("\nPorter-Tokenize: 'runners like running and thus they run'")
print(t2)

Tokenize: 'runners like running and thus they run'

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

Porter-Tokenize: 'runners like running and thus they run'

[u'runner', 'like', u'run', 'and', u'thu', 'they', 'run']

3. Training a logistic regression model for document classifcation

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
stop = stopwords.words('english')
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values


tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

gs_lr_tfidf.fit(X_train, y_train)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

CV Accuracy: 0.897

clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.899

Reference：《Python Machine Learning》

Applying Machine Learning to Sentiment Analysis

继续阅读

来自python的【条件控制/语句循环/break/continue/else/pass】一、条件控制二、语句循环

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入