Text analysis is becoming important for Revealing imformation from text content. Due to advancement of machine learning algorithm, now the text analysis is very flexible and interesting in Neuro-lingistic field. The purpose of this project was predicting the game rating based on comments left by the users. For the purpose of the analysis, 30,000 unbalanced and 20,000 balanced random sample was taken from orginal board game geek data. To obtain the best accuracy MNB, MNB with N-grams, linear-SVC, Ensemble model for balanced SVC and Unbalanced SVC under joining condition and VotingClassifier condition was conducted. The accuracy for different experiment in case of unbalaced data set was: MNB accuaracy 27% with hyperparameter 1, MNB with N-grams 27% and SVC accuracy was 29%. But there was problem with these models due to unbalaced data used for training the model, it was not able to capture most of the negative rating (<5) and very high positive rating (>8). To overcome this problem a 20,000 balanced sample was created by taking 2000 samples from each rating. Then the SVC model was trained with balanced sample, after that SVC model was used to predict on unbalanced dataset. It was found that balanced training model accuracy was less than unbalced models, but it was able to capture all kinds of rating. SVC Balanced accuracy was 23% and SVC-unbalanced test accuracy was 20%, here we train SVC model only as we got that SVC performs better than MNB and MNB-N-grams in terms of the accuracy of the prediction. This project conducted two types of ensemable models: Voting Ensemable shows accuacy 29% while Joining Ensemble shows 66% accuracy which was outstanding results for this project. SVC balnced and SVC unbalced model was joined in this best model to predict on unbalced data.This study concludes that Ensemble model performs best than any other models with highest accuarcy. The challanging of this project was selecting the sample size and finding the best machine leaning algorithm and impliment them in a proper way.
The Board Game Geek (BGG) database is a collection of data and information on traditional board games. The game information was recorded to intend for posterity, historical research, and user-contributed ratings. All the information within the database was meticulously and voluntarily entered on a game-by-game basis by board game user. This information is freely offered through flexible queries and "data mining". BoardGameGeek's ranking charts are ordered using the BGG Rating, which is based on the Average Rating. Game Rating was scaled 1 to 10 to present the sentiment. Understanding the popularity of the game depends on information provided by users, which is very important. For this project, board game reviews was used to predict the rating of the game using Machine learning Algorithm. Three kinds of Machine learning Algorithms (MNB, MNB_N-grams,SVC and Ensemble models) was used here for the whole projects.
The orginal Board game data is vast (1GB-1,31,70,073 rows data for review file) and time consuming process to clean, that needs excellent memory of the computer. So to avoid complexity this project has taken 30,000 unbalnced and 20,000 balanced data set for the aanalysis and model development. The file contains
The main purpose of the project is predicting the rating of the game based on the given reviews and understanding how the test classification machine learning algorithm works and improving the outputs on existing references. The second purpose is providing good documentation for the whole process
The Naive Bayes classifier is a simple probabilistic classifier which is based on Bayes theorem with strong and naïve independence assumptions. It is one of the most basic text classification techniques with various applications in email spam detection, personal email sorting, document categorization, sexually explicit content detection, language detection and sentiment detection. Despite the naïve design and oversimplified assumptions that this technique uses, Naive Bayes performs well in many complex real-world problems.
Multinominal Naive Bayes (MNB) algorithm has been widely used in text classification due to its computational advantage and simplicity. MNB maximizes likelihood rather than conditional likelihood or accuracy.The task of text classification can be approached from a Bayesian learning perspective, which assumes that the word distributions in documents are generated by a specific parametric model, and the parameters can be estimated from the training data. Beolow Equation shows Multinominal Naive Bayes (MNB) model which is one such parametric model commonly used in text classification where fi is the number of occurrences of a word wi in a document d, P(wijc) is the conditional probability that a word wi may happen in a document d given the class value c, and n is the number of unique words appearing in the document d.Conditional probability P(wijc) can be determined using the relative frequency of the word wi in documents belonging to class c. where fic is the number of times that a word wi appears in all documents with the class label c, and fc is the total number of words in documents with class label c in T.
One advantage of the Multinominal Naive Bayes model is that it can make predictions efficiently.
An n-gram is defined either as a textual sequence of length n, or similarly, as a sequence of n adjacent ‘textual units’, in both cases extracted from a particular document. A ‘textual unit’ can be identified at a byte, character or word level depending on the context of interest. N-Grams are the basic method for text categorization. It is also a statistical based approach for classifying text. The N is the number of keywords used for dividing the input text. Based on the number of keywords used, the N-gramsare called as 2-grams, 3-grams, etc.
Linear SVM is the newest extremely fast machine learning algorithm for solving multiclass classification problems from ultra large data sets that implements an original proprietary version of a cutting plane algorithm for designing a linear support vector machine.The objective of a Linear SVC (Support Vector Classifier) is to fit to the data, returning a "best fit" hyperplane that divides, or categorizes the training data.After getting the hyperplane, then the model can be feeded with test sample to classify the "predicted" class.
SVM uses kernel function, which finds the linear hyperplane that separates classes with the maximum margin. The above diagram shows how the data points (that is, support vectors) belonging to two different classes (red versus blue) are separated using the decision boundary based on the maximum margin.
Ensemble modeling is a process where multiple diverse models are created to predict an outcome, either by using many different modeling algorithms or using different training data sets. The ensemble model then aggregates the prediction of each base model and results in once final prediction for the unseen data. The motivation for using ensemble models is to reduce the generalization error of the prediction.Every model has its strengths and weaknesses. Ensemble models can be beneficial by combining individual models to help hide the weaknesses of an individual model.
Voting classification techniqes in the ensemable predicts based on major votes. For example, if we use three models and they predict [1,0,1]target variable, the final prediction that the ensemble model would make would be 1, since two out of the three models predicted 1.
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import pandas as pd
import numpy as np
import string
import nltk
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import svm, linear_model
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.ensemble import VotingClassifier
sns.set(color_codes=True)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, cross_val_score, train_test_split
import random
from sklearn.metrics import accuracy_score
from collections import Counter
from sklearn.metrics import accuracy_score
review_data0 = pd.read_csv('../input/boardgamegeek-reviews/bgg-13m-reviews.csv', index_col=0)
review_data0.head()
** we will use .shape to see the number of rows and columns in our data file
review_data0.shape
So, Remove all NaN rows from comment column
review_data2=review_data0[~review_data0.comment.str.contains("NaN",na=True)]
review_data2.head()
review_data2.shape
review_data2.describe()
#plot histogram of ratings
num_bins = 70
n, bins, patches = plt.hist(review_data2.rating, num_bins, facecolor='green', alpha=0.9)
#plt.xticks(range(9000))
plt.title('Histogram of Ratings')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()
review_data2.head()
review_data3=review_data2.sample(n=30000)
review_data3.head()
review_data3.dtypes
review_data3.isna().sum()
review_data3['word_count'] = review_data3.comment.str.len()
num_bins = 70
n, bins, patches = plt.hist(review_data3.word_count, num_bins, facecolor='green', alpha=0.9)
#plt.xticks(range(9000))
plt.title('Histogram of Word Count')
plt.xlabel('Word Count')
plt.ylabel('Count')
plt.show()
#lowercase and remove punctuation
review_data3['cleaned'] = review_data3['comment'].str.lower().apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
# stopword list to use
stopwords_list = stopwords.words('english')
stopwords_list.extend(('game','play','played','players','player','people','really','board','games','one','plays','cards','would'))
stopwords_list[-10:]
#remove stopwords
review_data3['cleaned'] = review_data3['cleaned'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords_list)]))
review_data3.head()
We have made lower case of all words in comment, removed punctuation and stop words to get the unique, meaningful and clear comments for the analysis. Lower case will halp to get same format of similar kinds of word. stopwords do not carry any meaningful significance.
num_bins = 70
n, bins, patches = plt.hist(review_data3.rating, num_bins, facecolor='green', alpha=0.9)
#plt.xticks(range(9000))
plt.title('Histogram of Ratings')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()
So, it is clear that our unbalced sample has similar rating pattern like orginal sample.
Counter(" ".join(review_data3["cleaned"]).split()).most_common(50)[:50]
from wordcloud import WordCloud
from collections import Counter
neg = review_data3.loc[review_data3['rating'] < 3]
pos = review_data3.loc[review_data3['rating'] > 8]
words = Counter([w for w in " ".join(pos['cleaned']).split()])
wc = WordCloud(width=400, height=350,colormap='plasma',background_color='white').generate_from_frequencies(dict(words.most_common(100)))
plt.figure(figsize=(20,15))
plt.imshow(wc, interpolation='bilinear')
plt.title('Common Words in Positive Reviews', fontsize=20)
plt.axis('off');
plt.show()
words = Counter([w for w in " ".join(neg['cleaned']).split()])
wc = WordCloud(width=400, height=350,colormap='plasma',background_color='white').generate_from_frequencies(dict(words.most_common(100)))
plt.figure(figsize=(20,15))
plt.imshow(wc, interpolation='bilinear')
plt.title('Common Words in Negative Reviews', fontsize=20)
plt.axis('off');
plt.show()
print('Mean: ', review_data3.rating.mean())
print('Median: ', review_data3.rating.median())
print('Mode: ', review_data3.rating.mode())
def calc_rmse(errors, weights=None):
n_errors = len(errors)
if weights is None:
result = sqrt(sum(error ** 2 for error in errors) / n_errors)
else:
result = sqrt(sum(weight * error ** 2 for weight, error in zip(weights, errors)) / sum(weights))
return result
#if the score is far from mean (high or low scores), weight those reviews and ratings more when assessing model accuracy
def calc_weights(scores):
peak = 6.851
return tuple((10 ** (0.3556 * (peak - score))) if score < peak else (10 ** (0.2718 * (score - peak))) for score in scores)
def assess_model( model_name, test, predicted):
error = test - predicted
rmse = calc_rmse(error)
mae = mean_absolute_error(test, predicted)
weights = calc_weights(test)
weighted_rmse = calc_rmse(error, weights = weights)
print(model_name)
print('RMSE:',rmse)
print('Weighed RMSE:', weighted_rmse)
print('MAE:', mae)
The 30 thousands unbalanced data was splitted as train and test set for the modeling, then pipeline was used to tune the model.
count_vectorizer - Breaks up the text into a matrix with each word (called "token" in NLP) being the column of the matrix and the value being the count of occurences.
ngram_range - Optional parameter to extract the text in groups of 2 or more words together. This is useful because the modifiers such as 'not' can be used to change the following word's meaning.
Model performance will be judged with the accuracy value
X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, random_state=44,test_size=0.20)
model_nb = Pipeline([
('count_vectorizer', CountVectorizer(lowercase = True, stop_words = stopwords.words('english'))),
('tfidf_transformer', TfidfTransformer()), #weighs terms by importance to help with feature selection
('classifier', MultinomialNB()) ])
model_nb.fit(X_train,y_train.astype('int'))
labels = model_nb.predict(X_test)
mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Multinomial NB", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))
#Experimented with adding different numbers of n-grams, 1-2 seems to have best performance
model_nb2 = Pipeline([
('count_vectorizer', CountVectorizer( ngram_range=(1,2), lowercase = True, stop_words = stopwords.words('english'))),
('tfidf_transformer', TfidfTransformer()), #weighs terms by importance to help with feature selection
('classifier', MultinomialNB()) ])
model_nb2.fit(X_train,y_train.astype('int'))
labels = model_nb2.predict(X_test)
mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Multinomial NB n-grams 1-2", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))
# Convert the data in vector fpormate
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
tf_idf_train = tf_idf_vect.fit_transform(X_train)
tf_idf_test = tf_idf_vect.transform(X_test)
alpha_range = list(np.arange(0,30,1))
len(alpha_range)
from sklearn.naive_bayes import MultinomialNB
y_train=y_train.astype('int')
alpha_scores=[]
for a in alpha_range:
clf = MultinomialNB(alpha=a)
scores = cross_val_score(clf, tf_idf_train, y_train, cv=5, scoring='accuracy')
alpha_scores.append(scores.mean())
print(a,scores.mean())
# Plot b/w misclassification error and CV mean score.
import matplotlib.pyplot as plt
MSE = [1 - x for x in alpha_scores]
optimal_alpha_bnb = alpha_range[MSE.index(min(MSE))]
# plot misclassification error vs alpha
plt.plot(alpha_range, MSE)
plt.xlabel('hyperparameter alpha')
plt.ylabel('Misclassification Error')
plt.show()
optimal_alpha_bnb
model_nb = Pipeline([
('count_vectorizer', CountVectorizer(lowercase = True, stop_words = stopwords.words('english'))),
('tfidf_transformer', TfidfTransformer()), #weighs terms by importance to help with feature selection
('classifier', MultinomialNB(alpha=optimal_alpha_bnb)) ])
model_nb.fit(X_train,y_train.astype('int'))
labels = model_nb.predict(X_test)
mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Multinomial NB", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))
model_svc = make_pipeline(TfidfVectorizer(ngram_range=(1,3)), svm.SVC(kernel="linear",probability=True))
model_svc.fit(X_train, y_train.astype('int'))
labels = model_svc.predict(X_test)
mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Linear SVC model", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))
review_data2.head()
rating1_subset = review_data2[review_data2['rating']==1]
rating1_subset.head()
# Slect 100 sample that have rating =1
r1=rating1_subset.sample(2000)
r1.head()
rating2_subset = review_data2[review_data2['rating']==2]
rating2_subset.head()
# Slect 100 sample that have rating =2
r2=rating2_subset.sample(2000)
r2.head()
rating3_subset = review_data2[review_data2['rating']==3]
rating3_subset.head()
# Slect 100 sample that have rating =3
r3=rating3_subset.sample(2000)
r3.head()
rating4_subset = review_data2[review_data2['rating']==4]
rating4_subset.head()
# Slect 100 sample that have rating =4
r4=rating4_subset.sample(2000)
r4.head()
rating5_subset = review_data2[review_data2['rating']==5]
rating5_subset.head()
# Slect 100 sample that have rating =5
r5=rating5_subset.sample(2000)
r5.head()
rating6_subset = review_data2[review_data2['rating']==6]
rating6_subset.head()
# Slect 100 sample that have rating =6
r6=rating6_subset.sample(2000)
r6.head()
rating7_subset = review_data2[review_data2['rating']==7]
rating7_subset.head()
# Slect 100 sample that have rating =7
r7=rating7_subset.sample(2000)
r7.head()
rating8_subset = review_data2[review_data2['rating']==8]
rating8_subset.head()
# Slect 100 sample that have rating =8
r8=rating8_subset.sample(2000)
r8.head()
rating9_subset = review_data2[review_data2['rating']==9]
rating9_subset.head()
# Slect 100 sample that have rating=9
r9=rating9_subset.sample(2000)
r9.head()
rating10_subset = review_data2[review_data2['rating']==10]
rating10_subset.head()
# Slect 100 sample that have rating=10
r10=rating10_subset.sample(2000)
r10.head()
review_balance=df = r1.append([r2, r3,r4,r5,r6,r7,r8,r9,r10])
review_balance.head()
review_balance.shape
#lowercase and remove punctuation
review_balance['cleaned'] = review_balance['comment'].str.lower().apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
# stopword list to use
stopwords_list = stopwords.words('english')
stopwords_list.extend(('game','play','played','players','player','people','really','board','games','one','plays','cards','would'))
stopwords_list[-10:]
#remove stopwords
review_balance['cleaned'] = review_balance['cleaned'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords_list)]))
review_balance.head()
#plot histogram of ratings
num_bins = 70
n, bins, patches = plt.hist(review_balance.rating, num_bins, facecolor='green', alpha=0.9)
#plt.xticks(range(9000))
plt.title('Histogram of Ratings')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()
X_train1, X_test1, y_train1, y_test1 = train_test_split(review_balance.cleaned, review_balance.rating, test_size=0.20)
model_svc_balance = make_pipeline(TfidfVectorizer(ngram_range=(1,3)), svm.SVC(kernel="linear",probability=True))
model_svc_balance.fit(X_train1, y_train1.astype('int'))
labels = model_svc_balance.predict(X_test1)
mat = confusion_matrix(y_test1.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Linear SVC Balanced model", y_test1,labels)
acc = accuracy_score(y_test1.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))
** Although now it has captured all rating catagory but the accuracy is less than unbalanced sample.
X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, test_size=0.20)
labels = model_svc_balance.predict(X_test)
mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Linear SVC model", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy of re-trained SVC is',(acc))
X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, test_size=0.20)
Ensemble = VotingClassifier(estimators=[('model_svc_unbalance',model_svc), ('model_svc_balance', model_svc_balance )],
voting='soft',
weights=[3, 1])
Ensemble.fit(X_train,y_train.astype(int))
labels = Ensemble.predict(X_test)
mat = confusion_matrix(y_test.astype(int), labels)
ax = sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Ensemble model", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy of Ensemble SVC is',(acc))
X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, test_size=0.20)
labels = model_svc.predict(X_test)
labels_2 = model_svc_balance.predict(X_test)
pred = pd.concat([pd.DataFrame(y_test).reset_index().rating,pd.Series(labels),pd.Series(labels_2)],axis=1)
pred.columns = ['rating','model_1','model_2']
pred = pd.concat([pd.DataFrame(y_test).reset_index().rating,pd.Series(labels),pd.Series(labels_2)],axis=1)
pred.columns = ['rating','model_1','model_2']
pred['final'] = np.where(pred.model_2 >= 3, np.where(pred.model_2 <= 9, pred.model_1, pred.model_2), pred.model_2)
#pred['final'] = np.where(pred.model_2 <= 9, pred.model_1, pred.model_2)
pred.tail()
mat = confusion_matrix(pred.rating.astype(int), pred.final)
ax = sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Ensemble model", pred.rating,pred.final)
acc = accuracy_score(pred.rating.astype(int),pred.final, normalize=True) * float(100)
print('\n****Test accuracy of Ensemble SVC is',(acc))