Kaggle: Detecting Insulting in Social Commentary
I have competed in the following competition (till the milestone)
http://www.kaggle.com/c/detecting-insults-in-social-commentary
I was focusing on another competition called Jubatus Challenge, so I did not compete for the final result.
By the way, the following code was used for competing.
I used LIBSVM and bag-of-words as a feature vector.
As a preprocessing for the testing data, I used nltk.PunktWordTokenizer(). I did not remove the stop words since it is already removed in the training data.
# -*- coding:utf-8 -*- import sys, csv sys.path.append('/home/fujinuma/libsvm-3.1/python/') from libsvm.svm import * from libsvm.svmutil import * from nltk.tokenize.punkt import PunktWordTokenizer def posdic_entry(word, dic, i): if not word in dic: dic[word] = i i = i + 1 return i def pos_entry(line, dic): words = PunktWordTokenizer().tokenize(line) #excluding quotation marks for word in words: posdic_entry(word, dic, 1) def neg_entry(line, dic, i): line = line.replace(".", " ") words = PunktWordTokenizer().tokenize(line) #excluding quotation marks for word in words: word = word.lower() i = posdic_entry(word, dic, i) return i #what does this mean if __name__ =='__main__': dic = {} text0 = open("train0.txt", "r") text1 = open("train1.txt", "r") test = open("test.csv", "r") label = [] #target values of the training data data = [] i = 0 dic = {} for line in text1: label.append(1) i = neg_entry(line, dic, i) line = line.replace(".", " ") words = PunktWordTokenizer().tokenize(line) dic_revised = {} for word in words: word = word.lower() dic_revised = {dic[word]:1} data.append(dic_revised) for line in text0: label.append(-1) i = neg_entry(line, dic, i) line = line.replace(".", " ") words = PunktWordTokenizer().tokenize(line) dic_revised = {} for word in words: word = word.lower() dic_revised = {dic[word]:1} data.append(dic_revised) problem = svm_problem(label, data) parameter = svm_parameter('-s 0 -t 0') t = svm_train(problem, parameter) outWriter = csv.writer(open('submit_svm_balanced.csv', 'wb')) outWriter.writerow(('Insult', 'Date', 'Comment')) reading = csv.reader(test) next(reading) #skipping the first line for row in csv.reader(test): test_dic = {} test_data = [] string = row[1][:-1].replace("_", " ") string = string.replace("-", " ") string = string.replace(".", " ") string = string.replace("xa0", " ") words = PunktWordTokenizer().tokenize(string) for word in words: word = word.lower() if not word in dic: continue test_dic ={dic[word]:1} print test_dic if test_dic == {}: p_labs[0] = 0 print p_labs[0] else: test_data.append(test_dic) p_labs, p_acc, p_vals = svm_predict([0]*len(test_data), test_data, t) print p_labs[0] if p_labs[0] == -1.0: p_labs[0] = 0.0 w_row = (p_labs[0], row[0], row[1]) outWriter.writerow(w_row)
train0.txt consists of non-insulting comments which are preprocessed by R package called 'tm'. I used R package because I was not familiar with nltk at that time and the sample used 'tm'.
train1.txt consists of insulting comments which are prepocessed as above.
The following code is referred from http://stackoverflow.com/questions/7927367/r-text-file-and-text-mining-how-to-load-data
It is for the preprocessing of the training data.
library(tm) setwd('F:/My Documents/My texts') a<-Corpus(DirSource("/My Documents/My texts"), readerControl = list(language="lat")) a <- tm_map(a, removeNumbers) a <- tm_map(a, removePunctuation) a <- tm_map(a , stripWhitespace) a <- tm_map(a, tolower) a <- tm_map(a, removeWords, stopwords("english"))
The precision was around 60% which was around 100th out of 120 groups.
I should have included the time as a feature in order to improve the precision.
Need a lot more effort and practice using nltk.