Kaggle: Detecting Insulting in Social Commentary

I have competed in the following competition (till the milestone)
http://www.kaggle.com/c/detecting-insults-in-social-commentary

I was focusing on another competition called Jubatus Challenge, so I did not compete for the final result.

By the way, the following code was used for competing.
I used LIBSVM and bag-of-words as a feature vector.
As a preprocessing for the testing data, I used nltk.PunktWordTokenizer(). I did not remove the stop words since it is already removed in the training data.

# -*- coding:utf-8 -*-

import sys, csv
sys.path.append('/home/fujinuma/libsvm-3.1/python/')
from libsvm.svm import *
from libsvm.svmutil import *
from nltk.tokenize.punkt import PunktWordTokenizer

def posdic_entry(word, dic, i):
  if not word in dic:
    dic[word] = i
    i = i + 1
  return i

def pos_entry(line, dic):
    words = PunktWordTokenizer().tokenize(line) #excluding quotation marks
    for word in words:
        posdic_entry(word, dic, 1)

def neg_entry(line, dic, i):
    line = line.replace(".", " ")
    words = PunktWordTokenizer().tokenize(line) #excluding quotation marks
    for word in words:
        word = word.lower()
        i = posdic_entry(word, dic, i)
    return i
        

#what does this mean
if __name__ =='__main__':
    dic = {}
    text0 = open("train0.txt", "r")
    text1 = open("train1.txt", "r")
    test = open("test.csv", "r")
    label = [] #target values of the training data
    data = []
    i = 0
    dic = {}
    for line in text1:
        label.append(1)
        i = neg_entry(line, dic, i)
        line = line.replace(".", " ")
        words = PunktWordTokenizer().tokenize(line)
        dic_revised = {}
        for word in words:
            word = word.lower()
            dic_revised = {dic[word]:1}
        data.append(dic_revised)
    for line in text0:
        label.append(-1)
        i = neg_entry(line, dic, i)
        line = line.replace(".", " ")
        words = PunktWordTokenizer().tokenize(line)
        dic_revised = {}
        for word in words:
            word = word.lower()
            dic_revised = {dic[word]:1}
        data.append(dic_revised)
    problem = svm_problem(label, data)
    parameter = svm_parameter('-s 0 -t 0')
    t = svm_train(problem, parameter)
    
    outWriter = csv.writer(open('submit_svm_balanced.csv', 'wb'))
    outWriter.writerow(('Insult', 'Date', 'Comment'))
    reading = csv.reader(test)
    next(reading) #skipping the first line
    for row in csv.reader(test):
        test_dic = {}
        test_data = []
        string = row[1][:-1].replace("_", " ")
        string = string.replace("-", " ")
        string = string.replace(".", " ")
        string = string.replace("xa0", " ")
        words = PunktWordTokenizer().tokenize(string)
        for word in words:
            word = word.lower()
            if not word in dic:
              continue
            test_dic ={dic[word]:1}
        print test_dic
        if test_dic == {}:
            p_labs[0] = 0
            print p_labs[0]
        else:
            test_data.append(test_dic)
            p_labs, p_acc, p_vals = svm_predict([0]*len(test_data), test_data, t)
            print p_labs[0]
            if p_labs[0] == -1.0:
                p_labs[0] = 0.0
        w_row = (p_labs[0], row[0], row[1])
        outWriter.writerow(w_row)

train0.txt consists of non-insulting comments which are preprocessed by R package called 'tm'. I used R package because I was not familiar with nltk at that time and the sample used 'tm'.
train1.txt consists of insulting comments which are prepocessed as above.

The following code is referred from http://stackoverflow.com/questions/7927367/r-text-file-and-text-mining-how-to-load-data
It is for the preprocessing of the training data.

library(tm)
setwd('F:/My Documents/My texts') 
a<-Corpus(DirSource("/My Documents/My texts"), readerControl = list(language="lat"))
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) 

The precision was around 60% which was around 100th out of 120 groups.
I should have included the time as a feature in order to improve the precision.

Need a lot more effort and practice using nltk.