Going Beyond Corr-LDA for Detecting Specific Comments on News & Blogs

Mrinal Kanti Das, Trapit Bansal, Chiranjib Bhattacharyya.
International conference on Web Search and Data Mining (WSDM), 2014.

Download Paper

Abstract

Understanding user generated comments in response to news and blog posts is an important area of research. After ignoring irrelevant comments, one finds that a large fraction, approximately 50%, of the comments are very specific and can be further related to certain parts of the article instead of the entire story. For example, in a recent product review of Google Nexus 7 in ArsTechnica (a popular blog), the reviewer talks about the prospect of “Retina equipped iPad mini” in a few sentences. It is interesting that although the article is on Nexus 7, but a significant number of comments are focused on this specific point regarding “iPad ”. We pose the problem of detecting such comments as specific comments location (SCL) problem. SCL is an important open problem with no prior work.
SCL can be posed as a correspondence problem between comments and the parts of the relevant article, and one could potentially use Corr-LDA type models. Unfortunately, such models do not give satisfactory performance as they are restricted to using a single topic vector per article-comments pair. In this paper we propose to go beyond the single topic vector assumption and propose a novel correspondence topic model, namely SCTM, which admits multiple topic vectors (MTV) per article-comments pair. The resulting inference problem is quite complicated because of MTV and has no off-the-shelf solution. One of the major contributions of this paper is to show that using stick-breaking process as a prior over MTV, one can derive a collapsed Gibbs sampling procedure, which empirically works well for SCL.
SCTM is rigorously evaluated on three datasets, crawled from Yahoo! News (138,000 comments) and two blogs, ArsTechnica (AT) Science (90,000 comments) and AT-Gadget (160,000 comments). We observe that SCTM performs better than Corr-LDA, not only in terms of metrics like perplexity and topic coherence but also discovers more unique topics. We see that this immediately leads to an order of magnitude improvement in F1 score over Corr-LDA for SCL.

Datasets

Download All Datasets
Read Me for Data Format
ArsTechnica Science (Gold Standard)
ArsTechnica Gadgets
Yahoo! News

Code

Download or Fork on Git Hub

Contact

Please feel free to contact “mrinal at csa dot iisc dot ernet dot in” or “trapit at csa dot iisc dot ernet dot in” for any queries, comments etc.
If you have any thoughts or criticism on this work, we will be happy to discuss.