Topic Modeling and Visualization for Big Data in Social Sciences

2017 SIAM Conference on Computational Science and Engineering

Topic Modeling and Visualization for Big Data in Social Sciences

Abstract. Topic modeling is a widely used approach for analyzing large text collections. In particular, Latent Dirichlet Allocation (LDA) is one of the most popular topic modeling approaches to aggregate vocabulary from a document corpus to form latent ”topics”. However, learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging, given the complexity of the data involved and the difficulty in distributing the computation across multiple computing nodes. In recent years some data processing frameworks, such as Spark, Mallet and others have been developed to address the issues associated with analyzing large volumes of unlabeled text pertaining to various domains in a scalable and efficient manner. In this paper, we will present a preliminary case study demonstrating the scholarship achieved in the study of political consumerism via XSEDE resources. The experimental study will showcase the use of digitized social sciences data and text analytics toolkits to generate topic models and visualize topics for empowering intersectional research engaging the relationship between consumption and race, class and gender in the area of sociology. Consequently, this comparative big data textual analysis involving use of JSTOR data, LDA modeling toolkit’s, visualization techniques and computational components is of paramount importance, especially for researchers from academic domain dealing with social science applications involving big data.

Authors