Twitter data analysis using Spark

Yash Bopardikar

Back

Thesis

Open access

Twitter data analysis using Spark

Yash Bopardikar

Master of Science (MS), California State University, Sacramento

12/19/2016

Handle: https://hdl.handle.net/10211.3/182694

Abstract

Apache Spark

Twitter data analysis

Natural language processing

The idea is to solve people’s dilemma for choosing a certain product or service over other. The objective is to collect data from tweeter feeds on trending topics like various OS, new technologies, different Products etc. and to categories them according to need. This Project would use Scala and Spark cluster. All the Big Data that would be fetched would be collected in MySQL database. Then using the MapReduce API’s and Data Mining algorithms data will be classified accordingly. Here we would be using Naïve Bayes Algorithm for Sentimental analysis over the data. Twitter provides a streaming API to stream real time twitter data. This library named Twitter4j is available in python to access Streaming API and download twitter data. The filter on this data is based on a list of keywords supplied. This analysis will employee a distributed data processing system known as Apache Spark using several worker and master nodes. This cluster is scalable and can handle millions of records. To filter out the huge number of data we will use Map reduce technique on Spark. The input file will contain a JSON object for each tweet in data. This file will be uploaded on Spark frame structure. The Spark frame structure will have replicated the structure and distribute to multiple nodes, thus the mapper takes all files presented in the directory and classifies them according to the filter set. These filtered tweets will pass through the data mining algorithms and thus we could observe one to one comparison of data which would be helpful to take tough decisions.

Files and links (1)

pdf

Bopardikar_Yash_Masters_project_Report_508CompliantCopy5.67 MBDownload View

TextThis document has been made accessible/508 compliant by Sacramento State University Library. Open Access

Metrics

191 File views/ downloads

153 Record Views

Details

Title: Twitter data analysis using Spark
Creators: Yash Bopardikar
Contributors: Ying Jin (Committee Member)
Jun Dai (Advisor)
Academic Unit: Computer Science Department
Theses and Dissertations: Master of Science (MS); Computer Science; California State University, Sacramento; 12/02/2016
Publication Details: 12/19/2016
Identifiers: 99257831121801671; https://hdl.handle.net/10211.3/182694
Resource Type: Masters Project
Language: English