Abstract
The malware industry is a billion-dollar market that aims to evade conventional security measures. Once the system is breached, it gets highly vulnerable to all types of cyberattacks. It also becomes a real financial and emotional pain to users. Anti-malware organizations also have many problems. One of the main tasks is to examine and process massive amounts of data communicated as well as stored inside a computer system. Microsoft, like organizations, scans millions of computers per month to understand examine malware detection for their systems [24]. It affects the organization financially as well as in terms of resource time. To understand and process such massive data, we need to have an algorithm that can complement the dataset accordingly. Using Gradient boosting algorithms such as LightGBM and XGBoost, we can process such large datasets. Gradient boosting algorithms can work with regression as well as classification. This algorithm generally will be a low learning rate model with decision trees. In this project, we have applied gradient boosting algorithms that are variants of LightGBM, XGBoost, machine learning techniques like SVM. We also studied CPU versus GPU performance based on the execution time of different models. The main goal of our project is to minimize the training time, increasing the accuracy for malware prediction, and extensively study the behavior of CPUs and GPUs under different situations of model execution. We have used the opensource and an unprecedented malware dataset published by Microsoft on Kaggle, which is dedicated for the data science community.