Abstract
A meme is a socially produced image used to comment on an event, often accompanied by a template of high-quality online pictures with text. Memes can spread humor, but they can also be hurtful to certain groups or individuals. Multimodal memes frequently contain abusive images with unpleasant words, making the classification of hateful memes essential. It has become challenging to categorize abusive memes because the model must get the combined multimodal context of both the image and text. The project's main goal is to reduce hatefulness on online platforms by detecting these hateful memes.
This work aims to identify and categorize hateful content shared via memes on social media platforms to create safer online spaces using the capabilities of Vision Transformer (ViT) architecture and Bidirectional Encoder Representations from Transformers (BERT). My project involves extracting text from memes in image format, using Optical Character Recognition (OCR), and then building a model that integrates ViT for image analysis and BERT for text processing. Further, the model is trained by concatenating the output layers of ViT and BERT. The "Hateful memes" dataset is sourced from the Hugging Face platform and is specifically designed to classify hateful memes. Each meme in the dataset is labeled as either hateful or not.