Abstract
Breast cancer is a cancer that starts in the tissues of the breast. Over the course of a lifetime, 1 in 8 women will be diagnosed with breast cancer. There are existing therapies for treatment of cancer like chemotherapy, radiation therapy, and targeted therapy. Targeted therapy aims at specific cancer cell growth, division and lifecycle. Existing targeted therapies for breast cancer include Avastin, Herceptin, Iressa and Tykerb. Each of these drugs has specific effects on cancer cells. There is a need for new drug. The motivation of this project is to make a contribution such that the results obtained from the project are useful for new drug design. Transcriptional modules (TM) consist of groups of co-regulated genes and transcription factors (TF) regulating their expression. Transcription factors are one of the groups of proteins that read and interpret the genetic "blueprint" in the DNA. They bind DNA and help initiate a program of increased or decreased gene transcription. As such, they are vital for many important cellular processes. Currently breast cancer profile data is large and may be noisy and incomplete. It is not in the format suitable as an input to any data mining algorithm. As a result, we propose to develop a method to transform original data to a feasible format, which can be used as an input to the data-mining algorithm. This project first focus on developing a method of data preprocessing to make the data feasible to be used by a data mining algorithm. The second step in the project is to derive association rules between set of transcriptional factors and set of genes using an association rule mining method. The project procedure is as follows: (1) Conduct analysis of the breast cancer profile data such as analyzing the most important attributes of the profile data and removing the redundant attributes from the profile data. (2) Perform a survey of data mining algorithms and identify a suitable algorithm for association rule mining. (3) Develop a method of data preprocessing so that we can rank and sort the data and make it feasible as an input to the chosen association rule mining algorithm. Data preprocessing techniques such as data cleaning, data integration, data transformation and reduction were used. (4) Perform an experiment and actually apply the association rule mining algorithm on this transformed data and perform analysis of the results. Association rule mining is performed using a two stage approach: Stage 1: Generating association rules using association rule mining method in WEKA Stage 2: Generating association rules using data filtered on the basis of p-value and support.