A data mart for annotated protein sequence extracted from UniProt database

Maulik Vyas

Back

A data mart for annotated protein sequence extracted from UniProt database

Thesis

Open access

A data mart for annotated protein sequence extracted from UniProt database

Maulik Vyas

Master of Science (MS), California State University, Sacramento

01/13/2012

Handle: https://hdl.handle.net/10211.9/1436

Abstract

Data warehouse

Star schema

Data cube

Data warehouses are used by various organizations to organize, understand and use the data with the help of provided tools and architectures to make strategic decisions. Biological data warehouse such as the annotated protein sequence database is subject oriented, volatile collection of data related to protein synthesis used in bioinformatics. Data mart contains a subset of enterprise data from data warehouse that is of value to a specific group of users. I implemented a data mart based on data warehouse design principles and techniques on protein sequence database using data provided by Swiss Institute of Bioinformatics. While the data warehouse contains information about many protein sequence areas, data mart focuses on one or more subject area. It brings together experimental results, computed features and scientific conclusions by implementing star schema and data cube that supports the data warehouse to make it easier for organizations to distribute data within a unit. This enables them to deploy the data, manipulate it and develop the protein sequence data any way they see fit. The main goal of this project is to provide consistent, accurate annotated protein sequence data to group of researchers working on protein sequence. I took a chunk of this data to extract it from warehouse, transform it and loaded it in staging area. I used HJSplit to split the XML protein sequence data into equal parts and extract information using XML editor. I populated the database tables in Microsoft Access 2010 from XML file. Once the database was set up, I used MySQL Workbench 5.2 CE to generate queries related to star schema. Finally, I implemented star schema, OLAP operations, and data cube and drill up-down operations for strategic analysis of protein sequence database based on SQL queries. This ensured explicit support for dimension, aggregation and long-range analysis.

Files and links (2)

doc

Maulik_Vyas_MS_Project_Report_DW_on_UniProt735.00 kBDownload View

Main Project-Word Open Access

pdf

Maulik_Vyas_MS_Project_Report_DW_on_UniProt733.40 kBDownload View

Main Project-PDF Open Access

Metrics

3 File views/ downloads

22 Record Views

Details

Title: A data mart for annotated protein sequence extracted from UniProt database
Creators: Maulik Vyas
Contributors: Meiliu Lu (Advisor)
Academic Unit: Computer Science Department
Theses and Dissertations: Master of Science (MS); Computer Science; California State University, Sacramento; 11/29/2011
Publication Details: 01/13/2012
Identifiers: 99257830926501671; https://hdl.handle.net/10211.9/1436
Resource Type: Masters Project
Language: English