Abstract
Increases in network and storage capacity, as well as decreases in the price of digital cameras, means that image data is being collected at a faster and faster rate. The volume of data is already orders of magnitude larger than humans can handle, and it will become increasingly important to develop better techniques for classifying images by machine. Astronomy in particular suffers from this overwhelming volume of data. Studying the morphology, or shape, of galaxies, for example, allows us to learn about how they formed, and therefore about the history of our universe. In this work, we look at several methods of using ensembles of Support Vector Machines to classify galaxies into one or more of thirty-seven different categories. We used images from the Sloan Digital Sky Survey, which has a publicly available database of ten million galaxies including both images and metadata. The brightest 243,000 of these were classified by Galaxy Zoo 2, a citizen science project which used tens of thousands of volunteers to perform the classifications. We used the publicly available data from Galaxy Zoo 2 and the Sloan Digital Sky Survey to train a number of different ensembles of Support Vector Machines, in order to see what combinations of techniques and data stood out as performing better than others. Based on our original work, using only a single gray scale image for each galaxy seemed like it would give good results, however this performed better than consistently picking the largest category for only a single question. Using bagging improved the results, and including information on how spread out the light in the image is also improved the results, but even this only performed better than picking the largest category in 3 out of 11 questions.