FARMS: Efficient mapreduce speculation for failure recovery in short jobs

Huansong Fu; Haiquan Chen; Yue Zhu; Weikuan Yu

doi:10.1016/j.parco.2016.10.004

Back

FARMS: Efficient mapreduce speculation for failure recovery in short jobs

Journal article

Open access

Peer reviewed

FARMS: Efficient mapreduce speculation for failure recovery in short jobs

Huansong Fu, Haiquan Chen, Yue Zhu and Weikuan Yu

Parallel computing, Vol.61, pp.68-82

01/2017

DOI: https://doi.org/10.1016/j.parco.2016.10.004

Handle:

https://hdl.handle.net/20.500.12741/rep:7798

Abstract

Failure recovery

Mapreduce

YARN

Speculation

•Existing speculation mechanism has fundamental flaws in mitigating intra-node and completed task stragglers, which are often caused by node failure.•Those issues result in more than an order of magnitude performance breakdown of small jobs and serious performance degradation of large jobs upon failures.•A hybrid solution includes a speculation mechanism to cope with the issues and a scheduling policy to enhance failure awareness and recovery.•The implementation of the solution shows striking performance improvement for MapReduce failure recovery. With the ever-increasing size of software and hardware components and the complexity of configurations, large-scale analytics systems face the challenge of frequent transient faults and permanent failures. As an indispensable part of big data analytics, MapReduce is equipped with a speculation mechanism to cope with run-time stragglers and failures. However, we reveal that the existing speculation mechanism has some major drawbacks that hinder its efficiency during failure recovery, which we refer to as the speculation breakdown. We use the representative implementation of MapReduce, i.e., YARN and its speculation mechanism as a case study to demonstrate that the speculation breakdown causes significant performance degradation among MapReduce jobs, especially those with shorter turnaround time. As our experiments show, a single node failure can cause a job slowdown by up to 9.2 times. In order to address the speculation breakdown, we introduce a failure-aware speculation scheme and a refined task scheduling policy. Moreover, we have conducted a comprehensive set of experiments to evaluate the performance of both single component and the whole framework. Our experimental results show that our new framework achieves dramatic performance improvement in handling with node failures compared to the original YARN.

Files and links (1)

url

https://doi.org/10.1016/j.parco.2016.10.004View

Published (Version of record) Open

Metrics

4 Record Views

Details

Title: FARMS: Efficient mapreduce speculation for failure recovery in short jobs
Creators: Huansong Fu - Florida State University, 600 W College Ave, Tallahassee, FL 32306, United States
Haiquan Chen - Valdosta State University, 1500 N Patterson St, Valdosta, GA 31698, United States
Yue Zhu - Florida State University, 600 W College Ave, Tallahassee, FL 32306, United States
Weikuan Yu - Florida State University, 600 W College Ave, Tallahassee, FL 32306, United States
Academic Unit: Computer Science Department
Publisher: Elsevier B.V
Publication Details: 01/2017
Identifiers: 99257880259501671; https://hdl.handle.net/20.500.12741/rep:7798; https://doi.org/10.1016/j.parco.2016.10.004
Language: English