Abstract
Producing a good instruction schedule is vital for the run-time performance of compiler generated executables. Unfortunately, instruction scheduling is an NP hard problem. Since the overall compilation time is a primary design consideration, production compilers typically use greedy algorithms for instruction scheduling. While such algorithms typically produce good schedules, they are not provably optimal, and prior research has found that such algorithms may produce sub optimal schedules [1] [2]. Exact algorithms for this problem have super-polynomial complexity, and may not terminate within a reasonable compile time. Many techniques have been proposed to improve the performance of exact algorithms [3][4][5], but even the best-known exact algorithms often fail to find optimal schedules within a reasonable compile time. In this work, we will propose a novel algorithm which takes advantage of current multi-core processors and parallelizes an existing state-of-the-art exact scheduling algorithm that uses branch-and-bound. The value of the parallel branch-and-bound algorithm lies in its ability to find optimal schedules of instructions more quickly than the sequential algorithm. Our experimental evaluation using AMD’s rocPRIM benchmark on a 16 core AMD ThreadRipper CPU shows an average speedup of 6.1 across all regions of interest, which amounts to a speedup of 1.2 in overall scheduling time. These speedups are achieved while maintaining the same execution-time improvement of the sequential algorithm, which gives an average improvement of 5% in the execution times of the resultant executables when run on an AMD Vega20 GPU.