Streaming Applications on Heterogeneous Platforms
- 679 Downloads
Using multiple streams can improve the overall system performance by mitigating the data transfer overhead on heterogeneous systems. Currently, very few cases have been streamed to demonstrate the streaming performance impact and a systematic investigation of streaming necessity and how-to over a large number of test cases remains a gap. In this paper, we use a total of 56 benchmarks to build a statistical view of the data transfer overhead, and give an in-depth analysis of the impacting factors. Among the heterogeneous codes, we identify two types of non-streamable codes and three types of streamable codes, for which a streaming approach has been proposed. Our experimental results on the CPU-MIC platform show that, with multiple streams, we can improve the application performance by up 90 %. Our work can serve as a generic flow of using multiple streams on heterogeneous platforms.
KeywordsMultiple streams Heterogeneous platforms Performance
Heterogeneous platforms are increasingly popular in many application domains . The combination of using a host CPU combined with a specialized processing unit (e.g., GPGPUs or Intel Xeon Phi) has been shown in many cases to improve the performance of an application by significant amounts. Typically, the host part of a heterogeneous platform manages the execution context while the time-consuming code piece is offloaded to the coprocessor. Leveraging such platforms can not only enable the achievement of high peak performance, but increase the performance per Watt ratio.
Given a heterogeneous platform, how to realize its performance potentials remains a challenging issue. In particular, programmers need to explicitly move data between host and device over PCIe before and/or after running kernels. The overhead counts when data transferring takes a decent amount of time, and determines whether to perform offloading is worthwhile [2, 3, 5]. To hide this overhead, overlapping kernel executions with data movements is required. To this end, multiple streams (or streaming mechanism) has been introduced, e.g., CUDA Streams , OpenCL Command Queues , and Intel’s hStreams . These implementations of multiple streams spawn more than one streams/pipelines so that the data movement stage of one pipeline overlaps the kernel execution stage of another1.
Prior works on multiple streams mainly focus on GPUs and the potential of using multiple streams on GPUs is shown to be significant [4, 7, 9, 17]. Liu et al. give a detailed study into how to achieve optimal task partition within an analytical framework for AMD GPUs and NVIDIA GPUs . In , the authors model the performance of asynchronous data transfers of CUDA streams to determine the optimal number of streams. However, these studies have shown very limited number of cases, which leaves two questions unanswered: (1) whether each application is required and worthwhile to use multiple streams on a given heterogeneous platform?, (2) whether each potential application is streamable or overlappable? If so, how can we stream the code?
We build a statistical view of the data transfer fraction (R) with a large number of test cases and analyze its impacting factors (Sect. 3).
We categorize the heterogeneous codes based on task dependency and present our approach to stream three types of applications (Sect. 4).
We demonstrate the performance impact of using multiple streams on the CPU-MIC platform with 13 streamed benchmarks (Sect. 5).
2 Related Work
In this section, we list the related work on pipelining, multi-tasking, workload partitioning, multi-stream modeling, and offloading necessity.
Pipelinining is widely used in modern computer architectures . Specifically, the pipeline stages of an instruction run on different functional units, e.g., arithmetic units or data loading units. In this way, the stages from different instructions can occupy the same functional unit in different time steps, thus improving the overall system throughput. Likewise, the execution of a heterogeneous application is divided into stages (H2D, KEX, D2H), and can exploit the idea of software pipelining on the heterogeneous platforms.
Multi-tasking provides concurrent execution of multiple applications on a single device. In , the authors propose and make the case for a GPU multitasking technique called spatial multitasking. The experimental results show that the proposed spatial multitasking can obtain a higher performance over cooperative multitasking. In , Wende et al. investigate the concurrent kernel execution mechanism that enables multiple small kernels to run concurrently on the Kepler GPUs. Also, the authors evaluate the Xeon Phi offload models with multi-threaded and multi-process host applications with concurrent coprocessor offloading . Both multitasking and multiple streams share the idea of spatial resource sharing. Different from multi-tasking, using multiple streams needs to partition the workload of a single application (rather than multiple applications) into many tasks.
Workload Partitioning: There is a large body of workload partitioning techniques, which intelligently partition the workload between a CPU and a coprocessor at the level of algorithm [20, 21] or during program execution [14, 15]. Partitioning workloads aims to use unique architectural strength of processing units and improve resource utilization . In this work, we focus on how to efficiently utilize the coprocessing device with multiple streams. Ultimately, we need to leverage both workload partitioning and multiple streams to minimize the end-to-end execution time.
Multiple Streams Modeling: In , Gomez-Luna et al. present performance models for asynchronous data transfers on different GPU architectures. The models permit programmers to estimate the optimal number of streams in which the computation on the GPU should be broken up. In , Werkhoven et al. present an analytical performance model to indicate when to apply which overlapping method on GPUs. The evaluation results show that the performance model are capable of correctly classifying the relative performance of the different implementations. In , Liu et al. carry out a systematic investigation into task partitioning to achieve maximum performance gain for AMD and NVIDIA GPUs. Unlike these works, we aim to evaluate the necessity of using multiple streams and investigate how to use streams systematically. Using a model on Phi to determine the number of streams will be investigated as our future work.
Offloading Necessity: Meswani et al. have developed a framework for predicting the performance of applications executing on accelerators . Using automatically extracted application signatures and a machine profile based on benchmarks, they aim to predict the application running time before the application is ported. Evaluating offloading necessity is a former step of applying multiple streams. In this work, we evaluate the necessity of using multiple streams with a statistical approach.
3 A Statistical View
In this section, we give a statistical view of how many applications are worthwhile to be streamed on heterogeneous platforms, and analyze the factors that impact the streaming necessity.
3.1 Benchmarks and Datasets
Applications, inputs and configurations
3.2 Experimental Platforms
The heterogeneous platform used in this work includes a dual-socket Intel Xeon CPU (12 cores for each socket) and an Intel Xeon 31SP Phi (57 cores for each card). The host CPUs and the cards are connected by a PCIe connection. As for the software, the host CPU runs Redhat Linux v7.0 (the kernel version is 3.10.0-123.el7.x86_64), while the coprocessor runs a customized uOS (v188.8.131.52). Intel’s MPSS (v3.6) is used as the driver and the communication backbone between the host and the coprocessor. Also, we use Intel’s multi-stream implementation hStreams (v3.5.2) and Intel’s OpenCL SDK (v14.2). Note that the applications in Table 1 are in OpenCL, while the pipelined versions are in hStreams.
3.3 Measurement Methodology
A typical heterogeneous code has three parts: (1) transferring data from host to device (H2D), (2) kernel execution (KEX), and (3) moving data from device back (D2H). To measure the percentage of each stage, we run the codes in a strictly stage-by-stage manner. Moreover, we perform 11 runs and calculate the median value. Before uploading datasets, buffer allocation on the device side is required. Due to the usage of the lazy allocation policy, the allocation overhead is often counted into H2D. Thus, we argue that H2D might be larger than the actual host-to-device data transferring time.
3.4 Results and Analysis
To summarize, we observe that R varies over platforms, benchmarks, code variants, and input configurations. Each benchmark has a unique balance between computation and memory accesses. Different code variants lead to the differences in transferred data amounts and kernel execution time. Also, input configurations can incur changes in transferred data amounts and/or kernel execution time. Furthermore, R depends on hardware capabilities (e.g., the PCIe interconnect and the accelerator).
We use R as an indicator of deciding whether the target application is worthwhile to be streamed. Figure 1 shows that H2D takes only 10 % of the total execution time for over 50 % test cases. We argue that, on the one hand, the applications are not worthwhile to be streamed when R is small. This is due to two factors: (1) code streaming introduces overheads for filling and emptying the pipeline, and (2) streaming an application requires extra programming efforts from reconstructing data structures and managing streams. Thus, streaming such applications might lead to a performance degradation compared with the non-streamed code. On the other hand, when R is too large (e.g., 90 %), it is equally not worthwhile to apply streams. When the fraction of H2D is too large, using accelerators may lead to a performance drop (when comparing to the case of only using CPUs), not to mention using streams. In real-world cases, users need make the streaming decision based on the value of R and the coding effort.
4 Our Streaming Approach
After determining the necessity to apply the pipelining/streaming mechanism, we further investigate how-to. Generally, we divide applications into tasks which are mapped onto different processing cores. As we have mentioned above, each task includes the subtasks of data transfers and kernel execution. To pipeline codes, we should guarantee that there exist independent tasks running concurrently. Once discovering independent tasks, we are able to overlap the execution of H2D from one task and KEX from another (Fig. 5(a)). For a single task, H2D is dependent on KEX.
Based on the dependency analysis, we categorize the codes listed in Table 1 as streamable codes (Sect. 4.2) and non-streamable codes. The first pattern (SYNC) of non-streamed codes is when the H2D data is shared by all the tasks of an application. In this case, the whole data transfer has to be finished before kernel execution. The second non-streamable pattern is characterized as Iterative, for which KEX will be invoked in an iterative manner once the data is located on device. Although such cases can be streamed by overlapping the data transfer and the first iteration of kernel execution, we argue that the overlapping brings no performance benefit for a large number of iterations.
4.2 Code Streaming
We divide the streamable/overlappable applications into three categories based on task dependency: (1) embarrassingly independent, (2) false dependent, and (3) true dependent. Tasks are generated based on input or output data partitioning, and thus task dependency shows as a form of data dependency. We will explain them one by one.
True Dependent. The third category of overlappable applications is similar to the second one in that there exist data dependencies between tasks. The difference is that the dependency is true (i.e., RAW). This is complicated for programmers not only because there is a dependence between the input data elements, but because they need to update input data in the process of calculation. Thus, the output elements depend on the updated input data, and we must control the order of calculation. The key for this pattern is to discover concurrency while respecting the dependency.
5 Experimental Results
In this section, we discuss the performance impact of using multiple streams. We use the CPU-MIC heterogeneous platform detailed in Sect. 3.2. Due to the limitation in time and space, we port 13 applications from Table 1 with hStreams. As shown in Table 2, these 13 benchmarks are characterized as different categories and thus we use the corresponding approach to stream them.
Figure 9 shows the overall performance comparison. We see that using multiple streams outperforms using a single stream, with a performance improvement of 8 %–90 %. In particular, for nn, FastWalshTransform, ConvolutionFFT2D, and nw, the improvement is around 85 %, 39 %, 38 %, and 52 %, respectively. However, for applications such as lavaMD, we cannot obtain the expected performance improvement with multiple streams, which will be discussed in the following.
For the False Dependent applications, if the extra overhead of transferring boundary elements is nonnegligible, code streaming is not beneficial. For FWT, one element is related to 254 elements which is far less than the subtask data size of 1048576. Therefore, although having to transfer extra boundary values, the overall streaming performance impact is positive. However, when the boundary elements are almost equal to the subtask size, the overhead introduced by boundary transmission can not be ignored. LavaMD calculates particle potential and relocation due to mutual forces between particles within a large 3D space. In the experiment, one element for lavaMD depends on 222 elements, in which 111 elements are lying before the target element and the other half behind. The task data size is 250, which is close to the boundary element number. Thus, we cannot get the expected performance improvement, and the experimental results confirm our conclusion. Specifically, when the task is of 250 and remains unchanged, for single stream, the H2D and KEX time is 0.3476 s and 0.3380s, respectively. When using multiple streams, the overall execution time is 0.7242s. Therefore, it is not beneficial to stream the overlappable applications like lavaMD.
In this paper, we summarize a systematic approach to facilitate programmers to determine whether the application is required and worthwhile to use streaming mechanism, and how to stream the code. (1) obtaining the ratio R: run the codes in stage-by-stage manner, record the H2D and KEX time, and calculate R; (2) judging whether the application is overlappable; (3) streaming the codes by either eliminating or respecting data dependency. Our experimental results on 13 streamed benchmarks show a performance improvement of upto 90 %.
The process of analyzing whether a code is streamable and transforming the code is manually performed. Thus, we plan to develop a compiler analysis and tuning framework to automate this effort. Based on the streamed code, we will further investigate how to get optimal performance by setting a proper task and/or resource granularity. Ultimately, we plan to autotune these parameters leveraging machine learning techniques. Also, we want to investigate the streaming mechanism on more heterogeneous platforms, other than the CPU-MIC one.
We would like to thank the reviewers for their constructive comments. This work was partially funded by the National Natural Science Foundation of China under Grant No. 61402488, No. 61502514 and No. 61602501, the National High-tech R&D Program of China (863 Program) under Grant No. 2015AA01A301, the National Research Foundation for the Doctoral Program of Higher Education of China (RFDP) under Grant No. 20134307120035 and No. 20134307120031.
- 1.Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking. In: 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA), pp. 1–12. IEEE, February 2012Google Scholar
- 2.Boyer, M., Meng, J., Kumaran, K.: Improving GPU performance prediction with data transfer modeling. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Symposium Workshops and PhD Forum (IPDPSW), pp. 1097–1106. IEEE, May 2013Google Scholar
- 3.Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 44–54. IEEE, October 2009Google Scholar
- 5.Gregg, C., Hazelwood, K.: Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In: 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 134–144. IEEE, April 2011Google Scholar
- 8.Intel Inc. hStreams Architecture document for Intel MPSS 3.5, April 2015Google Scholar
- 12.NVIDIA Inc. CUDA C Best Practices Guide Version 7.0, March 2015Google Scholar
- 14.Pienaar, J.A., Raghunathan, A., Chakradhar, S.: MDR: performance model driven runtime for heterogeneous parallel platforms. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 225–234. ACM, New York (2011)Google Scholar
- 15.Takizawa, H., Sato, K., Kobayashi, H.: SPRAT: runtime processor selection for energy-aware computing. In: 2008 IEEE International Conference on Cluster Computing, pp. 386–393. IEEE (2008)Google Scholar
- 16.The Khronos OpenCL Working Group. OpenCL - The open standard for parallel programming of heterogeneoussystems, January 2016. http://www.khronos.org/opencl/
- 17.Werkhoven, B.V., Maassen, J., Seinstra, F.J., Bal, H.E.: Performance models for CPU-GPU data transfers. In: 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 11–20. IEEE, May 2014Google Scholar
- 19.Wende, F., Steinke, T., Cordes, F.: Multi-threaded kernel offloading to GPGPU using hyper-Q on kepler architecture. Technical report 14–19, ZIB, Takustr. 7, 14195 Berlin (2014)Google Scholar
- 20.Yang, C., Wang, F., Du, Y., Chen, J., Liu, J., Yi, H., Lu, K.: Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: 2010 IEEE International Conference on Cluster Computing (CLUSTER), pp. 19–28. IEEE (2010)Google Scholar
- 21.Yang, C., Xue, W., Fu, H., Gan, L., Li, L., Xu, Y., Lu, Y., Sun, J., Yang, G., Zheng, W.: A peta-scalable CPU-GPU algorithm for global atmospheric simulations. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2013, pp. 1–12. ACM, New York (2013)Google Scholar