Tim Kaldewey, Jeff Hagen, Andrea Di Blas, Eric Sedlar
Oracle Server Technologies - Special Projects
{tim.kaldewey, jeff.hagen, andrea.di.blas, eric.sedlar}@oracle.com
Today's enterprise online-transaction processing (OLTP) systems rarely need to access data not in memory [22]. The growth rates of main memory size have outstripped the growth rates of structured data in the enterprise, particularly when ignoring historical data. In such cases, database performance is governed by main memory latency [1], aggravated by the ever widening gap between main memory and CPU speeds. In the 1990's the term memory wall was coined [23], which remains an accurate description of the situation today [16].
The paradigm shift from bandwidth to throughput oriented, parallel computing [13] comes with new opportunities to circumvent the memory wall. Interleaving memory requests from many cores and threads theoretically allows for much higher memory throughput than optimizing an individual core could provide. Leveraging the massive parallelism of GPUs with up to 128 cores [18] we put this approach to the test.
Optimizing enterprise-class (parallel) database systems for throughput usually means exploiting thread-level parallelism (TLP), pairing each query with a single thread to avoid the cost associated with thread synchronization. The GPU's threading model exposed to the programmer suggests the same mapping, although modern GPUs architectures actually consist of multiple SIMD processors, each comprising multiple processing elements (Fig. 1). While for certain workloads this approach already outperforms similarly priced CPUs, we demonstrate that algorithms exploiting the nature of the GPU's SIMD architecture can go much further (Tab. 1).
We developed a parallel search algorithm, named p-ary search, with the goal to improve response time for massively parallel architectures like the GPU, which were originally designed to improve throughout, rather than individual response time. The "p" in the name refers to the number of processors that can be synchronized within a few compute cycles, which determines convergence rate of this algorithm. As it turns out, implemented on the GPU, p-ary search outperforms conventional approaches not only in terms of response time but also throughput, despite significantly more memory accesses and theoretically lower throughput. However, small workloads yield poor performance, due to the overhead incurred to invoke GPU computation. As the GPU operates in batch mode, the often critical time-to-first-result is the time-to-completion, such that larger workloads/batches incur higher latency.
Despite promising performance results, the GPU is not ready yet to be adopted as a query co-processor for multiple reasons. First, the GPU's batch processing mode. Second, the lack of global synchronization primitives, small caches and the absence of dynamic memory allocation make it difficult to develop efficient parallel algorithms for more complex operations such as joins [14]. Third, the development environment is still in its infancies offering limited debugging and profiling capabilities.
However, GPU computing is progressing so rapidly that between the time this research was conducted and its publication some of these issues are already addressed in the latest hardware and software generation. For example, global synchronization and asynchronous communication have been added to the hardware feature list. While emerging parallel programming APIs like OpenCL [17] already blur the frontiers between CPU and GPU computing, future architectures like Larrabee [7] integrate both.
At the beginning of the computer graphics era, the CPU was in charge of all graphics operations. Progressively, more and more complex operations were offloaded to the GPU. When, thanks to their massively parallel architecture, GPUs started becoming more powerful than the CPU itrself, many programmers began exploring the use of GPU for non-graphics computations, a practice referred to as General-Purpose Graphics Processing Unit, or GPGPU. However, programming the GPU had to be done by using standard graphics APIs, such as OpenGL or DirectX. Even when using it for general-purpose computations, the developer had to map data and variables to graphics objects, using single-precision floating-point values, the only data type available on the GPU. Algorithms had to be expressed in terms of geometry and color transformations, and to actually perform a computation required pretending to draw a scene. This task was usually rather challenging even for simple applications, due to the rigid limitations in terms of functionality, data types, and memory access.
Nvidia's Compute Unified Device Architecture (CUDA), an extension to the C programming language that allows programming the GPU directly, was a major leap ahead [19]. At the top level, a CUDA application consists of two parts: a serial program running on the CPU and a parallel part, called a kernel, running on the GPU.
The kernel is organized as a number of blocks of threads, with one block running all its own threads to completion on one of the several streaming multiprocessors, SMs, available. When the number of blocks as defined by the programmer exceeds the number of physical multiprocessors, blocks are queued automatically. Each SM has eight processing elements, PEs (Fig. 1). PEs in the same streaming multiprocessor execute the same instruction at the same time in Single Instruction-Multiple Data, SIMD, mode [9].
To optimize SM utilization, within a block the GPU groups threads following the same code path into so called warps for SIMD-parallel execution. Due to this this mechanism, nVidia calls its GPU architecture Single Instruction-Multiple Threads(SIMT). Threads running on the same SM share a set of registers as well as a low-latency shared memory located on the processor chip. This shared memory is small (16 KB on our G80) but about 100 faster than the larger global memory on the GPU board. A careful memory access strategy is even more important on the GPU than it is on the CPU because caching on the GPU is minimal and mainly the programmer's responsibility.
To compensate for the small local memories and caches GPUs employ massive multithreading to effectively hide memory latency. The scheduler within an SM decides for each cycle which group of threads (warp) to run, such that warps with threads accessing memory can be suspended at no cost until the requested data is available. The seamless multithreading is made possible by thousands of register in each SM, such that each thread keeps its variables in registers and context switching is free. Effectively this approach implements what we would naively describe as event-based scheduling (Fig. 2) and benefits large, latency-bound workloads.
On the other hand, CPUs employ larger caches (4 MB on our Q6700) but rely on a single set of registers, such that context switches require preserving the state of execution of the current thread before loading the next. As context switching is expensive and schedulers are implemented in software, CPU scheduling is based on time quanta such that in case of a cache miss a thread sits idle until the memory request returns or its time quantum expires.
These characteristics make the GPU an interesting platform for parallel database processing.
Video cards have been explored as coprocessors for a variety of non-graphics related applications [20] including database operations [11]. Sorting on the GPU [6,12] including very large data sets that require multi-pass sorting [10], used variants of Batcher's bitonic sorting networks [3]. While geospatial databases, whose data sets are similar to graphics data were the first to adopt GPU's for more complex operations like join [2], this has only been considered recently for general-purpose databases [14]. GPGPU research prior to 2007, before the release of nVidia's CUDA [19], required the use of graphics specific APIs that only supported floating-point numbers. However, fast searching -- which is fundamental even beyond database applications -- has not been explored for general purpose data on the GPU yet.
|
The obvious way of implementing search on the GPU is to exploit data parallelism, omnipresent in large-scale database applications, handling thousands of queries simultaneously. Multiple threads run the same serial algorithm on different problem instances, which is no different than CPU multi-threading where each select operation is paired with a single thread. While this approach does not improve response time for a single problem instance, it returns multiple results at the same time while requiring only minimal synchronization. To reduce response time for a single problem instance, we suggest to explore functional parallelism following the divide-and-conquer principle, assigning different sub-problems to different PEs. The efficiency of this approach is only limited by the amount of communication and the synchronization overhead imposed by the architecture. Since synchronization within a SM is relatively fast, this appears to be a promising avenue.
Figure 3 shows a parallel binary search of four keys, one per PE, in the same search space -- the characters from '4' to ''. In our example PE0, PE2, and PE3 found their data quickly and have to idle until PE1 finishes. The larger the number of PEs in a SM the more likely it is that searches require worst-case execution time.
This approach reduces the search range by at each iteration, yielding a worst-case execution time of . The response time of this algorithm is significantly lower than the previous one, but it delivers only one result for each run instead of . However, it has higher efficiency since PEs never idle and the synchronization overhead is minimal on SIMD architectures. Neighboring PEs can share a boundary key, to reduce the number of global memory accesses from to for each iteration. This can be further reduced to by re-using the result's boundary keys from the previous iteration or setting the lower bound to and the upper one to .
However, in practice p-ary search outperforms multithreaded binary search by 30% on the same architecture (Sec. 4). The reasons why p-ary search is still able to achieve better throughput are manyfold.
First, memory requests by multiple PEs can be served in parallel due the GPUs wide memory buses (384-bit for our G80), as long as they do not conflict. Without caching, on the GPU conflicting memory requests are serialized. P-ary search produces converging memory strides, with a stride length determined by the subset searched divided by the number of PEs. Memory conflicts can only occur in the last iteration if the remaining subset contains less entries than processors.
Second, multithreaded binary search will produce many memory conflicts as all PEs start with the same pivot element, before they diverge. The first iteration is guaranteed to produce number of PEs, , conflicts resulting in serialized memory accesses. While the probability decreases from iteration to iteration, binary search is guaranteed to produce a minimum of memory conflicts, while p-ary search at most.
Third, p-ary search has a smaller footprint in terms of register and local memory usage. This allows to run more instances (threads) of p-ary search on a streaming multiprocessor than of binary search. Therefore, performance gains over binary search only become apparent for larger workloads (Fig. 6).
The data set was designed to resemble database structures, with a maximum size limited by the video card memory, which also has to accommodate queries, results and the search application. The data consists of randomly-generated 15-character null-terminated ASCII strings organized as in a database table. For the evaluation of our index search implementations we use a sorted column with 36 million unique entries, approximately 550 MB in size, resembling an index column of a large database table.
The following experiments resemble a busy OLTP database server with a varying number of requests in its input queues. For the GPU implementation we permanently store the data set in video memory and for the CPU implementation in main memory, as in-memory databases would. Transferring our 550 MB data set from main memory to GPU memory takes 400 ms, but is only required at startup. An efficient GPU accelerated database would require either the memory being shared between CPU and GPU, or updates taking place directly in the GPU memory, such that transfer time is irrelevant.
As the low performance of GPU search implementations on small workloads in the previous experiment already indicates, offloading work to the GPU involves significant startup overhead. Thus we measure the time that each step in the offloading process requires, i.e. API launch time, processing time and copying queries to and results from the GPU (Fig. 7). For a detailed analysis of execution time of database functions on the CPU we would like to refer to the work by Ailamaki et al. [1].
As expected, Figure 7 shows that for small workloads the total time is dominated by the cost for offloading the processing to the GPU, e.g. for a workload of 32 queries approximately 70% of the time is spent launching the API and copying data back and forth. As the workload size increases this overhead decreases and execution time becomes the dominant factor. This also marks when offloading to the GPU becomes efficient.
As p-ary search and our research analyzing memory performance [15] demonstrate, parallel memory accesses are a way to scale not only memory latency but also throughput. With rapidly increasing main memory sizes, it is expected that soon even traditionally I/O bound database applications like Business Intelligence associated with Data Warehousing can leverage in-memory analytics [21]. With more and more applications becoming memory bound, overcoming the memory wall [23,16] is imperative.
We will continue our research on parallel algorithms, leveraging parallelism to improve performance of memory-bound applications, using in-memory databases as an example. To validate this concept, we are porting p-ary search to other parallel architectures, including Intel Nehalem, Sun Niagara, and IBM Cell. While implementing p-ary search to multi-core architectures is straightforward, it is also possible to map p-ary search to vector architectures, for example x86 SSE which is going to double in width in the next generation [8]. The rapid improvements of GPU programmability and Efforts like Intel Larrabee [7] indicate a fusion of GPU and CPU architectures such that algorithms will be applicable to future architectures. We focus on algorithms designed to scale with core count, such that even in the current era of throughput-oriented computing [13], response time will improve as well.