Accelerating String Search¶
The structure of a hardware/software (HW/SW) system can evolve quite dramatically to reflect changing requirements, or during design exploration. In this section, we consider several different implementations of a simple string search application~cite{mpAlgo}. Each variation represents a step in the iterative refinement process, intended to enhance performance or enable a different software execution environment.
- begin{figure}[!h]
centering
- includegraphics[width=0.43textwidth]{platform.pdf}
- caption{label{Fig:Platform0}Target platform for string search
application}
end{figure}
Figure~ref{Fig:Platform0} shows the target platform for our example. The pertinent components of the host system are the multi-core CPU, system memory, and PCI Express (PCIe) bus. The software components of our application will be run on the CPU in a Linux environment. Connected to the host is a PCIe expansion card containing (among other things) a high-performance FPGA chip and a large array of flash memory. The FPGA board was designed as a platform to accelerate ``big data’’ analytics by moving more processing power closer to the storage device.
Initial Implementation¶
- begin{figure}[!h]
centering
- includegraphics[width=0.43textwidth]{data_accel_logical0.pdf}
- caption{label{Fig:StringSearch0}Logical components of the string
search system}
end{figure}
The design process really begins with a pure software implementation of the algorithm, but the first attempt we consider is the initial inclusion of HW acceleration shown in Figure~ref{Fig:StringSearch0}. The search functionality is executed by software running in user-space which communicates with the hardware accelerator through a device driver running in the Linux kernel. The hardware accelerator, implemented in the FPGA fabric, executes searches over data stored in the flash array as directed by the software.
The FPGA has direct access to the massive flash memory array, so if we implement the search kernel in hardware, we can avoid bringing data into the CPU cache (an important consideration if we intend to run other programs simultaneously). By exploiting the high parallelism of the execution fabric as well as application aware caching of data, an FPGA implementation can outperform the same search executed on the CPU.
Multithreading the Software¶
The efficient use of flash memory requires a relatively sophisticated management strategy. Our first refinement is based on the observation that there are four distinct tasks which the application software executes (mostly) independently:
Send search command to the hardware.
Receive search results from the hardware.
Send commands to the hardware to manage the flash arrays
Receive responses from the flash management hardware
To exploit the task-level parallelism in our application, we can assign one thread to each of the four enumerated tasks. To further improve efficiency, the two threads receiving data from the hardware put themselves to sleep by calling textbf{poll} and are woken up only when a message has been received.
- begin{figure}[!h]
centering
- includegraphics[width=0.43textwidth]{data_accel_logical1.pdf}
caption{label{Fig:StringSearch1}Using a mutex to coordinate user-level access to hardware accelerator}
end{figure}
With the introduction of multithreading, we will need a synchronization mechanism to enforce coherent access to the hardware resources. Because the tasks which need coordinating are all being executed as user-space threads, the access control must be implemented in software as well. As shown in Figure~ref{Fig:StringSearch1}, a mutex is used to coordinate access to the shared hardware resource between user-level processes.
Refining the Interfaces¶
- begin{figure}[!h]
centering includegraphics[width=0.43textwidth]{data_accel_logical2.pdf} caption{label{Fig:StringSearch2}Movement of functionality from
user to kernel space. Software-based coordination between kernel and user processes are prohibitively expensive.}
end{figure}
Figure~ref{Fig:StringSearch2} shows a further refinement to our system in which we have reimplemented the Flash Management functionality as a block-device driver. Instead of directly operating on physical addresses, the string search now takes a file descriptor as input and uses a Linux system-call to retrieve the file block addresses through the file system. This refinement permits other developers to write applications which can take advantage of the accelerator without any knowledge of the internal details of the underlying storage device. It also enables support for different file systems as we now use a POSIX interface to generate physical block lists for the the storage device hardware. The problem with this refinement is that we no longer have an efficient SW mechanism to synchronize the block device driver running in kernel space with the application running in user space.
- begin{figure}[htb]
centering includegraphics[width=0.43textwidth]{data_accel_logical3.pdf} caption{label{Fig:StringSearch3}Correct interface design
removes the need for coordination between user and kernel threads.}
end{figure}
To solve to this problem (shown in Figure~ref{Fig:StringSearch3}), we can remove the need for explicit SW coordination altogether by giving each thread uncontested access to its own dedicated HW resources mapped into disjoint address regions. (There will of course be implicit synchronization through the file system.)