“In many industries, data acceleration is the key to building efficient and intelligent systems. Traditional general-purpose processors have insufficient performance in supporting users to break through performance and latency limitations. Many accelerator technologies that have emerged have filled the gap based on custom chips, graphics processors, or dynamically reconfigurable hardware, but the key to their success lies in their ability to integrate into a high-throughput, low-latency, and easy-to-development environment. In the environment.The board-level platform jointly developed by Achronix and BittWare has been optimized for these applications, providing developers with a quick way to deploy high-throughput data acceleration
Summary of this article
In many industries, data acceleration is the key to building efficient and intelligent systems. Traditional general-purpose processors have insufficient performance in supporting users to break through performance and latency limitations. Many accelerator technologies that have emerged have filled the gap based on custom chips, graphics processors, or dynamically reconfigurable hardware, but the key to their success lies in their ability to integrate into a high-throughput, low-latency, and easy-to-development environment. In the environment. The board-level platform jointly developed by Achronix and BittWare has been optimized for these applications, thus providing developers with a quick way to deploy high-throughput data acceleration.
Increasing demand for distributed acceleration
In cloud computing and edge computing, the industry is eager for high performance that can support various applications. To meet this demand, operators of data centers, network clusters, and edge computing sites are turning to customized accelerator technology.
For users who need a high-performance computing platform, dedicated accelerators are commonly used in practice to meet various challenges; these users no longer rely on traditional general-purpose CPUs such as Intel Xeon series CPUs to support the demand for increasing data throughput. The core problem with general-purpose CPUs is that although Moore’s Law has been evolving at a rate that doubles the number of transistors integrated per square millimeter of silicon every two years, it no longer supports clock rate growth. In addition, the parallelism within the CPU quickly reached the ceiling. Therefore, other technologies are more suitable to support new workloads, including, for example, machine learning, genome research, mathematical and statistical analysis, speech and image recognition, and data mining and search.
Compared with traditional database-driven applications, these new workloads are usually not well mapped to the traditional CPU pipeline; for example, some neural network training has been verified to run well on the GPU, and these algorithms can utilize hundreds of Parallel floating-point shader kernels iteratively update a large network through the trillions of steps required. On the other hand, genome research and data search need to use a lot of comparison steps, and need to deal with low-resolution integer data. Although these workloads can be processed by CPU or GPU, the computational efficiency and energy efficiency of these tasks are relatively low when running on these platforms. Custom ASIC or FPGA-based accelerators can provide greater throughput with lower power consumption, because they support designers to build dedicated circuits optimized for these operations and data types.
Super-large-scale data center operators in fields such as Internet search and social media have adopted the accelerator concept to ensure the efficient operation of their server loads. Voice response systems are now used in daily life and are supported by artificial intelligence algorithms running on a combination of traditional blade servers and custom accelerators. As the demand for these applications based on machine learning and data mining technology continues to grow, a large number of enterprise users are turning to accelerator-based solutions to enable them to keep up with the demand. According to the analysis and prediction of Research and Markets, the market size of data center accelerators alone will grow from US$2.8 billion in 2018 to US$21.2 billion in 2023, with a compound annual growth rate of nearly 50%.
In addition to this growth, the application of accelerators is also destined to expand beyond the data center. Fields such as virtual reality, autonomous driving, robotics, and Industry 4.0 cannot tolerate the telecommunications delay caused by information relayed through remote data centers. More and more computing power will need to be deployed in edge computer racks and installed in roadside cabinets, next to mobile base stations, or in campus cabinets.
In various data center and edge computing use cases, there are many common demand drivers, such as energy efficiency, rapid transformation, and scalability. Energy efficiency is a core requirement to reduce cooling costs and complexity and minimize electricity costs. Low-power operation is essential in edge computing devices, because the ambient temperature control function among them is weak, and maintenance needs to be kept to a minimum.
In many areas, rapid transformation is inevitable and will create new requirements so that applications can be adjusted and reprocessed according to requirements when changes occur. It is not just an update of existing applications; usually, when new use cases appear, they will challenge the user’s ability to react in a timely manner. These use cases may require the development of applications that combine different technologies and concepts, such as adding artificial intelligence (AI) functions to mathematical modeling and data mining systems. In order to cope with these transformations, users need to call accelerator technology that can work well together, and each component can communicate with each other through a network connection.
Scalability is equally important. As the customer base for specific services continues to grow, operators need to know that they can easily increase capacity. It is also crucial that highly programmable solutions with efficient communication capabilities support their scalability by increasing parallelism. Support for protocols such as 100 Gbps Ethernet and faster links ensures that distributed processing can be used to accommodate growth. For example, edge applications may call cloud support until the local cabinet is upgraded to have additional processing power.
Hardware platform for acceleration
The hardware of the accelerator can take many forms. The ideal configuration is to provide a combination of PCI Express (PCIe) and high-speed Ethernet connections, and you can choose to add custom connections to support various topologies such as ring, mesh and daisy chain structures to meet the various data throughput of the application Quantity demand. The support for PCIe integrates the acceleration engine with the main processor and other accelerators through a memory-mapped interface. The ability to store shared structures on interfaces such as PCIe to exchange data can greatly simplify the development of distributed applications.
Ethernet connections operating at 100 Gbps or higher speeds further provide extended range. By using their own Ethernet ports instead of routing data packets through the host’s main network interface, accelerators can efficiently coordinate with each other. For example, in a distributed storage configuration, the accelerator card can be directly connected to the embedded non-volatile memory (NVMe) module, and the independent search engine in each module uses the messages sent through its Ethernet connection to identify Data scattered on multiple nodes, which can be easily coordinated.
Whether used as the main acceleration technology or in conjunction with GPU and other technologies, FPGAs are very suitable for the needs of data center and edge computing applications. A key advantage of FPGA is that it can be programmed in the system to create a variety of digital circuits. The software can select the configuration bit stream for the target application and send it to configure the FPGA. By loading the new mode into the logic array on the device, the FPGA can be dynamically updated as needed to take on new tasks. Programmability creates software-defined hardware, which fully supports users not only to dynamically change applications, but also to dynamically change the hardware that supports them. Combining hardware programmability with the ability to connect multiple accelerators provides users with great flexibility.
Many computing users have realized the power of FPGA in acceleration applications. For example, Microsoft’s Catapult project uses FPGAs to build accelerators for its search services, and uses FPGAs for high-speed artificial intelligence inference in its BrainWave projects. Amazon provides FPGAs that can be used in the cloud through its F1 service, which makes it easy for remote users to deploy this technology.
Choosing to use FPGA acceleration in other fields has also been for some time. For example, FPGA logic arrays have been used for radar processing in the military and aerospace fields, and real-time imaging in the medical field for many years. As the industrial field accepts concepts such as real-time machine and equipment health monitoring as part of the move towards Industry 4.0, users can turn to FPGAs to improve the quality and responsiveness of their algorithms.
Compared with using GPU for data acceleration, FPGA-based implementations usually benefit from lower latency and higher energy efficiency. A key problem with GPUs is that their computational efficiency is usually only a small part of their theoretical throughput. Because the GPU is optimized for the 3D graphics rendering pipeline, and the execution pipeline design based on high data reuse, the shader core often runs outside of a relatively small local storage. Data streaming workloads provide fewer opportunities for data reuse, which means that memory needs to be filled with new data more frequently, which affects processing time. The cache-oriented subsystem in the CPU is also subject to similar problems. FPGA can realize the complete pipeline that data flows freely, so it can provide much higher computational efficiency than GPU or CPU. For example, benchmarks for genomic research applications show that FPGA-based hardware can increase speed by 80 times compared to CPU-based implementations.
In high-performance computing and cloud computing environments, architects are turning to FPGA acceleration to avoid bottlenecks in other parts of the system. By shifting more work to the storage subsystem itself, data center users can be greatly improved in efficiency. Database acceleration, data analysis, and other forms of processing suitable for computational storage can be deployed on the accelerator along with low-level service functions such as encryption, deduplication, and secure erasure coding.
With the popularity of concepts such as software-defined networking (SDN) and network function virtualization (NFV), blade servers are playing a more important role in the communication management tasks within and between data centers. However, as the line speed increases to 100 Gbps or higher, the processing burden of Xeon-class server processors is very huge, and data center operators are keen to offload the processing of many SDN functions to nearby accelerator cards. In the emerging architecture, general-purpose server CPUs are used to handle abnormal events, while accelerators are responsible for handling large amounts of network traffic. When new requirements, applications, and security threats emerge, FPGAs can update algorithms and network protocol processing, making them an ideal basic platform for network acceleration.
Implement effective acceleration
The first accelerators adopted by hyperscale users such as Amazon, Facebook, and Microsoft were all heavily customized designs. These companies can ensure the required economies of scale in creating their own board designs, whether based on their own design-specific integrated circuits (ASICs), or using off-the-shelf FPGAs and GPUs. From the perspective of cost and time, for enterprise data center and edge computing users, it is difficult for them to find a reasonable scale in this custom chip-level design. However, it is not necessary to design custom ASICs and boards. The demand for standard interfaces such as Ethernet and PCIe not only makes it possible to use standard board-level products, but it is also desirable.
As a long-term supplier of hardware acceleration products, BittWare has been designing PCIe-sized FPGA-based boards for customers in many fields, from high-performance computing to cloud acceleration to instrumentation, and has accumulated a lot in this regard. experience of. Now, as a subsidiary of the Molex Group, BittWare can take advantage of its global supply network and deep relationships with server vendors such as Dell and HP Enterprise. BittWare is the only important batch supplier that can cooperate with many mainstream FPGA suppliers, and can meet the quality certification, verification, product lifecycle management and support needs of enterprise customers who want to deploy on a large scale for mission-critical applications FPGA accelerator.
In these applications, an important difference realized by BittWare is that the company provides extensive software support for its FPGA-based accelerators. Each accelerator card is equipped with driver software suitable for Linux and Windows systems, which can be quickly integrated into various systems through PCIe and Ethernet connections. In addition to supporting the communication between the main CPU and the accelerator card, the driver also supports access to the embedded firmware on the accelerator card. This firmware can handle many management and self-check functions.
They enable the FPGA circuit to be reconfigured according to the new functions required, and also provide some monitoring procedures for power consumption, voltage and temperature. If the cooling function in the host system fails, the firmware acting as the administrator can turn off the accelerator card to avoid thermal overload. In addition, the software package includes various reference designs so that developers can quickly build configurations so that they can test the functions of the accelerator card and start working on their own applications.
For the latest generation of accelerator cards, BittWare works closely with Achronix. Achronix is the only FPGA supplier that can provide both independent FPGA chips and embedded FPGA (eFPGA) semiconductor intellectual property (IP). The VectorPath™S7t-VG6 accelerator card uses Achronix’s Speedster®7t FPGA chip, which is built with 7nm process and combines many functions. It not only provides high-throughput data acceleration internally, but also supports current systems from machine learning to advanced instruments. The required highly distributed, networked architecture.
Figure 1: VectorPath S7t-VG6 accelerator card
Software-friendly hardware provides maximum flexibility
By providing direct support for the distributed architecture, the Speedster7t FPGA chip used in the VectorPath S7t-VG6 accelerator card marks a major change from the traditional FPGA architecture. It makes it easier for software-oriented developers to build customized processing units. This innovative new architecture is completely different from traditional FPGAs produced by vendors such as Intel and Xilinx. The design focus of traditional FPGAs is not on data acceleration.
When designing the Speedster7t architecture, Achronix created an FPGA chip that maximizes system throughput, while also improving ease of use for computer architects and developers. Compared with the traditional FPGA architecture, a key difference of the Speedster7t FPGA chip is that it includes an innovative two-dimensional network on a chip (2D NoC), which can be integrated between the processing unit in the logic array and various on-chip high-speed interfaces and memory ports. Circulate data between.
Traditional FPGAs require users to design circuits to connect their accelerators to high-speed Ethernet or PCIe data ports and/or memory ports. Usually, an independent system is composed of multiple accelerators connected to multiple high-speed ports. For example, the following figure illustrates a scenario where two accelerators are connected to two storage ports to share a storage space. This scenario uses FIFO to manage the clock domain crossing (CDC) between the memory and FPGA clock. In addition, a switching function is needed in the FPGA logic architecture to manage addressing, arbitration, and backpressure. In traditional FPGAs, this function consumes a lot of FPGA resources, and its complexity is sufficient to reduce system performance and make timing closure complicated.
Achronix adopts a software design to implement hardware, and the Ethernet and other high-speed I/O ports in this hardware can be easily connected to customized accelerator functions using a two-dimensional network on chip (2D NoC). Speedster7t NoC no longer needs to design CDC and switching functions to connect accelerators to high-speed data or memory ports. By simply connecting these functions to the NoC, the connection problems are eliminated, which simplifies the design, reduces the consumption of FPGA resources, improves performance and simplifies timing closure.
Figure 2: Challenges faced by traditional FPGA design
Figure 3: Speedster7t two-dimensional on-chip network supports software-friendly hardware
In order to achieve high-performance arithmetic operations, each Speedster7t device has a large programmable computing unit array, which is placed in an orderly manner in the machine learning processor (MLP) unit module. MLP is a highly configurable and computationally intensive unit module that can support up to 32 multiplication/accumulation (MAC) operations in each cycle. In an accelerator-centric design, the existence of MLP enables effective sharing of resources between fully programmable logic and hard-wired arithmetic units.
Although some FPGAs tend to use HBM2 memory, where FPGA and memory are assembled into an expensive 2.5D package, the Speedster7t series uses the GDDR6 memory standard interface. This interface provides the highest performance that can be achieved with off-chip memory today, and the cost is significantly reduced, making it easier for the team to implement accelerators with high-bandwidth storage arrays. A GDDR6 memory controller can support a bandwidth of 512 Gbps. The VectorPath S7t-VG6 accelerator card can provide eight groups of storage, and the total storage bandwidth can reach 4 Tbps. In addition, there is a DDR4 interface on the board, which can be used to access data that is less frequent or does not require GDDR6 throughput.
The VectorPath S7t-VG6 accelerator card provides many high-performance interfaces to support distributed architecture and high-speed host communication. Now, the accelerator card provides PCIe Gen 3.0 16-channel compliance and certification, and provides a way to obtain Gen 4 and Gen 5 qualification certification. In terms of Ethernet connection, the accelerator card uses a widely supported optical interface module, based on the QSFP-DD and QSFP56 standards, and can handle ultra-high line speeds of up to 400 Gbps.
There is an OCuLink expansion port at the other end of the accelerator card to support many other low-latency application scenarios. For example, the OCuLink port can be used to connect accelerator cards to various peripheral devices, such as NVMe storage arrays for computing storage or database acceleration applications. Compared with the PCIe interface connected to the main processor, the OCuLink connection can be a better choice because it provides a highly deterministic connection that eliminates system-level delay and jitter. The OCuLink port can also introduce other network connections, which can be expanded to achieve various port specifications other than QSPF-DD or QSFP56.
Figure 4: VectorPath’s network and storage interfaces
The front panel of the VectorPath S7t-VG6 accelerator card also includes multiple clock inputs, which are usually required when synchronizing multiple accelerator cards together. The two SMB clock input connectors support clock input from 1PPS and 10 MHz. They are connected to the jitter cleaner before they enter the FPGA. Once in the FPGA, these clocks can be multiplied or divided to the frequency required for a specific application.
It can also be further expanded through the universal digital I/O terminal. The I/O port supports single-ended 3.3V connections and low-voltage differential (LVDS) signals, and supports custom signals such as external clocks, triggers, and dedicated I/Os to directly connect to Speedster7t FPGAs. The expansion port can also be used to transform the VectorPath accelerator card into traditional hardware.
Figure 5: VectorPath clock input and GPIO
Suitable for small batches and large batches
The VectorPath S7t-VG6 accelerator card has considered every detail, such as supporting passive and active air cooling and liquid cooling. In addition, BittWare and Achronix also ensure long-term supply and support for areas that require longer product life cycles, such as medical care. In these markets, the short product life cycle of GPU-based PCIe accelerator cards is inconsistent with the demand for system service support for more than 10 years.
For larger volume requirements, especially in scenarios such as edge computing, customers can use BittWare’s cost reduction plan to simplify hardware, and its design only supports the I/O options that customers need. In addition, BittWare can also provide circuit board design files and the use of the software and drivers accompanying the VectorPath S7t-VG6 accelerator card. Using Achronix’s Speedcore eFPGA IP, it can also move towards custom system-on-chip (SoC) devices. Customers can build their own SoC that includes Speedster7t programmability, but also has the cost structure of ASIC.
In order to achieve better development and more convenient deployment, the VectorPath S7t-VG6 accelerator card can be provided by BittWare in the form of its TeraBox platform to provide a pre-integrated multi-core server. The form factor ranges from 2U to 5U. The rack-mounted chassis of the TeraBox can accommodate up to 16 BittWare PCIe accelerator cards and is managed by a dual Intel Xeon processor. As a complete solution, TeraBox provides customers with the fastest mechanism to start and run FPGA development. With the support of Bittworks II and FPGA Devkit software, users can directly use TeraBox and start development immediately. Alternatively, customers can purchase pre-configured servers that include BittWare accelerator cards from Dell and HP Enterprise.
Figure 6: Deployment of TeraBox platform
Considering that users need to seek data acceleration functions in a variety of applications, BittWare and Achronix have created a highly flexible engine that can be used independently or as part of a large heterogeneous processing array. Easy to deploy. As the core chip of the accelerator card, Speedster7t FPGA provides developers with the ability to build high-throughput applications that can take full advantage of programmable logic, PCIe, and Ethernet connections up to 400 Gbps. BittWare’s software and support ensure that these developers can start working immediately after inserting the card. The flexible nature of FPGA and Speedster7t NoC means that these accelerator cards can maximize their service life as applications change and develop.