Universal AI chip breakthrough “memory wall” bottleneck is just around the corner

At present, the artificial intelligence (AI) industry is in a transitional period from its infancy to a mature stage, and related applications are also in the exploratory stage. Therefore, various dedicated AI chips are emerging in endlessly. However, if an ASIC-based product is to be made for a certain application scenario, it may be behind the moment it is made. In addition, ASICization is not completely infeasible due to the different links in the application. For example, in a fixed application scenario on the end side, the product can be made through ASIC when the scenario is clear. However, the closer to the cloud, the greater the application changes. Under such changes, it is difficult to deploy a certain ASIC-based processor. Whether it is the cloud or the edge of the cloud, or the enterprise application market, all have very high requirements for computing power. Therefore, general-purpose AI processors have become a more reasonable choice.

Compared with dedicated AI chips, general-purpose AI processors have a broader application range and represent the development direction of AI hardware. In this field, GPUs and CPUs are currently the most popular.

As application requirements expand to a broader and deeper level, GPU applications in the AI ​​field have encountered more and more obvious bottlenecks. First of all, GPU and CPU are traditional processors and are not specifically designed for AI computing. In the initial stage, they are competent, but in the next second, third…development stages, in the face of more complex models and technologies, the limitations of their computing architecture begin to gradually manifest.

It is in this context that IPU appeared. The processor was invented by the British start-up Graphcore to support the new computing needs of machine intelligence. The more than 1,200 processor cores in its first-generation IPU can handle completely independent tasks, and can communicate with each other to support complete multi-instruction and multi-data parallel operations. These are the basic requirements of the next generation of machine intelligence.

At the Zhongguancun Forum a few days ago, Graphcore co-founder and CEO Nigel Toon and Graphcore senior vice president and general manager of China Lu Tao were invited to attend and gave speeches at the Zhongguancun Forum Cloud Forum and the Global Science and Technology Youth Forum.

According to Graphcore co-founder and CEO Nigel Toon, IPU can support large models with efficient sparse computing in both training and deployment. IPU can not only promote innovative development, but also effectively deploy these new models, and more efficient calculations can reduce the total cost of the system. Users can use the same IPU hardware in training and inference, and can flexibly change the number of IPUs called by each CPU.

Overall, Graphcore’s business is mainly divided into three parts: 1. IPU processor designed from scratch for AI; 2. Poplar SDK and development tools; 3. IPU platform, such as IPU-Machine, which can be purchased through Inspur and Dell IPU server, and IPU-Pod64 which can be scaled horizontally on a large scale.

In July of this year, Graphcore released the second-generation IPU (Mk2 IPU). The Mk2 IPU is an AI processor based on TSMC’s 7nm process technology, which integrates 59.4 billion transistors on a 823 square millimeter chip. Mk2 IPU has 250 TFLOPS of AI computing power and 900MB of internal processor storage capacity. Such a processor has 1472 independent processor cores and nearly 9,000 independent parallel processor threads. Compared with the first-generation IPU (Mk1 IPU), the system-level performance is increased by more than 8 times.

The company also newly launched IPU-Machine: M2000 (IPU-M2000), which is a slim data center blade that can provide 1 PFLOP of AI computing power, and has built-in AI horizontal expansion network architecture IPU-Fabric through a dedicated IPU . Whether you are a start-up company that only needs one IPU-M2000, or a cloud company that wants to connect thousands of IPU-M2000s together, IPU-Machine: M2000 (IPU-M2000) can meet your needs.

Technology Highlights

Compared with competing products, IPU has many bright spots in storage, versatility, software support and ecology.

In terms of storage, the GPU uses HBM when performing AI calculations, which can achieve a bandwidth of 1.6 TB per second and a capacity of 40 GB. Graphcore puts forward an innovative concept: IPU Exchange Memory. According to Jason Lu, Senior Vice President and General Manager of Graphcore China: IPU Exchange Memory includes on-chip storage and streaming storage. An IPU-Machine: M2000 system can provide 180 TB of bandwidth per second and 450 GB of capacity. , Compared with GPU, it has a very big improvement in bandwidth and capacity.

Specifically, the IPU Exchange Memory proposed by Graphcore is composed of two types of storage, one is In-Processor Memory, which is on-chip storage, and the other is Streaming Memory. Mk2 IPU integrates 900 MB of on-chip storage, while the on-chip storage of mainstream CPUs may only be tens of MB.

Compared with DDR or HBM, sufficient on-chip storage can provide 50-100 times the bandwidth increase and delay reduction. In Mk2 IPU, the distance between storage and calculation is greatly shortened. 900 MB of on-chip storage and streaming storage make large-scale expansion possible.

There is an MMU (Memory Management Unit) in the CPU system. One of the important units is the TLB. Pageant operations can be performed between the TLB and the external memory. Because Mk2 IPU has 900 MB of on-chip storage, it can expand hundreds of GB of storage space through remote streaming storage. There is no need for the on-chip storage of 32 MB or 64 MB like GPU or CPU to continuously interact with DDR and HBM.

Through the combination of on-chip storage and streaming storage technology in the Mk2 IPU, the IPU-M2000 can obtain a total capacity of 450 GB, and the on-chip storage bandwidth has also been greatly improved.

For comparison with competing products, Lu Tao mentioned one of the highlights of IPU. He said: “NVIDIA claims that the new data format TF32 built by them can increase the computing power of FP32. We believe that the most standard things are the most open, such as FP32. It is a data format specified by IEEE. Developers can use GPU, IPU, and CPU to perform calculations based on FP32, but if developers use NVIDIA’s TF32 data format, they will be trapped.”

In terms of cost-effectiveness, IPU also has advantages. Lu Tao made a comparison with the training of EfficientNet-B4. If you want to achieve the training throughput of EfficientNet-B4 in 8 IPU-M2000, you need to invest 16 DGX A100, which is more than 3 million US dollars, plus the corresponding electricity bill. And other expenses. In other words, if DGX A100 is used, to obtain the EfficientNet-B4 computing performance of 8 IPU-M2000s, it will cost more than 10 times.

In terms of software and development environment support, Graphcore designed the Poplar SDK with Graph as the core from scratch, which can facilitate users whether they use a single IPU-M2000 or a single PCIe card, even 1,000, or even tens of thousands. IPU can get a completely consistent user experience. Poplar SDK docks with industry standard machine learning frameworks such as TensorFlow, PyTorch, ONNX, PaddlePaddle, etc.

In July of this year, Graphcore released the PopLibs source code. Tao Lu said: “Part of Graphcore’s spirit is to hand over power to AI developers so that they can modify, optimize and innovate themselves. At the same time, Graphcore is also vigorously developing the IPU developer community, and a very important part of it is already online in China. IPU Developer Cloud can provide different models such as Inspur IPU Server NF5568M5, Dell IPU Server DSS8440, and IPU-Pod64. IPU Developer Cloud is currently open for application.

Developers can obtain IPU very conveniently. There are two main ways: one is through the cloud, which can currently be obtained through Microsoft Azure and Kingsoft Cloud; the other is to use Dell or Inspur’s IPU server to build the user’s own private cloud or self-prepared Set of computing resources.

Speaking of openness and innovation, Lu Tao said: “Graphcore’s IPU platform, whether IPU-M2000 or IPU-Pod64, has been designed with the integration of chips, systems, clusters, and software and hardware in mind. Graphcore is committed to empowering If AI innovators can make new breakthroughs, if they only follow the GPU route, they can only make some attempts in a limited way. Therefore, providing support for innovators, developers, and researchers is an important driving force for Graphcore’s research and development. If it is Due to hardware shackles, your excellent works cannot reach the ideal performance. Graphcore welcomes developers to explore and try on IPU.”

customer

Talking about the application of IPU, Lu Tao said that at present, IPU has developed rapidly in the five major fields of ultra-large-scale data centers and the Internet, universities and research institutions, medical and life sciences, finance, and automobiles, and has also received a lot of attention. So far, Graphcore has shipped more than 10,000 IPU processors, serving more than 100 different organizations around the world.

“One of our early customers, Carmot Capital, used our product to train its financial market prediction model, and its performance improved by 26 times.” Lu Tao said, “Microsoft is using IPU to help diagnose pneumonia and COVID-19 chest X-rays. When imaging, the speed is increased by 10 times, and the accuracy is much better than that of the GPU.”

Microsoft is an early partner of Graphcore. They not only used IPU technology for their internal AI workloads, but also provided IPU to users of its Azure cloud computing platform in November 2019, thus accelerating the work of AI innovators.

In addition, many companies that understand the relationship between innovation and applications, such as Microsoft, BMW, Bosch, Dell, and Samsung, have invested in Graphcore.

China business

Regarding the Chinese market, Nigel Toon said bluntly: “The most direct demand for new technologies lies in China. China is a leader in the field of artificial intelligence. China recognizes that artificial intelligence innovation is inseparable from long-term economic development. At present, Graphcore’s technology has begun Some very successful Chinese companies provide support and will help promote the fastest growing and most innovative AI start-ups in China. In the near future, we will be able to talk more about some of Graphcore’s partners in China and share our cooperation Details.”

The Chinese name of Graphcore has been set as “Planned”, and the company is expanding the Chinese team in order to provide customers with fully localized response and support. Nigel Toon said: “Our goal is to build the company into an important Chinese company.”

In terms of cooperation with Chinese universities, after the launch of the IPU Developer Cloud, Graphcore has probably received applications from 30 to 40 universities’ top AI laboratories and research institutions. Graphcore has begun to discuss cooperation with some organizations, and some organizations are already working on the IPU Developer Cloud.

In terms of application scenarios, Lu Tao believes that the Chinese market has developed very rapidly in natural language processing-related applications, and has huge potential. The computing power requirements for training are also very high, which is very important for IPU.

  

The Links:   XCR3064XL-10VQG44I G121SN01-V2

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *