The rapid expansion of Artificial Intelligence (AI) and Machine Learning (ML) applications into all aspects of business and everyday life is generating an explosion in Big Data. This advancement comes with a price, however the need for frequent training, retraining, and hyperparameter tuning longer times than are now the norm. In addition, AI/ML also requires enormous amounts of processing power for model training. Compute-intensive Machine Learning algorithms take extended times to complete when using hardware without acceleration features, resulting in overall poor application performance and reduced ROI. With this growing demand for AI/ML applications, enterprise data centers accommodate budget, space, and IT resources, while also shortening this training time bottleneck. With no end in sight to expanding datasets, nor to compute and memory-intensive applications, data center managers must rapidly secure the necessary processing horsepower and matching AI/ ML platforms to satisfy their business needs. With the proper selection of vendors, these hardwareplus-application solutions will help users to identify trends and patterns, improving throughput and training times, thus leading to a positive cycle of advancement. This paper describes one such AI/ML solution from Supermicro.
SUPERMICRO AI / ML SOLUTION GENERAL DESCRIPTION
As Artificial Intelligence and Machine Learning solutions become more accessible and more mature, global organizations will come to realize the value that these solutions can deliver to solve the advanced business challenges. The Supermicro AI/ML solution features a best-in-class hardware platform with the enterprise-ready Canonical Distribution of Kubernetes (CDK) and software-defined storage capabilities from Ceph. The solution through its reference architecture integrates network, compute, and storage. The recommended starting implementation includes a single rack with capabilities to scale to many racks as required. AI / ML REFERENCE ARCHITECTURE The reference architecture is ready to deploy end-to-end AI / ML solution that includes AI SW stack, orchestration, and containers. The optimized reference design fits machine learning training and inference applications. The architecture on a high-level comprises software, network switches, control, compute, storage, and support services. The reference design shown in Figure 1 contains two data switches, two management switches, three infrastructure nodes that act as foundation nodes for MAAS / JUJU, and six cloud nodes. It is built on the Kubernetes platform and provides Canonical hardened packages for Kubernetes containers and Ceph. Kubeflow provides a machine learning toolkit for Kubernetes.
The key highlights include a certified reference architecture with validated and tested components, racks that scale-out from one to many, green Resource Saving servers for the Cloud saving hundreds of dollars per server, industry-leading performance, optional consulting and support services, and an optimized solution for certified Intel AI partners. This solution is built and validated on Supermicro server families: Ultra and BigTwin™. It also utilizes Supermicro ethernet switches such as SSE-G3648B (management/IPMI traffic switch), SSE-X3648S (10 GbE data network switch), SSE-F3548S (25 GbE data network switch), and SSE-C3632S (40 GbE data network switch). The solution is optimized for performance and designed to provide the highest levels of reliability, quality, and scalability.
As data grows exponentially on the order of terabytes and petabytes, a network infrastructure requires a reliable scale-out storage solution. Ceph is the preferred storage system to achieve that stable, robust network infrastructure. The highly scalable fault-tolerant storage cluster transforms the network into a high-performance infrastructure by handling users’ data throughput and transaction requirements. Additionally, the AI/ML solution comprises dual management switches (IPMI and Kubernetes), dual data switches, three infrastructure nodes, and six cloud nodes. The management switch supports 1Gbps connectivity and is common to all three networking options, which are 10Gbps, 25Gbps, and 40Gbps. In addition, the data switch supports 10Gbps, 25Gbps, and 40Gbps as well. The 10 GbE and 40 GbE data switches require a Cumulus OS, whereas the 25 GbE data switch requires a Supermicro (SMIS) OS.
The second-generation Intel® Xeon® scalable processors showed approximately 25% higher performance results over previous generation systems in both training and inferencing with CNN benchmark testing.
BigTwin with Intel Xeon Platinum 8260L scalable processor showed improved throughput for both training and inference.