Intelligent Tech Channels Issue 15 | Page 20

ENTERPRISE TECHNOLOGY Lifecycle of data This lifecycle applies for any type of parallelised machine learning, not just neural networks or deep learning. Standard machine learning frameworks, rely on CPUs instead of GPUs, but the data ingest and training workflows are the same. I ngest the data from an external source into the training system. Each data point is often a file or object. Inference may also have been run on this data. After the ingest step, the data is stored in raw form and is often also backed up in this raw form. Any associated labels may come in with the data or in a separate ingest stream. Clean and transform the data and save in a format convenient for training, including linking the data sample and associated label. This second copy of the data is not backed up because it can be recomputed if needed. Explore parameters and models, and quickly test with a smaller dataset and iterate to converge on the most promising models to push into the production cluster. Training phases select random batches of input data, including both new and older samples, and feed those into production GPU servers for computation to update model parameters. Evaluation uses a holdback portion of the data not used in training in order to evaluate model accuracy on the holdout data. As seen above, each stage in the AI data pipeline has varying requirements from the underlying storage architecture. To innovate and improve AI algorithms, storage must deliver uncompromised performance 20 for all manner of access patterns, from small to large files, from random to sequential access patterns, from low to high concurrency, and with the ability to easily scale linearly and non-disruptively to grow capacity and performance. Often, there is a production pipeline alongside an experimental pipeline operating on the same dataset. Further, the DGX-1 GPUs can be used independently for different models or joined together to train on one larger model, even spanning multiple DGX-1 systems for distributed training. A single shared storage data hub creates a coordination point throughout the lifecycle without the need for extra data copies among the ingest, preprocessing, and training stages. Rarely is the ingested data used for only one purpose, and shared storage gives the flexibility to interpret the data in different ways, train multiple models, or apply traditional analytics to the data. A centralised data hub in a deep learning architecture increases the productivity of data scientists and makes scaling and operating simpler and more agile for the data architect. For legacy storage systems, this is an impossible design point to meet, forcing the data architects to introduce complexity that just slows down the pace of development. In the first stage, data is ideally ingested and stored on to the same data hub such that following stages do not require excess data copying. The next two steps can be done on a standard compute server that optionally includes a GPU, and then in the fourth and last stage, full training production jobs are run on powerful GPU- accelerated servers like the DGX-1. If the shared storage tier is slow, then data must be copied to local storage for each phase, resulting in wasted time staging data onto different servers. The ideal data hub for the AI training pipeline delivers similar performance as if data was stored in system RAM while also having the simplicity and performance for all pipeline stages to operate concurrently. (Source: From Bytes to AI: Why it’s all about the data lifecycle, by Joshua Robinson.) Issue 15 INTELLIGENT TECH CHANNELS