ENTERPRISE TECHNOLOGY
Lifecycle of data
This lifecycle applies for any type of parallelised machine learning, not just neural networks or
deep learning. Standard machine learning frameworks, rely on CPUs instead of GPUs, but the
data ingest and training workflows are the same.
I
ngest the data from an external source
into the training system. Each data
point is often a file or object. Inference
may also have been run on this data.
After the ingest step, the data is stored in
raw form and is often also backed up in
this raw form. Any associated labels may
come in with the data or in a separate
ingest stream.
Clean and transform the data and
save in a format convenient for training,
including linking the data sample and
associated label. This second copy of the
data is not backed up because it can be
recomputed if needed.
Explore parameters and models,
and quickly test with a smaller dataset
and iterate to converge on the most
promising models to push into the
production cluster.
Training phases select random
batches of input data, including both new
and older samples, and feed those into
production GPU servers for computation
to update model parameters.
Evaluation uses a holdback portion
of the data not used in training in order
to evaluate model accuracy on the
holdout data.
As seen above, each stage in
the AI data pipeline has varying
requirements from the underlying
storage architecture. To innovate and
improve AI algorithms, storage must
deliver uncompromised performance
20
for all manner of access patterns, from
small to large files, from random to
sequential access patterns, from low to
high concurrency, and with the ability to
easily scale linearly and non-disruptively
to grow capacity and performance.
Often, there is a production
pipeline alongside an experimental
pipeline operating on the same dataset.
Further, the DGX-1 GPUs can be used
independently for different models or
joined together to train on one larger
model, even spanning multiple DGX-1
systems for distributed training.
A single shared storage data hub
creates a coordination point throughout
the lifecycle without the need for
extra data copies among the ingest,
preprocessing, and training stages.
Rarely is the ingested data used for only
one purpose, and shared storage gives
the flexibility to interpret the data in
different ways, train multiple models, or
apply traditional analytics to the data.
A centralised data hub in a deep learning architecture increases the productivity of data scientists
and makes scaling and operating simpler and more agile for the data architect.
For legacy storage systems, this is
an impossible design point to meet,
forcing the data architects to introduce
complexity that just slows down the pace
of development.
In the first stage, data is ideally ingested
and stored on to the same data hub such
that following stages do not require excess
data copying. The next two steps can be
done on a standard compute server that
optionally includes a GPU, and then in
the fourth and last stage, full training
production jobs are run on powerful GPU-
accelerated servers like the DGX-1.
If the shared storage tier is slow, then
data must be copied to local storage for
each phase, resulting in wasted time
staging data onto different servers.
The ideal data hub for the AI training
pipeline delivers similar performance
as if data was stored in system RAM
while also having the simplicity and
performance for all pipeline stages to
operate concurrently.
(Source: From Bytes to AI: Why
it’s all about the data lifecycle, by
Joshua Robinson.)
Issue 15
INTELLIGENT TECH CHANNELS