In this podcast, we look at how to size storage for artificial intelligence (AI) and machine learning (ML) workloads that can range from batches of large images to many, many tiny files, with Curtis Anderson, software architect at Panasas.
Anderson talks about the ways different AI/ML frameworks pull data into compute and the implications of that for how data is stored.
Anderson also talks about whether to go all-flash on-premise, use of the cloud and the benchmarking organisations that can help organisations size their infrastructure for AI/ML.
Antony Adshead: What are the challenges for storage with AI and ML?
Curtis Anderson: Let’s set some context here. The first place to start is that you, the customer, want to start to accomplish some goal, whether it is identifying cancer cells in an x-ray image or, based on a customer’s purchase history, recommending a product they might want to purchase next.
Your data scientist is going to take that requirement and build a model. The model then leverages a framework, which is a software layer below that. PyTorch and TensorFlow are the two most popular frameworks today. It’s actually the framework that determines how the training data is stored on the storage subsystem.
With Pytorch, for example, if you’re talking about images, it’ll store one jpg per file, so say a million jpgs in a tree of directories. TensorFlow, on the other hand, will put 1,000 or a couple of thousand images into a much larger file format specific to TensorFlow and will have a much smaller number of large files.
It’s actually the framework that determines how data is stored on the storage and therefore has a big impact on the performance of the storage. Lots of small files stress the storage in a different way than really high bandwidth coming out of a small number of files.
So, that’s the first order of magnitude of challenges of AI applied to storage.
Adshead: How do we size storage for different AI/ML workloads?
Anderson: So, that’s where it does get really interesting. Again, I need to set some context, and then I can go through the more direct answer to the question.
When you’re training your neural network model, the framework pulls data from the storage, puts it into memory in what we call a batch, and hands the entire batch across to the GPU [graphics processing unit] to crunch on, to calculate.
While the GPU is crunching on that piece of data, on that batch, the framework is going to read more data to build the next batch. As soon as the first batch is done, the next one gets dumped in the GPU and the framework goes out and reads more data.
Where it gets complicated is that for neural networks to operate correctly, to train correctly, the data needs to be randomised. Not changed, but you pull in this image this time and train the data, and next time you pull in a different image at random as part of the next batch. And in the next batch you pick a different set of images, you pull them in. So, randomness is just part of the requirement.
When the GPU is crunching away on the current batch and the framework is trying to assemble the next batch, if the storage can’t respond fast enough, the GPU finishes its calculation and goes idle, and waits for the next batch to finish building in memory.
Well, that’s an expensive resource. GPUs are not cheap. You don’t want them going idle, so the storage has to be fast enough to keep up with building batches and keeping the GPU fed.
Again, more complexity. The amount of computation on an image is much larger than you would need on a recommender. With someone’s purchase history, there’s just a lot less data than there is in a large x-ray image.
When the GPU is processing a batch of x-ray images, for example, the storage has a lot more time to pull the data in and build the batch. On a recommender, where the computation per byte of data is small, the storage has to be much faster.
Most people don’t know what problems they’re going to be working on, so they say, “Fine, we’ll just spend the money and buy an all-flash storage solution and that’ll keep up with anything.” That works perfectly fine, it’s a good answer.
If you start experimenting in the cloud, you click here instead of there and you get an all-flash storage sub-system.
If you have 200TB or 300TB [terabytes] of data that you want to train against, which is the size of most projects these days, then all-flash is an affordable solution. If you have 2PB to 3PB [petabytes] of data, then you are talking serious money and you have to think harder about whether it’s justified or not, so there’s just a bit more complexity there.
If you’ve heard of MLCommons or MLPerf, that’s an industry consortium, a non-profit that helps AI practitioners. They’re in the process of building a benchmark for storage systems that support AI environments and have published results from different vendors, so that’s a place to get data when comparing storage products.
Adshead: How would you summarise storage requirements for AI/ML workloads?
Anderson: If your problem is small, then just buy the all-flash, if you’re on-prem. Most AI projects start in the cloud because you just rent the GPUs you need for the time you need them. Cloud is not cheap. It’s generally cheaper to purchase and have gear on-prem, but you have to have a committed project that you know is working. You have to go into production on that project before it is justified.
The simple answer is, buy all-flash storage products and you’ll be able to keep up. When problems get larger, you need a lot more detail, a deeper analysis, to not end up wasting money. That’s where the MLPerf stuff comes in.
The short version is, just start with an all-flash product.