If you’d asked an IT professional whether object storage is any good for databases, the answer over most of the past decade or so would have been a resounding “no”.
The response would have been pretty obvious because databases, especially in busy and mission-critical environments, are subject to a lot of changes from a lot of users, either simultaneously or almost simultaneously.
Databases need IOPS (input/output operations per second) and they need some way of enforcing consistency of data, meaning block access SAN storage has long been the way to deliver that.
However, that “no” answer may have changed over the past few years to something more like a “maybe”.
With the rise of databases that use in-memory storage, IOPS are in plentiful supply close to compute, so it has become possible for object storage to be the site for bulk storage of datasets, with segments moved to memory during processing.
But does this constitute database operations using object storage?
SANs good for IOPS but capacity a bottleneck
For a couple of decades, SAN block access storage was the go-to for running databases and the enterprise applications built on them. IOPS was king, as potentially numerous I/O requests hit the database from client systems.
To cope, SANs got ever larger and more performant, with the eventual mainstream adoption of very quick – in IOPS terms – flash media. NetApp’s current advice on storage sizing for SAP HANA, for example, is in terms of (presumably on-site) SAS HDDs and flash capacity.
But while SANs can deliver in terms of IOPS – even if that becomes partially redundant as application databases go in-memory – they still have their limits, and that is in terms of scale.
Here, object storage excels. It cannot produce the kind of IOPS that a SAN can provide to database operations, but it can give throughput in large volumes.
There are two good reasons as to why that’s a big deal right now. One, for several years the volumes of data being analysed have grown ever larger. With block or file access storage starting to become unwieldy above 100TB, object storage looks like a good bet with its ability to scale to PBs.
At the same time, object storage has become the de facto storage mode for the cloud, adding to its prevalence both off-site and on-site. In addition, as part of the backdrop, we’ve seen the emergence of in-memory database-based applications such as SAP HANA delivered in the cloud.
A big benefit of cloud object storage is the low cost. On Amazon, for example, file or block storage can cost 10x more than object storage.
Object storage, S3 as bulk storage for databases, AI/ML
With the emergence of object storage, and in particular S3, we have seen the rise of its use as bulk storage that can be delivered to in-memory database work and for AI/ML analytics.
Alongside this trajectory has been the emergence of databases that will work with S3 (or S3-compatible storage) as a data store, such as MongoDB, CockroachDB, MariaDB and Teradata. Cloud data warehousing phenomenon Snowflake is also S3-based.
But we need to remember the limitations. SANs work well with databases because user requests are literally dipping into and reading, writing, and so on, to parts of files. Mechanisms exist to limit the potential for clashes between user requests.
Object storage cannot work in the same way as a data source for databases as files can, with blocks within that can be manipulated, but you can store data to much greater capacities in object storage.
The architecture for object storage and work with databases is different, with the latter staged to memory for working processes and then re-sent to the object backing store.
Does that make object storage more like an archive in these cases? Possibly, although suppliers such as Minio have taken on the challenge of providing rapid access to datasets stored in object storage for database and analytics use cases. Minio calls it “warm” storage, so it’s clearly not blisteringly quick in I/O terms and majors on throughput.
S3 Select – S3 meets SQL
The idea of very large object stores with the possibility of choosing the subset of data you want and working on that is the idea behind S3 Select. Here you use SQL query statements to filter the contents of a data store and just pull the data you want. This cuts data egress costs and gives you a smaller dataset and lower latency.
Results from S3 Select are available in CSV, JSON and Apache Parquet formats, and you can perform queries from the AWS console, command line or via APIs, meaning it’s quite possible to select data from S3 for analysis on faster, local compute.
It’s not really a database, however. And with data coming in specific formats, such as CSV and JSON, it will need some coding to wrangle it into your analysis tool of choice.
So, is object storage any good for databases?
The verdict is: it depends – or, not really.
Object storage can’t store databases in a way that provides multiple users with access consistent with atomicity, consistency, isolation, durability (ACID) principles. Object storage is out for use cases that would satisfy transactional processing needs. Object storage doesn’t have the I/O or the sub-file locking mechanisms to do this.
Object storage does, however, excel at storing large volumes of data, of which a subset can be downloaded – at potentially high rates of throughput – for retention on memory during local processing. These are the types of approaches pushed by the likes of Minio and possibly in use cases that use AWS S3 Select. But we’re mostly then talking about batch processing of data in analytics settings.