1010data: Operating Well in a Parallel Universe?

Updated: September 16, 2010

Since the 1980s, people have been wrestling with the problem of read and write locks on data. The idea is that if you decide to update a datum while another person is attempting to read it, each of you will see a different value, or the reader can't predict which value he/she will see. To avoid this, the updater can block all other access via a write lock - which in turn slows down the other person drastically; or via the "query from hell" can block updaters with a read lock on all data. In a data warehouse, updates are typically held and then rushed through at certain times (end of day/week) in order to avoid locking problems. In another approach, columnar databases also sometimes provide what is called "versioning", in which previous values of a datum are kept around, so that the updater can operate on one value while the reader can operate on another.

How is 1010data different? The company provides a data warehouse/business intelligence solution as a remote service - a "database as a service" variant of SaaS/public cloud. However, 1010data's solution does not start by worrying about locking. Instead, it worries about how to provide each end user with a consistent "slice of time" database of his/her own. It appears to do this as follows: all data is divided into what the company calls "master" tables (as in "master data management or MDM" of customer and supplier records), which are small in size, and time-associated/time-series "transactional" tables, which are really large.

Master tables are more rarely changed, and therefore a full copy of the table after each update (really, a "burst" of updates) can be stored on disk, and loaded into main memory if needed by an end user, causing little storage or processing overhead. This isn't feasible for transactional tables; but 1010data sees old versions of these as integral parts of the time series, not as superseded data; so the actual amount of "excess" data "appended" to a table, if maximum session length for an end user is a day, is actually small in all realistic circumstances. As a result, two versions of a transactional table include a pointer to a common ancestor plus a small "append". That is, the storage overhead of additional versioning data is actually small compared to some other columnar technologies, and not that much more than row-oriented relational databases.

Now the other shoe drops, because, in my very rough approximation, versioning entire tables instead of particular bits of data allows you to keep those bits of data pretty much sequential on disk - hence the lack of need for indexing. It is as if each burst of updates comes with an online reorganization that restores the sequentiality of the resulting table version, so that reads during queries potentially almost eliminate seek time. The storage overhead means that more data must be loaded from disk; but that's more than compensated for by eliminating the need to jerk from one end of the disk to the other in order to inhale all needed data.

So here's my take: 1010data's claim to better performance, as well as to competitive scalability, is credible. Since we live in a universe in which indexing to minimize disk seek time plus minimizing added storage to minimize disk accesses in the first place allows us to push against the limits of locking constraints, we are (and should be) appreciative of the ability of columnar technology to provide additional storage savings and bit-mapped indexing to store more data in memory. Since 1010data lives in a universe in which locking never happens and data is stored pretty sequentially, it can happily forget indexes, squander a little disk storage, and still perform better.

1010data Loves Sushi

At this point, I could say that I have summarized 1010data's technical value-add, and move on to considering best use cases. However, to do that would be to ignore another way that 1010data does not operate in the same universe as its database peers: the company loves raw data. In fact, it would prefer to operate on data before any detection of errors and inconsistencies, as it views these problems as important data in their own right.

As a strong proponent of improving the quality of data provided to end users, I might be expected to disagree strongly. However, as a proponent of "data usefulness", I feel that the potential drawbacks of 1010data's approach are counterbalanced by some significant real-world advantages.

In the first place, 1010data is not doctrinaire about ETL (Extract, Transform, Load) technology. Rather, 1010data allows you to apply ETL at system implementation time or simply start with an existing "sanitized" data warehouse (although it is philosophically opposed to these approaches), or apply transforms online, at the time of a query. It's nice that skipping the transform step when you start up the data warehouse will speed implementation. It's also nice that you can have the choice of going raw or staying baked.

In the second place, data quality is not the only place where the usefulness of data can be decreased. Another key consideration is how and how well a wide array of end users can employ warehoused data to perform more in-depth analysis. 1010data offers a user interface using the Excel spreadsheet metaphor and supporting column/time-oriented analysis (as well as an Excel add-in), thus providing better rolling/ad-hoc time-series analysis to any business users familiar with Excel. Of course, someone else may come along and develop such a similarly flexible interface, although 1010data would seem to have a lead as of now; but in the meanwhile, the company's wider range of end users and additional analytic capabilities appear to compensate for any problems with operating on incorrect data - and especially when there are 1010data features to ensure that analyses take into account possible incorrectness.

Caveat

To me, some of continuing advantages of 1010data's approach depend fundamentally on the idea that users of large transactional tables require ad-hoc historical analysis. To put it another way, if users really don't need to keep historical data around for more than an hour in their databases, and require frequent updates/additions for "real-time analysis" (or online transaction processing), then tables will require frequent reorganizing and will include a lot of storage-wasting historical data, so that 1010data's performance advantages will decrease or vanish.

However, there will always be a need for ad-hoc, in-depth queries, and these are pretty likely to be leveraged for historical analysis. So while 1010data may or may not be the be-all, end-all data-warehousing database for all verticals forever, it is very likely to continue to offer distinct advantages for particular end users, and therefore should always be a valuable complement to a data warehouse that handles vanilla querying on a "no such thing as yesterday" basis.