What Is Google's Bigtable System?

Updated: April 30, 2009

Introduction

In the world of cloud computing, one essential ingredient is a database that can accommodate a very large number of users on an on-demand basis. Many Web applications use databases, typically of the SQL variety, for various tasks. What's needed is a database that can be carved into subsets that can be accessed by various users, is distributed across many servers and is highly responsive; it should also be able accommodate a virtually infinite variety of data tables.

Google has had a proprietary database, called Bigtable, since early 2005. Bigtable is the basis of Google's search technology, as well as many other applications such Google Finance, Google Maps and Google Earth. Bigtable was developed with very high speed, flexibility and extremely high scalability in mind. A Bigtable database can be petabytes in size and span thousands of distributed servers.

In April 2008, Google announced that is was making Bigtable available to outside developers as part of Google App Engine, the company's cloud-computing platform. The only other large company that offers a database for cloud computing is Amazon.com Inc., so Google's entry into the market is a pretty big deal.

Understanding Bigtable's architecture is a job for Ph.D.s. Google has released one highly technical document describing Bigtable's plumbing, and it is recommended for potential developers who want to understand the database's technical details.

Analysis

Basic Architecture of BigTable

Bigtable is described as a fast and extremely scalable DBMS (database management system). It is based on the proprietary Google File System, which gives Bigtable the ability to scale across hundreds or thousands of commodity servers that collectively can store petabytes of data.

Each table is a multidimensional sparse map. The table consists of rows and columns, and each cell has a time stamp. There can be multiple versions of a cell with different time stamps. The time stamp allows for operations such as "select 'n' versions of this Web page" or "delete cells that are older than … "

In order to manage the huge tables, Bigtable splits tables at row boundaries and saves them as tablets. Each tablet is around 200MB, and each server saves about 100 tablets. This setup allows tablets from a single table to be spread among many machines. It also allows for fine-grained load balancing, because if one table is receiving many queries, it can shed other tablets or move the busy table to another machine that is not so busy. Also, if a machine goes down, a tablet may be spread across many other machines so that the performance impact on any given machine is minimal.

Tables are stored as immutable SSTables and a tail of logs (one log per machine). When a machine's system memory is full, it compresses some tablets using Google proprietary compression techniques such as BMDiff and Zippy. Minor compactions involve only a few tablets, while major compactions involve the whole table system and recover hard-disk space.

The locations of Bigtable tablets are stored in cells. The lookup of any particular tablet is handled by a three-tiered system. The clients get a point to a META0 table, of which there is only one. The META0 table keeps track of many META1 tablets that contain the locations of the tablets being looked up. Both META0 and META1 make heavy use of pre-fetching and caching to minimize bottlenecks in the system.

Bigtable's Release

Bigtable was released in May 2008 as part of Google App Engine. As is typical with Google offerings, Bigtable is free to use, and the service is described as "in beta," even though Google has been using Bigtable internally for more than three years. The first 10,000 developers that signed up for the Google App Engine service received 500MB of storage and enough computing power and bandwidth to handle 5 million page views per month — all for free.

The system allows developers to use Google's tremendous infrastructure, which enables applications to handle very large spikes in traffic that would otherwise require extensive revisions to database architecture. Google App Engine allows developers to concentrate on their applications, while Google handles maintenance chores such as load balancing and replication.

Google has opened Bigtable to the development community in order to further its vision of cloud computing and the on-demand paradigm of computing-resources sales.

Misgivings about Bigtable

Developers have been generally positive about Bigtable and Google App Engine, as well as competitor Amazon Web Services. After all, Google App Engine gives developers access to very powerful Web platforms at no cost.

"Companies like Google and Amazon have a tonne of bandwidth that they can load share really well," managing director of Web-development company Western Civilisation Pty. Ltd. John Allsopp told ZDNet Australia. "As a developer, when you launch something, you might get a big hit on it, so you really want a system that can provide the bandwidth when you need it."

But developers are nonetheless leery of the proprietary Bigtable platform, which locks applications into the Google stable. The same is true of Amazon Web Services and other cloud-computing services.

The danger is that cloud-computing vendors may decide to discontinue services upon which Web applications depend, and there will be no way to move those applications to other platforms.

Google has announced a pricing structure for Google App Engine and Bigtable, and it contains pleasant surprises for start-ups and other cost-conscious developers. It seems that a 4TB database application on Google App Engine costs one-tenth of what it would on Amazon Web Services. Google App Engine's price structure is on par with Amazon.com's S3 (Simple Storage Service).

Related Categories

Featured Research

ESG Lab Report: IBM Tivoli Storage Manager for Virtual Environments

Enterprise Strategy Group's Lab Validation Report on TSM for Virtual Environments. See why TSM is one of the preeminent backup solutions for VMware and other virtual servers. more
Improve Visibility: Reducing the Complexity of Your Storage Environment

IBM Tivoli Storage Productivity Center can help reduce storage costs by enabling integrated management of storage assets, performance and operations from a single, web-based console. It also integrates with IBM Cognos Business Intelligence for reporting and analytics. more
Improve Visibility with EMA: Software Defined Storage

This EMA paper gives insights on why storage matters for cloud and what's the advantages of storage virtualization for cloud. It reviews IBM’s software defined storage infrastructure solution and highlights the competitive differentiator for IBM's SmartCloud offering. more
Reducing Cost & Complexity: Streamline Data Protection with Tivoli Storage Manag...

The next generation of simplified backup administration dramatically improves scalability and efficiency. Experience how IBM’s advanced interface for Tivoli Storage Manager enables consolidation, intuitive problem resolution and integrated team collaboration. more
Gartner Magic Quadrant for SRM 2012

SRM and storage area network tools enable customers to manage shared storage environments. These fully featured, integrated and user-friendly tools are offered as solutions ranging from the holistic to the specialist, for customers with a broad range of maturity levels and requirements. more