Cloud Bigtable

  • Cloud Bigtable is a fully managed scaleble, NoSQL big data database service for large analytical operational workloads
  • Scales to petabytes with consistent sub-10ms latency
  • Learns and adjusts to access patterns
  • Useful for machine learning applications
  • Powers Google services such as Search, Maps and Gmail
  • Supports high read and write throughput with low latency
  • Integration with popular big data tools like Hadoop, Cloud Dataflow with the HBase API
  • Consider Cloud Bigtable if you require
    • > 1TB structured data
    • Very high rate of writes
    • read/write latency < 10ms
    • Strong consistency
    • Compatibility with the Hadoop HBase API

Storage Model

  • Stores data in massively scalable key-value sorted tables
  • Rows are indexed with a single key
  • Related columns are grouped into column-families
  • Each column is identitied by a combination of column-family and column-qualifier
Cloud BigTable Schema
  • In this table:
    • Column Family = follows
    • Column qualifiers are used as data e.g. tjefferson
    • Tables are sparesely populated (not all cells have a value)
  • Each row/cell intersection can contain multiple cells (versions) at different timestamps, thereby providing a history

Bigtable architecture

  • All client requests go through a front end server
  • Nodes are organised into a Cloud Bigtable Cluster belonging to a Cloud Bigtable Instance – a container for the cluster
  • Cluster throughput can be increased by adding Nodes
Cloud BigTable Architecture
  • Cloud Bigtable data is sharded into blocks of contiguous rows called tablets

Bigtable maintains data in lexicographic order by row
key. The row range for a table is dynamically partitioned.
Each row range is called a tablet, which is the unit of distribution and load balancing.

Bigtable: A Distributed Storage System for Structured Data
  • Tablets are stored on Colossus, Google’s file system, in SSTable format
  • An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings
  • Performance scales linearly with the number of nodes in a cluster

Load Balancing

  • Each zone is managed by a primary process, balancing workload and data
  • This process balances load by splitting larger and busy tablets in half
  • Conversely, smaller and less-busy tablets are merged, thereby reducing fragmentation
  • Balancing of traffic and split/merge activity is handle automatically