HBase is a distributed column-oriented database built on top of HDFS. HBase is the Hadoop application to use when you require real-time read/write random access to very large datasets.
Applications store data into labeled tables. Tables are made of rows and columns. Table cells—the intersection of row and column coordinates—are versioned. By default, their version is a timestamp auto-assigned by HBase at the time of cell insertion. A cell’s content is an uninterpreted array of bytes. Table row keys are also byte arrays, so heoretically anything can serve as a row key, from strings to binary representations of long or even serialized data structures. Table rows are sorted by row key, the table’s primary key. The sort is byte-ordered. All table accesses are via the table primary key.2 Row columns are grouped into column families. All column family members have a common prefix, so, for example, the columns temperature:air and tempera-ture:dew_point are both members of the temperature column family, whereas station:identifier belongs to the station amily.3 The column family prefix must be com-posed of printable characters. The qualifying tail, the column family qualifier, can be made of any arbitrary bytes.
A table’s column families must be specified up front as part of the table schema definition, but new column family members can be added on demand. For example, a new column station:address can be offered by a client as part of an update, and its value persisted, as long as the column family station is already in existence on the targeted
table.Physically, all column family members are stored together on the filesystem. So although earlier we described HBase as a column-oriented store, it would be moreaccurate if it were described as a column-family-oriented store. Because tunings and storage specifications are done at the column-family level, it is advised that all columnfamily members have the same general access pattern and size characteristics.In synopsis, HBase tables are like those in an RDBMS, only cells are versioned, rowsare sorted, and columns can be added on the fly by the client as long as the column family they belong to preexists.
Internally, HBase keeps special catalog tables named -ROOT- and .META., within which it maintains the current list, state, and location of all regions afloat on the cluster. The
-ROOT- table holds the list of .META. table regions. The .META. table holds the list of all user-space regions. Entries in these tables are keyed by region name, where a regionname is made of the table name the region belongs to, the region’s start row, its time of creation, and finally, an MD5 hash of all of the former