Overview

  • An HBase Table consists of multiple rows.
  • Table will be split into Regions based on rows’ lexicographical order.
  • A Store corresponds to a column family for a table for a given region.
  • A Store hosts a MemStore and 0 or more StoreFiles (HFiles).
  • The MemStore holds in-memory modifications to the Store.
  • StoreFiles are composed of Blocks.

A widespread graph of HBase architecture lists as following.

HBase Architecture

I drew a simplified version from region’s viewpoint.

Region

Region

HBase stores rows of data in tables. For high performance and availability, Tables are split into “regions”. A region contains a continuous range of rows. Since rows are sorted lexicographically, it’s easy to deduce that all rows within the scope of the region’s start key and end key are stored in the same region. A single row belongs to exactly one region at any time. A region is only served by one single RegionServer, one RegionServer can hold many regions at the same time. Regions are distributed across the cluster, requests from client can be processed by RegionServer process independently.

A HBase table can be pre-split into regions when creation. After that, regions split when they reach a configured threshold. You can also force split manually at will, but there are some rules of thumb needs to be respected. On the contrary, regions can be merged if too many regions exists.

OK, Let source code tell the truth.

public interface Region extends ConfigurationObserver {
    /**
    * @return a list of the Stores managed by this region
    */
    List<? extends Store> getStores();

    /**
    * @return the Store for the given family
    */
    Store getStore(byte[] family);

    /**
    * Check the region's underlying store files, open the files that have not
    * been opened yet, and remove the store file readers for store files no
    * longer available.
    * @throws IOException
    */
    boolean refreshStoreFiles() throws IOException;

    /**
    * @return memstore size for this region, in bytes. It just accounts data size of cells added to
    *         the memstores of this Region. Means size in bytes for key, value and tags within Cells.
    *         It wont consider any java heap overhead for the cell objects or any other.
    */
    long getMemStoreSize();

    /**
    * Puts some data in the table.
    * @param put
    * @throws IOException
    */
    void put(Put put) throws IOException;

    /**
    * Perform one or more increment operations on a row.
    * @param increment
    * @return result of the operation
    * @throws IOException
    */
    Result increment(Increment increment) throws IOException;

    /**
    * Do a get based on the get parameter.
    * @param get query parameters
    * @return result of the operation
    */
    Result get(Get get) throws IOException;
}

public class HRegion implements HeapSize, PropagatingConfigurationObserver, Region {
    private final WAL wal;
    private final HRegionFileSystem fs;


    protected final Map<byte[], HStore> stores =
     new ConcurrentSkipListMap<>(Bytes.BYTES_RAWCOMPARATOR);
}

Store

A Store corresponds to a column family for a table for a given region.

public interface Store {
    Collection<? extends StoreFile> getStorefiles();

    FileSystem getFileSystem();

    /**
    * @return The size of this store's memstore.
    */
    MemStoreSize getMemStoreSize();
}

public class HStore implements Store, HeapSize, StoreConfigInformation, PropagatingConfigurationObserver {
    protected final MemStore memstore;
    // This stores directory in the filesystem.
    protected final HRegion region;
    private final ColumnFamilyDescriptor family;
    private final HRegionFileSystem fs;
}

MemStore

A Store hosts a MemStore and 0 or more StoreFiles (HFiles). The MemStore holds in-memory modifications to the Store. Modifications are Cells/KeyValues.

When the MemStore reaches a given size (hbase.hregion.memstore.flush.size), it flushes its contents to a StoreFile.

public interface MemStore {
    /**
    * Write the updates
    * @param cells
    * @param memstoreSizing The delta in memstore size will be passed back via this.
    *        This will include both data size and heap overhead delta.
    */
    void add(Iterable<Cell> cells, MemStoreSizing memstoreSizing);
}

StoreFile (HFile)

StoreFiles are where your data lives. The HFile file format is based on the SSTable file described in the BigTable [2006] paper and on Hadoop’s TFile. StoreFile wraps HFile, a file of sorted key/value pairs. Both keys and values are byte arrays.

public interface StoreFile {
    /**
    * @return True if this is HFile.
    */
    boolean isHFile();

    /**
    * Get the first key in this store file.
    */
    Optional<Cell> getFirstKey(); 
}

/**
 * A Store data file.  Stores usually have one or more of these files.  They
 * are produced by flushing the memstore to disk.
 */
public class HStoreFile implements StoreFile {
    private final StoreFileInfo fileInfo;
    private final FileSystem fs;

    // firstKey, lastkey and cellComparator will be set when openReader.
    private Optional<Cell> firstKey;
    private Optional<Cell> lastKey;
    private CellComparator comparator;

    /**
    * Bloom filter type specified in column family configuration. Does not
    * necessarily correspond to the Bloom filter type present in the HFile.
    */
    private final BloomType cfBloomType;
}

The following describes the evolution of the HFile format. It is useful to give a short overview of the original (HFile version 1) format.

The block index in version 1 is very straightforward. For each entry, it contains:

  • Offset (long)
  • Uncompressed size (int)
  • Key (a serialized byte array written using Bytes.writeByteArray)
  • Key length as a variable-length integer (VInt)
  • Key bytes

The number of entries in the block index is stored in the fixed file trailer, and has to be passed in to the method that reads the block index. One of the limitations of the block index in version 1 is that it does not provide the compressed size of a block, which turns out to be necessary for decompression.

HFile version 2 fixed this limitation, and was introduced in in HBase 0.92.

In the version 2 every block in the data section contains the following fields:

  1. 8 bytes: Block type, a sequence of bytes equivalent to version 1’s “magic records”.
  2. Compressed size of the block’s data, not including the header (int). Can be used for skipping the current data block when scanning HFile data.
  3. Uncompressed size of the block’s data, not including the header (int) This is equal to the compressed size if the compression algorithm is NONE
  4. File offset of the previous block of the same type (long) Can be used for seeking to the previous data/index block
  5. Compressed data (or uncompressed data if the compression algorithm is NONE).

The above format of blocks is used in the following HFile sections:

  • Scanned block section The section is named so because it contains all data blocks that need to be read when an HFile is scanned sequentially. Also contains leaf block index and Bloom chunk blocks.
  • Non-scanned block section This section still contains unified-format v2 blocks but it does not have to be read when doing a sequential scan. This section contains “meta” blocks and intermediate-level index blocks.

We are supporting “meta” blocks in version 2 the same way they were supported in version 1, even though we do not store Bloom filter data in these blocks anymore.

Block

StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis.

HFile doesn’t know anything about key and value struct/type (row key, qualifier, family, timestamp, …). As Hadoop’ SequenceFile (Block-Compressed), keys and values are grouped in blocks, and blocks contains records. Each record has two Int Values that contains Key Length and Value Length followed by key and value byte-array [8].

Reference

  1. HBase笔记:存储结构
  2. Regions
  3. HBASE ARCHITECTURE
  4. HBase – Memstore Flush深度解析
  5. HFile文件格式与HBase读写
  6. HFile
  7. HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs
  8. HBase I/O: HFile
  9. HBase file format with inline blocks (version 2)
  10. An In-Depth Look at the HBase Architecture