At Conversant we operate a 12 node OpenTSDB cluster that takes in metrics from hundreds of servers across our multiple data centers. We have talked about its many uses before but to summarize, it is used by many different teams for a variety of tasks from basic time-series visualization to report generation.
This heavy use led to a problem. Ingesting 550 million data points a day and years of use meant the cluster was rapidly approached its 75 TB capacity. Something had to be done.
There are two ways to reduce disk footprint, delete data and/or compress data. OpenTSDB uses HBase (a distributed/scalable datastore) to store its time-series data. HBase has functionality that can be used to implement both of these strategies.
HBase has a cell level TTL (time to live) which can be defined per column family of a table. Since all TSDB data is stored in a single table and column family in HBase (appropriately named
t respectively) the TTL can be used to define a global data retention policy on all time-series data. Selecting an appropriate TTL is very situational but for us 2 years is sufficient.
Overriding the default HBase TTL provides both short- and long-term benefits. The short-term benefit is the immediate removal of a large amount of old data. The TSDB cluster had been running for years thus a large portion of the data stored was older than 2 years. In the long term it will limit the rate of growth of the
tsdb table. As new data points are added old data points are deleted. Of course the rate of ingest is growing so the table size won’t be completely static but it will grow slowly.
HBase supports both compression and data block encoding to help reduce tables’ disk footprints. Both are helpful but work in slightly different ways. Compression uses general purpose compression algorithms to compress the data while data block encoding leverages specifics of the way key-value pairs are stored in HBase.
HBase compression applies a general purpose compression algorithm to the byte data of a block. For reference; a HBase block is the smallest unit of data that can accessed in a single unit of I/O and HFiles are essentially a collection of blocks. HBase supports the GZ/LZO/SNAPPY/LZ4 compression algorithms. However, for anything other than GZ, which is supported by Java, native packages will need to be installed. This limited our choices to SNAPPY and GZ because our TSDB environment only had the SNAPPY packages installed.
Data Block Encoding
Data block encoding is an often over-looked feature that takes advantage of the sorted nature of HFiles to help reduce the key storage overhead. Which is particularly useful for TSDB since the majority of the data is stored in the key. A TSDB HBase entry (a singular data point) is made of a rowkey, column qualifier, and value. The rowkey format for a TSDB entry begins with the metric’s uid followed by a timestamp truncated to the hour and the various tag key and tag value pairs. The column qualifier contains a time offset realitive to the value in the rowkey and some format information on the value. There are 2 possible formats for the column qualifiers a 2 byte and a 4 byte version. The 2 byte version has an offset in seconds while the 4 byte qualifier has a millisecond offset. The value is a 1-8 byte integer or float.
Data block encoding essentially stores the difference between adjacent keys. This is an effective strategy since HBase stores the keys in sorted order minimizing the differences between each key (especially in the case of TSDB). You can find a more in depth explanation of data block encoding and its various settings in this cloudera blog post.
To test the effectiveness of the various combinations of compression and data block encoding algorithms they were applied to a sample of the
|Compression Algorithm||Data Block Encoding||Table Size||Compression Ratio|
The testing shows that TSDB data is highly compressible. Compression ratios over 20x were observed.
After evaluating the test results we decided to use a combination of SNAPPY compression, FAST_DIFF data block encoding, and a 2 year TTL. SNAPPY was chosen over GZ due to GZ’s poor compression/decompression speed. The choice of TTL was less measured and based more on our business needs (Many metrics are irrelevant/deprecated after 2 years).
So we altered the table and then ran a script which slowly iterates through the regions triggering a major compaction on them (major compactions are the only way to recover space from deleted cells) and watched.
Disk usage went from ~70 TB to ~4 TB and of that 4 TB of disk used on the cluster the
tsdb table only accounts for 1.29 TB! The growth of the table has significantly slowed as well.
These changes not only solved our impending capacity issue and gave us room to grow for the foresee-able future but they also allow us to do cool things like split the cluster into two