Near real time processing of time series data with HBase
A known problem when storing time series data in HBase is having hot regions when using timestamps as keys. A common solution is to use a salt as prefix to distribute the data over multiple regions. This presents a problem when one wants to process the data ordered by timestamps in a Map/Reduce job, as currently only one Scan object can serve as input. One approach is to start a Map/Reduce job for each prefix. Another solution is to allow multiple Scan objects, one for each prefix, to serve as input in a Map/Reduce job by implementing a MultiSegmentTableInputFormat. Using the MultiSegmentTableInputFormat in a Map/Reduce job has the advantage of being able to use a prefix to avoid hot regions when writing data and allows to process data ordered by timestamps in a single Map/Reduce job, though improving performance. This talk will provide details on how we use HBase for near real time processing of time series data from a real world application.
Watch the video of Christian Richter's talk here.