Saturday, February 9, 2013

Metric States: How fresh is my metric ?

Push Summary: Implementation of a STATE code in ChronicleTS and H2.

As part of a larger implementation, APMRouter core now maintains a state code on each created metric. Possible states are:

  • ACTIVE:  Data has been submitted for this metric within the last periods.
  • STALE: No activity has been seen for this metric within the last n periods.
  • OFFLINE: No activity has been seen for this metric within the last full time-series live tier.
These states are also represented in the enum EntryStatus

Under heavy load, we don't want to set a flag in the metric catalog every time we receive data for a metric. This is what the Realtime MetricCatalog does, and the latency is noticeable. By and large, thye metric catalog does not deal with metric data, and we don't want to stress the DB by pushing all incoming data through the catalog just to keep the metric states up to date. At the same time, we want to be able to:
  1. Filter metric queries by state (i.e. only populate metric tree branches with non-OFFLINE metrics)
  2. Broadcast events regarding metric state change events, in the same way we broadcast new metric events.
 So here's how this works (and it's not 100% done yet, but this is what's done):

  • The chronicle live tier's entry headers have a new field (offset H_SIZE, 1 byte) which represents the STATE of the metric represented by that entry.
  • The H2 table METRIC has a new TINYINT column, STATE.
  • The ChronicleTSManager now accepts registrations of EntryStatusChangeListeners which are notified of state changes of metrics in the live tier.
  • A background scheduler runs every x time-series periods, and choreographs a scan of all the metrics in the live tier. (CORES_AVAILABLE threads are used to do this). Entries are processed as follows:
    • OffLine entries are ignored.
    • If Stale entries have not seen data in a full live-tier window, they are set offline.
    • If Active entries have not seen data in n periods, they are set to stale.
    • These changes are broadcast to the status change listeners (in aggregate) at the end of the scan.
    • The duration of a period and a tier (which is # periods) is defined in the ChronicleTSManager using a time-series model expression which defaults to p=15s,t=5m, meaning that a period is 15 seconds, and a tier is 5 minutes.
    • So without any data coming in, all metrics are slowly degrading from active -> stale -> offline
  • When incoming data is written into the timeseries live-tier, the state of the entry is automatically set to ACTIVE. If the prior state was not ACTIVE, the change is propagated to the state change listeners through the tsManager. So as long as data is coming in for a metric, it will be kept active.
The metric catalog registers itself as an EntryStatusChangeListener and processes all the received events, updating the STATE and LAST_SEEN (to be renamed STATE_TS), so updates to the catalog database for incoming data are limited to only changes, and not every time new data comes in.

In a steady state, this works ok, but when a new agent comes on line and generates a bunch of new metrics, this puts some serious stress on the system, and I have been seeing H2 timeouts when this happens. Below is the jconsole view of the last elapsed time of a background status check after I started a JBoss instance that publishes about 200 metrics:



Needs more work.

Since the STATE is stored as a TINYINT, I added a new method to the H2StoredProcedure to do a decode, and this is exposed in H2 as the user defined function STATUS(int). Now we can query the METRIC table as follows:

select status(state), count(*) from metric group by status(state)

Results:

ACTIVE228
OFFLINE327
STALE34


No comments:

Post a Comment