Thursday, February 14, 2013

MetricURI: A metric query processor

MetricURI Expression and Processor

MetricURI is a syntax compliant java.net.URI that represents an apmrouter query for a set of metric definitions. That is, it is used to query metric meta-data, not metric data. The implementation of the processor is in the class org.helios.apmrouter.dataservice.json.catalog.MetricURIThe format of the URI is as follows:

<Domain>/<Host>/<Agent>/<Namespace>[:<MetricName>][?<OptionKey1>=<OptionValue1>[...<OptionKeyn>=<OptionValuen>]]

Here's a quick example of the simplest instance:

DefaultDomain/njw810/APMRouterServer/platform=os/resource=netstat

The URI is broken out into 2 parts, the path and the options.

Path Expression Components
The path has a mandatory pattern which is Domain/Host/Agent/Namespace and the expression parser will fail if these components are not defined. The MetricName is optional and expressed by appending ":<MetricName>" on the end of the path. By default, if a metric name is not defined, the query will return all metrics that are direct children of the evaluated path, although some options may filter out some or all of the metrics. Any path component, including the metric name can be wildcarded. (See Wildcards below).
  • Domain: The domain of the target host[s]. (DefaultDomain in the example above)
  • Host: The host of the target agent[s]. (njw810 in the example above)
  • Agent: The agent name of the target namespace[s]. (APMRouterServer in the example above)
  • Namespace: The namespace of the target metric[s]. (platform=os/resource=netstat in the example above)
  • MetricName: The optional metric name of the target metric. (not shown in the example above, but it would look like this: DefaultDomain/njw810/APMRouterServer/platform=os/resource=netstat:TCPOutbound )

Option Expression Components

Options are specified by appending a "?" at the end of the path, followed by the option key=option value name value pairs. Name value pairs should be separated by "&" characters.
  • maxd: The maximum depth of the query with respect to the namespace. This option causes the expression parser to treat the namespace as wildcard terminated, and this option limits the level of the metrics retrieved. The level attribute of a metric is the materialized count of how many entries are in the metric's namespace, so it can be likened to the nested depth of a directory tree in a filesystem. For example, platform=os/resource=netstat has a level of 2, platform=os/resource=if/if=lo has a level of 3. If you specified a namespace of platform=os/resource=if, and a maxd of 3, you would return the applicable metrics for all registered interfaces found under platform=os/resource=if, but no metrics that might be under that at level 4+. For example, to find all the RXBytes metrics for a given agent:  DefaultDomain/njw810/APMRouterServer/platform=os/resource=if:RXBytes?maxd=3.
  • type: Filters the metric type of the returned metrics as defined in the enum org.helios.apmrouter.metric.MetricType. By default, only  numeric (long) metric types are returned. Types are specified using comma separated names, or enum ordinals (ints).  The comma separated values can be either the name, the ordinal or both. The type name will automatically be trimmed and uppercased. For example, to retrieve all DELTA type netstat metrics for a given agent, the following expressions would work:  
    • DefaultDomain/njw810/APMRouterServer/platform=os/resource=netstat?type=DELTA_GAUGE,DELTA_COUNTER.
    • DefaultDomain/njw810/APMRouterServer/platform=os/resource=netstat?type=2,3
    • DefaultDomain/njw810/APMRouterServer/platform=os/resource=netstat?type=DELTA_GAUGE,3
  • st: Filter the metric status of the returned metric as defined in the enum org.helios.apmrouter.catalog.EntryStatus.  By default, the status filter is ACTIVE and STALE.  States are specified using comma separated names, or enum ordinals (ints).  The comma separated values can be either the name, the ordinal or both. The type name will automatically be trimmed and uppercased.  For example, to extend the above type filter to include all metrics regardless of status, the following would work:
    • DefaultDomain/njw810/APMRouterServer/platform=os/resource=netstat?type=DELTA_GAUGE,DELTA_COUNTER&st=ACTIVE,STALE,OFFLINE
    • DefaultDomain/njw810/APMRouterServer/platform=os/resource=netstat?type=2,3&st=1,2,3
    • DefaultDomain/njw810/APMRouterServer/platform=os/resource=netstat?type=DELTA_GAUGE,3&st=ACTIVE,2,OFFLINE

Wilcards

Wildcards can be specified on path fragments. The wildcard characters and their meaning are the same as standard SQL wildcards:
  • _: Represents a single character wildcard. Must be escaped as "\\_" as this is a valid path fragment character.
  • %: Represents a multiple character wildcard. Must be escaped as "\\%" as this is a valid path fragment character.
  • *: Synonym for % that does not require escaping.
For example, to list the CPU PercentUsage for all agents on a given host:

DefaultDomain/njw810/*/platform=JVM/category=cpu:PercentUsage

The following is a screen shot of this example as implemented in the console:

Roadmap Notes

  • Wildcards are not currently supported on any of the options.
  • A pure regex expression is on the road map.

Implementation

To be added.

JSON Request and Response Examples

To be added.

Saturday, February 9, 2013

Metric States: How fresh is my metric ?

Push Summary: Implementation of a STATE code in ChronicleTS and H2.

As part of a larger implementation, APMRouter core now maintains a state code on each created metric. Possible states are:

  • ACTIVE:  Data has been submitted for this metric within the last periods.
  • STALE: No activity has been seen for this metric within the last n periods.
  • OFFLINE: No activity has been seen for this metric within the last full time-series live tier.
These states are also represented in the enum EntryStatus

Under heavy load, we don't want to set a flag in the metric catalog every time we receive data for a metric. This is what the Realtime MetricCatalog does, and the latency is noticeable. By and large, thye metric catalog does not deal with metric data, and we don't want to stress the DB by pushing all incoming data through the catalog just to keep the metric states up to date. At the same time, we want to be able to:
  1. Filter metric queries by state (i.e. only populate metric tree branches with non-OFFLINE metrics)
  2. Broadcast events regarding metric state change events, in the same way we broadcast new metric events.
 So here's how this works (and it's not 100% done yet, but this is what's done):

  • The chronicle live tier's entry headers have a new field (offset H_SIZE, 1 byte) which represents the STATE of the metric represented by that entry.
  • The H2 table METRIC has a new TINYINT column, STATE.
  • The ChronicleTSManager now accepts registrations of EntryStatusChangeListeners which are notified of state changes of metrics in the live tier.
  • A background scheduler runs every x time-series periods, and choreographs a scan of all the metrics in the live tier. (CORES_AVAILABLE threads are used to do this). Entries are processed as follows:
    • OffLine entries are ignored.
    • If Stale entries have not seen data in a full live-tier window, they are set offline.
    • If Active entries have not seen data in n periods, they are set to stale.
    • These changes are broadcast to the status change listeners (in aggregate) at the end of the scan.
    • The duration of a period and a tier (which is # periods) is defined in the ChronicleTSManager using a time-series model expression which defaults to p=15s,t=5m, meaning that a period is 15 seconds, and a tier is 5 minutes.
    • So without any data coming in, all metrics are slowly degrading from active -> stale -> offline
  • When incoming data is written into the timeseries live-tier, the state of the entry is automatically set to ACTIVE. If the prior state was not ACTIVE, the change is propagated to the state change listeners through the tsManager. So as long as data is coming in for a metric, it will be kept active.
The metric catalog registers itself as an EntryStatusChangeListener and processes all the received events, updating the STATE and LAST_SEEN (to be renamed STATE_TS), so updates to the catalog database for incoming data are limited to only changes, and not every time new data comes in.

In a steady state, this works ok, but when a new agent comes on line and generates a bunch of new metrics, this puts some serious stress on the system, and I have been seeing H2 timeouts when this happens. Below is the jconsole view of the last elapsed time of a background status check after I started a JBoss instance that publishes about 200 metrics:



Needs more work.

Since the STATE is stored as a TINYINT, I added a new method to the H2StoredProcedure to do a decode, and this is exposed in H2 as the user defined function STATUS(int). Now we can query the METRIC table as follows:

select status(state), count(*) from metric group by status(state)

Results:

ACTIVE228
OFFLINE327
STALE34