SpurTracer - Docs

Documentation Topics


How SpurTracer Works...

How Does SpurTracer Monitor Components?

SpurTracer is a agent-less monitoring server that relies on notifications that it will auto-correlate instead of performing active/passive service checks.

The basic idea is to implement stateless notifications in your components. Using such notifications SpurTracer can correlate the notification flow to detect component interactions by triggering interfaces. For both interfaces and components SpurTracer tracks call and timeout and error rates as well as performance metrics.

Which Metrics Are Collected?

Currently SpurTracer gathers

  • Error Rate,
  • Timeout Rate,
  • Call Duration
per
  • Host,
  • Component Type,
  • Component Instance,
  • Interface Type,
  • Interface Instance
in 4 fixed intervals
  • Last Hour,
  • Last Day,
  • Last Week,
  • Last Year.
SpurTracer can provide you many different views on general aspects of your system or specific deployed instances in your system.

SpurTracer Architecture

The following diagram tries to give a summary of the SpurTracer architecture.

SpurTracer Architecture

Design Decisions

The current SpurTracer design is influenced by the wish to quickly get useful results. This means concentrating on getting detailed metrics without complex configuration instead of providing an "eye candy" style GUI missing easy access to real measurements as provided by many commercial offerings.

Therefore the most influental requirements were zero-configuration and simple deployment. These requirements had the following consequences:

  1. Simple basic software stack.
    1. No dependency on a 3rd party HTTP daemon for the user interface.
    2. Simple zero configuration key-value store instead of a relational database.
  2. No initial configuration needed.
  3. Simple external notification interface.
  4. No heavy UI toolkit.
  5. No direct Nagios integration.

1.) As a suitable runtime stack we choose Perl as it provides necessary networking feature and provides modules for virtually all 3rd party databases or key value stores. Additionally several Perl modules do provide drop in implementations for sophisticated, scalable HTTP daemons.

1.a) To be able to run without an existing 3rd party HTTP daemon (as Apache) SpurTracer is built upon the Perl HTTPd implementation Net::Server::HTTP. This CPAN Perl module provide multiple concurrency implementations as well as many typical HTTPd configuration options. SpurTracer uses the multi-process worker implementation to allow multiple concurrent event submissions.

1.b) While there are many key-value stores available (e.g. MemCache) we choose Redis as it additionally provides simple data structures (lists, sorted lists and hashes) and pattern based querying that we utilize for the different filtering features in SpurTracer. With the "Redis" Perl module there is also a simple native binding available. Most importantly Redis allows the concurrent HTTPd workers to write keys concurrently.

2.) By not relying on an 3rd party HTTPd SpurTracer can be started without any prerequisites besides a running Redis instance on port 5667. All "settings" have defaults and SpurTracer can receive notifications right after starting. All further settings are only needed to fine tune alarming and to configure an optional Nagios integration.

3.) For now SpurTracer only supports a simply HTTP GET API for external components to submit notifications. The assumption is that every Unix SW stack at every abstraction layer has the capabilities to submit events via HTTP GET (e.g. using wget/curl from shell scripts, using sockets in C, using java.net.HttpURLConnection in Java...). This is in contrast to a higher level interface like HTTP REST which would impose the use of an additional library on almost all abstraction layers that might not be available in the framework the component relies on. Therefore to prevent imposing dependencies on external components we provide an HTTP GET based notification API.

4.) For the initial SpurTracer implementation we decided on a simplicistic UI solution by delivering XML responses via HTTP GET that are rendered using XSLT stylesheets. This allows rendering generic data dumps using specific XSLT views. On the long term this could be replaced with a more user-friendly JQuery based GUI framework fetching data asynchronously and on-demand.

5.) One of the most important decisions was not to integrate directly into Nagios as _the_ standard monitoring solution. This is both to a conceptual mismatch (component monitoring vs. service monitoring) and to fullfill requirement 2.) "no initial configuration". By following the Nagios service model SpurTracer would have to be pre-configured with all service definitions instead of being able to discover components and interfaces from their notifications.

For this reason we only have light-weight Nagios integration by allowing the user to explicitely connect statistic objects (as detected by SpurTracer) with service instances in the users Nagios configuration. Once connected SpurTracer will submit service check results with the standard NSCA interface.


Notification Flow

The following simplified web shop sequence diagram can illustrate the notification schema to be used with SpurTracer. Below you see the interaction between a session managing frontend with a payment and a delivery service, which themselves use a messaging service to send mails to the customer.

Web Shop Sequence Diagram

SpurTracer can monitor the interactions of the actors in this sequence diagram if each actor notifies SpurTracer when a context is created and and when processing is finished. SpurTracer can additionally monitor the interface invocations if each actor calling another one announces the call to SpurTracer. This behaviour is shown in the graph below:

SpurTracer Notification Sequence Diagram

From the notifications as shown in the above sequence diagram SpurTracer can build the relations of the created context types SID, PID, DID and MsgId. Determining such a context diagram is the core function of SpurTracer. The context diagram is what SpurTracer considers a "spur" (in the German meaning of this word which could be translated as a track of someone or something your can follow).

For each edge in this diagram (representing an interface invocation) SpurTracer will gather failure rates and performance metrics.


Installing SpurTracer

The SpurTracer source distribution provides an autotools based installation. To learn more on how to install SpurTracer please refer to the Download page!


Implementing SpurTracer Support

Extending Existing SW Components

To integrate one or more of your SW components with SpurTracer you need to extend them to produce push notifications using the SpurTracer API. This requires modifying your SW. Before starting an implementation it might be good to go over the prerequistes.

Implementation Considerations

While it is possible to continuously integrate standalong components one by one into the SpurTracer monitoring more consideration is necessary when doing this as a means for monitoring the whole system, when the main goal is to monitor the interactions and non-functional aspects of the components and their interactions.

For SpurTracer to be able to monitor component interactions - interfaces - it must be able to correlate component contexts, which it identifies using unique component context ids. So one precondition is that each component reports unique context ids per execution.

To correlate components using announcements SpurTracer requires an announcing component to already know the context id of the future component execution context. This can be realized by both the calling and the called component creating the context id reported to SpurTracer by the same schema and from the information passed in the interface. One of the basic assumptions of SpurTracer is that it is always possible for to components in a caller/callee relationship to use predictable ids when following a convention for the callee execution context id.

How to get good context ids? The following questions might help:

  • Maybe you have unique business object ids?
  • What unique string do you use when searching logs?
  • Do you have transaction ids in the processing?

The answer the question to those question might be a valid context id!

Aside from the context id you should also consider which alarms to generate and how long to set the timeout duration per component type.

Check List

Here is a short check list to be considered before starting an implementation.

  1. Identify Objects
    1. Which Components do you want to monitor?
    2. Which Interfaces do you want to monitor?
    3. On which hosts? -> Component and Interface instances
  2. Identify Component Context Id Schema
    • For each component determine the id schema
    • Ensure the ids are at least unique per component instance
    • Globally unique ids are optimal.
  3. Ensure Callee Component Context Id Prediction:
    • Calling components must know the component name of the called component. Both must send notification using this name.
    • Each calling component must announce the called interface.
    • Announce only interfaces you want to track.
    • You can only track interfaces when both components involved are tracked.
  4. Component SLAs
    1. Maximum error rate -> error alarm setting
    2. Maximum timeout rate -> timeout alarm setting
    3. Maximum execution time -> timeout setting
  5. Interface SLAs
    1. Maximum error rate -> error alarm setting
    2. Maximum timeout rate -> timeout alarm setting
  6. Objects to be Checked via Nagios
    • Determine which objects you want to add Nagios alarms for.
    • Determine which SLA (error rate, timeouts...) you want to monitor.
    • Plan mappings from components to Nagios service definitions.

Using the points 1.) to 3.) you can start implementing the notifications in your component code and using 4.) to 6.) you can configure SpurTracer to correctly handle those notifications.

Using a Notification Proxy

In case you cannot or do not want to modify your existing SW components you might want to consider implementing a notification proxy. Such a proxy could:

  • Monitor log files and produce notifications.
  • Extend the logging framework you use with some log levels or special log messages also issuing notifications.

Unix Shell Script Example

A shell script based log file monitor could look like this:

#!/bin/sh

# Note: Change the server location to match your setup
SERVER="http://localhost:8080"

HOST=`uname -n`

# Determine input file
infile=/var/log/myapp.log
if [ ! -f $infile ]; then
        echo "ERROR: No input file found ($infile)!"
        exit 1
fi

# FIXME: Test for curl

################################################################################
# Print the current Unix timestamp in [ms]
################################################################################
get_timestamp() {
	# Getting a [ms] timestamp is a bit complicated...
	nanos=`date +%N`
	nanos=`expr $nanos / 1000000`
	time=`date +%s`
	time=`printf "%d%03d" $time $nanos`
	echo "$time"
}

################################################################################
# Announce an interface
#
# $1	name of triggering component
# $2	id of triggering context
# $3	name of triggered component
# $4	id of announced context
################################################################################
announce() {
	# FIXME: HTTP URI encode parameters 

	curl -s "${SERVER}/set?time=$(get_timestamp)&host=$HOST&type=c&component=$1&ctxt=$2&newcomponent=$3&newctxt=$4"
}

################################################################################
# Send a notification
#
# $1	name of triggering component
# $2	id of triggering context
# $3	status (started|failed|finished)
# $4	optional description
################################################################################
notify() {
	# FIXME: HTTP URI encode parameters 

	curl -s "${SERVER}/set?time=$(get_timestamp)&host=$HOST&type=n&component=$1&ctxt=$2&status=$3&desc=$4"
}

################################################################################
# Main loop
################################################################################

# Note: Use tail --follow=name to stop reading on log rotation
# and loop to reopen the file when recreated...
while [ 1 ]; do
	tail --follow=name $infile |\
	while read date component ctxt event rest
	do
		# Conditions:
		#
		# 1.) $ctxt must be unique
		#
		# 2.) $date should have [ms] resolution, be parseable by date
		#     and provide different values for the started and finished 
		#     notifications (otherwise order might be confused and no
		#     performance metrics can be calculated)
		#
		# 3.) The log format for this example script should be like this:
		#
		#    2011-12-15 14:32:52.10 MyComp 6473753 Started
		#    2011-12-15 14:33:23.04 MyComp 6473753 Invoke OtherComp 6473753
		#    2011-12-15 14:33:40.92 MyComp 6473753 Exception OutOfMemoryException in ...
		#    2011-12-15 14:33:44.12 MyComp 6473753 Finished

		case $event in
			Start)
				notify "$time" "$component" "$ctxt" 'started'
				;;
			Invoke)
				# Parse target component and context
				echo "$rest" | read newcomponent newctxt newrest

				announce "$time" "$component" "$ctxt" "$newcomponent" "$newctxt"
				;;
			Exception)
				notify "$time" "$component" "$ctxt" 'failed' "$rest"
				;;
			Finished)
				notify "$time" "$name" "$ctxt" 'finished'
				;;
		esac
	done

	sleep 2
done
					

For more shell script examples have a look at the "simulate_*.sh" scripts in the tools subdirectory of the source distribution. Those are used for testing purposes but can still illustrate the notification schema.

Java Example

In the context of European research project TRIDEC a Java integration of SpurTracer was implemented. This code can serve as an example for a typical Java integration and can be found in the SVN source repository:

http://svnext.gfz-potsdam.de/fosslab/libraries/trunk/org.fosslab.spurtracer.sender

Push Notification API

The current API is a HTTP GET interface supporting two types of notifications:

Parameter Encoding

The following parameter encoding limitations exist:

  • Encoding must be UTF-8
  • You must not use the "!" character

URI Schema for Event Notifications

set?type=n&host=<host>&component=<component>&ctxt=<ctxt>&time=<time>&desc=<desc> With the GET parameters having the following meaning:
typeAlways 'n' for event notifications.
hostThe hosts name the notification originates from. Not necessarily fully qualified, as short as possible while still being unique.
componentThe components name which did produce the notification.
ctxtUnique context identifier. As short as possible. Not necessarily descriptive. Used for event correlation.
timeUnix time stamp [ms]
descHuman readable event description (optional)

URI Schema for Context Announcements

set?type=c&host=<host>&newcomponent=<announced component>&newctxt=<announced ctxt>\
&component=<component>&ctxt=<ctxt>&time=<time>&desc=<desc>

Most of the GET parameters are exactly as for event notifications (see the table above) with the following additional/different parameters:

typeAlways 'c' for context announcements.
newcomponentThe name of the component whose execution is announced.
newctxtUnique context identifier of the announced execution context.
descHuman readable name of the interface (optional)

Notification Responses

On success SpurTracer will return HTTP status 200. In case of errors it will return HTTP status 400 with error message "Invalid data". Usually this indicates missing parameters or invalid encoding. In this case check that you have not used a "!" or non-UTF-8 encoded characters.


Internal Data Schema

Namespaces

As SpurTracer uses the Redis key/value store for fast data access and persistency we need namespaces to group different types of data. The namespace organisation is as following:

EBNF notation:

schema ::= ('stats' interval? name property | 'announce' type event | 'event' id event | 'spuren' id)

The global namespace separator used by SpurTracer is the '!' character. If you want to view the live keys used in your Redis setup you might want to run:

$ redis-cli keys "*" | less

Each namespace is described in detail in the following sections.

Events

The event namespace contains a list of all events tracked so far. While allowing time range access its most important function is to allow event filtering. Therefore the key schema uses field identifier characters that allow matching on one or more fields using Redis wildcards.

The key schema has two variants based on the type field 't' for notification events 'n' and context announcement events 'c'. While notification events have mandatory status 's' and an optional description 'd', announcements have no further key fields.

EBNF notation:

event ::= 'event' 'd' timestamp 'h' host 'n' component 'c' ctxt 't' (('n' 's' status ('d' description)?)|'c')

Announcements

The announcement namespace is a list of pending events used for lookup and timeout detection. Therefore the key schema must allow easy lookup and the value contains the event that caused the announcement.

EBNF notation:

announcement ::= 'announce' (('interface' host component component ctxt) | ('component' host component ctxt))

Statistic Objects

For SpurTracer to be able to correlate events, provide summary on different objects as well as non-functional measurements a special counter logic - so called "Statistic Objects" are necessary.

Each statistic object keeps data per interval and has a type specific name. Currently there are four fixed intervals.

EBNF notation:

statistic_object ::= 'stats' ( 'hour' | 'day' | 'week' | 'year' )? name property

Statistic Object Names

Statistic object names encode both type and identity of the statistic object. We distinguish betweem statistic object providing information about infrastructure element types ('objects' like host, component, interface) and 'instances' of these types (actual components and interfaces).

EBNF notation:

name ::= (('object' (('host' host) | ('component' component) | ('interface' source_component target_component)) | 'instance' (('component' host component) | ('interface' source_host source_component target_component))))

Statistic Object Properties

Each statistics object has different properties. Currently there are two types: event counters and performance metrics. While event counters are simply increased on matching events the perf_samples counter is increased each time a performance value is added. All properties are kept as interval sized arrays which are accessed using a ring buffer semantic.

EBNF notation:

property ::= ((('started' | 'failed' | 'timeout' | 'finished') array) | (('perf_values' | 'perf_samples') array))