Protecting Against Textfile Collector Script Failure

November 20, 2019
Internet & Technology
Oliver

Monitoring bespoke metrics with Prometheus textfile collector scripts can fail in such a way that all your monitors look fine but actually there is a problem. Here's how to prevent that.

When you want to monitor a metric with the Prometheus monitoring system one approach is to use the node exporter with the --collector.textfile.directory flag, this allows us to add additional custom metrics easily and quickly.

Textfile Collector Cron Jobs

The recommeneded approach is to use a cronjob to generate these text files, here's an example based on the node exporter README:

scriptToOutputMetrics > metrics.prom.$$
mv metrics.prom.$$ metrics.prom

There is an alternative approach recommended in the community node exporter textfile collector scripts repo:

scriptToOutputMetrics | sponge metrics.prom

Both of these approaches solve the problem of the metrics.prom file from being read while it's being written to (so-called "atomic" changes), but what happens if scriptsToOutputMetrics fails to run?

The Problem

In both the scripts above if scriptToOutputMetrics exits without outputting to stdout then the previous copy of metrics.prom isn't overwritten! What this means is that when the node exporter is run and reads the file, as far as it's concerned the values in metrics.prom have just remained the same!

The Solution

All that needs to change to mitigate this problem is to delete the metrics.prom file if the metrics.prom.$$ doesn't exist:

scriptToOutputMetrics > metrics.prom.$$
mv metrics.prom.$$ metrics.prom || rm metrics.prom

This will cause those metrics to disappear which can then be detected and cause an alert.

Blog