Protecting Against Textfile Collector Script Failure
Monitoring bespoke metrics with Prometheus textfile collector scripts can fail in such a way that all your monitors look fine but actually there is a problem. Here's how to prevent that.
When you want to monitor a metric with the Prometheus monitoring system one approach is to use the node exporter with the --collector.textfile.directory
flag, this allows us to add additional custom metrics easily and quickly.
Textfile Collector Cron Jobs
The recommeneded approach is to use a cronjob to generate these text files, here's an example based on the node exporter README:
scriptToOutputMetrics > metrics.prom.$$ mv metrics.prom.$$ metrics.prom
There is an alternative approach recommended in the community node exporter textfile collector scripts repo:
scriptToOutputMetrics | sponge metrics.prom
Both of these approaches solve the problem of the metrics.prom
file from being read while it's being written to (so-called "atomic" changes), but what happens if scriptsToOutputMetrics
fails to run?
The Problem
In both the scripts above if scriptToOutputMetrics
exits without outputting to stdout then the previous copy of metrics.prom
isn't overwritten! What this means is that when the node exporter is run and reads the file, as far as it's concerned the values in metrics.prom
have just remained the same!
The Solution
All that needs to change to mitigate this problem is to delete the metrics.prom
file if the metrics.prom.$$
doesn't exist:
scriptToOutputMetrics > metrics.prom.$$ mv metrics.prom.$$ metrics.prom || rm metrics.prom
This will cause those metrics to disappear which can then be detected and cause an alert.