Protecting Against Textfile Collector Script Failure
Monitoring bespoke metrics with Prometheus textfile collector scripts can fail in such a way that all your monitors look fine but actually there is a problem. Here's how to prevent that.
When you want to monitor a metric with the Prometheus monitoring system one approach is to use the node exporter with the
--collector.textfile.directory flag, this allows us to add additional custom metrics easily and quickly.
Textfile Collector Cron Jobs
The recommeneded approach is to use a cronjob to generate these text files, here's an example based on the node exporter README:
scriptToOutputMetrics > metrics.prom.$$ mv metrics.prom.$$ metrics.prom
There is an alternative approach recommended in the community node exporter textfile collector scripts repo:
scriptToOutputMetrics | sponge metrics.prom
Both of these approaches solve the problem of the
metrics.prom file from being read while it's being written to (so-called "atomic" changes), but what happens if
scriptsToOutputMetrics fails to run?
In both the scripts above if
scriptToOutputMetrics exits without outputting to stdout then the previous copy of
metrics.prom isn't overwritten! What this means is that when the node exporter is run and reads the file, as far as it's concerned the values in
metrics.prom have just remained the same!
All that needs to change to mitigate this problem is to delete the
metrics.prom file if the
metrics.prom.$$ doesn't exist:
scriptToOutputMetrics > metrics.prom.$$ mv metrics.prom.$$ metrics.prom || rm metrics.prom
This will cause those metrics to disappear which can then be detected and cause an alert.