Monitoring

Tools Used

node-exporter

Currently all nixos boxes are installed with node-exporter to expose system metrics on port 9100. node-exporter is installed with various plugins to gather system metrics.

node-exporter is configured by nixos here

Prometheus

Prometheus is configured to scrape metrics from various sources. Prometheus is currently deployed with static scrape configs pointing to DNS entries of servers.

All servers need to be added here to ensure they are scraped by prometheus for metrics.

By default prometheus scrapes every 15s, this may need to be reduced to 30s or 1m later on. All data is retained for 15 days by default. Redbrick currently has no use cases for long term data. But if required an influx or graphite database should be used as a remote_write for Prometheus.

fluentd

Fluentd is used as a syslog endpoint. log.internal:514 is the logs endpoint. Fluentd can be configured to parse and tag logs. Manual parsing of should be avoided in fluentd in favour of Loki and fluentd plugins.

Fluentd is configured to send logs to Loki on the same host it is running on.

Loki

Loki is grafana's logging solution. Loki is query able in grafana. All Logs should be configured to send to it. Loki supports multiple ways to receive logs, redbrick uses fluentd and docker logging driver.

To send logs to Loki using a Loki client point logs to log.internal:3100

Grafana

Grafana is a graphing front end. Grafana has a large number of dashboards for reviewing metrics and logs from every node. Alerts should be configured in grafana to alert admins and root holders when events occur based on the metrics or log events.