Concepts
We assume a cluster of nodes here, each running one or more containers. You want at any point in time to have insights what's going on on the nodes and in your containers and services:
Suggest you start by reading James Turnbull's wonderful The Art of Monitoring book or, if you've only got 10 min, Netsil's Making Sense of the Application Monitoring Landscape.
Tooling
On each node, for example, use collectd and cAdvisor to scrape data locally and then there are multiple options for the other functionalities:
- event router:
- fluentd http://www.fluentd.org
- Flume https://flume.apache.org
- Kafka https://kafka.apache.org
- logstash https://www.elastic.co/products/logstash
- Riemann http://riemann.io
- storage:
- Elasticsearch https://www.elastic.co/products/elasticsearch
- Graphite https://graphiteapp.org
- InfluxDB https://influxdata.com/time-series-platform/influxdb
- KairosDB (on top of Cassandra) https://kairosdb.github.io
- OpenTSDB (on top of HBase) http://opentsdb.net
- (others such as using a local filesystem, Ceph FS, HDFS , etc.)
- dashboard:
- D3 https://d3js.org
- Grafana https://grafana.net
- signal fx https://signalfx.com
- alerting:
- BigPanda https://bigpanda.io
- PagerDuty https://www.pagerduty.com
- signal fx https://signalfx.com
- VictorOps https://victorops.com
There are a couple of integrated or end-to-end solutions as well as fully managed offerings available as well:
- Amazon CloudWatch https://aws.amazon.com/cloudwatch
- AppDynamics https://www.appdynamics.com
- Azure Monitor https://azure.microsoft.com/services/application-insights
- Circonus https://www.circonus.com
- DataDog https://www.datadoghq.com
- dcos/metrics
- Ganglia http://ganglia.info
- Google Stackdriver https://cloud.google.com/monitoring
- Hawkular http://www.hawkular.org/
- Icinga https://www.icinga.com
- Librato https://www.librato.com
- Nagios https://www.nagios.org
- New Relic https://newrelic.com
- OpsGenie https://www.opsgenie.com
- Pingdom https://www.pingdom.com
- Prometheus https://prometheus.io
- Ruxit http://www.dynatrace.com/en/ruxit
- Sensu https://sensuapp.org
- Sysdig https://sysdig.com
- Zabbix http://www.zabbix.com
For more hints on how to selecting tools and applying good practices check out the following resources:
- Comparing Seven Monitoring Options for Docker via rancher.com
- Why Percentiles Don’t Work the Way you Think via vividcortex.com
- The Art of Testing Alerts via signifai.io
There's an excellent conference on this topic, Monitorama, which you should totally keep an eye on.