Observing system is a crucial thing for having robust system running in production. You need to be able to know how the components are interaction and how the infrastructure is behaving. It's fundamental to know as early as possible when something is not operating as expected.
A robust monitoring system allows gathering metrics from services and infrastructure and using those metrics to gather insights from the operation of a system. Monitoring system should provide a way to collect data and store, display and analyze it. Monitoring does not only allows you to react to issues, it can be used to predict system behavior or to provide data for business analytic purposes (by using collected metric).
Monitoring stack is constructed out of:
Each of these components feeds aggregated data from multiple services into its own dashboard. Monitoring individual service will be of little to no use because services provide isolation, but they are acting as an organism. They are dependent on each other and on the infrastructure layer (databases, caches, queues, network…).
Monitoring system should allow you to know what is broken and why. With it it should be easy to quickly reveal any symptom and use the available monitors to determine causes. Found symptoms and causes vary depending on the observation point, so you must be sure that you are looking from the correct place.
There are four signals one should focus on while collecting metrics from any system:
- latency: the time which passes between a request is made to a given service and when the service completes it
- errors: number or request that don't result in a successful outcome
- traffic: the demand of a system
- saturation: measurement of the service capacity (usually CPU, memory and network)
It is important to determine which type of metric is the best suited for a given observed resource. There are several different types:
- counters: single numerical value representing cumulative metric (e.g. number of requests, number of errors, bytes transmitted, …)
- gauges: single numerical value that can go up or down on a given scale (e.g. number of connections to a database, memory used, CPU used, average load,… )
- histograms: sample observations categorized in buckets (per type, time,…) (e.g. I/O latency, latency of request,…)
Once a monitoring system is in place you should be able to determine when somebody needs to be alerted when a certain condition(s) in the system is met. Moreover alerts need to be prioritized and categorized because most likely the service(s) or infrastructure will trigger multiple alerts.
As mentioned earlier, multiple services will be involved in providing functionality to a user. Without central access point to data it is hard to understand what is going on inside the system. To be able to achieve observability data must be collected from multiple sources; not only from the running service but also from the infrastructure.
To effectively store and make log data searchable, there must be a consistent format (agreed on an engineering team) which will guarantee that it can be stored and processed effectively. There can be data from multiple sources which you may collect and use (e.g. application logs, database logs, network logs…), and for some of them you want be able to control the format, so you need to cope with their specifications.
In order to understand the behavior in the system, logs need to include certain information:
Each of the log entries needs to be human-readable, but at the same time easily parseable by a machine. One of the most used formats is JSON.
Once the structure is in place you need to carefully decide which information will be included in the logs which will be sent. Things like passwords, credit card numbers, medical IDs, and other potentially sensitive personal data need to be excluded in order to comply with General Data Protection Regulation (GDPR). General rule would be to log as little information as possible, avoiding any personal data in logs, and taking extra care with reviewing the data which will be sent out from the service.
All services should propagate an ID field which will allow you to follow execution path of a request through the system. With this set up, logs can be grouped under the same context.
To better reconstruct the journey of a request through the system you can set up distributed tracing which will allow visualization of the flow of execution between services. This will provide insights on how long each operation takes, and how the services are related to each other while performing the request.