More Amazon Cloudwatch Monitoring

In this lab we will look at the the more numerous metrics available directly via the Cloudwatch console

  1. Login to the Console of the account where your cluster is running and go to the Cloudwatch console

  2. Select Metrics in the left hand navigation pane

This screen shows you all the different services emitting metrics - we want AWS/Kafka for the Amazon MSK cluster

  1. Select AWS/Kafka

Notice that there are 3 categories of metrics - Broker ID, Cluster Name, Topic, Broker ID, Cluster Name, and Cluster Name. These generally all different metrics at different levels of granularity or scope.

Broker ID, Cluster Name, Topic - These are at the per topic level, on each broker, for each cluster. They will include many individual metrics per topic and are useful if you need to investigate performance or balance issues for particular topics.

An example of this would be looking at the MessagesInPerSec for a topic (ExampleTopic) in a particular cluster, broken down by broker:

Note that you can use the search bar to filter - you can type directly in to the bar, or you can select the down arrow beside any field and filter from that menu

Broker ID, Cluster Name - These metrics are more focused on the broker and topics, looking at things like memory, CPU, networking, and higher level monitoring such as counts of UnderReplicatedPartitions.

Note here that the Statistic field is Maximum - I’ve switched to the Graphed Metrics tab, and set this to show the maximum value for each poll. It is generally a bad idea to use ‘average’ as it can give skewed results, hide spikes/dips, and make identifying problems harder. Evaluate the metric and your goal and select one of the many options available for the statistic

Cluster Name - the higest level metrics, looking at the overall state and health of a cluster.

In this graph we can see 2 important metrics, but the scale of their values is vastly different - one is large and growing integers (GlobalTopicCount), the other is measing ms of latency (ZooKeeperRequestLatecyMsMean). To accomodate these differing metrics and values, I’ve moved the ZooKeeperRequestLatencyMsMean to the 2nd Y Axis, allowing us to have the different scales graphed on the same graph easily.

For this exercise, lets select an important metric to watch - MessagesInPerSec in Broker ID, Cluster Name. This will be the total messages in to Amazon MSK per broker.

Note - If you want MessagesInPerSec at a topic level, then you need to use the **Broker ID, Cluster Name, Topic** metric source.
  1. In the search bar, enter your Amazon MSK Cluster name (eg: MSKCluster) then enter the metric you want to graph - MessagesInPerSec

You will see the messages in to each broker.

If you were investigating a problem you might start here - look at the messages coming in the brokers and compare it to normal values. To do this, you would use the time controls in the top right corner

Other options would be:

  • set your auto-refresh - Handy when you’re investigating to keep the graph refreshed
  • Share your graph - click on Actions to get a link to share this graph with someone else
  • Change your graph type - switch to Stacked, Line, or Number (show only the latest value)

We can zoom out to 1w to look at historical trends, or zoom in to look at current data with greater granularity.