Enterprise-level Redis monitoring, detailed to each project instance

Enterprise-level Redis monitoring, detailed to each project instance


Several issues are also mentioned in the article:

  • How to distinguish business
  • How to find anomalies
  • How to truncate exceptions
  • How to scale gracefully

Today's article will focus on how to find anomalies and why we should talk about it from this point, because this point is the foundation of the early stage. Only when we have found some problems can we have follow-up behaviors.

redis monitoring

In our company's redis monitoring, the entire redis cluster is monitored , such as: memory usage, client connections, how many keys, etc., to monitor the usage of redis from the overall cluster latitude. Of course, these monitoring is also very important.

Now the conventional solution is to use Prometheus to collect data and display it with Grafana , as shown in the figure below

But these are just monitoring from the overall latitude, because many projects in our enterprise are using a public redis cluster, so how to monitor the requests for the public redis cluster in each project and each project instance?

Public Redis cluster monitoring requirements

Let's look at the following picture first:

Many companies now have a microservice architecture , and there will be many service instances involved . The above figure shows multiple projects, and multiple service instances are requesting a public redis cluster.

In this case, we need a bottom line, that is , we must ensure that our public redis cluster is stable. Once unstable, it will lead to instability of all our project services .

But this kind of architecture will have R&D risks, because in the real development process, we can t guarantee that all our developers will not make mistakes. **In case there is a developer, redis components are used when writing some business. , I wrote an endless loop that caused our redis requests to increase sharply, resulting in redis instability. Or abuse redis and put all business in redis, causing memory to crash. In this case, we have no way of knowing which project or instance caused it. **Our operation and maintenance will be in the chaos of troubleshooting. Can we defend ourselves in advance?

Let's sort out the requirements:

1) Can we allocate the size of redis memory occupied by a project in advance, specify the maximum number of requests per unit time per instance, and the maximum number of exceptions tolerated.

2) Real-time monitoring of the redis memory occupied by the project

3) Real-time monitoring of the request status of each instance in the project

4) Once the project occupies more redis memory than allocated, it will alarm; even this project can be automatically blacklisted

5) Once the number of requests and exceptions in the project is greater than the allocated number, the alarm will be reported; the project can even be automatically blacklisted


Now that the demand comes, we, as developers, need to design our solutions for the project. Because there are still many requirements, let's sort out the business processes of these requirements:

The project allocation rules are actually relatively simple . Just design mysql, develop a web service, and operate on the interface; it is some basic CRUD. I won't talk about it here.

Monitoring is the focus of this article, and the post-monitoring strategies mentioned later will be introduced in the follow-up article.

Monitor project memory usage

If you want to monitor the memory usage of the entire redis, it is relatively simple. The Prometheus+Grafana solution is introduced above . Of course, you can also do some follow-up processes with the alert strategy .

The basic principle is to use redis' own info command and info stats and other commands to output some redis service performance indicators

So how do we fine-tune to monitor the memory usage of each project?

Do you still remember that redis has a concept of persistence , which will persist some cached data to dump.rdb . This file stores the value of each key and value, as well as the occupancy of the slot . However, this file is binary, so friends cannot open it directly.

Our solution is to parse the dump.rdb file and analyze the usage of the project memory, because we will stipulate that each project will have a unique project name, and there will be a fixed format for the key [project name: key name]


That is, each key will be preceded by the project name. This is helpful for us to count the memory used by the project through the project name prefix after parsing dump.rdb.

Friends of the principle should know, but how to parse the dump.rdb file is too difficult; fortunately, with the contribution of the open source community, we don't need to reinvent the wheel. A tool for parsing dump.rdb in go language is used.

The git address is:

github.com/sripathikri... gitee.com/gujiachun/r...

How to get the dump.rdb file? There is a corresponding configuration in the redis configuration file. Of course, the operation of analyzing dump.rdb must be done on the redis slave node , which is relatively safe and reliable, and can also cover all business Redis clusters.

The overall process is as follows:

This open source tool requires some knowledge related to the go language. Of course, a bin has been packaged and can be used directly

Excuting an order

./macos_start dump dump.rdb >> 111.txt

We can see the exported data, the key indicates that the setting is OK, Bytes is the number of bytes occupied by the value stored by this key, and Type indicates the type of the key-value pair.

We can set timed tasks to perform dump operations. Of course, if our friends can speak Go language, we can add businesses to this project. In this way, we can get the memory usage of all keys, and then analyze the project prefix of the key, so that we can calculate the memory usage of the project.

Friends will ask, the dump.rdb file is very large, how long does it take to parse it. I tried it. With a key of 800,000, and a dump file of about 300M, it only takes 5 seconds to analyze with this tool; it's still quite powerful, implemented in Go language.

Instance request monitoring

We have solved the memory usage of the project above. Here we are solving how to monitor the request status of each service instance. We need to know the number of requests, the number of successes, the number of exceptions, the maximum time, the minimum time, and the average time consumption indicators. .

This piece refers to the source code of the current limiting and downgrading framework Sentinel, real-time statistics request indicators.

Sentinel is in a Bucket (barrels) units recorded total number of requests, the total number of anomalies over a period of time, total time-consuming , and a Bucket of data can be recorded within one second, the data can be within 10 milliseconds, we call this time The interval is the statistical unit of Bucket , which is customized by the user:

Bucket stores the number of requests and exceptions within a period of time using a LongAdder array. LongAdder guarantees the atomicity of data modification. Each element of the array represents the total number of requests, the number of exceptions, and the total time spent in a period of time. .

Use the ordinal of the enumeration type MetricEvent as the subscript. I replaced LongAdder with the atomic class under the juc package.

When you need to get the total number of successful requests, or the total number of exceptions, or the total request processing time in the bucket record, you can get the corresponding LongAdder from the LongAdder array through MetricEvent and call the sum method.

When you need to add a request or an exception to the bucket, or when it takes time to process the request, you can get the corresponding LongAdder from the LongAdder array through MetricEvent and call the add method.

With the bucket, suppose we need to store the data for one second in the bucket, so that we can know the number of successful requests processed per second (successful QPS), the number of failed requests processed per second (failed QPS), and the processing of each The average time taken for a successful request (avg RT). But how can we ensure that the Bucket stores data accurate to 1 second? The most low way is to start a scheduled task to create a bucket every second, but the statistical data error is absolutely large.

This is how Sentinel does it. It defines a Bucket array and locates the array subscript based on the timestamp. Suppose we need to count the number of requests processed every 1 second and other data, and only need to save the most recent minute of data. Then the size of the bucket array can be set to 60 , and the windowLengthInMs window size of each bucket is 1000 milliseconds (1 second).

Since each bucket stores 1 second of data, you can remove the millisecond part of the current timestamp to get the current number of seconds. Assuming that the size of the bucket array is infinite, then the number of seconds you get is the current number of seconds you want to get The subscript of the array where the bucket is located.

**But we can't store buckets indefinitely. How much memory is needed for a bucket per second to store data for one day. **So, when we only need to keep one minute of data, the size of the bucket array is 60. Take the remaining number of seconds and the length of the array to get the array index of the current bucket. This array is used cyclically, and only the last 1 minute of data is always saved.

Taking the remainder is to recycle the array. If you want to get the Bucket data for a continuous minute, you can't simply traverse the array from the beginning, but specify a start time and end time , calculate the index of the array stored in the bucket from the start timestamp, and then the loop will start each time Add 1 second to the timestamp until the start time is equal to the end time.

The overall principle is as above, but it is quite scratchy. Just go directly to the case code to see how it works.

Request case

The above is the monitoring and monitoring points added to the business, below we directly output

The friends saw the FlowHelper class, which is in the open source project;

Git address: gitee.com/gujiachun/r...

In this way, we can transform our redis request client method, just monitor and bury the points.

Now there is a question left, that is, how do we provide the monitored data to the monitoring platform?

How to report monitoring data

At the beginning, I thought about using timed tasks to submit to the monitoring platform; at the end I thought about it, what is the address of the monitoring platform? Too coupled. Then I thought that there are some monitoring indicators in our spring boot.

You can see the monitored indicators by visiting http://localhost:1100/actuator/metrics, and then you can directly see the indicator data through http://localhost:1100/actuator/prometheus.

We can also refer to it, as long as we expose each instance, let prometheus take it by itself, and it is a standard process. Implementation:

Introduce dependencies

Implement the MeterBinder interface

It is ok to use the Gauge monitoring class to achieve

In this way, the request metrics of our redis instance can be displayed in http://localhost:1100/actuator/prometheus

Isn t it cool?


There are many knowledge points in this article. To complete the requirements at the beginning of our article, we need to combine these knowledge together to achieve the goal of monitoring; of course, we need to build a DashBoard interface to view the monitoring data .

We look at the memory ratio requests of each project, whether the threshold is reached, and which keys occupy more memory, and rank them. You can also monitor the request status of each instance, the ranking of the number of requests, the ranking of the number of exceptions, etc.

It can greatly help our operations and developers to learn more about our redis usage. The article only introduces the core ideas. All have been implemented, friends who need the source code, can contact.

Write at the end

I have compiled a copy of Java JVM MySQL Redis Kafka Docker RocketMQ Nginx MQ other technical knowledge points , if you need it, you can click here to get it