Every system has a log that needs to be used to resolve the problem when something goes wrong with the system
When there are fewer system machines, log in to the server to view
When the system machine is so large, it is almost impossible to log on to the machine and look at it

Of course, even if the scale of the machine is not large, a system usually involves the development of multiple languages, take our company, the bottom layer is developed through C++, and also the business application layer is developed through Python, and even C++ is divided into many levels of applications, Python also has multiple applications, then the problem comes, every time the system has a problem, how to quickly check the problem? A better situation may be that the python application layer check the log and find that the system is the underlying processing exception, so it is called C++ colleagues to check, if the C++ side can quickly locate the error to tell the python layer this side is okay, if the error is easy to troubleshoot, it may be that the various development layers are together to find out where it is caused. Of course, this may be a general statement, but it raises a question:

When there is a problem in the system, how to quickly locate the problem in an application layer based on the logs?
In ordinary work, how to analyze a request to the system based on logs, mainly in which application layer is time-consuming?
In ordinary work, how to obtain a request after reaching the system and summarize the test logs at each layer?

One solution we want to achieve for the above problems is:

The logs on the machine are collected in real time and stored in a unified way to the central system
These logs are then indexed, and the corresponding logs can be found by searching
By providing a user-friendly web interface, log search can be completed through the web

Regarding the problems you may face when implementing this system:

The volume of real-time logs is very large, billions of logs per day (although our company's system has not yet reached this level)
Logs are collected in near real-time, and latency is controlled to minutes
Ability to scale horizontally

Regarding log collection systems, the industry's solution is ELK

ELK's solution is a universal solution, so the following problems will inevitably arise:

O&M costs are high, and each additional log collection requires manual configuration modification
Monitoring is missing to accurately obtain the status of logstash
Custom development and maintenance are not possible

In response to this situation, in fact, we want the system that the agent can dynamically obtain which logs we need to monitor for a certain server

And those logs we need to collect, and when the server that needs to collect logs goes offline, we can dynamically stop collecting them

Of course, these effects are ultimately presented through the web interface.

Log collection system design

The main architecture diagram is

Notes on the individual components:

Log Agent, a log collection client, is used to collect logs on the server
Kafka, high-throughput distributed queue, linkin development, Apache top open source project
ES, elasticsearch, an open source search engine that provides a web interface based on HTTP restful
Hadoop, a distributed computing framework, is a platform capable of distributed processing of large amounts of data

An introduction to Kakfa

Apache Kafka is a distributed publish-subscribe message system and a powerful queue that can handle large amounts of data and enable you to pass messages from one endpoint to another. Kafka is suitable for both offline and online message consumption. Kafka messages remain on disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming data analysis.

Note: There will not be much introduction to Kafka here, just a description of the basic content and application scenarios, after all, the knowledge here is also very expensive

There are a few basic messaging terms to understand in Kafka:

Kafka summarizes messages in topical units.
The program that will publish messages to the Kafka topic becomes producers.
Programs that subscribe to topics and consume messages become consumers.
Kafka runs as a cluster and can consist of one or more services, each called a broker.

Advantages of Kafka:

Reliability – Kafka is distributed, partitioned, replicated and fault-tolerant.
Scalability - Kafka messaging systems scale easily with no downtime.
Durability - Kafka uses a distributed commit log, which means that messages stay on disk as quickly as possible, so it's durable.
Performance - Kafka has high throughput for both publish and subscribe messages. It maintains stable performance even though it stores many terabytes of messages.

Kafka is very fast and guarantees zero downtime and zero data loss.

Kafka application scenarios:

Asynchronous processing, which asynchronousizes non-critical processes and improves the response time and robustness of the system

Applications are decoupled through message queues

Traffic peak shaving

About ZooKeeper

ZooKeeper is a distributed coordination service for managing mainframes. Coordinating and managing services in a distributed environment is a complex process. ZooKeeper solves this problem with its simple architecture and API. ZooKeeper allows developers to focus on core application logic without worrying about the distributed nature of the application.

Apache ZooKeeper is a service used by clusters (groups of nodes) to coordinate between themselves and maintain shared data through robust synchronization techniques. ZooKeeper itself is a distributed application that provides services for writing distributed applications.

ZooKeeper mainly consists of several components:

Client: A node in our distributed application cluster that accesses information from the server. For specific time intervals, each client sends a message to the server to let the server know that the client is active. Similarly, when a client connects, the server sends an acknowledgment code. If the connected server does not respond, the client automatically redirects the message to another server.
Server: Server, a node in our ZooKeeper population, that provides all services to clients. Send a confirmation code to the client to let the server be active.
Ensemble: A group of ZooKeeper servers. The minimum number of nodes required to form an ensemble is 3.
Leader: A server node that performs automatic recovery if any of the connected nodes fail. The Leader is elected when the service starts.
Follower: The server node that follows the leader's instructions.

ZooKeeper application scenarios:

Service Registration & Service Discovery

Configure the hub

Distributed locks

Zookeeper is a strong consistency of multiple clients creating the same znode on Zookeeper at the same time, and only one creation is successful

About Log Agent

This is the step-by-step content we will implement through the code later, and the main functions implemented are:

Similar to how we read log files through tail under Linux, the read content is sent to Kafka

What needs to be known here is that the tailf we have here can be changed dynamically, and when the configuration file changes, it can notify us to automatically increase the tailf that needs to be added to obtain the corresponding log and send it to Kafka Producer

It is mainly composed of the following directories:

Kafka
tailf
configlog

Go Log Collection System-1

Log collection system design

An introduction to Kakfa

About ZooKeeper

About Log Agent