1、前面文章中我們主要示範了eureka的使用,今天我們來學習eureka的自我保護機制。
2、什麼是eureka的自我保護機制?
預設情況下,如果Eureka Server在一定時間内(預設60秒)剔除90秒内沒有續約的節點。但是當網絡分區故障發生時,微服務與Eureka Server之間無法正常通信,而微服 務本身是正常運作的,此時不應該移除這個微服務,是以引入了自我保護機制。
spring官方的定義是:自我保護模式正是一種針對網絡異常波動的安全保護措施,使用自我保護模式能使Eureka叢集更加的健壯、穩定的運作。
自我機制的表現形式是啥?
如果在15分鐘内超過85%的用戶端節點都沒有正常的心跳,那麼Eureka就認為用戶端與注冊中心出現了網絡故障,Eureka Server自動進入自我保護機制,此時會出現以下幾種情況:
1、Eureka Server不再從注冊清單中移除因為長時間沒收到心跳而應該過期的服務。
2、Eureka Server仍然能夠接受新服務的注冊和查詢請求,但是不會被同步到其它節點上,保證目前節點依然可用。
3、當網絡穩定時,目前Eureka Server新的注冊資訊會被同步到其它節點中。是以Eureka Server可以很好的應對因網絡故障導緻部分節點失聯的情況,而不會像ZK那樣如果有一半不可用的情況會導緻整個叢集不可用而變成癱瘓。
3、自我保護機制是針對eurekaServer端的,自我保護機制預設是開啟的,我們可以進行配置,使用配置eureka.server.enable-self-preservation=false進行關閉。
4、自我保護機制的實作原理:
表現形式如上圖描述,那麼這個自我保護機制的實作原理是啥呢?
這個自我保護機制的實作原理在于,如何統計服務續約的門檻值threshold、最後一分鐘的實際續約數量來完成判斷。
我們在源碼中找到了這兩個值的定義所在:在類 AbstractInstanceRegistry 中:
#最後一分鐘續約數量處理器
private final MeasuredRate renewsLastMin;
#服務續約門檻值
protected volatile int numberOfRenewsPerMinThreshold;
#注冊的服務數量,由于注冊的服務數量是會不停變化的美是以這個值會在服務注冊registry、cancel的時候進行改動。
protected volatile int expectedNumberOfClientsSendingRenews;
我們先來看看服務續約的門檻值是如何計算的:
protected void updateRenewsPerMinThreshold() {
先使用服務數量 expectedNumberOfClientsSendingRenews
* 續約的頻率(預設1min 2次)
* 失敗率(預設0.85)
例如3個服務那麼門檻值就等于 3 * 2 * 0.85 = 5
this.numberOfRenewsPerMinThreshold = (int) (this.expectedNumberOfClientsSendingRenews
* (60.0 / serverConfig.getExpectedClientRenewalIntervalSeconds())
* serverConfig.getRenewalPercentThreshold());
}
我們來看看什麼時候服務數量 expectedNumberOfClientsSendingRenews 會發生改變呢?
1、服務上線的時候肯定會。
2、服務主動下線的時候肯定也會。
3、剔除不良的節點的時候也會。
我們依次來看看源碼中的實作:
1、服務上線的時候源碼:
public void register(InstanceInfo registrant, int leaseDuration, boolean isReplication) {
try {
read.lock();
Map<String, Lease<InstanceInfo>> gMap = registry.get(registrant.getAppName());
REGISTER.increment(isReplication);
if (gMap == null) {
final ConcurrentHashMap<String, Lease<InstanceInfo>> gNewMap = new ConcurrentHashMap<String, Lease<InstanceInfo>>();
gMap = registry.putIfAbsent(registrant.getAppName(), gNewMap);
if (gMap == null) {
gMap = gNewMap;
}
}
Lease<InstanceInfo> existingLease = gMap.get(registrant.getId());
// Retain the last dirty timestamp without overwriting it, if there is already a lease
if (existingLease != null && (existingLease.getHolder() != null)) {
Long existingLastDirtyTimestamp = existingLease.getHolder().getLastDirtyTimestamp();
Long registrationLastDirtyTimestamp = registrant.getLastDirtyTimestamp();
logger.debug("Existing lease found (existing={}, provided={}", existingLastDirtyTimestamp, registrationLastDirtyTimestamp);
// this is a > instead of a >= because if the timestamps are equal, we still take the remote transmitted
// InstanceInfo instead of the server local copy.
if (existingLastDirtyTimestamp > registrationLastDirtyTimestamp) {
logger.warn("There is an existing lease and the existing lease's dirty timestamp {} is greater" +
" than the one that is being registered {}", existingLastDirtyTimestamp, registrationLastDirtyTimestamp);
logger.warn("Using the existing instanceInfo instead of the new instanceInfo as the registrant");
registrant = existingLease.getHolder();
}
} else {
// The lease does not exist and hence it is a new registration
synchronized (lock) {
if (this.expectedNumberOfClientsSendingRenews > 0) {
// Since the client wants to register it, increase the number of clients sending renews
将服務數量進行 + 1 然後更新續約門檻值。
this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews + 1;
updateRenewsPerMinThreshold();
}
}
logger.debug("No previous lease information found; it is new registration");
}
Lease<InstanceInfo> lease = new Lease<InstanceInfo>(registrant, leaseDuration);
if (existingLease != null) {
lease.setServiceUpTimestamp(existingLease.getServiceUpTimestamp());
}
gMap.put(registrant.getId(), lease);
recentRegisteredQueue.add(new Pair<Long, String>(
System.currentTimeMillis(),
registrant.getAppName() + "(" + registrant.getId() + ")"));
// This is where the initial state transfer of overridden status happens
if (!InstanceStatus.UNKNOWN.equals(registrant.getOverriddenStatus())) {
logger.debug("Found overridden status {} for instance {}. Checking to see if needs to be add to the "
+ "overrides", registrant.getOverriddenStatus(), registrant.getId());
if (!overriddenInstanceStatusMap.containsKey(registrant.getId())) {
logger.info("Not found overridden id {} and hence adding it", registrant.getId());
overriddenInstanceStatusMap.put(registrant.getId(), registrant.getOverriddenStatus());
}
}
InstanceStatus overriddenStatusFromMap = overriddenInstanceStatusMap.get(registrant.getId());
if (overriddenStatusFromMap != null) {
logger.info("Storing overridden status {} from map", overriddenStatusFromMap);
registrant.setOverriddenStatus(overriddenStatusFromMap);
}
// Set the status based on the overridden status rules
InstanceStatus overriddenInstanceStatus = getOverriddenInstanceStatus(registrant, existingLease, isReplication);
registrant.setStatusWithoutDirty(overriddenInstanceStatus);
// If the lease is registered with UP status, set lease service up timestamp
if (InstanceStatus.UP.equals(registrant.getStatus())) {
lease.serviceUp();
}
registrant.setActionType(ActionType.ADDED);
recentlyChangedQueue.add(new RecentlyChangedItem(lease));
registrant.setLastUpdatedTimestamp();
invalidateCache(registrant.getAppName(), registrant.getVIPAddress(), registrant.getSecureVipAddress());
logger.info("Registered instance {}/{} with status {} (replication={})",
registrant.getAppName(), registrant.getId(), registrant.getStatus(), isReplication);
} finally {
read.unlock();
}
}
2、服務主動下線的時候源碼分析:
protected boolean internalCancel(String appName, String id, boolean isReplication) {
try {
read.lock();
CANCEL.increment(isReplication);
Map<String, Lease<InstanceInfo>> gMap = registry.get(appName);
Lease<InstanceInfo> leaseToCancel = null;
if (gMap != null) {
leaseToCancel = gMap.remove(id);
}
recentCanceledQueue.add(new Pair<Long, String>(System.currentTimeMillis(), appName + "(" + id + ")"));
InstanceStatus instanceStatus = overriddenInstanceStatusMap.remove(id);
if (instanceStatus != null) {
logger.debug("Removed instance id {} from the overridden map which has value {}", id, instanceStatus.name());
}
if (leaseToCancel == null) {
CANCEL_NOT_FOUND.increment(isReplication);
logger.warn("DS: Registry: cancel failed because Lease is not registered for: {}/{}", appName, id);
return false;
} else {
leaseToCancel.cancel();
InstanceInfo instanceInfo = leaseToCancel.getHolder();
String vip = null;
String svip = null;
if (instanceInfo != null) {
instanceInfo.setActionType(ActionType.DELETED);
recentlyChangedQueue.add(new RecentlyChangedItem(leaseToCancel));
instanceInfo.setLastUpdatedTimestamp();
vip = instanceInfo.getVIPAddress();
svip = instanceInfo.getSecureVipAddress();
}
invalidateCache(appName, vip, svip);
logger.info("Cancelled instance {}/{} (replication={})", appName, id, isReplication);
}
} finally {
read.unlock();
}
synchronized (lock) {
if (this.expectedNumberOfClientsSendingRenews > 0) {
// Since the client wants to cancel it, reduce the number of clients to send renews.
将服務數量進行 -1 , 然後跟行續約數量
this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews - 1;
updateRenewsPerMinThreshold();
}
}
return true;
}
3、剔除不良節點的時候,在eureka服務端提出不良節點的時候是采用定時任務來實作的,源碼如下:
protected void postInit() {
renewsLastMin.start();
if (evictionTaskRef.get() != null) {
evictionTaskRef.get().cancel();
}
evictionTaskRef.set(new EvictionTask());
剔除的定時任務,預設時間間隔是1分鐘
evictionTimer.schedule(evictionTaskRef.get(),
serverConfig.getEvictionIntervalTimerInMs(),
serverConfig.getEvictionIntervalTimerInMs());
}
EvictionTask源碼如下:
class EvictionTask extends TimerTask {
private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l);
@Override
public void run() {
try {
long compensationTimeMs = getCompensationTimeMs();
logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
剔除不良節點
evict(compensationTimeMs);
} catch (Throwable e) {
logger.error("Could not run the evict task", e);
}
}
/**
* compute a compensation time defined as the actual time this task was executed since the prev iteration,
* vs the configured amount of time for execution. This is useful for cases where changes in time (due to
* clock skew or gc for example) causes the actual eviction task to execute later than the desired time
* according to the configured cycle.
*/
long getCompensationTimeMs() {
long currNanos = getCurrentTimeNano();
long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);
if (lastNanos == 0l) {
return 0l;
}
long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
return compensationTime <= 0l ? 0l : compensationTime;
}
long getCurrentTimeNano() { // for testing
return System.nanoTime();
}
}
剔除不良服務源碼如下:
public void evict(long additionalLeaseMs) {
logger.debug("Running the evict task");
如果目前已經處于開啟了自我保護狀态,那就不剔除節點。
if (!isLeaseExpirationEnabled()) {
logger.debug("DS: lease expiration is currently disabled.");
return;
}
// We collect first all expired items, to evict them in random order. For large eviction sets,
// if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
// the impact should be evenly distributed across all applications.
List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
if (leaseMap != null) {
for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
Lease<InstanceInfo> lease = leaseEntry.getValue();
if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
expiredLeases.add(lease);
}
}
}
}
// To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
// triggering self-preservation. Without that we would wipe out full registry.
int registrySize = (int) getLocalRegistrySize();
int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
int evictionLimit = registrySize - registrySizeThreshold;
int toEvict = Math.min(expiredLeases.size(), evictionLimit);
if (toEvict > 0) {
logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);
Random random = new Random(System.currentTimeMillis());
for (int i = 0; i < toEvict; i++) {
// Pick a random item (Knuth shuffle algorithm)
int next = i + random.nextInt(expiredLeases.size() - i);
Collections.swap(expiredLeases, i, next);
Lease<InstanceInfo> lease = expiredLeases.get(i);
String appName = lease.getHolder().getAppName();
String id = lease.getHolder().getId();
EXPIRED.increment();
logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
最終還是調用取消服務的方法。
internalCancel(appName, id, false);
}
}
}
服務數量的動态值就是通過上面三個地方進行動态更新的。
接下來我們來看看eureka是如何計算最後一分鐘的服務實際續約數量的?????????
還是我們在前面講的三個核心屬性中的:這個叫做續約最新一分鐘任務。
private final MeasuredRate renewsLastMin;
protected AbstractInstanceRegistry(EurekaServerConfig serverConfig, EurekaClientConfig clientConfig, ServerCodecs serverCodecs) {
this.serverConfig = serverConfig;
this.clientConfig = clientConfig;
this.serverCodecs = serverCodecs;
this.recentCanceledQueue = new CircularQueue<Pair<Long, String>>(1000);
this.recentRegisteredQueue = new CircularQueue<Pair<Long, String>>(1000);
指派這個任務,設定間隔時間是1分鐘,注意這裡是寫死的。
this.renewsLastMin = new MeasuredRate(1000 * 60 * 1);
this.deltaRetentionTimer.schedule(getDeltaRetentionTask(),
serverConfig.getDeltaRetentionTimerIntervalInMs(),
serverConfig.getDeltaRetentionTimerIntervalInMs());
}
然後在初始化方法中進行啟動:
protected void postInit() {
啟動最新一分鐘續約數量的計算任務
renewsLastMin.start();
if (evictionTaskRef.get() != null) {
evictionTaskRef.get().cancel();
}
evictionTaskRef.set(new EvictionTask());
evictionTimer.schedule(evictionTaskRef.get(),
serverConfig.getEvictionIntervalTimerInMs(),
serverConfig.getEvictionIntervalTimerInMs());
}
啟動任務方法源碼:
public synchronized void start() {
if (!isActive) {
timer.schedule(new TimerTask() {
@Override
public void run() {
try {
// Zero out the current bucket.
lastBucket.set(currentBucket.getAndSet(0));
} catch (Throwable e) {
logger.error("Cannot reset the Measured Rate", e);
}
}
此處的sampleInterval 就是構造函數中定義的1000 * 60 * 1
}, sampleInterval, sampleInterval);
isActive = true;
}
}
MeasuredRate 源碼:
public class MeasuredRate {
private static final Logger logger = LoggerFactory.getLogger(MeasuredRate.class);
所有節點的最後續約數
private final AtomicLong lastBucket = new AtomicLong(0);
目前節點的目前本次需約數,例如續約一次+1 , 也可以續約一次+2,就看increment怎麼實作。
private final AtomicLong currentBucket = new AtomicLong(0);
private final long sampleInterval;
private final Timer timer;
private volatile boolean isActive;
/**
* @param sampleInterval in milliseconds
*/
public MeasuredRate(long sampleInterval) {
this.sampleInterval = sampleInterval;
this.timer = new Timer("Eureka-MeasureRateTimer", true);
this.isActive = false;
}
public synchronized void start() {
if (!isActive) {
timer.schedule(new TimerTask() {
@Override
public void run() {
try {
// Zero out the current bucket.
把目前續約的數量放到最後續約數量中
lastBucket.set(currentBucket.getAndSet(0));
} catch (Throwable e) {
logger.error("Cannot reset the Measured Rate", e);
}
}
}, sampleInterval, sampleInterval);
isActive = true;
}
}
public synchronized void stop() {
if (isActive) {
timer.cancel();
isActive = false;
}
}
/**
* Returns the count in the last sample interval.
*/
擷取續約數量。
public long getCount() {
return lastBucket.get();
}
/**
* Increments the count in the current sample interval.
*/
在調用續約的方法的時候會調用次方法,進行需約數 +1
public void increment() {
currentBucket.incrementAndGet();
}
}
4、上面我們講解了續約門檻值以及服務最後一分鐘續約數量,那麼如果處于自我保護狀态了,有是如何恢複正常的呢?????
經過翻閱源碼我們發現了,其實還是使用定時任務,我們來看一看源碼:來到 PeerAwareInstanceRegistryImpl 類中的init(...)方法:
@Override
public void init(PeerEurekaNodes peerEurekaNodes) throws Exception {
this.numberOfReplicationsLastMin.start();
this.peerEurekaNodes = peerEurekaNodes;
initializedResponseCache();
啟動續約門檻值更新任務
scheduleRenewalThresholdUpdateTask();
initRemoteRegionRegistry();
try {
Monitors.registerObject(this);
} catch (Throwable e) {
logger.warn("Cannot register the JMX monitor for the InstanceRegistry :", e);
}
}
啟動的方法:
private void scheduleRenewalThresholdUpdateTask() {
timer.schedule(new TimerTask() {
@Override
public void run() {
更新續約門檻值
updateRenewalThreshold();
}
此處的serverConfig.getRenewalThresholdUpdateIntervalMs()就是預設的15分鐘。
}, serverConfig.getRenewalThresholdUpdateIntervalMs(),
serverConfig.getRenewalThresholdUpdateIntervalMs());
}
@Override
public int getRenewalThresholdUpdateIntervalMs() {
return configInstance.getIntProperty(
namespace + "renewalThresholdUpdateIntervalMs",
(15 * 60 * 1000)).get();
}
更新續約門檻值的實作:
private void updateRenewalThreshold() {
try {
擷取所有的應用
Applications apps = eurekaClient.getApplications();
int count = 0;
for (Application app : apps.getRegisteredApplications()) {
擷取應用的節點,然後計算節點數量。
for (InstanceInfo instance : app.getInstances()) {
if (this.isRegisterable(instance)) {
++count;
}
}
}
synchronized (lock) {
// Update threshold only if the threshold is greater than the
// current expected threshold or if self preservation is disabled.
如果節點數量>續約門檻值 或者 配置了沒有啟用自我保護機制,那就更新續約門檻值。15分鐘做一次,也就是說沒15分鐘做一次矯正,雙重保障。
if ((count) > (serverConfig.getRenewalPercentThreshold() * expectedNumberOfClientsSendingRenews)
|| (!this.isSelfPreservationModeEnabled())) {
this.expectedNumberOfClientsSendingRenews = count;
updateRenewsPerMinThreshold();
}
}
logger.info("Current renewal threshold is : {}", numberOfRenewsPerMinThreshold);
} catch (Throwable e) {
logger.error("Cannot update renewal threshold", e);
}
}
5 、eureka的監控頁面上的資料來源
我們來到spring-cloud-netflix-eureka-server 包下面,打開navbar.ftlh:
<h1>System Status</h1>
<div class="row">
<div class="col-md-6">
<table id='instances' class="table table-condensed table-striped table-hover">
<#if amazonInfo??>
<tr>
<td>EUREKA SERVER</td>
<td>AMI: ${amiId!}</td>
</tr>
<tr>
<td>Zone</td>
<td>${availabilityZone!}</td>
</tr>
<tr>
<td>instance-id</td>
<td>${instanceId!}</td>
</tr>
</#if>
<tr>
<td>Environment</td>
<td>${environment!}</td>
</tr>
<tr>
<td>Data center</td>
<td>${datacenter!}</td>
</tr>
</table>
</div>
<div class="col-md-6">
<table id='instances' class="table table-condensed table-striped table-hover">
<tr>
<td>Current time</td>
<td>${currentTime}</td>
</tr>
<tr>
<td>Uptime</td>
<td>${upTime}</td>
</tr>
<tr>
<td>Lease expiration enabled</td>
<td>${registry.leaseExpirationEnabled?c}</td>
</tr>
<tr>
這裡是續約的門檻值。
<td>Renews threshold</td>
<td>${registry.numOfRenewsPerMinThreshold}</td>
</tr>
<tr>
這裡就是擷取到的所有節點最後一分鐘續約的數量
<td>Renews (last min)</td>
<td>${registry.numOfRenewsInLastMin}</td>
</tr>
</table>
</div>
</div>
<#if isBelowRenewThresold>
<#if !registry.selfPreservationModeEnabled>
<h4 id="uptime"><font size="+1" color="red"><b>RENEWALS ARE LESSER THAN THE THRESHOLD. THE SELF PRESERVATION MODE IS TURNED OFF. THIS MAY NOT PROTECT INSTANCE EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS.</b></font></h4>
<#else>
<h4 id="uptime"><font size="+1" color="red"><b>EMERGENCY! EUREKA MAY BE INCORRECTLY CLAIMING INSTANCES ARE UP WHEN THEY'RE NOT. RENEWALS ARE LESSER THAN THRESHOLD AND HENCE THE INSTANCES ARE NOT BEING EXPIRED JUST TO BE SAFE.</b></font></h4>
</#if>
<#elseif !registry.selfPreservationModeEnabled>
<h4 id="uptime"><font size="+1" color="red"><b>THE SELF PRESERVATION MODE IS TURNED OFF. THIS MAY NOT PROTECT INSTANCE EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS.</b></font></h4>
</#if>
<h1>DS Replicas</h1>
<ul class="list-group">
<#list replicas as replica>
<li class="list-group-item"><a href="${replica.value}" target="_blank" rel="external nofollow" >${replica.key}</a></li>
</#list>
</ul>
這個頁面中的資料是通過接口 EurekaController 中的status 接口擷取的:
@RequestMapping(method = RequestMethod.GET)
public String status(HttpServletRequest request, Map<String, Object> model) {
populateBase(request, model);
populateApps(model);
StatusInfo statusInfo;
try {
statusInfo = new StatusResource().getStatusInfo();
}
catch (Exception e) {
statusInfo = StatusInfo.Builder.newBuilder().isHealthy(false).build();
}
model.put("statusInfo", statusInfo);
populateInstanceInfo(model, statusInfo);
filterReplicas(model, statusInfo);
return "eureka/status";
}
在model裡面存在registry資料模型:
這就是eureka服務端自我保護機制的原理。