Cassandra LCS压缩原理详解

cassandra的压缩的策略是在cassandra的守护线程cassandraDaemon类中的startUp中进行定时启动的压缩机制。

CassandraDaemon setUp()中的定时启动任务

ScheduledExecutors.optionalTasks.scheduleWithFixedDelay(ColumnFamilyStore.getBackgroundCompactionTaskSubmitter(), 5, 1, TimeUnit.MINUTES);

从代码中可以看出，cassandra是启动5分钟以后每隔1分钟就要启动一次压缩任务

public static Runnable getBackgroundCompactionTaskSubmitter()

{

return new Runnable()
{
    public void run()
    {
        for (Keyspace keyspace : Keyspace.all())
            for (ColumnFamilyStore cfs : keyspace.getColumnFamilyStores())
                CompactionManager.instance.submitBackground(cfs);
    }
};

}

从上面的代码中可以看出，获取到所有的keyspace，然后针对所有的keyspace的表进行压缩。

/**

Call this whenever a compaction might be needed on the given columnfamily.
It's okay to over-call (within reason) if a call is unnecessary, it will
turn into a no-op in the bucketing/candidate-scan phase.

*/

public List> submitBackground(final ColumnFamilyStore cfs)

if (cfs.isAutoCompactionDisabled()) // 判断表格是否关闭了压缩策略
{
    logger.trace("Autocompaction is disabled");
    return Collections.emptyList();
}

 /**
 * 如果CF当前正在被压缩了，并且没有闲置的线程池了，我们则等待下一次提交当前的CF压缩任务，当我们有足够多线程的时候
 * 否则我们应该至少提交一个任务以防止某个CF长时间霸占线程池，也就是CF饥饿。
 **/
int count = compactingCF.count(cfs);
if (count > 0 && executor.getActiveCount() >= executor.getMaximumPoolSize())
{ // 已经有在压缩了，并且没有空间的线程池，则退出
    logger.trace("Background compaction is still running for {}.{} ({} remaining). Skipping",
                 cfs.keyspace.getName(), cfs.name, count);
    return Collections.emptyList();
}

logger.trace("Scheduling a background task check for {}.{} with {}",
             cfs.keyspace.getName(),
             cfs.name,
             cfs.getCompactionStrategyManager().getName());

List<Future<?>> futures = new ArrayList<>(1);
Future<?> fut = executor.submitIfRunning(new BackgroundCompactionCandidate(cfs), "background task");
//没有正在压缩的，情况，则提交一次压缩，以防止CF 饥饿
if (!fut.isCancelled())
    futures.add(fut);
else
    compactingCF.remove(cfs);
return futures;

public void run()

try
{
    logger.trace("Checking {}.{}", cfs.keyspace.getName(), cfs.name);
    if (!cfs.isValid()) // 如果已经删除了，则不允许在被压缩了
    {
        logger.trace("Aborting compaction for dropped CF");
        return;
    }
    
    //先从cf表中获取到当前表格的压缩策略
    CompactionStrategyManager strategy = cfs.getCompactionStrategyManager();
    //根据压缩策略，获取到压缩任务，这里需要获取到GC的时间，这里的GC是指墓碑的删除时间
    AbstractCompactionTask task = strategy.getNextBackgroundTask(getDefaultGcBefore(cfs, FBUtilities.nowInSeconds()));
    if (task == null)
    {
        logger.trace("No tasks available");
        return;
    }
    task.execute(metrics);
}
finally
{
    compactingCF.remove(cfs);
}
submitBackground(cfs);

Return the next background task

*
Returns a task for the compaction strategy that needs it the most (most estimated remaining tasks)

public synchronized AbstractCompactionTask getNextBackgroundTask(int gcBefore)

if (!isEnabled())
    return null;

maybeReload(cfs.metadata);

// 将任务分为已经repaired过的，和没有进行repaired的两部分
// 哪个预估剩余的任务量大，就先进行哪个任务
if (repaired.getEstimatedRemainingTasks() > unrepaired.getEstimatedRemainingTasks())
{
    AbstractCompactionTask repairedTask = repaired.getNextBackgroundTask(gcBefore);
    if (repairedTask != null)
        return repairedTask;
    return unrepaired.getNextBackgroundTask(gcBefore);
}
else
{
    AbstractCompactionTask unrepairedTask = unrepaired.getNextBackgroundTask(gcBefore);
    if (unrepairedTask != null)
        return unrepairedTask;
    return repaired.getNextBackgroundTask(gcBefore);
}

the only difference between background and maximal in LCS is that maximal is still allowed
(by explicit user request) even when compaction is disabled.

@SuppressWarnings("resource")

while (true)
{
    OperationType op;
    //获取到压缩的候选者
    LeveledManifest.CompactionCandidate candidate = manifest.getCompactionCandidates();
    if (candidate == null)
    {  // 如果没有压缩候选者，也就是候选者为null
        // 这个时候，没有压缩候选者，那么就尝试针对已经删除的数据，也就是墓碑是否有需要被处理的
        SSTableReader sstable = findDroppableSSTable(gcBefore);
        if (sstable == null)
        {
            logger.trace("No compaction necessary for {}", this);
            return null;
        }
        candidate = new LeveledManifest.CompactionCandidate(Collections.singleton(sstable),
                                                            sstable.getSSTableLevel(),
                                                            getMaxSSTableBytes());
        op = OperationType.TOMBSTONE_COMPACTION;
    }
    else
    {
        op = OperationType.COMPACTION;
    }

    LifecycleTransaction txn = cfs.getTracker().tryModify(candidate.sstables, OperationType.COMPACTION);
    if (txn != null)
    {
        // 返回分层压缩任务
        LeveledCompactionTask newTask = new LeveledCompactionTask(cfs, txn, candidate.level, gcBefore, candidate.maxSSTableBytes, false);
        newTask.setCompactionType(op);
        return newTask;
    }
}

@return highest-priority sstables to compact, and level to compact them to
If no compactions are necessary, will return null

public synchronized CompactionCandidate getCompactionCandidates()

// during bootstrap we only do size tiering in L0 to make sure
// the streamed files can be placed in their original levels
if (StorageService.instance.isBootstrapMode())
{
    List<SSTableReader> mostInteresting = getSSTablesForSTCS(getLevel(0));
    if (!mostInteresting.isEmpty())
    {
        logger.info("Bootstrapping - doing STCS in L0");
        return new CompactionCandidate(mostInteresting, 0, Long.MAX_VALUE);
    }
    return null;
}
// LevelDB 会给每个level 一个分数（有多少数据它拥有的比上它的理想数据），并且
// 压缩得分高的层级，但是这样很容以分崩离析，一旦发生落后
// 举个例子，现在L0 有 988个sstable，理想的是4个
// L1 117个sstable，理想的是10个
// L2 12个sstable，理想的是100个
// 问题就是当L0（225） 比 L1（11）要高，那么我们会做一个MAX_COMPACTION_SIZE的L0 和 117个L1压缩
// 并将压缩的结果放到L1，当我们计算下一个L0的时候，又需要一次和L1(120)个sstable一起做压缩
// 这样就会导致L1不停的被压缩，引起频繁的IO读取，而且是指针对L1的。
// 这种压缩策略，一但L0的压缩落后了以后，我们就不得不阻塞写性能
// 因此我们采用不同的策略
// 1. 首先先压缩高层，这样可以最大限度的减少IO
// 2. 并且L0一旦落后比较严重了，会采用SIZE压缩，以减少读性能，从而赶上高层的压缩分数
// 当然这不是一个万全之策，如果一直处于高压的写，也同样会崩溃，但是偶尔爆发性的写，这是一个很好的策略
for (int i = generations.length - 1; i > 0; i--)
{
    List<SSTableReader> sstables = getLevel(i);
    if (sstables.isEmpty())
        continue; // mostly this just avoids polluting the debug log with zero scores
    // we want to calculate score excluding compacting ones
    Set<SSTableReader> sstablesInLevel = Sets.newHashSet(sstables);
    Set<SSTableReader> remaining = Sets.difference(sstablesInLevel, cfs.getTracker().getCompacting());
    // 分数为  sstable的总的大小 /  该层级最大的磁盘空间
    double score = (double) SSTableReader.getTotalBytes(remaining) / (double)maxBytesForLevel(i, maxSSTableSizeInBytes);
    logger.trace("Compaction score for level {} is {}", i, score);

    if (score > 1.001) // 当分数大于1的时候，也就是当前层级的大小比当前c层级最大的允许的磁盘空间
    {
        // 在处理高层级压缩的时候，就需要判断一下L0的层级分数是否落后到足够多以至于开启STCS的压缩
        // before proceeding with a higher level, let's see if L0 is far enough behind to warrant STCS
        CompactionCandidate l0Compaction = getSTCSInL0CompactionCandidate();
        if (l0Compaction != null) // 如果L0 已经落后太多了，开启STCS压缩
            return l0Compaction;

        // L0当前还好，就直接执行当前的压缩策略
        // L0 is fine, proceed with this level
        Collection<SSTableReader> candidates = getCandidatesFor(i);
        if (!candidates.isEmpty())
        {
            int nextLevel = getNextLevel(candidates);
            // 将它的上一级的压缩次数清0，并且判断是否存在饥饿压缩的情况，如果是的话，就要考虑一下 sstable是否和候选者之间存在重叠，并且没有在压缩
            // 则也需要一起加进来就行一起压缩，这主要原因是因为有些层级数据量太少了，一直灭有被压缩过
            candidates = getOverlappingStarvedSSTables(nextLevel, candidates);
            if (logger.isTraceEnabled())
                logger.trace("Compaction candidates for L{} are {}", i, toString(candidates));
            return new CompactionCandidate(candidates, nextLevel, cfs.getCompactionStrategyManager().getMaxSSTableBytes());
        }
        else
        {
            logger.trace("No compaction candidates for L{}", i);
        }
    }
}

// Higher levels are happy, time for a standard, non-STCS L0 compaction
if (getLevel(0).isEmpty())
    return null;
Collection<SSTableReader> candidates = getCandidatesFor(0);
if (candidates.isEmpty())  // 如果获取到的L0层级的压缩候选者数据量为0，则直接进行stcs压缩
{
    // Since we don't have any other compactions to do, see if there is a STCS compaction to perform in L0; if
    // there is a long running compaction, we want to make sure that we continue to keep the number of SSTables
    // small in L0.
    return getSTCSInL0CompactionCandidate();
}
return new CompactionCandidate(candidates, getNextLevel(candidates), cfs.getCompactionStrategyManager().getMaxSSTableBytes());

@return highest-priority sstables to compact for the given level.
If no compactions are possible (because of concurrent compactions or because some sstables are blacklisted
for prior failure), will return an empty list. Never returns null.

private Collection getCandidatesFor(int level)

assert !getLevel(level).isEmpty();
logger.trace("Choosing candidates for L{}", level);

final Set<SSTableReader> compacting = cfs.getTracker().getCompacting();

if (level == 0) // 如果是level为0就走level 0 的压缩策略
{
    // 先要获取到L0正在压缩的sstable
    Set<SSTableReader> compactingL0 = getCompacting(0);

    // 首选，先要获取到L0 正在压缩的sstable中最大的 parttion 
    // 和最小的partion
    PartitionPosition lastCompactingKey = null;
    PartitionPosition firstCompactingKey = null;
    for (SSTableReader candidate : compactingL0)
    {
        if (firstCompactingKey == null || candidate.first.compareTo(firstCompactingKey) < 0)
            firstCompactingKey = candidate.first;
        if (lastCompactingKey == null || candidate.last.compareTo(lastCompactingKey) > 0)
            lastCompactingKey = candidate.last;
    }

    // L0 是很多新得sstable的垃圾场，因此可能会存在很多的sstable重叠
    // 我们对待L0的压缩比较特殊
    // 1. 添加sstables到 候选者集合中，直到至少最大的数量
    // 2. 优先选择老的sstable，而不是新的sstable，并且任意和候选者只有
    // 重叠的sstable也都会加入熬后选择中，当L0的sstable的数量大于Max的时候
    // 就会发起压缩
    // 如果所有的候选者的大小小于最大MB的时候，我们将不会打扰L1层，并
    // 将压缩后的结果保存到L0中，而不是直接提升。

// L0 is the dumping ground for new sstables which thus may overlap each other.
    //
    // We treat L0 compactions specially:
    // 1a. add sstables to the candidate set until we have at least maxSSTableSizeInMB
    // 1b. prefer choosing older sstables as candidates, to newer ones
    // 1c. any L0 sstables that overlap a candidate, will also become candidates
    // 2. At most MAX_COMPACTING_L0 sstables from L0 will be compacted at once
    // 3. If total candidate size is less than maxSSTableSizeInMB, we won't bother compacting with L1,
    //    and the result of the compaction will stay in L0 instead of being promoted (see promote())
    //
    // Note that we ignore suspect-ness of L1 sstables here, since if an L1 sstable is suspect we're
    // basically screwed, since we expect all or most L0 sstables to overlap with each L1 sstable.
    // So if an L1 sstable is suspect we can't do much besides try anyway and hope for the best.
    Set<SSTableReader> candidates = new HashSet<>();
    Set<SSTableReader> remaining = new HashSet<>();
    //任何可疑的sstable
    Iterables.addAll(remaining, Iterables.filter(getLevel(0), Predicates.not(suspectP)));
    // 将剩余的可疑的sstable按照sstable生成的时间进行排序
    for (SSTableReader sstable : ageSortedSSTables(remaining))
    {
        // 如果已经在候选者中了，就直接跳过
        if (candidates.contains(sstable))
            continue;

        //剩余的sstable和当前的sstalec有重叠的部分也会被加如到候选者中
        // 这里的重叠指得时 sstable中得最大最小得token。也就是说
        // 任何sstable 和 当前得sstable得token之间存在交集，也就是范围存在交集
        // 这里可能认为token范围重叠，就存在内容重叠吧？
        Sets.SetView<SSTableReader> overlappedL0 = Sets.union(Collections.singleton(sstable), overlapping(sstable, remaining));
        if (!Sets.intersection(overlappedL0, compactingL0).isEmpty())
            continue;  // 如果所有重叠额sstable和当前得sstable一起，和正在压缩得sstable之间存在交集，则直接跳
        // 如果overlappedL0 没有正在压缩的sstable，则需要判断
        // 候选者中是否有和正在压缩的l0层sstable 有token范围交集
        // 如果没有交集，则认为当前的sstable就直接加入候选者
        // 
        for (SSTableReader newCandidate : overlappedL0)
        {
            if (firstCompactingKey == null || lastCompactingKey == null || overlapping(firstCompactingKey.getToken(), lastCompactingKey.getToken(), Arrays.asList(newCandidate)).size() == 0)
                candidates.add(newCandidate);
            remaining.remove(newCandidate); // 已经经过重叠的sstable就不在进行重复添加了
            // 要么这个sstable 和 正在压缩的有重叠，要么已经加入到候选者，所以可以在剩余的sstable集合中直接删除
        }

        //如果候选者的数据已经大于MAX_COMPACTING_L0的时候，直接获取到时间最早的最大数据量的sstable
        if (candidates.size() > MAX_COMPACTING_L0)
        {
            // limit to only the MAX_COMPACTING_L0 oldest candidates
            candidates = new HashSet<>(ageSortedSSTables(candidates).subList(0, MAX_COMPACTING_L0));
            break;
        }
    }
    
    // 如果候选者加起来的sstable的大小比最大值的话，就需要加入L1层中的sstable进来一起压缩
    // leave everything in L0 if we didn't end up with a full sstable's worth of data
    if (SSTableReader.getTotalBytes(candidates) > maxSSTableSizeInBytes)
    {
        // add sstables from L1 that overlap candidates
        // if the overlapping ones are already busy in a compaction, leave it out.
        // TODO try to find a set of L0 sstables that only overlaps with non-busy L1 sstables
        // 候选者最大最小的tokenf范围内和L1有重叠的sstable
        Set<SSTableReader> l1overlapping = overlapping(candidates, getLevel(1));
        // L1重叠的sstable和正在压缩的sstable有重叠，则直接放弃当前的L0压缩
        if (Sets.intersection(l1overlapping, compacting).size() > 0)
            return Collections.emptyList();
        // 如果L0正在压缩的sstable 和 候选者之间存在token重叠的话，也直接放弃当前L0压缩    
        if (!overlapping(candidates, compactingL0).isEmpty())
            return Collections.emptyList();
        candidates = Sets.union(candidates, l1overlapping);
    }
    if (candidates.size() < 2)
        return Collections.emptyList();
    else
        return candidates;
}

// for non-L0 compactions, pick up where we left off last time
Collections.sort(getLevel(level), SSTableReader.sstableComparator);
int start = 0; // handles case where the prior compaction touched the very last range
for (int i = 0; i < getLevel(level).size(); i++)
{
    SSTableReader sstable = getLevel(level).get(i);
    if (sstable.first.compareTo(lastCompactedKeys[level]) > 0)
    {
        start = i;
        break;
    }
}

// look for a non-suspect keyspace to compact with, starting with where we left off last time,
// and wrapping back to the beginning of the generation if necessary
for (int i = 0; i < getLevel(level).size(); i++)
{
    SSTableReader sstable = getLevel(level).get((start + i) % getLevel(level).size());
    Set<SSTableReader> candidates = Sets.union(Collections.singleton(sstable), overlapping(sstable, getLevel(level + 1)));
    if (Iterables.any(candidates, suspectP))
        continue;
    if (Sets.intersection(candidates, compacting).isEmpty())
        return candidates;
}

// all the sstables were suspect or overlapped with something suspect
return Collections.emptyList();

private CompactionCandidate getSTCSInL0CompactionCandidate()

if (!DatabaseDescriptor.getDisableSTCSInL0() && getLevel(0).size() > MAX_COMPACTING_L0)
{
    List<SSTableReader> mostInteresting = getSSTablesForSTCS(getLevel(0));
    if (!mostInteresting.isEmpty())
    {
        logger.debug("L0 is too far behind, performing size-tiering there first");
        return new CompactionCandidate(mostInteresting, 0, Long.MAX_VALUE);
    }
}

return null;

//  如果开启了STCS压缩，并且L0的sstable 的总数大于 MAX量，则开启STCS压缩
// 
if (!DatabaseDescriptor.getDisableSTCSInL0() && getLevel(0).size() > MAX_COMPACTING_L0)
{
    List<SSTableReader> mostInteresting = getSSTablesForSTCS(getLevel(0));
    if (!mostInteresting.isEmpty())
    {
        logger.debug("L0 is too far behind, performing size-tiering there first");
        return new CompactionCandidate(mostInteresting, 0, Long.MAX_VALUE);
    }
}

return null;

STCS的大小归类的方法是，比如1 2 sstable的平均大小作为一个值，这个值上上下 0.5倍也都加入

到这个sstable中，然后再进行求解平均值，然后再重新计算上下值，加入到这个sstable中，进行重新编写大小。最后返回大小差不多的sstable加入到一起

然后比较所有大小差不多的 sstable 集合，之间所有的sstable的热度比较大小，返回热度最大的sstable集合进行压缩。因为读取越多的sstable，优先进行压缩，有利于提升读性能

private List getSSTablesForSTCS(Collection sstables)

Iterable<SSTableReader> candidates = cfs.getTracker().getUncompacting(sstables);
List<Pair<SSTableReader,Long>> pairs = SizeTieredCompactionStrategy.createSSTableAndLengthPairs(AbstractCompactionStrategy.filterSuspectSSTables(candidates));
List<List<SSTableReader>> buckets = SizeTieredCompactionStrategy.getBuckets(pairs,
                                                                            options.bucketHigh,
                                                                            options.bucketLow,
                                                                            options.minSSTableSize);
return SizeTieredCompactionStrategy.mostInterestingBucket(buckets, 4, 32);

Cassandra LCS压缩原理详解

继续阅读

关于Gradle配置的小结

Java小案例——随机数猜测随机数猜测

nginx location中斜线的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method