【Solr现网问题】索引文档数量超限

概念解释

Too many documents具体是指什么？索引数？索引文档数？

异常信息

在splitShard时，报错如下：

Caused by: org.apache.lucene.index.CorruptIndexException: Too many documents: an index cannot exceed 2147483519 but readers have total maxDoc=2147483529 
	at org.apache.lucene.index.BaseCompositeReader.<init>(BaseCompositeReader.java:83)
	at org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:342)
	at org.apache.lucene.index.StandardDirectoryReader.<init>(StandardDirectoryReader.java:45)
	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:120)
	at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:460)
	at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
	at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:276)
	at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:235)
	at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1759)
	... 19 more
 | org.apache.solr.common.SolrException.log(SolrException.java:159)

定位分析

Although a Solr implementation can scale into the billions of documents by using a number of shards, each individual shard or Solr core is limited by the Lucene limit for an index which is approximately 2.14 billion documents (2147483519 to be exact) in the current implementation of Lucene.

https://issues.apache.org/jira/browse/SOLR-3504

https://issues.apache.org/jira/browse/SOLR-3505

https://issues.apache.org/jira/browse/SOLR-6065

https://issues.apache.org/jira/browse/LUCENE-5843

https://issues.apache.org/jira/browse/LUCENE-6299

解决方案

查看Lucene源码，得知BaseCompositeReader类存在对当前索引总量的校验。

if (maxDoc > IndexWriter.getActualMaxDocs()) {
  if (this instanceof DirectoryReader) {
    // A single index has too many documents and it is corrupt (IndexWriter prevents this as of LUCENE-6299)
    throw new CorruptIndexException("Too many documents: an index cannot exceed " + IndexWriter.getActualMaxDocs() + " but readers have total maxDoc=" + maxDoc, Arrays.toString(subReaders));
  } else {
    // Caller is building a MultiReader and it has too many documents; this case is just illegal arguments:
    throw new IllegalArgumentException("Too many documents: composite IndexReaders cannot exceed " + IndexWriter.getActualMaxDocs() + " but readers have total maxDoc=" + maxDoc);
  }
}

修改Lucene源码IndexWriter.java，重新编译lucene-core-6.2.0.jar包。

public static final int MAX_DOCS = Integer.MAX_VALUE - 128;

将单shard索引上限调整为30亿，暂时规避数量校验：

public static final int MAX_DOCS = Integer.valueOf(System.getProperty("solr.docs.max.limit", "3000000000"));

替换所在实例的lucene-core-6.2.0.jar包，重启Solr实例。
实例状态稳定之后，进行split shard分裂。
最后将jar包回退，问题解决。

注意事项

索引过大的时候，对索引进行分裂操作推荐在服务器空闲时进行，因为split shard会进行大量的io读写操作，索引的分裂和复制涉及大量读写，而且当原shard有多个复制节点的时候，split也会生成同样多的复制节点，这些节点的数据是需要从leader同步的，索引很大或者replica过多的情况下会耗费大量的网络资源。
替换客户生产集群的jar包时，我劝你一定要谨慎：由于涉及到历史版本，一定要仔细比对代码改动量，最好取下老的jar包进行反编译比对IndexWriter.java代码，确保改动范围，避免引出二次事故。

建议方案

对索引状态进行监控，预警，及时分裂，避免单个索引承载过多文档数。
查看索引状态可以根据Lucene提供的工具类CheckIndex来查看，见如下资料。

Reference

https://blog.csdn.net/shirdrn/article/details/9770829

http://www.it610.com/article/2134736.htm

https://blog.csdn.net/jayson1001/article/details/78228699

【Solr现网问题】索引文档数量超限

概念解释

异常信息

定位分析

解决方案

注意事项

建议方案

Reference

继续阅读

基于lucene的案例开发：IndexSearcher中检索方法

Solr--02.Solr中Core详解

How to make nutch run in eclipse ?

Solr配置文件及SolrCloudSolr配置文件及SolrCloud

solr学习添加文档（Add Document)

Solr实现商城搜索高亮显示

solr（八）：管理solr cloud切片实例（增加和删除）一、创建有两切片shard1/shard2的collection2二、删除集群实例collection1

Solr 8-7的安装、配置1、Solr单独运行方式2、运行Solr3、Solr常用命令4、Solr+tomcat方式

solrCloud 4.7 分布式搜索重要bug QueryComponent.mergeIds() unmarshals all docs' sort field values once per doc instead of once per shard

飞5的Spring Boot2（27）- solr

飞5的Spring Boot2（3）- 细说starters

solr查询服务配置

Solr6.3 Getting Started managed-schema配置介绍

solr6.1.0的安装与入门添加core

延云行业搜索数据库在大数据生态中位置和重要性大数据的挑战大数据技术的现状延云行业搜索数据库

30天了解30种技术系列---(10)面向Cloud的搜索引擎 ElasticSearch