CDH 5.2.0 的改變1. CDH 5.2.0 新特性2.0 不相容改變3. 性能改進4. 存在的問題5. 總結

最近 cdh 5.2.0 釋出了，想看看其做了哪些改進、帶來哪些不相容以及是否有必要更新現有的 hadoop 叢集。

avro 版本使用1.7.6，重要的一些改變：

提供新的功能：

hdfs data at rest encryption。hdfs 資料的加密，該功能在5.2.0中還有一些限制，尚不能用于生産環境。

使用 http proxy server時的authentication改進

增加了一個新的 metrics sink，允許直接将監控資料寫到 graphite

specification for hadoop compatible filesystem effort

增加 offlineimageviewer 通過 webhdfs api 浏覽 fsimage

對 nfs 支援的改進

hdfs daemons 的 web ui 改進

cdh 5.2 提供了一個 mapper 端 shuffle 的優化實作，使用該實作需要修改原來的實作類，預設未開啟該實作。

可以修改 <code>mapreduce.job.map.output.collector.class</code> 參數為 <code>org.apache.hadoop.mapred.nativetask.nativemapoutputcollectordelegator</code>來開啟該特性。

使用了自定義的可寫的類型或者比較器時，無法使用該特性。

fair scheduler 新特性：

允許為每個隊列設定 <code>fairsharepreemptionthreshold</code> 屬性，該值在 fair-scheduler.xml 中設定，預設值為0.5

允許為每個隊列設定 <code>fairsharepreemptiontimeout</code> 屬性，該值在 fair-scheduler.xml 中設定

在 web ui 中可以顯示 steady fair share

fair scheduler 改進：

fair scheduler uses instantaneous fair share (fairshare that considers only active queues) for scheduling decisions to improve the time to achieve steady state (fairshare).

maxamshare 預設值設為0.5，意思是隻有一半的叢集資源可以被 application master 使用。該參數可以在 fair-scheduler.xml 中設定。

yarn 的 rest api 支援送出和殺掉 application 。

yarn 的 timeline store 和 kerberos 內建

新的 join api

增加新的子產品 crunch-hive，用于使用 crunch 讀寫 orc 檔案。

kite sink 可以寫資料到 hive 和 hbase。

flume agent 可以通過 zookeeper 配置（試驗中）。

嵌入式的 agent 支援攔截器。

syslog source 支援配置那個字段可以保留。

file channel replay 速度變快

添加新的正規表達式查詢替換攔截器

backup checkpoint 可以可選的被壓縮。

添加新的應用修改資料和表上的 sentry 的角色和權限

arch app

添加 heatmap, tree, leaflet 元件

micro-analysis of fields

exclusion facets

oozie dashboard: bulk actions, faster display

file browser: drag-and-drop upload, history, acls edition

hive and impala: ldap pass-through, query expiration, ssl (hive), new graphs

job browser: yarn kill application button

hbase 版本更新到 0.98.6

hive 版本更新到 0.13，增加如下特性：

where 語句支援子查詢

common table expressions

parquet 支援 timestamp

hiveserver2 可以配置 hiverc 檔案，當連接配接的時候，自動執行該檔案内容

permanent udfs

hiveserver2 添加 session 和操作逾時

beeline 接受一個 <code>-i</code> 參數執行初始化的 sql 檔案

新的 join 文法(implicit joins)

建表語句支援 avro 存儲格式

hive 支援額外的資料類型：

hive 可以讀 hive 和 impala 建立的 char 和 varchar 資料類型

impala 可以讀 hive 和 impala 建立的 char 和 varchar 資料類型

describe database 指令添加兩個新屬性：owner_name 和 owner_type。

impala 版本更新到 2.0，改進包括：

子查詢改進：

<code>where</code> 語句中支援子查詢，可以用于 <code>in</code> 查詢

支援 <code>exists</code> 和 <code>not exists</code> 操作

子查詢中可以使用 <code>in</code> 和 <code>not in</code>

where 語句可以使用如下語句： <code>where column = (select max(some_other_column from table)</code> 或者 <code>where column in (select some_other_column from table where conditions)</code>

correlated subqueries let you cross-reference values from the outer query block and the subquery.

scalar subqueries let you substitute the result of single-value aggregate functions such as max(), min(), count(), or avg(), where you would normally use a numeric value in a where clause.

添加幾個聚合函數： <code>rank()</code>, <code>lag()</code>, <code>lead(</code>), <code>first_value()</code>

添加新的資料類型：

varchar

char

security方面的改進：

<a href="http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/impala_mixed_security.html#mixed_security" target="_blank">using multiple authentication methods with impala</a>

grant

revoke

create role

drop role

show roles

–disk_spill_encryption

impala 可以讀取 gzip, bzip, 或 snappy 的壓縮資料

<code>query_timeout_s</code> 用于設定查詢逾時時間。

添加 <code>var_samp()</code> 和 <code>var_pop()</code>，分别為 <code>variance_samp()</code> 和 <code>variance_pop()</code> 别名

添加新的日期和時間類型函數：date_part()

stddev(), stddev_pop(), stddev_samp(), variance(), variance_pop(), variance_samp(), ndv() 傳回 double 類型

parquet 塊大小預設值由 1g 改為256m，也可以通過 <code>parquet_file_size</code> 參數設定

支援 anti-joins，可以使用 <code>left anti join</code> 和 <code>right anti join</code> 語句

版本更新： 5.2 parquet is rebased on parquet 1.5 and parquet-format 2.1.0.

apache spark/streaming 版本使用 1.1

穩定性和性能改進

新的 sort-based shuffle 實作，預設未開啟。

spark ui 更好的監控性能改進

pyspark 支援 hadoop inputformats

改進 yarn 的支援，并修複一些 bug

cdh 5.2 sqoop 1 is rebased on sqoop 1.4.5

mainframe connector added.

parquet support added.

當沒有快照目錄時，getsnapshottabledirlisting() 方法傳回 null

namenode ` -finalize<code> 啟動參數被删除，為了完成叢集的更新，應該使用 </code>hdfs dfsadmin -finalizeupgrade` 指令

libhdfs 函數傳回正确的錯誤碼

hdfs balancer 指令運作錯誤時候傳回0，運作成功傳回1

disable symlinks temporarily

files named <code>.snapshot</code> or <code>.reserved</code> must not exist within hdfs.

change in high-availability support：

cdh5 中唯一的 ha 實作是基于 quorum-based storage，使用 nfs 的共享存儲不再支援。

catalina_base 變量不再用于決定一個元件是否配置為 yarn 或者 mrv1

yarn fair scheduler acl change. root queue defaults to everybody, and other queues default to nobody.

yarn 高可用配置參數修改了 key 名稱

<code>yarn_home</code> 改為 <code>hadoop_yarn_home</code>

yarn-site.xml 中的以下參數改名：

<code>mapreduce.shuffle</code> 改為 <code>mapreduce_shuffle</code>

<code>yarn.nodemanager.aux-services.mapreduce.shuffle.class</code> 改為 <code>yarn.nodemanager.aux-services.mapreduce_shuffle.class</code>

<code>yarn.resourcemanager.resourcemanager.connect.max.wait.secs</code> 改為 <code>yarn.resourcemanager.connect.max-wait.secs</code>

<code>yarn.resourcemanager.resourcemanager.connect.retry_interval.secs</code> 改名為 <code>yarn.resourcemanager.connect.retry-interval.secs</code>

<code>yarn.resourcemanager.am.max-retries</code> 改名為 <code>yarn.resourcemanager.am.max-attempts</code>

hbase 版本變化太大，這裡不做說明。

cdh 5 提供一個新的離線指令用于更新中繼資料：

cdh 4.x 和 cdh 5 中不相容的地方：

cdh 4 jdbc 用戶端和 cdh5 hiveserver2 不相容

連接配接 hiveserver2 需要 cdh5 的 jar 包

因為權限和并發問題，hive 指令行和 hiveserver1 将删除不再使用，建議使用 hiveserver2 和 beeline

cdh 5 hue 不能用于 cdh 4 的hiveserver2

删除 npath 函數

cloudera recommends that custom objectinspectors created for use with custom serdes have a no-argument constructor in addition to their normal constructors, for serialization purposes. see hive-5380 for more details.

the serde interface has been changed which requires the custom serde modules to be reworked.

the decimal data type format has changed as of cdh 5 beta 2 and is not compatible with cdh 4.

cdh 5 和 cdh 5.2.x 中不相容的地方：

the cdh 5.2 hive jdbc driver is not wire-compatible with the cdh 5.1

1、disabling transparent hugepage compaction

檢視是否開啟

關閉該特性，并将其加入到 /etc/rc.local

2、設定 swap 交換

檢視是否開啟：

on most systems, it is set to 60 by default. this is not suitable for hadoop clusters nodes, because it can cause processes to get swapped out even when there is free memory available. this can affect stability and performance, and may cause problems such as lengthy garbage collection pauses for important system daemons.

建議修改為0：

3、improving performance in shuffle handler and ifile reader

shuffle handler，開啟預先讀取資料：

對于 yarn，設定 <code>mapreduce.shuffle.readahead.bytes</code>，預設值為4mb

對于 mrv1，設定 <code>mapred.tasktracker.shuffle.readahead.bytes</code>，預設值為4mb

ifile reader，開啟預先讀取ifile檔案可以改進合并檔案性能，開啟該特性，請設定 <code>mapreduce.ifile.readahead property</code> 為 true，預設值為 true，更進一步，可以設定<code>mapreduce.ifile.readahead.bytes</code> 參數值，該值預設為4mb

4、mapreduce配置最佳實踐

設定 <code>mapreduce.tasktracker.outofband.heartbeat</code> 為 true，該值預設為 false

在一個小叢集中，設定 jobtracker heartbeat 間隔到一個更小的值，參數為 <code>apreduce.jobtracker.heartbeat.interval.min</code> ，預設值為10

5、立即啟動 mapreduce 的 jvm

對于小任務，設定 <code>mapred.reduce.slowstart.completed.maps</code> 值為0，對于比較大的任務，最大設定為 50%

6、調整 mrv1 日志級别

<code>mapreduce.map.log.level</code>

<code>mapreduce.reduce.log.level</code>

flume does not provide a native sink that stores the data that can be directly consumed by hive.

fast replay does not work with encrypted file channel

distcp between unencrypted and encrypted locations fails

namenode - kms communication fails after long periods of inactivity

spark fails when the kms is configured to use ssl

files inside encryption zones cannot be read in hue

cannot move encrypted files to trash

no error when changing permission to 777 on .snapshot directory

snapshots do not retain directories’ quotas settings

namenode cannot use wildcard address in a secure cluster

permissions for dfs.namenode.name.dir incorrectly set.

hadoop fsck -move does not work in a cluster with host-based kerberos

httpfs cannot get delegation token without prior authenticated request.

distcp does not work between a secure cluster and an insecure cluster in some cases

using distcp with hftp on a secure cluster using spnego requires that the dfs.https.port property be configured

offline image viewer (oiv) tool regression: missing delimited outputs.

snapshot operations are not supported by viewfilesystem

starting an unmanaged applicationmaster may fail

no jobtracker becomes active if both jobtrackers are migrated to other hosts

hadoop pipes may not be usable in an mrv1 hadoop installation done through tarballs

task-completed percentage may be reported as slightly under 100% in the web ui, even when all of a job’s tasks have successfully completed.

encrypted shuffle in mrv2 does not work if used with linuxcontainerexecutor and encrypted web uis.

link from resourcemanager to application master does not work when the web ui over https feature is enabled.

hadoop client jars don’t provide all the classes needed for clean compilation of client code

the ulimits setting in /etc/security/limits.conf is applied to the wrong user if security is enabled.

hive’s timestamp type cannot be stored in parquet

hive’s decimal type cannot be stored in parquet and avro

hive creates an invalid table if you specify more than one partition with alter table

postgresql 9.0+ requires additional configuration，需要設定 <code>standard_conforming_strings</code> 為 off

setting hive.optimize.skewjoin to true causes long running queries to fail

jdbc - executeupdate does not returns the number of rows modified

hive auth (grant/revoke/show grant) statements do not support fully qualified table names (default.tab1)

parquet file writes run out of memory if (number of partitions) times (block size) exceeds available memory

hive cannot read arrays in parquet written by parquet-avro or parquet-thrift

本篇文章主要是翻譯了 cloudera 官網上關于 cdh5.2 的新特性、不相容變化、性能改進以及可能存在的問題等相關文檔，以便清楚的了解 hadoop 各元件的特性并為是否更新 hadoop 版本做出決策支援。

CDH 5.2.0 的改變1. CDH 5.2.0 新特性2.0 不相容改變3. 性能改進4. 存在的問題5. 總結

繼續閱讀

Linux 7 中配置Apache服務，及禁止ip通路，删除apache廣告頁面。

Apache配置檔案中的deny和allow的使用

Apache 配置預設編碼

伺服器配置——Apache

Apache靜态檔案通路配置（書封伺服器）

apache httpd 配置

Ubuntu16.04安裝Apache+MySQL+PHP1. 安裝Apache2. 安裝MySQL3. 安裝PHP4. 安裝phpMyAdmin

ubuntu14.04下安裝hbse1.0.1.1

Apache配置SSLApache配置SSL

Windows下配置Apache的SSL服務

User Defined Hadoop DataType

Apache2.4.x 配置檔案詳解Apache配置需要了解如下：開始講解：

配置apache支援PHP（win7）

neo4j之cypher使用文檔

Ambari介紹和架構原理

sqlServer根據經緯查距離