hadoop环境搭建：完全分布式

如题

1、硬件配置

2、软件版本

3、准备工作

3.1、建立虚拟机，网络设置为桥接模式

3.2、更改主机名

3.3、绑定主机名和IP，建立各主机间的联系

3.4、关闭防火墙

3.5、配置宿主机hosts文件

3.6、配置SSH，实现节点间的无密码登录

4、安装JDK

5、安装Hadoop

6、格式化

7、启动

8、测试wordcount

9、注意点

10、配置文件10.1、一些配置项解释

采用3台虚拟机

节点名称

IP地址

内存

硬盘

节点角色

node1

192.168.1.6

2GB

10GB

NameNode、ResoucerManager

node2

192.168.1.7

DataNode、NodeManager、SecondaryNameNode

node3

192.168.1.8

DataNode、NodeManager

软件

版本

JDK

jdk-8u271

HADOOP

hadoop-3.2.1

在 node1 上执行如下步骤：

使宿主机和虚拟机系统可以相互ping通

<code>C:\Windows\System32\drivers\etc\hosts</code> 目录下，添加如下内容：

无密码登陆：在 node1 上，通过 <code>ssh node2</code> 或 <code>ssh node3</code> 就可以登陆到对方计算机上，而不用输入密码。

分别在三台虚拟机的 <code>/root</code> 目录下执行：

设置 ssh 的密钥和密钥的存放路径。路径为<code>~/.ssh</code>

进入到 <code>.ssh</code> 目录，执行如下命令，将公钥放到 authorized_keys 里：

将 node1 上的 authorized_keys 放入其他虚拟机的 <code>~/.ssh</code> 目录下:

在 node1 上，下载，解压，并配置环境变量：

将 jdk1.8.0_271 复制到 node2 和 node3

将 /etc/profile 复制到 node2 和 node3

配置配置文件后，将 hadoop-3.2.1 复制到 node2 和 node3

对 node1 :

如果再次格式化，需要先删除 namenode 和 datanode 上的 <code>dfs/namenode</code> 和 <code>dfs/datanode</code>目录。

可以全部启动，也可以分别启动。

出现了如下问题：

(1)通过 yarn 提交任务出现 <code>Failed while trying to construct the redirect url to the log server. Log Server url may not be configured</code>

原因是未配置 historyserver 服务。配置如下属性：

(2)执行作业时，出现了 <code>错误: 找不到或无法加载主类 org.apache.hadoop.mapreduce.v2.app.MRAppMaster</code>

将上述值添加到 <code>yarn-site.xml</code> 文件如下属性中：

(3)执行作业时，出现了 <code>The auxService:mapreduce_shuffle does not exist</code> 错误。

因为在复制 yarn-site.xml 时漏掉了 <code>yarn.nodemanager.aux-services</code> 属性。

(4)第一次执行作业的时候，输出日志一直卡在 <code>INFO mapreduce.Job: Running job: job_1605371813670_0001</code> 。这个问题首先要考虑配置文件是否正确，其次考虑yarn的资源分配。

(1)如果某个进程启动失败了，考虑配置文件是不是配置错误了，或者格式化的时候未清理上次集群的id。

(2)如果启动，出现了 <code>ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.</code> 错误，说明在 <code>hadoop-env.sh</code> 中未配置此项。具体配置内容见下面的配置文件。

(3)在 Hadoop3.x 中，NameNode 的 web 端口改成了 9870

(4)配置文件的配置可以同时参考官网集群搭建、官网core-site.xml、官网hdfs-site.xml、官网yarn-site.xml、官网mapred-site.xml

(5)在跑任务时，注意资源的分配。

管理员应该通过设置 <code>etc/hadoop/hadoop-env.sh</code>，和可选的 <code>etc/hadoop/mapred-env.sh</code>、<code>etc/hadoop/yarn-env.sh</code> 脚本来对 Hadoop 守护进程环境进行个性化设置，比如，设置 namenode 使用多少堆内存。

至少，你需要在每个远程结点上指定 JAVA_HOME 。

默认是 <code>file://${hadoop.tmp.dir}/dfs/data</code>

Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. The directories should be tagged with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS storage policies. The default storage type will be DISK if the directory does not have a storage type tagged explicitly. Directories that do not exist will be created if local filesystem permission allows.