Prerequisite
Hadoop 2.2 has been
installed (and the below installation steps should be
applied on each of Hadoop node)
Step 1. Install R
(by yum)
[hadoop@c0046220
yum.repos.d]$ sudo yum update
yum.repos.d]$ yum search r-project
yum.repos.d]$ sudo yum install R
...
Installed:
R.x86_64 0:3.0.2-1.el6
Dependency
R-core.x86_64 0:3.0.2-1.el6
R-core-devel.x86_64 0:3.0.2-1.el6 R-devel.x86_64 0:3.0.2-1.el6
R-java.x86_64 0:3.0.2-1.el6
R-java-devel.x86_64 0:3.0.2-1.el6
bzip2-devel.x86_64 0:1.0.5-7.el6_0 fontconfig-devel.x86_64
0:2.8.0-3.el6 freetype-devel.x86_64 0:2.3.11-14.el6_3.1
java-1.6.0-openjdk-devel.x86_64 1:1.6.0.0-1.62.1.11.11.90.el6_4 kpathsea.x86_64
0:2007-57.el6_2 libRmath.x86_64 0:3.0.2-1.el6
libRmath-devel.x86_64 0:3.0.2-1.el6
libXft-devel.x86_64 0:2.3.1-2.el6 libXmu.x86_64
0:1.1.1-2.el6 libXrender-devel.x86_64 0:0.9.7-2.el6
libicu.x86_64 0:4.2.1-9.1.el6_2
netpbm.x86_64 0:10.47.05-11.el6
netpbm-progs.x86_64 0:10.47.05-11.el6 pcre-devel.x86_64 0:7.8-6.el6
psutils.x86_64 0:1.17-34.el6
tcl.x86_64 1:8.5.7-6.el6
tcl-devel.x86_64 1:8.5.7-6.el6 tex-preview.noarch
0:11.85-10.el6 texinfo.x86_64 0:4.13a-8.el6
texinfo-tex.x86_64 0:4.13a-8.el6 texlive.x86_64
0:2007-57.el6_2 texlive-dvips.x86_64 0:2007-57.el6_2
texlive-latex.x86_64 0:2007-57.el6_2
texlive-texmf.noarch 0:2007-38.el6
texlive-texmf-dvips.noarch 0:2007-38.el6 texlive-texmf-errata.noarch
0:2007-7.1.el6 texlive-texmf-errata-dvips.noarch 0:2007-7.1.el6
texlive-texmf-errata-fonts.noarch 0:2007-7.1.el6
texlive-texmf-errata-latex.noarch 0:2007-7.1.el6 texlive-texmf-fonts.noarch
0:2007-38.el6 texlive-texmf-latex.noarch 0:2007-38.el6
texlive-utils.x86_64 0:2007-57.el6_2 tk.x86_64
1:8.5.7-5.el6 tk-devel.x86_64 1:8.5.7-5.el6
zlib-devel.x86_64 0:1.2.3-29.el6
Complete!
Validation:
yum.repos.d]$ R
R
version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright
(C) 2013 The R Foundation for Statistical Computing
Platform:
x86_64-redhat-linux-gnu (64-bit)
is free software and comes with ABSOLUTELY NO WARRANTY.
You
are welcome to redistribute it under certain conditions.
Type
‘license()‘ or ‘licence()‘ for distribution details.
Natural language support but running in an English locale
is a collaborative project with many contributors.
‘contributors()‘ for more information and
‘citation()‘
on how to cite R or R packages in publications.
‘demo()‘ for some demos, ‘help()‘ for on-line help, or
‘help.start()‘
for an HTML browser interface to help.
‘q()‘ to quit R.
>
Step 2. Install
RHadoop
2.1 Getting
RHadoop Packages
Download packages rhdfs,
rhbase and rmr2 from
and then run the R code below.
RHadoop]$ cd /tmp
tmp]$ mkdir RHadoop
tmp]$ cd RHadoop
RHadoop]$ wget
https://raw.githubusercontent.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz
https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/3.1.0/build/rmr2_3.1.0.tar.gz
https://raw.githubusercontent.com/RevolutionAnalytics/rhbase/master/build/rhbase_1.2.0.tar.gz
2.2 Install R
packages that RHadoop depends on.
java]$ echo $JAVA_HOME
/usr/java/jdk1.8.0_05
java]$ sudo -i
[root@c0046220
~]# export JAVA_HOME=/usr/java/jdk1.8.0_05
~]# R CMD javareconf
[root@c0046220 ~]# R
.libPaths();
[1]
"/usr/lib64/R/library" "/usr/share/R/library"
install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", "functional",
"stringr", "plyr", "reshape2", "caTools"))
#install.packages("caTools") #needed for rmr2
2.3 Install
Set environment variables
~]$ vi ~/.bashrc
#
set HADOOP locations for RHADOOP
export
HADOOP_CMD=$HADOOP_HOME/bin/hadoop
HADOOP_STREAMING=/opt/hadoop/hadoop-2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar
~]$ source .bashrc
[hadoop@c0040084
R]$ sudo -i
[root@c0040084
~]# R
Sys.setenv(HADOOP_HOME="/opt/hadoop/hadoop-2.2.0");
Sys.setenv(HADOOP_CMD="/opt/hadoop/hadoop-2.2.0/bin/hadoop");
Sys.setenv(HADOOP_STREAMING="/opt/hadoop/hadoop-2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar");
install.packages(pkgs="/tmp/RHadoop/rhdfs_1.0.8.tar.gz",repos=NULL);
install.packages(pkgs="/tmp/RHadoop/rmr2_3.1.0.tar.gz",repos=NULL);
Step 3. Validation
Load and initialize the
rhdfs package, and execute some simple commands as below:
library(rhdfs)
hdfs.init()
hdfs.ls("/")
[hadoop@c0046220 ~]$ R
Loading
required package: rJava
Be
sure to run hdfs.init()
14/05/15
10:02:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
permission owner group size modtime file
1
drwxr-xr-x hadoop supergroup 0 2014-05-14 03:05 /apps
2
drwxr-xr-x hadoop supergroup 0 2014-05-12 09:40 /data
3
drwxr-xr-x hadoop supergroup 0 2014-05-12 09:45 /output
4
drwxrwx--- hadoop supergroup 0 2014-05-15 10:02 /tmp
5
drwxr-xr-x hadoop supergroup 0 2014-05-14 05:48 /user
6
drwxr-xr-x hadoop supergroup 0 2014-05-13 06:43 /usr
rmr2 package, and execute some simple commands as below:
library(rmr2)
from.dfs(to.dfs(1:100))
from.dfs(mapreduce(to.dfs(1:100)))
~]$ R
required package: Rcpp
required package: RJSONIO
required package: bitops
required package: digest
required package: functional
required package: reshape2
required package: stringr
required package: plyr
required package: caTools
$key
NULL
$val
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
input<-
‘/user/hadoop/tmp.txt‘
wordcount =
function(input, output = NULL, pattern = " "){
wc.map = function(., lines) {
keyval(unlist( strsplit( x = lines,split = pattern)),1)
}
wc.reduce =function(word, counts ) {
keyval(word, sum(counts))
mapreduce(input = input ,output = output, input.format = "text",
map = wc.map, reduce = wc.reduce,combine = T)
wordcount(input)
input<- ‘/user/hadoop/tmp.txt‘
wordcount = function(input, output = NULL, pattern = " "){
+
10:18:40 INFO mapreduce.Job: Job job_1399887026053_0013 completed successfully
10:18:40 INFO mapreduce.Job: Counters: 45
File System Counters
FILE: Number of bytes read=11018
FILE: Number of bytes written=278566
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2004
HDFS: Number of bytes written=11583
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Failed reduce tasks=1
Launched map tasks=2
Launched reduce tasks=2
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=23412
Total time spent by all reduces in occupied slots (ms)=13859
Map-Reduce Framework
Map input records=24
Map output records=112
Map output bytes=10522
Map output materialized bytes=11024
Input split bytes=208
Combine input records=112
Combine output records=114
Reduce input groups=105
Reduce shuffle bytes=11024
Reduce input records=114
Reduce output records=112
Spilled Records=228
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=569
CPU time spent (ms)=3700
Physical memory (bytes) snapshot=574214144
Virtual memory (bytes) snapshot=6258499584
Total committed heap usage (bytes)=365953024
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1796
File Output Format Counters
Bytes Written=11583
rmr
reduce calls=110
10:18:40 INFO streaming.StreamJob: Output directory: /tmp/file612355aa2e35
function
()
{
fname
<environment:
0x37d70d0>
from.dfs("/tmp/file612355aa2e35")
[1] "-"
[2] "of"
[3] "Hong"
[4] "Paul‘s"
[5] "School"
[6] "College"
[7] "Graduate"
References
