Download Rhdfs Package For R
220 | Big Data Simplied
8.4 INTEGRATING HADOOP WITH R
At the beginning, open the R console in Ubuntu terminal using the following command.
amit@amit-Lenovo-Z51-70:~$ R
Once the R console is open, check the current working directory using R command 'getwd()'.
For integrating R with Hadoop ecosystem, RHadoop package can be leveraged. RHadoop is a col-
lection of ve R packages that allows users to manage and analyse data with Hadoop. The pack-
ages have been tested on recent releases of the Cloudera and Hortonworks Hadoop distributions.
A brief description of the ve packages under RHadoop is given as follows.
• rhdfs : It provides basic connectivity to the Hadoop Distributed File System. R programmers
can browse, read, write and modify files stored in HDFS from within R.
• rhbase : It provides basic connectivity to the HBASE distributed database using the Thrift
server. R programmers can browse, read, write and modify tables stored in HBASE from
within R.
• rmr2 : It allows to perform statistical analysis in R through Hadoop MapReduce functionality
in a Hadoop cluster.
• plyrmr : It enables to perform common data manipulation operations as found in popular
packages, such as plyr and reshape2 on very large data sets stored on Hadoop. Like rmr, it
relies on Hadoop MapReduce to perform its tasks, but it provides a familiar plyr-like interface
while hiding many of the MapReduce details.
• ravro : It provides the ability to read and write avro files from local and HDFS file system and
adds an avro input format for rmr2.
First, download all the packages as mentioned below (or latest version) from the location. https://
github.com/RevolutionAnalytics/Rhadoop/wiki/Downloads.
M08 Big Data Simplified XXXX 01.indd 220 5/10/2019 10:01:18 AM
Working with Big Data inR | 221
• For rhdfs package:rhdfs_1.0.8.tar.gz
• For rhbase package:rhbase_1.2.1.tar.gz
• For rmr2 package:rmr2_3.3.1.tar.gz
• For plyrmr package:plyrmr_0.6.0.tar.gz
• For ravro package:ravro_1.0.4.tar.gz
The les are stored in the Downloads folder (/home/<usrname>/Downloads). Before installing
each of the above packages, all the other packages on which these packages are dependent on
need to be installed. Following is a quick step-by-step guide on what to install and how.
A. Let's first start with the rmr2 package. It has a dependency on caTools package. So, here is the
sequence of installation steps.
1. Install caTools package from within the R console (or Rstudio) using the following
command.
> install.packages("caTools")
In case if there is an error, then you may try the extended version of the command.
> install.packages("caTools", repos=" https://cran.rstudio.com",
dependencies = TRUE)
2. Then come out of the R console to the Ubuntu prompt and run the installation for rmr2.
amit@amit-Lenovo-Z51-70:~$sudo HADOOP_CMD=/usr/bin/hadoop R CMD
INSTALL /home/amit/Downloads/rmr2_3.3.1.tar.gz
B. Next let's install the plyrmr package. For that the dependencies are rmr2 (which is already
installed), R.methodsS3, Hmisc and rjson. Again, the package Hmisc has a dependency on ace-
pack, which can be installed if gfortran is installed. Hence, we need to start with gfortran and
install it using the following set of commands from Ubuntu prompt.
$ sudo -i
$ apt-get update
$ apt-get install gfortran
Next, we should install acepack using the following command from R console (or RStudio).
>install.packages("acepack",repos= "https://cran.rstudio.com",
dependencies = TRUE)
Similarly, we shall install the packages Hmisc and R.methodsS3.
>install.packages("Hmisc",repos= "https://cran.rstudio.com",
dependencies = TRUE)
>install.packages("R.methodsS3",repos= "https://cran.rstudio.com",
dependencies = TRUE)
Eventually, we install plymr from the Ubuntu prompt.
$ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL /home/amit/
Downloads/plyrmr_0.6.0.tar.gz
M08 Big Data Simplified XXXX 01.indd 221 5/10/2019 10:01:18 AM
Posted by: autolightmirror.blogspot.com
Source: https://www.oreilly.com/library/view/big-data-simplified/9789353941505/chapter-122.html