Moving Data with the Hadoop Distributed File System
To analyze big data you need big data. That is, you need to have data stored on disk. The de facto standard for storing big data in a resilient, distributed manner is Apache’s Hadoop Distributed File System (HDFS). This post walks through different methods of storing data in HDFS on the ACCRE BigData Cluster, and along the way, we’ll introduce some basic Hadoop File System shell commands.
Intra-HDFS
Local to HDFS
Probably the most common workflow for new users is to scp
some data to
bigdata.accre.vanderbilt.edu
and then move that data to HDFS. The command
for doing that is:
hadoop fs -copyFromLocal \
file:///scratch/$USER/some/data hdfs:///user/$USER/some/dir
or, equivalently:
hadoop fs -copyFromLocal some/data some/dir
The second option highlights the use of paths relative to the user’s home directory in both the local and the Hadoop file systems.
We also have the option to use -moveFromLocal
which will delete
the local source file once it is copied to HDFS. This command is useful if
you have many large files that you don’t want hanging around on the native
file system on the cluster. One solution is to combine an scp
command with a
remote ssh
command:
for f in *.txt; do
scp $f bigdata:$f;
ssh bigdata "hadoop fs -moveFromLocal $f $f";
done
HDFS to Local
Copying from HDFS to a local drive works in very much the same with with the
analogous hadoop fs
commands -copyToLocal
and -moveToLocal
.
Moving data on HDFS
The hadoop fs
commands also have analogues for the *nix commands mv
, cp
,
mkdir
, rm
, rmdir,
ls,
chmod,
chown` and many other whose use is
very similar to the *nix versions.
Inter-HDFS
In the intra-HDFS, all the distributed files have to gather at a single node
at some point along the way, a many-to-one or one-to-many model if you will.
But moving data between HDFS clusters can be greatly accelerated since
HDFS file blocks only reside on (typically) 3 different nodes within a cluster;
thus, this model is “few-to-few”, and Hadoop provides the DistCp
(“distributed copy”)
utility for just such applications.
HDFS to HDFS
Passing data from one HDFS cluster to the next if fairly vanilla:
hadoop distcp hdfs://another-hdfs-host:8020/foo/bar \
hdfs://abd740:8020/bar/foo
This could be useful if you have collaborators running a Hadoop cluster who’d like to share their data with you.
AWS S3 to HDFS
Copying to and from Amazon’s S3 (Simple Storage Service) storage is
also supported by distcp
.
To use AWS (Amazon Web Services), a user needs to have credentials. Getting
credentialed is a slightly tedious but well-documented process that warrants
no further explanation here. Instead, I assume that you have credentials
stored in the file ~/.aws/credentials
on node abd740
.
Your AWS credentials need to be passed as command-line arguments to distcp, and I’ve found that a convenient and somewhat conventional way is to simply set the credentials as environment variables. I’ve factored out setting these credentials into it’s own script, since setting these environment variables comes up fairly often:
#!/bin/bash
# ~/.aws/set_credentials.sh
export $(cat ~/.aws/credentials | grep -v "^\[" | awk '{print toupper($1)$2$3 }')
I also store my distcp command in a script:
#!/bin/bash
. ~/.aws/set_credentials.sh
hadoop distcp \
-Dfs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY_ID \
-Dfs.s3n.awsSecretAccessKey=$AWS_SECRET_ACCESS_KEY \
s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/ \
hdfs:///user/$USER/eng-us-all
It’s really that simple; however, note that s3
, s3n
, s3a
are all distinct
specifications, and you should modify
your java -D
options according to the data source.