Creating a Hadoop archive – the small files problem

The HDSF is designed to store and process large data sets ( terabytes).  Storing a large number of small files in HDFS is inefficient.

Hadoop Archives (HAR) can be used to address the namespace limitations associated with storing many small files. Whit HAR we can packs a number of small files into large files so that the original files can be accessed transparently .

You can use following command to create a Hadoop archive:

hadoop archive -archiveName name -p  *

Example :

[gpadmin@]$ hadoop archive -archiveName ovi.har -p /user/ovidiu  /user/ovi

[gpadmin@sphdmst02 ~]$ hadoop fs -ls /user/ovi
Found 1 items
drwxr-xr-x   – gpadmin hadoop          0 2015-07-23 16:20 /user/ovi/ovi.har
[gpadmin@]$ hadoop archive -archiveName ovi.har -p /user/ovidiu  /user/ovi

 

Following example create creates an archive using /user/ovidiu as the relative archive directory.

The directories

/user/ovidiu/SIT1

/user/ovidiu/SIT2

/user/ovidiu/SIT3

will be archived in the /user/ovi/ovi2.har archive

$ hadoop archive -archiveName ovi2.har -p /user/ovidiu/ SIT1 SIT2 SIT3 /user/ovi

[gpadmin@cmtolsphdmst02 ~]$ hadoop fs -ls /user/ovi
Found 2 items
drwxr-xr-x   – gpadmin hadoop          0 2015-07-23 16:20 /user/ovi/ovi.har
drwxr-xr-x   – gpadmin hadoop          0 2015-07-24 11:59 /user/ovi/ovi2.har

 Looking up file in hadoop archives

To a client using the HAR filesystem nothing has changed:  the original files are  accessible and visible (albeit using a har:// URL)

[gpadmin@sphdmst02 ~]$ hdfs dfs -ls har:///user/ovi/ovi.har/
Found 3 items
-rw-r–r–   2 gpadmin hadoop        125 2015-07-23 16:19 har:///user/ovi/ovi.har/ranking.txt
-rw-r–r–   2 gpadmin hadoop         66 2015-01-14 16:06 har:///user/ovi/ovi.har/test.txt
-rw-r–r–   2 gpadmin hadoop         13 2015-07-23 16:18 har:///user/ovi/ovi.har/test2.txt

 

[gpadmin@sphdmst02 ~]$  hdfs dfs -ls har:///user/ovi/ovi2.har/
Found 3 items
drwxr-xr-x   – gpadmin hadoop          0 2015-07-24 11:54 har:///user/ovi/ovi2.har/SIT1
drwxr-xr-x   – gpadmin hadoop          0 2015-07-24 11:55 har:///user/ovi/ovi2.har/SIT2
drwxr-xr-x   – gpadmin hadoop          0 2015-07-24 11:55 har:///user/ovi/ovi2.har/SIT3
[gpadmin@cmtolsphdmst02 ~]$  hdfs dfs -ls har:///user/ovi/ovi2.har/SIT1
Found 2 items
-rw-r–r–   2 gpadmin hadoop        125 2015-07-24 11:52 har:///user/ovi/ovi2.har/SIT1/ranking.txt
-rw-r–r–   2 gpadmin hadoop         13 2015-07-24 11:54 har:///user/ovi/ovi2.har/SIT1/test2.txt

 

Leave a comment