Hawk

August 18, 2015September 18, 2015 techhadoop

HAWQ is the new benchmark for SQL on Hadoop,

HAWQ is a parallel SQL query engine . HAWQ has been designed from the ground up to be a massively parallel SQL processing engine optimized specifically for analytics with full transaction support

Pivotal HAWQ is a Massively Parallel Processing (MPP) database using several Postgres database instances and HDFS storage

Hawq Physical Architecture

Hawq Master server
HDFS NameNode
Segment server
Interconnect Switch

How to locate the logs:

HAWQ master logs.

[gpadmin@uphdmst02 gpseg-1]$ psql
psql (8.4.20, server 8.2.15)
WARNING: psql version 8.4, server version 8.2.
Some psql features might not work.
Type “help” for help.

gpadmin=# show data_directory;
data_directory
———————-
/data/master/gpseg-1
(1 row)

[gpadmin@uphdmst02 ~]$ psql
psql (8.4.20, server 8.2.15)
WARNING: psql version 8.4, server version 8.2.
Some psql features might not work.
Type “help” for help.

gpadmin=# select version();
version

————————————————————————-
————————————————-
PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 1.2.1.0 build 10335) on x86_64-unknown-linux-gnu, comp
iled by GCC gcc (GCC) 4.4.2 compiled on Aug 8 2014 16:31:48
(1 row)

Create some external table in uat

gpadmin=# CREATE EXTERNAL TABLE person ( id int, name text)

gpadmin-# location(‘gpfdist://phdmst03.uat.mydev.com:8000/Test/person.txt’) FORMAT ‘text’ (delimiter ‘|’)

gpadmin-# ENCODING ‘UTF8’;

CREATE EXTERNAL TABLE

gpadmin=# select count(*) from person;

count

———

1000000

(1 row)

gpadmin=#

— External Table: ext_sim_result_value_f

— DROP EXTERNAL TABLE ext_sim_result_value_f;

CREATE EXTERNAL TABLE ext_sim_result_value_f
(
sim_result_id ,
md_point_id ,
path_num ,
value
)
LOCATION (
‘gpfdist://uphdmst03.uat.mydev.com:8000/500041_Sim_Result_Value_F.csv.gz’
)
FORMAT ‘text’ (delimiter ‘|’ null ‘\\N’ escape ‘\\’)
ENCODING ‘UTF8’;
ALTER TABLE ext_sim_result_value_f
OWNER TO gpadmin;

shutdown

3
20150731:16:56:16:570576 gpstop:uphdmst02:gpadmin-[INFO]:-Commencing parallel segment instance shutdown, please wait…
…………….
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:—————————————————–
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:- Segments stopped successfully = 24
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:- Segments with errors during stop = 0
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:—————————————————–
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:-Successfully shutdown 24 of 24 segment instances
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:-Database successfully shutdown with no errors reported

Hawq – pga_conf

vim /data/master/gpseg-1/pg_hba.conf

source /usr/local/hawq/greenplum_path.sh

export MASTER_DATA_DIRECTORY=/data/master/gpseg-1

reload the configuration

[gpadmin@uphdmst02 ~]$ gpstop -u
20150805:11:05:40:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Starting gpstop with args: -u
20150805:11:05:40:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Gathering information and validating the environment…
20150805:11:05:40:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20150805:11:05:40:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Obtaining Segment details from master…
20150805:11:05:41:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Greenplum Version: ‘postgres (HAWQ) 4.2.0 build 1’
20150805:11:05:41:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Signalling all postmaster processes to reload

pg_hba.conf

host    all     gpadmin 192.168.68.135/32        trust
host    all     gpadmin 192.168.68.135/32        trust
host    all     all     192.193.68.132/32        trust
host     all         gpadmin         10.115.yyy.0/24    trust
host     all         gpadmin         10.115.yyy.0/24    trust
host     all         aafsh02          10.115.zzz.0/24   trust
#host     all         all              10.115.xx.xx/32   trust
local    all         gpadmin         ident
host     all         gpadmin         127.0.0.1/28    trust
host     all         gpadmin         192.193.68.134/32       trust
host     all         gpadmin         ::1/128       trust
host     all         gpadmin         fe80::3aea:a7ff:fe35:c0c/128       trust
host     all         gpadmin         192.168.68.136/32       trust
host     all         gpadmin         10.110.xxx.0/24    trust
#host     all         gpadmin         10.115.xxx.0/24    trust
#host     all         gpadmin         10.115.xxx.0/24    trust
host     all         all             10.115.xxx.0/24    ldap ldapserver=xxx.225.227.15 ldapprefix=”OFFICE\”
host     all         all             10.110.xxx.0/24     ldap ldapserver=xxx.225.227.15 ldapprefix=”OFFICE\”
host     all         all             10.202.xxx.0/24     ldap ldapserver=xxx.225.227.15 ldapprefix=”OFFICE\”
host     benchmark   fsh02         10.115.xxx.20/32 ldap ldapserver=xxx.225.227.yyy ldapprefix=”OFFICE\”
host     benchmark   fsh02         10.193.xxx.yyy/32 ldap ldapserver=xxx.225.227.yyy ldapprefix=”OFFICE\”

Create user in hawq

padmin=# CREATE USER ong01 WITH LOGIN ;

gpadmin=# CREATE USER user1 WITH LOGIN ;

gpadmin=# \du
List of roles
Role name         |            Attributes                                               | Member of
———— —–+———————————————+————-
CM_Admin       | Cannot login                                                   |
RD_Admin       | Cannot login                             |
RD_Users        | Cannot login                         |
ovi02    | Superuser, Create DB                 |
ong01                |                                       |
gpadmin        | Superuser, Create role, Create DB     |
user2          |                                                                                    |
ovi    |                                                                                     |
rd_user       |                                                                                    | {RD_Users}
user1       |                                                                                    |

GRANT ALL PRIVILEGES

ON TABLE sim_result_d2, sim_result_value_f

TO PUBLIC

So for now they can access the tables I created.

Configure the Capacity Scheduler

August 14, 2015August 19, 2015 techhadoop

The CapacityScheduler is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster.

The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee

To configure the ResourceManager to use the CapacityScheduler, set the following property in the conf/yarn-site.xml:

<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>

Each child queue is tied to its parent queue with the yarn.scheduler.capacity.<queue-path>.queues configuration property in the capacity-scheduler.xml file

<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default</value>
<description>
The queues at the this level (root is the root queue).
</description>

The Capacity Scheduler reads this file when starting,when you modifies the capacity-scheduler.xml file you have to reloads the settings by running the following command:

yarn rmadmin -refreshQueues

After successful completion of the above command, you may verify if the queues are setup using below command:

-bash-4.1$ hadoop queue -list
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.

15/08/14 16:31:20 INFO client.RMProxy: Connecting to ResourceManager at sphdmst03.dev.bmocm.com/192.168.68.131:8032
======================
Queue Name : default
Queue State : running
Scheduling Info : Capacity: 100.0, MaximumCapacity: 100.0, CurrentCapacity: 0.0

use the below command to identify the queue names on which you could submit your jobs.

-bash-4.1$ hadoop queue -showacls
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.

15/08/14 16:35:11 INFO client.RMProxy: Connecting to ResourceManager at sphdmst03.dev.bmocm.com/192.168.68.131:8032
Queue acls for user : gpadmin

Queue Operations
=====================
root ADMINISTER_QUEUE,SUBMIT_APPLICATIONS
default ADMINISTER_QUEUE,SUBMIT_APPLICATIONS

Hadoop Certifications

August 12, 2015January 20, 2016 techhadoop

Hadoop system administrators Certification 1. Cloudera http://www.cloudera.com/content/cloudera/en/training/certification/ccah/prep.html

2. Hortonworks

** Installation

Configure a local HDP repository
Install ambari-server and ambari-agent
Install HDP using the Ambari install wizard
Add a new node to an existing cluster
Decommission a node
Add an HDP service to a cluster using Ambari

Configuration

Troubleshooting

Restart an HDP service

View an application’s log file

Configure and manage alerts

Troubleshoot a failed job

High Availability

Configure NameNode HA

Configure ResourceManager HA

Copy data between two clusters using distcp

Create a snapshot of an HDFS directory

Recover a snapshot

Configure HiveServer2 HA

Security

Install and configure Knox

Install and configure Ranger

Configure HDFS ACLS

Configure Hadoop for Kerberos

Reference

Hortonworks http://hortonworks.com/training/class/hdp-certified-administrator-hdpca-exam/

HDFS Snapshots

August 11, 2015September 21, 2015 techhadoop

Make your HDFS directory snapshotable , in our case test4

[gpadmin@sphdmst01 tmp]$ hdfs dfsadmin -allowSnapshot /test4
Allowing snaphot on /test4 succeeded

[gpadmin@sphdmst01 tmp]$ hdfs dfsadmin -disallowSnapshot /test4
Disallowing snaphot on /test4 succeeded

Create a snapshot

[gpadmin@sphdmst01 ]$ hdfs dfs -createSnapshot /test4 first-snapshot
Created snapshot /test4/.snapshot/first-snapshot

[gpadmin@sphdmst01 tmp]$ hdfs dfs -ls -R /test4/.snapshot
drwxr-xr-x   – gpadmin hadoop          0 2015-08-11 15:28 /test4/.snapshot/first-snapshot
-rw-r–r–   2 gpadmin hadoop      14515 2015-01-12 10:05 /test4/.snapshot/first-snapshot/Hadoop Servers.xlsx
-rw-r–r–   2 gpadmin hadoop          0 2015-01-12 10:04 /test4/.snapshot/first-snapshot/Hadoop_prod.xlsx
-rw-r–r–   2 gpadmin hadoop       4322 2015-01-12 10:08 /test4/.snapshot/first-snapshot/check_hadoop-dfs.sh
-rw-r–r–   2 gpadmin hadoop          0 2015-01-12 10:04 /test4/.snapshot/first-snapshot/pgadmin.log

You can read the content of the file

[gpadmin@sphdmst01 tmp]$ hdfs dfs -cat /test4/.snapshot/first-snapshot/check_hadoop-dfs.sh

Recover the file from the snapshot

[gpadmin@sphdmst01 /]$ hdfs dfs -cp /test4/.snapshot/first-snapshot/check_hadoop-dfs.sh /ovitest

[gpadmin@cmtolsphdmst01 /]$ hdfs dfs -ls /ovitest
Found 6 items
-rw-r–r–   2 gpadmin hadoop       4322 2015-08-11 15:37 /ovitest/check_hadoop-dfs.sh
-rw-r–r–   2 gpadmin hadoop         66 2015-01-13 16:31 /ovitest/test.txt
-rw-r–r–   2 gpadmin hadoop         66 2015-01-13 17:09 /ovitest/test2.txt
-rw-r–r–   2 gpadmin hadoop         66 2015-01-13 17:10 /ovitest/test3.txt
-rw-r–r–   2 gpadmin hadoop         66 2015-01-14 10:52 /ovitest/test4.txt
-rw-r–r–   2 gpadmin hadoop         66 2015-01-14 10:53 /ovitest/test5.txt

Another example :

gpadmin@sphdmst01 ~]$ hdfs dfs -mkdir /test_snapshot

[gpadmin@sphdmst01 ~]$ hdfs dfs -put dfs-old-lsr-1.log /test_snapshot
[gpadmin@sphdmst01 ~]$ hdfs dfs -put dfs-old-fsck-1.log /test_snapshot

[gpadmin@sphdmst01 ~]$ hdfs dfs -ls /test_snapshot
Found 2 items
-rw-r–r– 2 gpadmin hadoop 45341 2015-09-14 09:39 /test_snapshot/dfs-old-fsck-1.log
-rw-r–r– 2 gpadmin hadoop 83862 2015-09-14 09:38 /test_snapshot/dfs-old-lsr-1.log

[gpadmin@sphdmst01 ~]$ hdfs dfs -createSnapshot /test_snapshot snapshot_dir
createSnapshot: Directory is not a snapshottable directory: /test_snapshot

[gpadmin@sphdmst01 ~]$ hdfs dfsadmin -allowSnapshot /test_snapshot
Allowing snaphot on /test_snapshot succeeded

[gpadmin@sphdmst01 ~]$ hdfs dfs -createSnapshot /test_snapshot snapshot_dir
Created snapshot /test_snapshot/.snapshot/snapshot_dir

Snapshot is read-only, HDFS will protect against user or application deletion of the snapshot

Creating a Hadoop archive – the small files problem

July 23, 2015August 12, 2015 techhadoop

The HDSF is designed to store and process large data sets ( terabytes). Storing a large number of small files in HDFS is inefficient.

Hadoop Archives (HAR) can be used to address the namespace limitations associated with storing many small files. Whit HAR we can packs a number of small files into large files so that the original files can be accessed transparently .

You can use following command to create a Hadoop archive:

hadoop archive -archiveName name -p  *

Example :

[gpadmin@]$ hadoop archive -archiveName ovi.har -p /user/ovidiu /user/ovi

[gpadmin@sphdmst02 ~]$ hadoop fs -ls /user/ovi
Found 1 items
drwxr-xr-x – gpadmin hadoop 0 2015-07-23 16:20 /user/ovi/ovi.har
[gpadmin@]$ hadoop archive -archiveName ovi.har -p /user/ovidiu /user/ovi

Following example create creates an archive using /user/ovidiu as the relative archive directory.

The directories

/user/ovidiu/SIT1

/user/ovidiu/SIT2

/user/ovidiu/SIT3

will be archived in the /user/ovi/ovi2.har archive

$ hadoop archive -archiveName ovi2.har -p /user/ovidiu/ SIT1 SIT2 SIT3 /user/ovi

[gpadmin@cmtolsphdmst02 ~]$ hadoop fs -ls /user/ovi
Found 2 items
drwxr-xr-x – gpadmin hadoop 0 2015-07-23 16:20 /user/ovi/ovi.har
drwxr-xr-x – gpadmin hadoop 0 2015-07-24 11:59 /user/ovi/ovi2.har

Looking up file in hadoop archives

To a client using the HAR filesystem nothing has changed: the original files are accessible and visible (albeit using a har:// URL)

[gpadmin@sphdmst02 ~]$ hdfs dfs -ls har:///user/ovi/ovi.har/
Found 3 items
-rw-r–r–   2 gpadmin hadoop        125 2015-07-23 16:19 har:///user/ovi/ovi.har/ranking.txt
-rw-r–r–   2 gpadmin hadoop         66 2015-01-14 16:06 har:///user/ovi/ovi.har/test.txt
-rw-r–r–   2 gpadmin hadoop         13 2015-07-23 16:18 har:///user/ovi/ovi.har/test2.txt

[gpadmin@sphdmst02 ~]$ hdfs dfs -ls har:///user/ovi/ovi2.har/
Found 3 items
drwxr-xr-x   – gpadmin hadoop          0 2015-07-24 11:54 har:///user/ovi/ovi2.har/SIT1
drwxr-xr-x   – gpadmin hadoop          0 2015-07-24 11:55 har:///user/ovi/ovi2.har/SIT2
drwxr-xr-x   – gpadmin hadoop          0 2015-07-24 11:55 har:///user/ovi/ovi2.har/SIT3
[gpadmin@cmtolsphdmst02 ~]$ hdfs dfs -ls har:///user/ovi/ovi2.har/SIT1
Found 2 items
-rw-r–r–   2 gpadmin hadoop        125 2015-07-24 11:52 har:///user/ovi/ovi2.har/SIT1/ranking.txt
-rw-r–r–   2 gpadmin hadoop         13 2015-07-24 11:54 har:///user/ovi/ovi2.har/SIT1/test2.txt

Rack awareness

June 23, 2015September 15, 2015 techhadoop cluster nodes, rack awareness

Hadoop divides the data into multiple file blocks and stores them on different machines. If Rack Awareness is not configured, there may be a possibility that hadoop will place all the copies of the block in same rack which results in loss of data when that rack fails

Below are steps to configure rack awareness policy – ( manually )

** stop the cluster

** Copy those 2 files rack_topology.sh ( rack topology script ) and topology.data to the /etc/gphd/hadoop/conf directory on all cluster NameNodes (phdmst01 and phdmst02 )

** Add the following property to core-site.xml:

<name>net.topology.script.file.name</name>

<value>/etc/gphd/hadoop/conf/rack_topology.sh</value>

</property>

[root@phdmst01 conf]# pwd

/etc/gphd/hadoop/conf

Rack topology script

[root@phdmst01 conf]# more rack_topology.sh

HADOOP_CONF=/etc/gphd/hadoop/conf

while [ $# -gt 0 ] ; do

nodeArg=$1

exec< ${HADOOP_CONF}/topology.data

result=””

while read line ; do

ar=( $line )

if [ “${ar[0]}” = “$nodeArg” ] ; then

result=”${ar[1]}”

done

shift

if [ -z “$result” ] ; then

echo -n “/default/rack ”

else

echo -n “$result ”

done

[root@phdmst01 conf]# more topology.data

192.168.129.56 /bcc/rack1

192.268.129.57 /bcc/rack1

192.168.129.58 /bcc/rack1

192.168.129.59 /bcc/rack2

192.168.129.60 /bcc/rack2

192.168.129.61 /bcc/rack2

Verify Rack Awareness

The hadoop dfsamin -printTopology command will show the topology

-bash-4.1$ hdfs dfsadmin -printTopology

Rack: /bcc/rack1

192.168.129.56:50010 (phddna01.mydev.com)

192.168.129.57:50010 (phddna02.mydev.com)

192.168.129.58:50010 (phddna03.mydev.com)

Rack: /bcc/rack2

192.168.129.59:50010 (phddnb01.mydev.com)

192.168.129.60:50010 (phddnb02.mydev.com)

192.168.129.61:50010 (phddnb03.mydev.com)

Also you can test with following commands:

– Hadoop fsck command

– dfsadmin -report

2) Configure rack awarness with ambari

Setting up the HDFS NFS Gateway

June 19, 2015August 21, 2015 techhadoop

The NFS Gateway supports NFSv3 and allows HDFS to be mounted as part of the client’s local file system

Set up NFS Gateway to access HDFS data

Install hdfs nfs packages

#yum install hadoop-hdfs-nfs3.x86_64

#yum install hadoop-hdfs-portmap

To start the portmap and NFS gateway daemon:

Run either:

$ sudo service hadoop-hdfs-portmap start

$ sudo service hadoop-hdfs-nfs3 start

$ sudo /etc/init.d/hadoop-hdfs-portmap start

$ sudo /etc/init.d/hadoop-hdfs-nfs3 start

Verify validity of NFS related services

[root@phdmst04 ~]# rpcinfo -p phdmst04

program vers proto port service

100005 1 tcp 4242 mountd

100000 2 udp 111 portmapper

100005 3 tcp 4242 mountd

100005 2 udp 4242 mountd

100003 3 tcp 2049 nfs

100000 2 tcp 111 portmapper

100005 3 udp 4242 mountd

100005 1 udp 4242 mountd

100005 2 tcp 4242 mountd

[root@phdmst04 ~]# showmount -e phdmst04

Export list for phdmst04:

/ 192.168.129.55/255.255.255.0

#mount -t nfs -o vers=3,proto=tcp,nolock,noatime phdmst04:/ /data/hdfs_mnt

[root@phdmst04 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/mapper/vg_cmri-lv_root

202G 11G 182G 6% /

tmpfs 95G 0 95G 0% /dev/shm

/dev/sda1 485M 66M 394M 15% /boot

/dev/mapper/vg_cmri-lv_home

9.9G 2.1G 7.3G 23% /home

/dev/mapper/vg_data-lv_data

493G 243M 467G 1% /data

phdmst04:/ 323T 3.3T 320T 2% /data/hdfs_mnt

Troubleshooting

check nfs3 and portmap status

[root@blpphdmst04 init.d]# ./hadoop-hdfs-portmap status
portmap is stopped
[root@blpphdmst04 init.d]# ./hadoop-hdfs-portmap start
starting portmap, logging to /var/log/gphd/hadoop-hdfs/hadoop-hdfs-portmap-blpphdmst04.mydev.com.out
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

[ OK ]

[root@blpphdmst04 init.d]# ./hadoop-hdfs-nfs3 status
nfs3 is stopped

[root@blpphdmst04 init.d]# ./hadoop-hdfs-nfs3 start
starting nfs3, logging to /var/log/gphd/hadoop-hdfs/hadoop-hdfs-nfs3-blpphdmst04.mydev.com.out
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

[ OK ]
[root@blpphdmst04 init.d]# mount -t nfs -o vers=3,proto=tcp,nolock,noatime blpphdmst04:/ /data/hdfs_mnt
[root@blpphdmst04 init.d]# df -h
Filesystem            Size Used Avail Use% Mounted on
/dev/mapper/vg_cmri-lv_root
202G   16G 176G   9% /
tmpfs                  95G     0   95G   0% /dev/shm
/dev/sda1             477M   89M 363M 20% /boot
/dev/mapper/vg_cmri-lv_home
9.8G 2.0G 7.3G 22% /home
/dev/mapper/vg_data-lv_data
493G 117M 467G   1% /data
blpphdmst04:/      269T 2.9T 266T   2% /data/hdfs_mnt

Sensitive Data Redaction

June 18, 2015 techhadoop

Hadoop Commands

June 4, 2015September 18, 2015 techhadoop

Create a directory in HDFS

$ hdfs dfs -mkdir /user/mike

-bash-4.1$ hadoop fs -mkdir hdfs://sphdmst01.dev.com/user/ovi/test
-bash-4.1$ hadoop fs -ls hdfs://sphdmst01.dev.com/user/ovi/
Found 3 items
drwxr-xr-x   – gpadmin hadoop          0 2015-07-23 16:20 hdfs://sphdmst01.dev.com/user/ovi/ovi.har
drwxr-xr-x   – gpadmin hadoop          0 2015-07-24 11:59 hdfs://sphdmst01.dev.com/user/ovi/ovi2.har
drwxr-xr-x   – gpadmin hadoop          0 2015-09-18 16:43 hdfs://sphdmst01.dev.com/user/ovi/test

Copies files from the local file system to the destination file system

$ hadoop fs -put test.txt /user/mike/

Download

$hadoop fs -get /user/mike/test/txt /home

List the contents of a directory

$ hdfs dfs -ls /user/mike
Found 1 items
-rw-r–r– 3 gpadmin hadoop 15 2015-06-04 11:04 /user/mike/test.txt

$ hdfs dfs -cat /user/mike/test.txt
just a test

$ hdfs dfs -rm /user/mike/test.txt
15/06/04 11:40:00 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 86400000 minutes, Emptier interval = 0 minutes.
Moved: ‘hdfs://dev/user/mike/test.txt’ to trash at: hdfs://dev/user/gpadmin/.Trash/Current

Takes a source directory files as input and concatenates files in src into the destination local file

$ hadoop fs -put test1.txt /user/mike
$ hadoop fs -put test2.txt /user/mike

$ hadoop fs -ls /user/mike
Found 2 items
-rw-r–r– 3 gpadmin hadoop 26 2015-06-09 11:10 /user/mike/test1.txt
-rw-r–r– 3 gpadmin hadoop 28 2015-06-09 11:10 /user/mike/test2.txt

$ hadoop fs -getmerge /user/mike /tmp/output.txt

$ more output.txt
just a test
just a test
just a test2
just a test2

Check file system

$ hadoop fsck /

…………………………………………………………………

Total size:    1252660561619 B
Total dirs:    784
Total files:   43391
Total symlinks:                0 (Files currently being written: 6)
Total blocks (validated):      23155 (avg. block size 54098922 B) (Total open file blocks (not validated): 1)
Minimally replicated blocks:   23155 (100.0 %)
Over-replicated blocks:        0 (0.0 %)
Under-replicated blocks:       0 (0.0 %)
Mis-replicated blocks:         0 (0.0 %)
Default replication factor:    3
Average block replication:     3.0
Corrupt blocks:                0
Missing replicas:              0 (0.0 %)
Number of data-nodes:          4
Number of racks:               1
FSCK ended at Thu Jun 04 10:54:29 EDT 2015 in 1544 milliseconds

The filesystem under path ‘/’ is HEALTHY

To view a list of all the blocks, and the locations of the blocks, the command would be

$hadoop fsck / -files -blocks -locations

$ hdfs dfs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> … <dst>]
[-cat [-ignoreCrc] <src> …]
[-checksum <src> …]
[-chgrp [-R] GROUP PATH…]
[-chmod [-R] <MODE[,MODE]… | OCTALMODE> PATH…]
[-chown [-R] [OWNER][:[GROUP]] PATH…]
[-copyFromLocal [-f] [-p] <localsrc> … <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> … <localdst>]
[-count [-q] <path> …]
[-cp [-f] [-p] <src> … <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> …]]
[-du [-s] [-h] <path> …]
[-expunge]
[-get [-p] [-ignoreCrc] [-crc] <src> … <localdst>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd …]]
[-ls [-d] [-h] [-R] [<path> …]]
[-mkdir [-p] <path> …]
[-moveFromLocal <localsrc> … <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> … <dst>]
[-put [-f] [-p] <localsrc> … <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] <src> …]
[-rmdir [–ignore-fail-on-non-empty] <dir> …]
[-setrep [-R] [-w] <rep> <path> …]
[-stat [format] <path> …]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> …]
[-touchz <path> …]
[-usage [cmd …]]

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

Hadoop dfsadmin Command Options

$ hdfs dfsadmin
Usage: java DFSAdmin
Note: Administrative commands can only be run as the HDFS superuser.
[-report]
[-safemode enter | leave | get | wait]
[-allowSnapshot ]
[-disallowSnapshot ]
[-saveNamespace]
[-rollEdits]
[-restoreFailedStorage true|false|check]
[-refreshNodes]
[-finalizeUpgrade]
[-metasave filename]
[-refreshServiceAcl]
[-refreshUserToGroupsMappings]
[-refreshSuperUserGroupsConfiguration]
[-printTopology]
[-refreshNamenodes datanodehost:port]
[-deleteBlockPool datanode-host:port blockpoolId [force]]
[-setQuota …]
[-clrQuota …]
[-setSpaceQuota …]
[-clrSpaceQuota …]
[-setBalancerBandwidth ]
[-fetchImage ]
[-help [cmd]]

Generic options supported are
-conf      specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files     specify comma separated files to be copied to the map reduce cluster
-libjars     specify comma separated jar files to include in the classpath.
-archives     specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

$ hdfs dfsadmin -safemode get
Safe mode is OFF

Hadoop haadmin Command Options

$ hdfs haadmin
Usage: DFSHAAdmin [-ns ]
[-transitionToActive ]
[-transitionToStandby ]
[-failover [–forcefence] [–forceactive] ]
[-getServiceState ]
[-checkHealth ]
[-help ]

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

Example:

$ hdfs haadmin -getServiceState nn1
active

$ hdfs haadmin -getServiceState nn2
standby

$ hdfs haadmin -checkHealth nn1
$ hdfs haadmin -checkHealth nn2

[gpadmin@phdmst01 ~]$ hdfs getconf
hdfs getconf is utility for getting configuration information from the config file.

hadoop getconf
[-namenodes]                    gets list of namenodes in the cluster.
[-secondaryNameNodes]                   gets list of secondary namenodes in the cluster.
[-backupNodes]                  gets list of backup nodes in the cluster.
[-includeFile]                  gets the include file path that defines the datanodes that can join the cluster.
[-excludeFile]                  gets the exclude file path that defines the datanodes that need to decommissioned.
[-nnRpcAddresses]                       gets the namenode rpc addresses
[-confKey [key]]                        gets a specific key from the configuration

Example:

[gpadmin@phdmst01 ~]$ hdfs getconf -namenodes
phdmst01.mydev.com phdmst02.mydev.com

[gpadmin@phdmst01 ~]$ hdfs getconf -nnRpcAddresses
phdmst01.mydevcom:8020
phdmst02.mydev.com:8020

Yarn

$ yarn node -list
15/06/05 14:26:11 INFO client.RMProxy: Connecting to ResourceManager at phdmst03.mydev.com/192.168.68.131:8032
Total Nodes:2
Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
phddnb01.mydev.com:16638              RUNNING        phddnb01.mydev..com:8042                                  0
phddna01.mydev.com:58002              RUNNING        phddna01.mydev.com:8042                                  0

$ yarn node -status phddnb01.mydev.com:16638
15/06/05 14:31:03 INFO client.RMProxy: Connecting to ResourceManager at phdmst03.mydev.com/10.193.68.131:8032
Node Report :
Node-Id : phddnb01.mydev.com:16638
Rack : /default-rack
Node-State : RUNNING
Node-Http-Address : phddnb01.mydev.com:8042
Last-Health-Update : Fri 05/Jun/15 02:29:06:575EDT
Health-Report :
Containers : 0
Memory-Used : 0MB
Memory-Capacity : 8192MB
CPU-Used : 0 vcores
CPU-Capacity : 8 vcores

$ yarn
Usage: yarn [–config confdir] COMMAND
where COMMAND is one of:
resourcemanager      run the ResourceManager
nodemanager          run a nodemanager on each slave
rmadmin              admin tools
version              print the version
jar <jar>            run a jar file
application          prints application(s) report/kill application
node                 prints node report(s)
logs                 dump container logs
classpath            prints the class path needed to get the
Hadoop jar and the required libraries
daemonlog            get/set the log level for each daemon
or
CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

yarn version
Hadoop 2.2.0-gphd-3.0.1.0
Source code repository: ssh://git@stash.greenplum.com:2222/phd/hadoop.git -r 3055df0b53cf992665913380a1651345c477a0d2
Compiled by pivotal on 2014-04-14T03:38Z
Compiled with protoc 2.5.0
From source with checksum 93b8d74f534acdc126e8575bba69fc70
This command was run using /usr/lib/gphd/hadoop/hadoop-common-2.2.0-gphd-3.0.1.0.jar

$ yarn rmadmin
Usage: java RMAdmin
[-refreshQueues]
[-refreshNodes]
[-refreshUserToGroupsMappings]
[-refreshSuperUserGroupsConfiguration]
[-refreshAdminAcls]
[-refreshServiceAcl]
[-getGroups [username]]
[-updateNodeResource [NodeID][MemSize][Cores]]
[-help [cmd]]

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

$ yarn rmadmin -getGroups gpadmin

15/06/18 11:37:27 INFO client.RMProxy: Connecting to ResourceManager at phdmst03.mydev.com/192.168.68.135:8033
gpadmin : gpadmin hadoop

Hadoop world !

June 3, 2015August 13, 2015 techhadoop

Hadoop world !

Infra Cloud Solutions

Year: 2015

Hawk

Configure the Capacity Scheduler

Hadoop Certifications

HDFS Snapshots

Creating a Hadoop archive – the small files problem

Rack awareness

Setting up the HDFS NFS Gateway

Sensitive Data Redaction

Hadoop Commands

Hadoop world !