Hawk

HAWQ is the new benchmark for SQL on Hadoop,

HAWQ is a parallel SQL query engine . HAWQ has been designed from the ground up to be a massively parallel SQL processing engine optimized specifically for analytics with full transaction support

Pivotal HAWQ is a Massively Parallel Processing (MPP)  database using several Postgres database instances and HDFS storage

Hawq Physical Architecture 

  1. Hawq Master server
  2. HDFS NameNode
  3. Segment server
  4. Interconnect Switch

How to locate the logs:

HAWQ master logs.

[gpadmin@uphdmst02 gpseg-1]$ psql
psql (8.4.20, server 8.2.15)
WARNING: psql version 8.4, server version 8.2.
Some psql features might not work.
Type “help” for help.

gpadmin=# show data_directory;
data_directory
———————-
/data/master/gpseg-1
(1 row)

[gpadmin@uphdmst02 ~]$ psql
psql (8.4.20, server 8.2.15)
WARNING: psql version 8.4, server version 8.2.
Some psql features might not work.
Type “help” for help.

gpadmin=# select version();
version

————————————————————————-
————————————————-
PostgreSQL 8.2.15 (Greenplum Database 4.2.0 build 1) (HAWQ 1.2.1.0 build 10335) on x86_64-unknown-linux-gnu, comp
iled by GCC gcc (GCC) 4.4.2 compiled on Aug  8 2014 16:31:48
(1 row)

Create some external table in uat

gpadmin=# CREATE EXTERNAL TABLE person ( id int, name text)

gpadmin-# location(‘gpfdist://phdmst03.uat.mydev.com:8000/Test/person.txt’) FORMAT ‘text’ (delimiter ‘|’)

gpadmin-# ENCODING ‘UTF8’;

CREATE EXTERNAL TABLE

gpadmin=# select count(*) from person;

count

———

1000000

(1 row)

gpadmin=#

— External Table: ext_sim_result_value_f

— DROP EXTERNAL TABLE ext_sim_result_value_f;

CREATE EXTERNAL TABLE ext_sim_result_value_f
(
sim_result_id ,
md_point_id ,
path_num ,
value
)
LOCATION (
‘gpfdist://uphdmst03.uat.mydev.com:8000/500041_Sim_Result_Value_F.csv.gz’
)
FORMAT ‘text’ (delimiter ‘|’ null ‘\\N’ escape ‘\\’)
ENCODING ‘UTF8’;
ALTER TABLE ext_sim_result_value_f
OWNER TO gpadmin;

shutdown

 

3
20150731:16:56:16:570576 gpstop:uphdmst02:gpadmin-[INFO]:-Commencing parallel segment instance shutdown, please wait…
…………….
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:—————————————————–
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:-   Segments stopped successfully      = 24
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:-   Segments with errors during stop   = 0
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:—————————————————–
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:-Successfully shutdown 24 of 24 segment instances
20150731:16:56:32:570576 gpstop:uphdmst02:gpadmin-[INFO]:-Database successfully shutdown with no errors reported

 Hawq – pga_conf

vim /data/master/gpseg-1/pg_hba.conf

source /usr/local/hawq/greenplum_path.sh

export MASTER_DATA_DIRECTORY=/data/master/gpseg-1

reload the configuration

[gpadmin@uphdmst02 ~]$ gpstop -u
20150805:11:05:40:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Starting gpstop with args: -u
20150805:11:05:40:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Gathering information and validating the environment…
20150805:11:05:40:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20150805:11:05:40:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Obtaining Segment details from master…
20150805:11:05:41:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Greenplum Version: ‘postgres (HAWQ) 4.2.0 build 1’
20150805:11:05:41:233745 gpstop:uphdmst02:gpadmin-[INFO]:-Signalling all postmaster processes to reload

pg_hba.conf

host    all     gpadmin 192.168.68.135/32        trust
host    all     gpadmin 192.168.68.135/32        trust
host    all     all     192.193.68.132/32        trust
host     all         gpadmin         10.115.yyy.0/24    trust
host     all         gpadmin         10.115.yyy.0/24    trust
host     all         aafsh02          10.115.zzz.0/24   trust
#host     all         all              10.115.xx.xx/32   trust
local    all         gpadmin         ident
host     all         gpadmin         127.0.0.1/28    trust
host     all         gpadmin         192.193.68.134/32       trust
host     all         gpadmin         ::1/128       trust
host     all         gpadmin         fe80::3aea:a7ff:fe35:c0c/128       trust
host     all         gpadmin         192.168.68.136/32       trust
host     all         gpadmin         10.110.xxx.0/24    trust
#host     all         gpadmin         10.115.xxx.0/24    trust
#host     all         gpadmin         10.115.xxx.0/24    trust
host     all         all             10.115.xxx.0/24    ldap ldapserver=xxx.225.227.15 ldapprefix=”OFFICE\”
host     all         all             10.110.xxx.0/24     ldap ldapserver=xxx.225.227.15 ldapprefix=”OFFICE\”
host     all         all             10.202.xxx.0/24     ldap ldapserver=xxx.225.227.15 ldapprefix=”OFFICE\”
host     benchmark   fsh02         10.115.xxx.20/32  ldap ldapserver=xxx.225.227.yyy ldapprefix=”OFFICE\”
host     benchmark   fsh02         10.193.xxx.yyy/32  ldap ldapserver=xxx.225.227.yyy ldapprefix=”OFFICE\”

Create user in hawq

padmin=# CREATE USER ong01 WITH LOGIN ;

gpadmin=# CREATE USER user1 WITH LOGIN ;

gpadmin=# \du
List of roles
Role name         |            Attributes                                               |  Member of
———— —–+———————————————+————-
CM_Admin       | Cannot login                                                     |
RD_Admin        | Cannot login                                                     |
RD_Users          | Cannot login                                                     |
ovi02                   | Superuser, Create DB                                 |
ong01                  |                                                                                   |
gpadmin              | Superuser, Create role, Create DB     |
user2                    |                                                                                    |
ovi                          |                                                                                     |
rd_user                |                                                                                    | {RD_Users}
user1                    |                                                                                    |

 

gpadmin-# \l
List of databases
Name    |  Owner   | Encoding |   Access privileges
———–+———-+———-+———————–
APMS      | gpadmin  | UTF8     |
B_POC | gpadmin  | UTF8        | =Tc/gpadmin
: gpadmin=CTc/gpadmin
: gorodn=CTc/gpadmin
: asi=CTc/gpadmin
CCD      | gpadmin  | UTF8        |
CI_DB   | usertest | UTF8          | usertest=CTc/usertest
: ovi=CTc/usertest
Mike_Test | gpadmin  | UTF8     | =Tc/gpadmin
: gpadmin=CTc/gpadmin
: ovi=CTc/gpadmin

 

GRANT ALL PRIVILEGES

ON TABLE sim_result_d2, sim_result_value_f

TO PUBLIC

So for now they can access the tables I created.

 

Configure the Capacity Scheduler

The CapacityScheduler is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster.

The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee

To configure the ResourceManager to use the CapacityScheduler, set the following property in the conf/yarn-site.xml:

<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>

Each child queue is tied to its parent queue with the yarn.scheduler.capacity.<queue-path>.queues configuration property in the capacity-scheduler.xml file

<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default</value>
<description>
The queues at the this level (root is the root queue).
</description>

<property>

The Capacity Scheduler reads this file when starting,when you modifies the capacity-scheduler.xml file you have to reloads the settings by running the following command:

yarn rmadmin -refreshQueues

After successful completion of the above command, you may verify if the queues are setup using below  command:

-bash-4.1$ hadoop queue -list
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.

15/08/14 16:31:20 INFO client.RMProxy: Connecting to ResourceManager at sphdmst03.dev.bmocm.com/192.168.68.131:8032
======================
Queue Name : default
Queue State : running
Scheduling Info : Capacity: 100.0, MaximumCapacity: 100.0, CurrentCapacity: 0.0

use the below command to identify the queue names on which you could submit your jobs.

-bash-4.1$ hadoop queue -showacls
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.

15/08/14 16:35:11 INFO client.RMProxy: Connecting to ResourceManager at sphdmst03.dev.bmocm.com/192.168.68.131:8032
Queue acls for user :  gpadmin

Queue  Operations
=====================
root  ADMINISTER_QUEUE,SUBMIT_APPLICATIONS
default  ADMINISTER_QUEUE,SUBMIT_APPLICATIONS

Hadoop Certifications

Hadoop system administrators Certification 1.  Cloudera http://www.cloudera.com/content/cloudera/en/training/certification/ccah/prep.html  

2. Hortonworks

** Installation

Configure a local HDP repository
Install ambari-server and ambari-agent
Install HDP using the Ambari install wizard
Add a new node to an existing cluster
Decommission a node
Add an HDP service to a cluster using Ambari

Configuration

Troubleshooting 

Restart an HDP service

View an application’s log file

Configure and manage alerts

Troubleshoot a failed job

High Availability

Configure NameNode HA

Configure ResourceManager HA

Copy data between two clusters using distcp

Create a snapshot of an HDFS directory

Recover a snapshot

Configure HiveServer2 HA

Security

Install and configure Knox

Install and configure Ranger

Configure HDFS ACLS

Configure Hadoop for Kerberos

 

 

Reference

 Hortonworks http://hortonworks.com/training/class/hdp-certified-administrator-hdpca-exam/

 

HDFS Snapshots

Make your HDFS directory snapshotable , in our case test4

[gpadmin@sphdmst01 tmp]$ hdfs dfsadmin -allowSnapshot /test4
Allowing snaphot on /test4 succeeded

[gpadmin@sphdmst01 tmp]$ hdfs dfsadmin -disallowSnapshot /test4
Disallowing snaphot on /test4 succeeded

Create a snapshot

[gpadmin@sphdmst01 ]$ hdfs dfs -createSnapshot /test4 first-snapshot
Created snapshot /test4/.snapshot/first-snapshot

[gpadmin@sphdmst01 tmp]$ hdfs dfs -ls -R /test4/.snapshot
drwxr-xr-x   – gpadmin hadoop          0 2015-08-11 15:28 /test4/.snapshot/first-snapshot
-rw-r–r–   2 gpadmin hadoop      14515 2015-01-12 10:05 /test4/.snapshot/first-snapshot/Hadoop Servers.xlsx
-rw-r–r–   2 gpadmin hadoop          0 2015-01-12 10:04 /test4/.snapshot/first-snapshot/Hadoop_prod.xlsx
-rw-r–r–   2 gpadmin hadoop       4322 2015-01-12 10:08 /test4/.snapshot/first-snapshot/check_hadoop-dfs.sh
-rw-r–r–   2 gpadmin hadoop          0 2015-01-12 10:04 /test4/.snapshot/first-snapshot/pgadmin.log

You can read the content of the file

[gpadmin@sphdmst01 tmp]$ hdfs dfs -cat /test4/.snapshot/first-snapshot/check_hadoop-dfs.sh

Recover the file from the snapshot

[gpadmin@sphdmst01 /]$ hdfs dfs -cp /test4/.snapshot/first-snapshot/check_hadoop-dfs.sh /ovitest

[gpadmin@cmtolsphdmst01 /]$ hdfs dfs -ls /ovitest
Found 6 items
-rw-r–r–   2 gpadmin hadoop       4322 2015-08-11 15:37 /ovitest/check_hadoop-dfs.sh
-rw-r–r–   2 gpadmin hadoop         66 2015-01-13 16:31 /ovitest/test.txt
-rw-r–r–   2 gpadmin hadoop         66 2015-01-13 17:09 /ovitest/test2.txt
-rw-r–r–   2 gpadmin hadoop         66 2015-01-13 17:10 /ovitest/test3.txt
-rw-r–r–   2 gpadmin hadoop         66 2015-01-14 10:52 /ovitest/test4.txt
-rw-r–r–   2 gpadmin hadoop         66 2015-01-14 10:53 /ovitest/test5.txt

Another example :

gpadmin@sphdmst01 ~]$ hdfs dfs -mkdir /test_snapshot

[gpadmin@sphdmst01 ~]$ hdfs dfs -put dfs-old-lsr-1.log /test_snapshot
[gpadmin@sphdmst01 ~]$ hdfs dfs -put dfs-old-fsck-1.log /test_snapshot

[gpadmin@sphdmst01 ~]$ hdfs dfs -ls /test_snapshot
Found 2 items
-rw-r–r–   2 gpadmin hadoop      45341 2015-09-14 09:39 /test_snapshot/dfs-old-fsck-1.log
-rw-r–r–   2 gpadmin hadoop      83862 2015-09-14 09:38 /test_snapshot/dfs-old-lsr-1.log

[gpadmin@sphdmst01 ~]$ hdfs dfs -createSnapshot /test_snapshot snapshot_dir
createSnapshot: Directory is not a snapshottable directory: /test_snapshot

[gpadmin@sphdmst01 ~]$ hdfs dfsadmin  -allowSnapshot /test_snapshot
Allowing snaphot on /test_snapshot succeeded

[gpadmin@sphdmst01 ~]$ hdfs dfs -createSnapshot /test_snapshot snapshot_dir
Created snapshot /test_snapshot/.snapshot/snapshot_dir

Snapshot is read-only, HDFS will protect against user or application deletion of the snapshot