AWS – CloudFront

July 16, 2016October 15, 2016 techhadoop Uncategorized

Amazon CloudFront is a global content delivery network (CDN) service that accelerates delivery of your websites, APIs, video content or other web assets. It integrates with other Amazon Web Services products to give developers and businesses an easy way to accelerate content to end users with no minimum usage commitments.

Edge Locations are used in conjunction with the AWS CloudFront service which is Content Delivery Network service. Edge Locations are deployed across the world in multiple locations to reduce latency for traffic served over the CloudFront and as result are usually located in highly populated areas

Amazon CloudFront is optimized to work with other AWS services, like Amazon S3, Amazon EC2, ELB, and Amazon Route 53

– Copy of the static content ( e.g. , images, css files, streaming of prerecording video) and dynamic content ( e.g., html response, live video ) can be cached at Amazon CloudFront, which is content delivery network (CDN) consisting of multiple edge location around the world. Edge caching allows content to be served by infrastructure that is closer to viewers, lowering latency and giving you the high data transfer rates.

CloudFront – 2 types of distribution HTTP/HTTPS (WEB) and RTMP ( Streaming )

AWS – DynamoDB

July 16, 2016January 3, 2017 techhadoop Uncategorized

DynamoDB – is a NoSQL database service by AWS designed for fast processing of small data, which dynamically grows and changes

Usage

Gaming: high-scores, world changes, player status and statistics
Advertising services :
Messaging and blogging
Data blocks systematization and processing

Your data is automatically replicated among 3 AZ within the selected region

There is no limit to the amount of data you can store in an Amazon DynamoDB table. As the size of your data set grows, Amazon DynamoDB will automatically spread your data over sufficient machine resources to meet your storage requirements.
To achieve high uptime and durability, Amazon DynamoDB synchronously replicates data across three facilities within an AWS Region.

Amazon DynamoDB supports two types of secondary indexes:

Local secondary index — an index that has the same partition key as the table, but a different sort key. A local secondary index is “local” in the sense that every partition of a local secondary index is scoped to a table partition that has the same partition key.
Global secondary index — an index with a partition or a partition-and-sort key that can be different from those on the table. A global secondary index is considered “global” because queries on the index can span all items in a table, across all partitions.

Global secondary indexes are indexes that contains a partition or partition-and-short keys that can be different from the table’s primary key.

DynamoDB cross-region replication allows you to maintain identical copies (called replicas) of a DynamoDB table (called master table) in one or more AWS regions. After you enable cross-region replication for a table, identical copies of the table are created in other AWS regions. Writes to the table will be automatically propagated to all replicas.

If you wish to exceed throughput rates of 10,000 writes/second or 10,000 reads/second, you must first contact Amazon

– DynamoDB data is automatically replicated across multiple AZs

– DynamoDB allow for the storage of large text and binary objects, but there is a limit

DynamoDB cross-region replication allows you to maintain identical copies (called replicas) of a DynamoDB table (called master table) in one or more AWS regions. After you enable cross-region replication for a table, identical copies of the table are created in other AWS regions. Writes to the table will be automatically propagated to all replicas.

-Strong Consistency

Atomic counter

DynamoDB supports atomic counters, where you use the updateItem method to increment or decrement the value of an existing attribute without interfering with others write request. ( All write request are applied in the order in which there were received)

Conditional writes

PutItem, DeleteItem, UpdateItem

Conditional writes are idempotent – that mean you can send the same conditional write request multiple times, but it will have no future effect on the item after the first time DynamoDB performs the specific update .

Batch Operations

If your application needs to read multiple items, you can use BatchGetItem. A single BatchGetItem request can retrieve up to 16 MB of data, which can contains as many as 100 items.

DynamoDB supports eventually consistent and strongly consistent reads .

Eventually consistent reads

When you read data from a DynamoDB table, the response might not reflect the results of recently completed writhe operations. The response might include some stale data. If you repeat your read request after a short time , the response should return the latest data.

Strongly consistent Reads

When you request strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that where successful. A strongly consistent read might not be available in the case of network delay or outage.

DynamoDB uses eventually consistent reads, unless you specify otherwise. Read operations ( such as GetItem, Query, and Scan )provide a ConsistentRead parameter: if you set this parameter to true, DynamoDB will use strongly consistent reads during the operation.

Units of Capacity required for writes = Number of item writes per second x item size in 1KB blocks

Units of Capacity required for reads* = Number of item reads per second x item size in 4KB blocks

* If you use eventually consistent reads you’ll get twice the throughput in terms of reads per second.

Error Components :

An HTTP code 200 – success

An HTTP code 400 – indicate a problem with your request ( client error)

e.g authentication failure, missing required parameters, or exceeding table’s provisioned throughput

Ann HTTP code 5xx – status code indicates a problem that mus be resolved by Amazon Web Services

Optimistic locking is a strategy to ensure that the client-side item that you are updating (or deleting) is the same as the item in DynamoDB. If you use this strategy, then your database writes are protected from being overwritten by the writes of others — and vice-versa.

-DynamoDB supports nested attributes up to 32 levels deep.

Reference

http://aws.amazon.com/faqs

AWS – Route 53

July 16, 2016October 31, 2016 techhadoop Uncategorized

Amazon Route 53 is a highly available and scalable cloud Domain Name System (DNS) web service.

Amazon Route 53 effectively connects user requests to infrastructure running in AWS – such as Amazon EC2 instances, Elastic Load Balancing load balancers, or Amazon S3 buckets – and can also be used to route users to infrastructure outside of AWS. You can use Amazon Route 53 to configure DNS health checks to route traffic to healthy endpoints or to independently monitor the health of your application and its endpoints.

Amazon Route 53 Traffic Flow makes it easy for you to manage traffic globally through a variety of routing types, including Latency Based Routing, Geo DNS, and Weighted Round Robin—all of which can be combined with DNS Failover in order to enable a variety of low-latency, fault-tolerant architectures. Using Amazon Route 53 Traffic Flow’s simple visual editor, you can easily manage how your end-users are routed to your application’s endpoints—whether in a single AWS region or distributed around the globe. Amazon Route 53 also offers Domain Name Registration – you can purchase and manage domain names such as example.com and Amazon Route 53 will automatically configure DNS settings for your domains.

Amazon Route 53 currently supports the following DNS record types:

TXT (text record)
SRV (service locator)
SPF (sender policy framework)
SOA (start of authority record)
PTR (pointer record)
NS (name server record)
MX (mail exchange record)
CNAME (canonical name record)
AAAA (IPv6 address record)
A (address record)
Additionally, Amazon Route 53 offers ‘Alias’ records (an Amazon Route 53-specific virtual record). Alias records are used to map resource record sets in your hosted zone to Amazon Elastic Load Balancing load balancers, Amazon CloudFront distributions, AWS Elastic Beanstalk environments, or Amazon S3 buckets that are configured as websites. Alias records work like a CNAME record in that you can map one DNS name (example.com) to another ‘target’ DNS name (elb1234.elb.amazonaws.com). They differ from a CNAME record in that they are not visible to resolvers. Resolvers only see the A record and the resulting IP address of the target record.

Amazon Route 53 does not support DNSSEC at this time.

Amazon Route 53 offers a special type of record called an ‘Alias’ record that lets you map your zone apex (example.com) DNS name to your ELB DNS name (i.e.elb1234.elb.amazonaws.com). IP addresses associated with Amazon Elastic Load Balancers can change at any time due to scaling up, scaling down, or software updates. Route 53 responds to each request for an Alias record with one or more IP addresses for the load balancer. Queries to Alias records that are mapped to ELB load balancers are free. These queries are listed as “Intra-AWS-DNS-Queries” on the Amazon Route 53 usage report.

Route53 has a security feature that prevents internal DNS being read be external sources. The work around is to create a EC2 hosted DNS instance that does zone transfers from the internal DNS, and allows it’self to be queried by external servers

DNS Routing Policy

Weighted Round Robin (WRR)
Latency Based Routing (LBR)

Amazon – Kinesis

July 16, 2016July 16, 2016 techhadoop Uncategorized

Amazon Kinesis Streams enables you to build custom applications that process or analyze streaming data for specialized needs. Amazon Kinesis Streams can continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, and location-tracking events. With Amazon Kinesis Client Library (KCL), you can build Amazon Kinesis Applications and use streaming data to power real-time dashboards, generate alerts, implement dynamic pricing and advertising, and more. You can also emit data from Amazon Kinesis Streams to other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elastic Map Reduce (Amazon EMR), and AWS Lambda.

AWS – CloudTrial

July 15, 2016October 28, 2016 techhadoop Uncategorized

AWS CloudTrail is a web service that records AWS API calls for your account and delivers log files to you. The recorded information includes the identity of the API caller, the time of the API call, the source IP address of the API caller, the request parameters, and the response elements returned by the AWS service.

AWS CloudTrail provides a record of your AWS API calls. You can use this data to gain visibility into user activity, troubleshoot
operational and security incidents, or to help demonstrate compliance with internal policies or regulatory standards.

This information is collected and written to log files that are stored in an Amazon S3 bucket that you specify.

– Once you have enabled CloudTrail, event logs are delivered every 5 minutes. You can configure CloudTrail so that it aggregates log files from multiple regions into a single Amazon S3 bucket.

– In addition to CloudTrail’s user activity logs, you can use the Amazon CloudWatch Logs feature to collect and monitor system, application, and custom log files from your EC2 instances and other sources in near real time.

Use Cases

Security analysis
Track changes to AWS Resources
Compliance Aid
Troubleshoot Operational issues

– by default, cloudTrail log files are encrypted using S3 Server Side Encryption (SSE) and placed into your S3 Bucket.

– You can turn on Amazon SNS notifications so that you can take immediate action on delivery of new logs

AWS – CloudWatch

July 15, 2016June 14, 2017 techhadoop Uncategorized

Amazon CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWS. You can use Amazon CloudWatch to collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources. Amazon CloudWatch can monitor AWS resources such as Amazon EC2 instances, Amazon DynamoDB tables, and Amazon RDS DB instances, as well as custom metrics generated by your applications and services, and any log files your applications generate. You can use Amazon CloudWatch to gain system-wide visibility into resource utilization, application performance, and operational health. You can use these insights to react and keep your application running smoothly.

-Many metrics are received and aggregated at 1 – minute intervals. Some are at 3 minute or 5 – minute interval

Metrics data are available for 2 weeks
Metrics can not be deleted, but they automatically expire after 2 weeks

Metrics Retention

CloudWatch now stores all metrics for 15 months at no extra charge ( nov 2016 ). In order to keep the overall volume of data reasonable, historical data is stored at a lower level of granularity, as follows:

One minute data points are available for 15 days.
Five minute data points are available for 63 days.
One hour data points are available for 455 days (15 months).

CloudWatch metrics require a custom monitoring script to populate the metric:

Swap Usage
Available Disk Space

Aggregation :

CloudWatch does not aggregate data across regions
Aggregated statistics are only available when using detailed monitoring.

Cloudwatch

– do not provide detailed monitoring for EMR

by default detailed monitoring is enable for Auto Scaling

– provide free detailed monitoring for :

AWS Route 53
AWS RDS
AWS ELB
opsworks

to upload custom metrics you can use the AWS CLI or the API

Reference

HDP – Data workflow

July 13, 2016July 13, 2016 techhadoop Uncategorized

Sqoop

Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores.

Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB

Flume

A service for streaming logs into Hadoop

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

YARN coordinates data ingest from Apache Flume and other services that deliver raw data into an Enterprise Hadoop cluster

Use Flume if you have an non-relational data sources such as log files that you want to stream into Hadoop.

Use Kafka if you need a highly reliable and scalable enterprise messaging system to connect many multiple systems, one of which is Hadoop.

Kafka

NFS

WebHDFS

Apache Zeppelin

July 13, 2016 techhadoop Uncategorized

AWS – SQS

July 13, 2016May 18, 2017 techhadoop Uncategorized

Amazon Simple Queue Service (SQS) and Amazon SNS are both messaging services within AWS, which provide different benefits for developers. Amazon SNS allows applications to send time-critical messages to multiple subscribers through a “push” mechanism, eliminating the need to periodically check or “poll” for updates.

Amazon SQS is a message queue service used by distributed applications to exchange messages through a polling model, and can be used to decouple sending and receiving components. Amazon SQS provides flexibility for distributed components of applications to send and receive messages without requiring each component to be concurrently available.

Amazon Simple Queue service (SQS) is a fast, reliable, scalable, fully managed message queuing service

You can use SQS to transmit any volume of data, at any level of throughput, without losing messages or requiring other services to be always available.

Each queue start with default settings of 30 seconds for the visibility timeout .

You can change that settings for entire queue.

You can change – specifying a new timeout value using the

ChangeMessageVisibilitiy

Messages can be retained in queues for up to 14 days.
the maximum VisibilityTimeout of an SQS message in a queue is 12 hours ( 30 sec visibility timeout default )
Message can contain upto 256KB of text, billed at 64KB chunks
Maximum long poling timeout 20 seconds

First 1 million request are free, the $0.50 per every million requests

No order – SQS messages can be delivered multiple times in any order

Amazon SQS uses short polling by default, querying only a subset of the servers to determine whether any messages are available for inclusion in the response.

Long polling setup Receive Message Wait Time – 20 s (value from 1 s to 20 s )

Benefit of Long polling

Long polling helps reduce your cost of using Amazon SQS by reducing the number of empty responses and eliminate false empty responses.

Long polling reduce the number of empty responses by allowing SQS to wait until a message is available in the queue before sending a response
Long polling eliminate false empty responses by querying all of the servers
Long polling returns messages as soon message becomes available

FIFO queues are designed to enhance messaging between applications when the order of operations and events is critical, for example:

Ensure that user-entered commands are executed in the right order.
Display the correct product price by sending price modifications in the right order.
Prevent a student from enrolling in a course before registering for an account.

Note

The name of a FIFO queue must end with the .fifo suffix. The suffix counts towards the 80-character queue name limit. To determine whether a queue is FIFO, you can check whether the queue name ends with the suffix.

Reference

http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html

AWS – SNS

July 13, 2016November 15, 2016 techhadoop Uncategorized

Amazon Simple Notification Service (SNS) is a simple, fully-managed “push” messaging service that allows users to push texts, alerts or notifications, like an auto-reply message, or a notification that a package has shipped.

Amazon Simple Notification Service (Amazon SNS) is a web service that coordinates and manages the delivery or sending of messages to subscribing endpoints or clients. In Amazon SNS, there are two types of clients—publishers and subscribers—also referred to as producers and consumers