Lab 2 - Store & Analyze Ingested Data

In this lab, we will store ingested data (data we’re ingesting in Lab 1) into Amazon S3 using Amazon Kinesis Data Firehose. We will then crawl this stored data using AWS Glue to discover schema and query this data using Amazon Athena.

Console / GUI

If you’re coming here from Lab 1, you can continue where you’re browser already is.

CONDITIONAL/OPTIONAL STEPs:

If you didn’t close your browser tab in Lab 1, you can skip this section and jump to Step A below. But if you did, follow the next few steps to get back to where you were.

Open a new tab and point it to https://console.aws.amazon.com/kinesis/home.

Then click on ‘RetailDataAnalytics’, the Kinesis Data Analytics application that we just created.
Click on ‘Go to SQL results’

Now you can continue from Step A #1 below.

Step A

Click on the ‘Destination’ tab, and then click on ‘Connect to a destination’. We will create a ‘Firehose’ destination.
For ‘Destination’, choose ‘Kinesis Firehose delivery stream’
And for ‘Kinesis Firehose delivery stream’, click on ‘Create new’. This will open up a new tab in which we will create and configure a new Kinesis Firehose delivery stream.
Create a Kinesis Firehose stream. Enter a descriptive name for ‘Delivery stream name’ and click ‘Create’
In the ‘New delivery stream’ page, leave the rest of the options as-is. Click on ‘Next’
In the subsequent ‘Process records’ page, scroll down to the ‘Transform source records with AWS Lambda’ section.

For ‘Data transformation’ choose ‘Enabled’.
Click on ‘Create new’
Click on ‘General Firehose Processing’ (the first link)
This should open up a new tab to create a Lambda function to process records
- For ‘Function name’ enter ‘AnnotateRetailAnalyticsData’
- For ‘Exectuion role’ choose ‘Create new role with basic Lambda permissions’
Scroll down further to where you see the Lambda source code. Leave it as-is (the source code is NOT editable in this screen) and click on ‘Create function’
In this next screen, you can edit the Lambda source code.
- For ‘Runtime’ select ‘Node.js 8.10’.
- And for source code, copy paste the file AnnotateRetailDataAnalytics.js.
- Take a look at the comments in the source code to clarify (just conceptually) what we’re doing with this Lambda function.
Scroll down further until you reach the ‘Basic Settings’ section. Increase the Lambda function’s ‘Timeout’ value to ‘1 min’
Scroll back up to the very top of the page and Click on ‘Save’. This button is on the top right-hand corner (again, you’ll have to scroll all the way up to see it).

OPTIONAL: You can also test out this Lambda function by sending it a mock Firehose record to see if it processes it successfully. You can do this by clicking on the ‘Test’ button and configuring a test event like below:

{
  "invocationId": "invocationIdExample",
  "deliveryStreamArn": "arn:aws:kinesis:EXAMPLE",
  "region": "us-west-2",
  "records": [
    {
      "recordId": "49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp": 1495072949453,
      "data": "eyJDT0xfdGltZXN0YW1wIjoiMjAxOS0wOS0xOCAxMTo0OTozNi4wMDAiLCJzdG9yZV9pZCI6InN0b3JlXzQ5Iiwid29ya3N0YXRpb25faWQiOiJwb3NfMiIsIm9wZXJhdG9yX2lkIjoiY2FzaGllcl8xNDkiLCJpdGVtX2lkIjoiaXRlbV84NzU4IiwicXVhbnRpdHkiOjQsInJlZ3VsYXJfc2FsZXNfdW5pdF9wcmljZSI6MzcuMzQsInJldGFpbF9wcmljZV9tb2RpZmllciI6Mi4yMSwicmV0YWlsX2twaV9tZXRyaWMiOjc5LCJBTk9NQUxZX1NDT1JFIjowLjcyODk2NzM5NDUxNzg5MDV9Cg=="
    }
  ]
}

You can now either close this browser tab where you configured the Lambda function to annotate Firehose records, or keep it open and switch back to the previous tab.
In the ‘Choose a Lambda blueprint’, click ‘close’
Now, in the ‘Transform source records with AWS Lambda’ section, click on the ‘Lambda function’ drop-down and choose the Lambda function that we just crreated to annotate Firehose records.

Just in case, you don’t see the function, click on the ‘Refresh’ button next to it to reload available Lambda functions.
Click on ‘Next’ (scroll down a bit, if you have to)
In the ‘Select a destination’ page, the ‘Amazon S3’ option should already be selected by default. If not, choose that as the option.

Scroll down to the ‘S3 destination’ section. And click on the ‘Create new’ button to create a new S3 bucket to store analytics data.
All S3 bucket names, regardless who or which accounts created them, need to be globally unique. Some string that is unique to you appended or prepended to retail_analytics should help. For example [COMPANY-NAME]-[SOME-UNIQUE-IDENTIFIER]-retail-analytics has a higher likelihood of being unique.
Click on ‘Create S3 Bucket’.
Configure the S3 destination.
- For the ‘S3 bucket’ field, the bucket that you just created should have been pre-selected.
- For ‘S3 prefix’, enter
```
prod-retail-data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
```
- For ‘S3 error prefix’, enter
```
prod-retail-data-errors/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/!{firehose:error-output-type}
```
- The above two prefix configurations allow customizing the prefixes (or the paths at which incoming data will be stored) using both static and dynamic (such as date) information. With the above configuration, we’re choosing to store it using the Hive compatible partitioning format, which works well for efficient querying with Amazon Athena and Amazon EMR.
Click on ‘Next’. This should auto-close the tab you’re on and lead you right back to the ‘Kinesis Firehose - Create delivery stream’ page.
Under ‘S3 buffer conditions’ section, for ‘Buffer size’ enter 1 and for ‘Buffer interval’ enter 60.
Scroll down to the ‘Permissions’ section and for ‘IAM role’ click ‘Create new or choose’
This will open up a new tab with a pre-created IAM Role and policy which you can just authorize by clicking on ‘Allow’.

Clicking on ‘Allow’ will automatically close the IAM tab and take you right back to your original tab, but now with the ‘IAM role’ pre-selected with the role that we just created.
Now click on ‘Next’. You may need to scroll down a bit.
This is the final screen. Leave everything as-is, scroll down, and click on ‘Create delivery stream’.
You will first see an in-progress flash message…
Followed by a success flash message, if all is successful.

You can now close this browser tab, or switch to the previous one (where you have the Kinesis Analytics Data application configuration open – the browser tab that looks like the screenshot below in Step B #1)

Step B

We have successfully configured a Kinesis Data Firehose Delivery Stream and we’re now going to configure our Kinesis Data Analytics to use this as its destination.

For ‘Kinesis Firehose delivery stream’, ensure that the ‘Retail Analytics Delivery Stream’ is selected.
Scroll down to the ‘In-application Stream’ section and for ‘In-application stream name’, choose ‘DESTINATION_STREAM’ (the stream we created via Streaming SQL in Step E #1 and Step F #1 of Lab 1)
Click on ‘Save and continue’. This might take a few 10 seconds.
Click on ‘Go to SQL results’
CONDITIONAL Step Check that your script is still running in Cloud9. If not, run it now.
```
cd ~/environment/retail/lab1/src
```
```
ruby gen_pos_log_stream.rb
```
Wait for the script to start…
Open up another browser tab and point it to https://console.aws.amazon.com/s3 and navigate to the bucket that you created. Click into this bucket and wait at least a minute (since we configured Kinesis Firehose’s buffer as 60 seconds or 1 MB – whichever is hit first) and refresh again. You should now see this bucket start to fill up with data.

Step C - Crawl S3 Data with AWS Glue

Open a new tab in your browser and point it to https://console.aws.amazon.com/glue

If you see a Splash Screen, just click ‘Get Started’
Click on ‘Add tables using a crawler’.
Give this crawler a descriptive name and click ‘Next’.
For crawler source, choose ‘Data stores’ (selected by default) and click ‘Next’.
Add a data store by specifying the S3 bucket name into which data is being ingested and click ‘Next’.
For ‘Add another data store’ screen, leave the default values as-is and click ‘Next’.
For ‘IAM role’ enter a unique name and click ‘Next’.
For ‘Frequency’, leave the choice as ‘Run on demand’ and click on ‘Next’.
Under ‘Configure the crawler’s output’, click on ‘Add Databse’
In the ‘Add database’ dialog box, for ‘Database name’ enter ‘retailanalyticsdb’ (or any name you like, but just remember to use this value when querying) and click on ‘Create’
Click on ‘Next’
Click ‘Finish’
You will see a success flash message with an option to run the newly created crawler right away. DO NOT RUN IT YET! (but, if you already clicked on it, well, no harm)

Step D - Update Glue IAM Role with Permissions to Access S3

Open up a new browser tab and point it at https://console.aws.amazon.com/iam.
Click on ‘Roles’ in the left-hand pane (see screenshot below #3)
And then type ‘Glue’ in the Search box, which should bring up a role named ‘AWSGlueServiceRole-[SOME_UNIQUE_IDENTIFIER]’ that you configured in Step E #6.

Click on it.
In the subsequent screen, click again on the Glue service linked role named ‘AWSGlueServiceRole-[SOME_UNIQUE_IDENTIFIER]’
Click on ‘Edit Policy’
Click on ‘JSON’.

Copy paste the below permissions policy.

NOTE: Replace YOUR_S3_BUCKET_NAME with what you used so that Glue may access YOUR S3 bucket.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR_S3_BUCKET_NAME/prod-retail-data/*"
            ]
        }
    ]
}

Click on ‘Review policy’
Click on ‘Save changes’
Close this browser tab.
You should have now returned to the previous tab you were on, when you had just finished creating a Glue crawler. Now Click on ‘Run it now?’ to crawl your S3 data.

This will take at least 1-2 mins from start to finish. Wait for the crawler to complete running (click on the refresh icon in the top-right a few times).
To verify that the Glue crawler has crawled your S3 data, has automatically discovered the underlying schema, and has created a table on your behalf, click on ‘Tables’ on the left-hand pane.
Enter ‘retail_analytics_db’ in the ‘Search’ field to narrow results down (if necessary)

You should see a new table created in retail_analytics_db.

Step E - Query the Table

We will now query this table from Athena to verify.

Open Amazon Athena by pointing your browser tab at https://console.aws.amazon.com/athena
If you see a splash screen, click ‘Get started’
Click on the ‘Databases’ drop-down and choose ‘retail_analytics_db’
In it, you should see the table ‘prod_retail_data’. Click on the dotted link beside it to open up multiple options.
Click on ‘Preview table’ to query it’s contents.
If this is a new AWS Account or if you’ve never used Amazon Athena before, this will result in an error like so:

Click on ‘set up a query result location in Amazon S3’
For ‘Query result location’ enter a unique S3 bucket.
Click ‘Save’.
Now run the query again and it should succeed.
Feel free to experiment by writing any Hive compatible query.

Recap

To recap, what we did was that we

Crated a Kinesis Data Firehose resource to store data in Amazon S3
Used AWS Glue to crawl data in Amazon S3 and auto-discover schema
Used Amazon Athena to query and gain insights into collected data

You can now jump to Lab 3 (Step F) to lookup a forecast.

[Ignore anything the below this, including any <style> directives]