Lab 2 - Store & Analyze Ingested Data

In this lab, we will store ingested data (data we’re ingesting in Lab 1) into Amazon S3 using Amazon Kinesis Data Firehose. We will then crawl this stored data using AWS Glue to discover schema and query this data using Amazon Athena.

Console / GUI

If you’re coming here from Lab 1, you can continue where you’re browser already is.

CONDITIONAL/OPTIONAL STEPs:

If you didn’t close your browser tab in Lab 1, you can skip this section and jump to Step A below. But if you did, follow the next few steps to get back to where you were.

  1. Open a new tab and point it to https://console.aws.amazon.com/kinesis/home.

    Then click on ‘RetailDataAnalytics’, the Kinesis Data Analytics application that we just created.

    Open Kinesis Data Stream

  2. Click on ‘Go to SQL results’

    Go to SQL Results

    Now you can continue from Step A #1 below.

Step A

  1. Click on the ‘Destination’ tab, and then click on ‘Connect to a destination’. We will create a ‘Firehose’ destination.

    Analytics Destination

  2. For ‘Destination’, choose ‘Kinesis Firehose delivery stream’

  3. And for ‘Kinesis Firehose delivery stream’, click on ‘Create new’. This will open up a new tab in which we will create and configure a new Kinesis Firehose delivery stream.

    Connect to Destination

  4. Create a Kinesis Firehose stream. Enter a descriptive name for ‘Delivery stream name’ and click ‘Create’

  5. In the ‘New delivery stream’ page, leave the rest of the options as-is. Click on ‘Next’

    Create Firehose

  6. In the subsequent ‘Process records’ page, scroll down to the ‘Transform source records with AWS Lambda’ section.

    For ‘Data transformation’ choose ‘Enabled’.

  7. Click on ‘Create new’

    Create Firehose Page 2

  8. Click on ‘General Firehose Processing’ (the first link)

    Choose a Lambda Blueprint

  9. This should open up a new tab to create a Lambda function to process records

    Configure Lambda Firehose Processor - Part 1

  10. Scroll down further to where you see the Lambda source code. Leave it as-is (the source code is NOT editable in this screen) and click on ‘Create function’

    Configure Lambda Firehose Processor - Part 2

  11. In this next screen, you can edit the Lambda source code.

    Configure Lambda Firehose Processor - Source Code

  12. Scroll down further until you reach the ‘Basic Settings’ section. Increase the Lambda function’s ‘Timeout’ value to ‘1 min’

    Increase Lambda Timeout

  13. Scroll back up to the very top of the page and Click on ‘Save’. This button is on the top right-hand corner (again, you’ll have to scroll all the way up to see it).

    Save Lambda

  14. OPTIONAL: You can also test out this Lambda function by sending it a mock Firehose record to see if it processes it successfully. You can do this by clicking on the ‘Test’ button and configuring a test event like below:

    {
      "invocationId": "invocationIdExample",
      "deliveryStreamArn": "arn:aws:kinesis:EXAMPLE",
      "region": "us-west-2",
      "records": [
        {
          "recordId": "49546986683135544286507457936321625675700192471156785154",
          "approximateArrivalTimestamp": 1495072949453,
          "data": "eyJDT0xfdGltZXN0YW1wIjoiMjAxOS0wOS0xOCAxMTo0OTozNi4wMDAiLCJzdG9yZV9pZCI6InN0b3JlXzQ5Iiwid29ya3N0YXRpb25faWQiOiJwb3NfMiIsIm9wZXJhdG9yX2lkIjoiY2FzaGllcl8xNDkiLCJpdGVtX2lkIjoiaXRlbV84NzU4IiwicXVhbnRpdHkiOjQsInJlZ3VsYXJfc2FsZXNfdW5pdF9wcmljZSI6MzcuMzQsInJldGFpbF9wcmljZV9tb2RpZmllciI6Mi4yMSwicmV0YWlsX2twaV9tZXRyaWMiOjc5LCJBTk9NQUxZX1NDT1JFIjowLjcyODk2NzM5NDUxNzg5MDV9Cg=="
        }
      ]
    }
  15. You can now either close this browser tab where you configured the Lambda function to annotate Firehose records, or keep it open and switch back to the previous tab.

  16. In the ‘Choose a Lambda blueprint’, click ‘close’

  17. Now, in the ‘Transform source records with AWS Lambda’ section, click on the ‘Lambda function’ drop-down and choose the Lambda function that we just crreated to annotate Firehose records.

    Just in case, you don’t see the function, click on the ‘Refresh’ button next to it to reload available Lambda functions.

  18. Click on ‘Next’ (scroll down a bit, if you have to)

  19. In the ‘Select a destination’ page, the ‘Amazon S3’ option should already be selected by default. If not, choose that as the option.

    Scroll down to the ‘S3 destination’ section. And click on the ‘Create new’ button to create a new S3 bucket to store analytics data.

    Create S3 Destination

  20. All S3 bucket names, regardless who or which accounts created them, need to be globally unique. Some string that is unique to you appended or prepended to retail_analytics should help. For example [COMPANY-NAME]-[SOME-UNIQUE-IDENTIFIER]-retail-analytics has a higher likelihood of being unique.

    Create S3 Bucket

  21. Click on ‘Create S3 Bucket’.

  22. Configure the S3 destination.

    Configure S3 Destination

  23. Click on ‘Next’. This should auto-close the tab you’re on and lead you right back to the ‘Kinesis Firehose - Create delivery stream’ page.

  24. Under ‘S3 buffer conditions’ section, for ‘Buffer size’ enter 1 and for ‘Buffer interval’ enter 60.

    S3 Buffer Configuration

  25. Scroll down to the ‘Permissions’ section and for ‘IAM role’ click ‘Create new or choose’

    Create Firehose IAM Role

  26. This will open up a new tab with a pre-created IAM Role and policy which you can just authorize by clicking on ‘Allow’.

    Authorize Firehose IAM Role Creation

    Clicking on ‘Allow’ will automatically close the IAM tab and take you right back to your original tab, but now with the ‘IAM role’ pre-selected with the role that we just created.

  27. Now click on ‘Next’. You may need to scroll down a bit.

    Click Next

  28. This is the final screen. Leave everything as-is, scroll down, and click on ‘Create delivery stream’.

    Create Delivery Stream

  29. You will first see an in-progress flash message…

    Firehose Create InProgress

  30. Followed by a success flash message, if all is successful.

    Firehose Create Success

You can now close this browser tab, or switch to the previous one (where you have the Kinesis Analytics Data application configuration open – the browser tab that looks like the screenshot below in Step B #1)

Step B

We have successfully configured a Kinesis Data Firehose Delivery Stream and we’re now going to configure our Kinesis Data Analytics to use this as its destination.

  1. For ‘Kinesis Firehose delivery stream’, ensure that the ‘Retail Analytics Delivery Stream’ is selected.

    Choose Firehose Delivery Stream

  2. Scroll down to the ‘In-application Stream’ section and for ‘In-application stream name’, choose ‘DESTINATION_STREAM’ (the stream we created via Streaming SQL in Step E #1 and Step F #1 of Lab 1)

    Choose In-Application Stream

  3. Click on ‘Save and continue’. This might take a few 10 seconds.

  4. Click on ‘Go to SQL results’

    Go to SQL Results

  5. CONDITIONAL Step Check that your script is still running in Cloud9. If not, run it now.

    cd ~/environment/retail/lab1/src
    ruby gen_pos_log_stream.rb

    Wait for the script to start…

  6. Open up another browser tab and point it to https://console.aws.amazon.com/s3 and navigate to the bucket that you created. Click into this bucket and wait at least a minute (since we configured Kinesis Firehose’s buffer as 60 seconds or 1 MB – whichever is hit first) and refresh again. You should now see this bucket start to fill up with data.

    Data in S3

Step C - Crawl S3 Data with AWS Glue

Open a new tab in your browser and point it to https://console.aws.amazon.com/glue

  1. If you see a Splash Screen, just click ‘Get Started’

    Get Started with Glue

  2. Click on ‘Add tables using a crawler’.

    Add database

  3. Give this crawler a descriptive name and click ‘Next’.

    Name the Crawler

  4. For crawler source, choose ‘Data stores’ (selected by default) and click ‘Next’.

    Choose Crawler Source

  5. Add a data store by specifying the S3 bucket name into which data is being ingested and click ‘Next’.

    Add Data Store

  6. For ‘Add another data store’ screen, leave the default values as-is and click ‘Next’.

    Add Another Data Store

  7. For ‘IAM role’ enter a unique name and click ‘Next’.

    Choose IAM Role

  8. For ‘Frequency’, leave the choice as ‘Run on demand’ and click on ‘Next’.

    Schedule Crawl Frequency

  9. Under ‘Configure the crawler’s output’, click on ‘Add Databse’

    Configure Crawler Output

  10. In the ‘Add database’ dialog box, for ‘Database name’ enter ‘retailanalyticsdb’ (or any name you like, but just remember to use this value when querying) and click on ‘Create’

    Add Database

  11. Click on ‘Next’

  12. Click ‘Finish’

    Finish

  13. You will see a success flash message with an option to run the newly created crawler right away. DO NOT RUN IT YET! (but, if you already clicked on it, well, no harm)

Step D - Update Glue IAM Role with Permissions to Access S3

  1. Open up a new browser tab and point it at https://console.aws.amazon.com/iam.

  2. Click on ‘Roles’ in the left-hand pane (see screenshot below #3)

  3. And then type ‘Glue’ in the Search box, which should bring up a role named ‘AWSGlueServiceRole-[SOME_UNIQUE_IDENTIFIER]’ that you configured in Step E #6.

    Click on it.

    Update Glue Service Linked Role

  4. In the subsequent screen, click again on the Glue service linked role named ‘AWSGlueServiceRole-[SOME_UNIQUE_IDENTIFIER]’

    Click Glue Service Linked Role

  5. Click on ‘Edit Policy’

    Edit Glue Service Linked Role Policy

  6. Click on ‘JSON’.

  7. Copy paste the below permissions policy.

    NOTE: Replace YOUR_S3_BUCKET_NAME with what you used so that Glue may access YOUR S3 bucket.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::YOUR_S3_BUCKET_NAME/prod-retail-data/*"
                ]
            }
        ]
    }   
  8. Click on ‘Review policy’

  9. Click on ‘Save changes’

  10. Close this browser tab.

  11. You should have now returned to the previous tab you were on, when you had just finished creating a Glue crawler. Now Click on ‘Run it now?’ to crawl your S3 data.

    This will take at least 1-2 mins from start to finish. Wait for the crawler to complete running (click on the refresh icon in the top-right a few times).

    Run Crawler

  12. To verify that the Glue crawler has crawled your S3 data, has automatically discovered the underlying schema, and has created a table on your behalf, click on ‘Tables’ on the left-hand pane.

  13. Enter ‘retail_analytics_db’ in the ‘Search’ field to narrow results down (if necessary)

    You should see a new table created in retail_analytics_db.

Step E - Query the Table

We will now query this table from Athena to verify.

  1. Open Amazon Athena by pointing your browser tab at https://console.aws.amazon.com/athena

  2. If you see a splash screen, click ‘Get started’

    Athena Splash Screen

  3. Click on the ‘Databases’ drop-down and choose ‘retail_analytics_db’

  4. In it, you should see the table ‘prod_retail_data’. Click on the dotted link beside it to open up multiple options.

  5. Click on ‘Preview table’ to query it’s contents.

  6. If this is a new AWS Account or if you’ve never used Amazon Athena before, this will result in an error like so:

    Click on ‘set up a query result location in Amazon S3’

  7. For ‘Query result location’ enter a unique S3 bucket.

  8. Click ‘Save’.

  9. Now run the query again and it should succeed.

  10. Feel free to experiment by writing any Hive compatible query.

Recap

To recap, what we did was that we

You can now jump to Lab 3 (Step F) to lookup a forecast.


[Ignore anything the below this, including any <style> directives]