Summit 2019 - TechFest - Building Serverless Data Lakes on AWS

Author: Unni Pillai, Amazon Web Services

Lets connect:

Architecture - Diagram



Part 1: Ingest and Storage

Create S3 Bucket

In this step, we will navigate to S3 Console and create the S3 bucket used throughout this demo.

Login to AWS Console:

Navigate to S3 Console & Create a new bucket in us-east-1 region :

Create Kinesis Firehose

In this step we will create navigate to Kinesis Console & create a Kinesis Firehose delivery stream to ingest data & store in S3:

Generate Dummy Data

In this step we will configure Kinesis Data Generator to produce fake data and ingest it into Kinesis Firehose

Once the tools send ~ 100,000 messages, you can click on - Stop sending data to Kinesis

  "uuid": "{{random.uuid}}",
  "device_ts": "{{date.utc("YYYY-MM-DD HH:mm:ss.SSS")}}",
  "device_id": {{random.number(50)}},
  "device_temp": {{random.weightedArrayElement(
    {"weights":[0.30, 0.30, 0.20, 0.20],"data":[32, 34, 28, 40]}
  "track_id": {{random.number(30)}},  
  "activity_type": {{random.weightedArrayElement(
            "weights": [0.1, 0.2, 0.2, 0.3, 0.2],
            "data": ["\"Running\"", "\"Working\"", "\"Walking\"", "\"Traveling\"", "\"Sitting\""]

Validate that data has arrived in S3

After few moments GoTo S3 console:

Part 2 : Catalog Data

Create IAM Role

In this step we will navigate to IAM Console & create a new Glue service role, this allows AWS Glue to access data sitting in S3 and create necessary entities in Glue catalog.

Create AWS Glue Crawlers

In this step, we will navigate to AWS Glue Console & create glue crawlers to discovery the newly ingested data in S3.

Verify newly created tables in catalog

Navigate to Glue Catalog & explore the crawled data:

Query ingested data using Amazon Athena

Lets query newly ingested data using Amazon Athena

SELECT activity_type,
FROM raw
GROUP BY  activity_type
ORDER BY  activity_type

Part 3 : Transform Data

Create Glue Development Endpoint

In this step you will be creating a glue endpoint to interactively develop Glue ETL scripts using PySpark

It will take close to 10 mins for the new Glue console to spin up.

You have to wait for this step to complete before moving to next step.

Create SageMaker Notebooks (Jupyter) for Glue Dev Endpoints

This will take few minutes, wait for this to finish

Launch Jupyter Notebook

Follow the instructions on the notebook
- Read and understand the instructions, they explain important Glue concepts

Validate - Transformed / Processed data has arrived in S3

Once the ETL script has ran successfully.

Part 4 : Analyze

Explore transformed data using Athena

In this step we will analyze the transformed data using Athena

Login to the Amazon Athena Console.

SELECT artist_name,
         count(artist_name) AS count
FROM processed_data
GROUP BY  artist_name
ORDER BY  count desc
SELECT device_id,
         count(track_name) AS count
FROM processed_data
GROUP BY  device_id, track_name
ORDER BY  count desc

Part 4: Visualize

Setting Up QuickSight

In this step we will visualize it using QuickSight

Login to Amazon Quick Sight Console & complete the registration & sign-up

Setting QuickSight Permissions

Adding a New Dataset

Using Amazon Quick Sight to Visualize Our Processed Data

Visualization 1: Heat map of users and tracks they are listening to

In this step, we will create a visualization that show us which users are listening to repetitive tracks

If you hover on dark blue patches on the heatmap you will see that those particular users are listening to same track repeatedly.

Visualization 2: Tree map of most played Artist Names

In this step we will create a visualization that shows who are the host played artists

Play around and explore Amazon QuickSight Console. Try out filters, other visualization types, etc.

Clean Up

Failing to do this will result in incuring AWS usage charges.

Make sure you bring down / delete all resources created as part of this lab

Resources to delete