Skip to main content
Introduction

 

AWS provides a suite of cloud services to create “serverless” deployments or managed services to handle file storage, function execution, database storage, APIs, authentication, etc. The advantages of leveraging these services can lead to increased development velocity, built-in scalability, and potential cost savings. This exploration into building a serverless system will involve two parts; creating an Extract, Transform, and Load (ETL) workflow and then deploying an authenticated REST API.

Requirements
AWS Services

To build out the first part’s ETL pipeline, we will leverage S3, Lambda, and DynamoDB. S3 will store the raw data files and broadcast events when changes occur within the S3 Bucket (create, update, delete…). DynamoDB is a fully managed NoSQL database which offers users flexible and performative database scalability. Finally, Lambda is a compute service that is an event-driven, code execution service that provides pay per use, scalability, and performance baked into the product. When the term “serverless” is used to describe an architecture type, AWS Lambda and other Functions as a Service (FaaS) are the technologies which refer to coined term. Google Cloud, Microsoft Azure, Iron.io, Webtask.io, and IBM Bluemix are some other FaaS providers to compete with AWS Lambda.

SDKs

The AWS Lambda supports code written in Python, Node.js, C#, or Java. Depending on your use case, these languages and their libraries are available to help complete the task at hand. For this first part, we will be using Python and its modules Boto3, Pyexcel, and PynamoDB to parse, clean, and load the sample excel sheets into DynamoDB. To make this guide more universally applicable, we will be using virtualenv to create the Python 2.7 environment necessary to develop the Lambda function.

Create a role

The lambda function will need to have access to S3 and DynamoDB to execute the ETL process. For simplicity purposes. Go to the AWS console’s IAM service, click Roles tab and click Create New Role. Name the role serverless-power-user, click Next Step, and select AWS Lambda. From the policy list, select Power User and click Review. Review the role settings and click Create Role.

user_64683_58ae5bd825c01.png

Exercise Dataset

For the following guide, we will be using the City of San Francisco Department of Building Inspection’s building permits issued reports. These reports are posted monthly to the department’s web site with a predictable table scheme and downloadable as a excel file.

ETL Planning

The goal of part one is to make these monthly reports more accessible and intelligible for future services and applications to leverage. As your data grows, excel spreadsheets are unsustainable, poorly distributed, and prone to error. This ETL pipeline will be able to ingest the city’s past and future reports to feed a scalable, comprehensive permits database.

dataset review

The raw monthly reports will be uploaded to an S3 location and stored in their original excel format (.xlsx). The excel reports consist of a single spreadsheet with a straightforward tabular design. The first row is the table’s column names and each consecutive row is a corresponding record. These columns contain information pertaining to each issued permit such as: application number, estimated cost, existing use, proposed use, street number, street name, and description.

user_64683_58ae5cece696a.PNG_800.jpg

 

 

 

 

 

 

 

 

 

Database Tables

DynamoDB is a NoSQL database and its schema-less design allows for record attributes to vary between one another within a single table. To create a new table, a partition key (primary key) and read\write capacity units (the number of reads\writes on a table per second) are necessary. The partition key must be set to the datatype of string, number, or binary and must be unique. If the desired partition key is NOT unique, a sort key can be added to create a unique combination of partition and sort keys. Global secondary indexes, and local secondary indexes are also available for improved query performance and multi table queries. For more information on table index design and capacity, refer to AWS docs.

For the example dataset, we will create a permits table to store the records extracted from the monthly reports. The example dataset’s APPLICATION # would seem to be a logical partition key, but the partition key name will be changed to application_number to make the key more semantic. To store the permit records, we will have to make sure every record loaded has an application_number associated with it as the partition key. Since the application_number is NOT unique, we will also add a sort key using the record’s index number to create a unique partition and sort key combination.

 

//permits table schema

{
“AttributeDefinitions”: [
{“AttributeName”: “application_number”, “AttributeType”: “S”},
{“AttributeName”: “record_id”, “AttributeType”: “N”}
],
“KeySchema”: [
{“AttributeName”: “application_number”, “KeyType”: “HASH”},
{“AttributeName”: “record_id”, “KeyType”: “RANGE”}
],
“ProvisionedThroughput”: {
“ReadCapacityUnits”: 5,
“WriteCapacityUnits”: 5
}
}

Data Transformations

The Lambda function will be invoked when a new report is uploaded to the S3 bucket. The function will use Python to read the excel spreadsheet, transform the records to fit the table schema, and load them into the permits table created in DynamoDB.

Creating S3 Bucket

Go to the S3 service in the AWS console and click Create Bucket. Create a unique bucket name and pick the same region you will use to deploy the Lambda function and DynamoDB table. If you already have a bucket deployed you would like to use to store the permit reports, you can skip creating a new bucket.

Example create bucket

user_64683_58ae5e2a5b50b.png

 

 

 

 

 

 

 

 

 

 

 

 

 

Once the bucket is deployed, create a folder called reports to hold all of the uploaded building permit reports. When we deploy the Lambda function to process the reports, we will create the bucket event for the reports folder to invoke the Lambda function.

Creating DynamoDB Table

Go to the DynamoDB service in the AWS console and click Create Table. Create a table named permits with the partition key application_number as a string and sort key as record_id. Then click Create. For the example, we will keep the default settings and not create any additional global or local secondary indexes.

user_64683_58ae5f4c9114a.png

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Once the permits table is created, our Lambda function will be able to access the table and write records.

user_64683_58ae5fab5be93.png

 

 

 

Creating Lambda Functions

We will create an initial blank function to visually walk through the creation process and later update it with our code. Go to Lambda service in the AWS console and click Create a Lambda function. Click Blank Function.

user_64683_58ae602592040.png_800.jpg

 

 

 

 

 

 

 

 

 

 

 

 

 

Next, configure the lambda function trigger to be called by the S3 POST event in your created S3 bucket, set the prefix to the reports folder, set the suffix to excel xlsx file type, and check the enable trigger. Once the trigger is configured, click Next.

user_64683_58ae60ac86825.png_800.jpg

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

First, we will configure the functions Name, Description, and Runtime.

user_64683_58ae610d49068.png

 

 

 

 

 

 

 

 

 

Scroll down past the Lambda function code section because we will update the code later on. Next, we will set the function handler, role, and advanced settings. Name the handler load_permits.lambda_handler, choose our previously created, existing role serverless-power-user, use the default 128MB memory, and set the Timeout to the maximum 5 minutes.

user_64683_58ae618419456.png

 

 

 

 

 

 

 

 

 

 

 

 

Scroll to the bottom of the page and click Next. Review the function settings and click Create Function.

WRITING THE FUNCTION

Lambda functions can execute code in a few different languages. We will be using Python to transform and load the excel reports. The function’s code will read the excel file, parse the records, clean the data, and load it into the DynamoDB permits table. To successfully create the function, we will rely on two outside Python packages to parse the excel files (pyexcel) and load the data into DynamoDB (pynamodb). Lambda functions relying on third party Python packages will need to reference a zipped deployment package which will hold both the function code and necessary packages referenced in the function code.

*Note: The Boto3 Python AWS SDK is included in all Python Lambda functions and is not necessary to be installed within the function package but can be to test functions locally if needed.

 

To start developing the function, we will create a new Virtualenv.

// install virtualenv if necessary
$> pip install virtualenv

$> mkdir lambda_project && cd lambda_project

// create virtualenv for project
$> virtualenv venv

// start virtualenv
$> source venv/bin/activate

// virtualenv is now running
(venv)$>

// To stop running virtualenv when finished
(venv)$> deactivate

 

Now that the virtualenv is started, we can load the third party packages necessary to create the Lambda function.

// install pyexcel and pyexcel-xls plugin to read xlsx
(venv)$> pip install pyexcel
(venv)$> pip install pyexcel-xls

// install pynamodb
(venv)$> pip install pynamodb

// optionally install boto3
(venv)$> pip install boto3

 

The third party packages are now installed and we can start writing the code which will be executed in our Lambda function. Using pynamodb, we will create the model for the records in the permits table. Within the project directory, create a file models.py.

## models.py

from pynamodb.models import Model
from pynamodb.attributes import (
UnicodeAttribute, NumberAttribute, UTCDateTimeAttribute
)

## This function creates a new class to create permits
## to load into the DynamoDB permits table

def PermitsModel(db_region=”, db_host=’http://localhost:8000′):

## See the docs on creating a model
## http://pynamodb.readthedocs.io/en/latest/tutorial.html

class PermitClass(Model):
class Meta:
table_name = ‘permits’
read_capacity_units = 5
write_capacity_units = 5
region = db_region
host = db_host

## set the partion key
application_number = UnicodeAttribute(hash_key=True)

## set the sort key
record_id = NumberAttribute(range_key=True)

## define other expected attributes
status = UnicodeAttribute(default=’issued’)
status_date = UTCDateTimeAttribute(null=True)
file_date = UTCDateTimeAttribute(null=True)
expiration_date = UTCDateTimeAttribute(null=True)
estimated_cost = NumberAttribute(default=0)
revised_cost = NumberAttribute(default=0)
existing_use = UnicodeAttribute(default=”)
proposed_use = UnicodeAttribute(default=”)
description = UnicodeAttribute(default=”)
address = UnicodeAttribute(default=”)
load_date = UTCDateTimeAttribute(null=True)

return PermitClass

 

Next, we will create utility functions to filter, format, and write our data. Create a utils.py file.

## utils.py

import datetime
from time import sleep

## Function to adjust table capacity to handle high capacity loads
## and reset capacity after records have been loaded
def adjustCapacity(conn, table, read_capacity=5, write_capacity=5):
conn.update_table(table, read_capacity_units=read_capacity, write_capacity_units=write_capacity)

## Handle unicode characters in Description field
def clean_unicode(unic):
return unic.encode(‘utf-8’).strip()

## Set number fields to 0 if blank
def clean_int(num):
if num is None:
return int(0)
else:
return int(num)

## Merge multiple fields to create and address column
def clean_address(rec):
st_num = str(rec[‘STREET_NUMBER’]).capitalize()
st_name = str(rec[‘AVS_STREET_NAME’]).capitalize()
st_sfx = str(rec[‘AVS_STREET_SFX’]).capitalize()
unit = str(rec[‘UNIT’]).capitalize()
inputs = [st_num, st_name, st_sfx, unit, ‘San Francisco, CA’]
return ‘ ‘.join([x for x in inputs if x != ‘None’])

## Create a new dict from the excel records with desired fields
def clean_record(rec):
update = {}
update[‘application_number’] = str(rec[‘APPLICATION #’])
update[‘status’] = str(rec[‘STATUS’]).lower()
update[‘status_date’] = rec[‘STATUS_DATE’]
update[‘file_date’] = rec[‘FILE_DATE’]
update[‘expiration_date’] = rec[‘EXPIRATION_DATE’]
update[‘estimated_cost’] = clean_int(rec[‘ESTIMATED COST’])
update[‘revised_cost’] = clean_int(rec[‘REVISED COST’])
update[‘existing_use’] = str(rec[‘EXISTING USE’]).lower()
update[‘proposed_use’] = str(rec[‘PROPOSED USE’]).lower()
update[‘description’] = clean_unicode(rec[‘DESCRIPTION’])
update[‘address’] = clean_address(rec)

return update

## Clean the excel record and create a Permit Model with the record data
def create_permit(record, record_id, model, load_date):
rec = clean_record(record)

permit = model(rec[‘application_number’], record_id)
permit.status = rec[‘status’]
permit.status_date = rec[‘status_date’]
permit.file_date = rec[‘file_date’]
permit.expiration_date = rec[‘expiration_date’]
permit.estimated_cost = rec[‘estimated_cost’]
permit.revised_cost = rec[‘revised_cost’]
permit.existing_use = rec[‘existing_use’]
permit.proposed_use = rec[‘proposed_use’]
permit.description = rec[‘description’]
permit.address = rec[‘address’]
permit.load_date = load_date

return permit

## Iterate and write the records from excel to dynamodb
def write_records(records, create_record, model):
load_date = datetime.datetime.now()

with model.batch_write() as batch:
for index, record in enumerate(records):
try:
model_record = create_record(record, index, model, load_date)
batch.save(model_record)
sleep(0.025)

except Exception:
continue

 

Now that our utils and models are created, we can write the lambda handler function which will be triggered by our S3 event. The handler specified in our Lambda function was load_permits.lambda_handler where load_permits refers to the file name and the lamdba_handler refers to our function to be executed. Create a new file lambda_permits.py.

## lambda_permits.py

import boto3
import urllib
import pyexcel as pe
from pynamodb.connection import Connection

## Our created model
from models import PermitsModel

## Our created utils
from utils import (write_records, create_permit, adjustCapacity)

s3 = boto3.client(‘s3’)
Permits = PermitsModel(‘us-west-2’, ‘https://dynamodb.us-west-2.amazonaws.com’)
conn = Connection(region=’us-west-2′, host=’https://dynamodb.us-west-2.amazonaws.com’)

def lambda_handler(event, context):
## set temp file
## ‘/tmp’ is directory to write to in Lambda function
report_file = ‘/tmp/report.xlsx’

## Get bucket and key from PUT event
bucket = event[‘Records’][0][‘s3’][‘bucket’][‘name’].encode(‘utf8’)
key = urllib.unquote_plus(event[‘Records’][0][‘s3’][‘object’][‘key’].encode(‘utf8’))

## Create DynamoDB table if it does not exist
if not Permits.exists():
Permits.create_table(wait=True)

try:
## Download excel file of permits
with open(report_file, ‘w’) as report:
s3.download_fileobj(bucket, key, report)

## expand table capacity during load
adjustCapacity(conn, ‘permits’, 50, 50)

## parse records
records = pe.iget_records(file_name=report_file)

## write records to “permits” table
write_records(records, create_permit, Permits)

## decrease table capacity after load
adjustCapacity(conn, ‘permits’, 5, 5)

except Exception as e:
adjustCapacity(conn, ‘permits’, 5, 5)
print(e)

finally:
return “load complete”

 

Our code is now ready to be deployed to our function.

Packaging and Updating Function

We will create a zip file of our code and third party package dependencies to create a deployment package for Lambda to execute.

Create a new package.zip and add our code files.

(venv)$> zip -9 package.zip lambda_permits.py utils.py models.py

 

Next we will navigate to our virtualenv site-packages directory and add them to our package.zip

// cd into virtualenv site-packages directory
(venv)$> cd $VIRTUAL_ENV/lib/python2.7/site-packages

// recursively add all site packages
(venv)$> zip -r9 ../../../../package.zip *

// cd back into the first level of our project directory
(venv)$> cd ../../../../


 

Finally we will upload our package to S3 and update our Lambda function code.

// Copy package to S3
(venv)$> aws s3 cp package.zip \
 s3://serverless-example-record-storage/load-permit-records/package.zip
// Update the lambda function with new package code
(venv)$> aws lambda update-function-code \
 --function-name load-permit-records \
 --s3-bucket serverless-example-record-storage
 --s3-key load-permit-records/package.zip

The function is now up to date and ready to be run.

Invoking the Pipeline

We have just deployed the storage service, Lambda function, and database table. The ETL pipeline is now ready to ingest data and load data. We have configured all the load events to be triggered on S3 PUT to the reports/ prefix in our bucket. Let’s upload the raw data to S3 and let the loading begin.

// Download sample dataset from City of SF cite
// December 2016 Issued Permits http://sfdbi.org/file/10166
$> aws s3 cp December2016Issued.xlsx \
 s3://serverless-example-record-storage/reports/December2016Issued.xlsx

 

The function will be triggered and the data loaded.

user_64683_58ae64808c4d4.png

To add other records from the city’s data, copy the excel reports to the S3 bucket location and they will be loaded into our permits table.
Up Next

Now that the pipeline has been created and data is successfully being extracted, transformed, and loaded, the next part will focus on accessing the data by building a REST API. The goal will be deploying an authenticated service to query the data, create subset data extracts, and monitor the service.

Additional Resources
Linux Academy has a course on using AWS Lambda called Lambda Deep Dive. The course gives you a comprehensive look at how to get started with Lambda, and how to build more advanced functions.
If you’re interested in learning more about Serverless, Linux Academy also has a course on Serverless Concepts.

Python AWS SDK (Boto3)

Pyexcel

City of San Francisco Building Permits Data Source

Comments are disabled for this guide.