AWS Lambda for Beginners: Overcoming the Most Common Challenges
Learn how to solve three common problems that you’ll likely encounter when using AWS Lambda: adding dependencies to your function, overcoming the 250MB function size limitation for application code and dependencies, and building a continuous deployment pipeline.
Developing data pipelines - an automated process that acquires, transforms, and stores data - is a common need for most application developers in today’s always-on, 24/7 economy. Serverless architectures provide the perfect solution for managing these pipelines in a lightweight nature, with elastic pricing and almost no operational overhead.
Whether you are recording IoT metrics, building stock or crypto applications, or powering machine learning models, serverless infrastructure is the go-to solution to move time-series data around. (If you’re new to time-series data and want to learn more, check out our “what is time-series data?” blog post.)
In the world of serverless service providers, AWS Lambda is one of the most popular choices. This article will go over three common challenges you’ll likely face while building data pipelines with AWS Lambda and a step-by-step solution to each.
You will see how to add external dependencies to your AWS Lambda function, overcome the 250MB package limit, and set up continuous deployment.
What is AWS Lambda?
Before we get into the nitty-gritty of solving challenges, let’s first understand what AWS Lambda is. AWS Lambda, or just Lambda, is a serverless computing service that enables you to run code without provisioning or managing servers. Lambda makes it easy to call or automatically trigger Java, Go, PowerShell, Node.js, C#, Python, and Ruby code. You can also connect AWS Lambda with AWS API Gateway to expose your function as an API endpoint or automatically run the function periodically using AWS EventBridge or AWS SNS/SQS.
Let’s cover some of the issues and considerations for Lambda!
Size matters!
When it comes to Lambda, you can be sure of one thing: size absolutely matters! There are hard limits for a Lambda package size which can be a deal-breaker depending on your use case. For example, suppose your Python function depends on Pandas, NLTK, and a database connector library. In this scenario, the whole unzipped package size is already over the limit of 250MB, so you won’t be able to upload your function, let alone run it.
Historically, Lambda has been an excellent choice only for small-to-medium-sized functions with small dependencies. The moment you needed to include a larger dependency, it became questionable whether Lambda was the best tool for your process. Although you could have used another serverless option like AWS Fargate or bootstrapped a specialized AWS EC2 instance to process your data, they generally required more operational experience and overhead.
Fortunately, during the AWS re:Invent 2020 event, Amazon introduced a new feature that made it possible to deploy larger functions within Lambda and use more dependencies than before: container support.
That’s right! You can now upload Docker images as Lambda functions with an upper limit of 10GB! This feature is a massive upgrade from the previous 250MB hard limit: you containerize your function code, upload it to ECR, and then deploy it directly to Lambda.
Pros and cons of AWS Lambda for data pipelines
Before deploying your data pipeline in the cloud, it’s important to look at your options and decide which serverless provider and tool/service to use that fits your needs.
Here’s a non-exhaustive list of major pros and cons of Lambda:
AWS Lambda pros
- Cheaper than regular hosting (like EC2)
Data pipelines usually get triggered by an event or are scheduled to run at specific time intervals — they don’t need to be up and running all the time. With Lambda, you only pay for computing costs that your function actually uses (in milliseconds) — no minimum execution time. - No server management
Lambda is also sometimes referred to as a FaaS (function as a service) wherein the only thing that the user needs to develop and manage is the function itself. No provisioning or infrastructure management. - Seamless scalability
You are not required to do anything manually to address scalability because it’s automatically managed by Lambda to accommodate the rate of incoming invocations. Does your function suddenly need more resources because there’s more data to process? No problem, Lambda can handle scaling up and down.
AWS Lambda cons
- 50MB zipped and 250MB unzipped limit deployment package limit
Lambda functions are supposed to be small and only do one specific task. But still, if your function needs to include multiple dependencies, it’s easy to go over the limits. - Max. 15 minute execution time
You can set the maximum timeout for the function, but it cannot run longer than 15 minutes. This limitation makes Lambda unsuitable for long-running processes. - Cumbersome deployment
Whether you opt to create a ZIP archive or a container image, deploying your function is not easy, especially with dependencies. Every time you want to deploy a new version of your function, you need to package your files and dependencies again to upload it. Hence, continuous deployment automation is essential to effectively manage your functions, saving you a lot of time and effort.
Of these three common issues, the package size limitation and managing continuous deployments are often the most troublesome. For the remainder of this article, we’d like to explore workarounds to help mitigate these specific issues.
Using TimescaleDB with AWS Lambda
If you are not yet familiar with TimescaleDB, it is a time-series database packaged as a PostgreSQL extension. It has features and optimizations to ingest, store and analyze time-series data efficiently. TimescaleDB works well with tools and services that support PostgreSQL, including Lambda.
TimescaleDB users often use AWS Lambda to load time-series data into their tables. Furthermore, Lambda can be used for transforming, fetching, and performing other data operations on TimescaleDB tables. Let’s cover some examples and sample Lambda function code to give you inspiration!
Example use cases for AWS Lambda with TimescaleDB:
- Insert data from a third-party API
- Insert data from another database
- Fetch data from TimescaleDB
- Create a data API for TimescaleDB
- Perform ETL operations
Since TimescaleDB is a PostgreSQL extension, you can access your TimescaleDB instance using your programming language’s database connector library (like JDBC in Java or Psycopg2 in Python).
Here’s a Python example using Psycopg2 that shows how to connect to a TimescaleDB hypertable and query it inside a Lambda function:
import json
import psycopg2
import psycopg2.extras
import os
def lambda_handler(event, context):
db_name = os.environ['DB_NAME']
db_user = os.environ['DB_USER']
db_host = os.environ['DB_HOST']
db_port = os.environ['DB_PORT']
db_pass = os.environ['DB_PASS']
conn = psycopg2.connect(user=db_user, database=db_name, host=db_host,
password=db_pass, port=db_port)
sql = "SELECT * FROM hypertable WHERE time > WHERE time > NOW() - INTERVAL '2 weeks'
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
cursor.execute(sql)
result = cursor.fetchall()
return {
'statusCode': 200,
'body': json.dumps(result, default=str),
'headers': {
"Content-Type": "application/json"
}
}
In this Lambda function code snippet, after creating a connection object, we query the hypertable to return all records from the past two weeks.
At the end of the function, we return this result set in JSON format.
For a simple example like this, a straightforward Lambda function can be coded, uploaded, and triggered without much hassle. However, because TimescaleDB is a time-series database that often deals with tens of thousands of rows (or more) in a single transaction, we find that users often struggle with the problems we discussed earlier.
Let’s explore how to overcome these obstacles for larger data pipelines using AWS Lambda!
Problem #1: how to add external dependencies
Suppose you want to interact with TimescaleDB (or any other external database) in your function to build a data pipeline or a data API. In that case, you need at least one external dependency: a database connector library. With Lambda, you can use Lambda Layers to include the dependencies of the function.
A Lambda Layer is an archive containing additional code, such as libraries. Additionally, you can use these libraries across multiple functions, which can come in handy if you build data pipelines because each function will have to use the connector library. Instead of adding common libraries to each function ZIP package, you can refer to the layers within the function which reduces complexity and eases the management of dependency versions across all of your functions!
Since we are using Python, let’s see how you can upload the Psycopg2 library as a Lambda Layer! It’s a little tricky as you need to use the compiled version of the library.
1) Download and unzip the compiled Psycopg2 library:
wget https://github.com/jkehler/awslambda-psycopg2/archive/refs/heads/master.zip
unzip master.zip
2) In the directory you downloaded the library to, copy the Psycopg2 files into a new directory called /python/psycopg2/:
cd awslambda-psycopg2-master/
mkdir -p python/psycopg2/
cp -r psycopg2-3.8/* python/psycopg2/
3) Zip the python directory and upload the zipped file as a Lambda layer:
zip -r psycopg2_layer.zip python/
aws lambda publish-layer-version --layer-name psycopg2 \
--description "psycopg2 for Python3.8" --zip-file fileb://psycopg2_layer.zip \
--compatible-runtimes python3.8
4) In the AWS Lambda console, check to see if your psycopg2 has been uploaded as a Lambda layer:
And there you have it, Psycopg2 uploaded as a Layer! You can now build out your data pipelines in Lambda that connect to any PostgreSQL database, like TimescaleDB. 🎉
If you need a more detailed tutorial on working with Lambda Layers, read our Create a data API for TimescaleDB tutorial to learn how to use Lambda Layers and API Gateway to create a time-series data API.
With Lambda Layers, you can add and share dependencies between your functions.
But, it’s important to note that the final package size includes your function code and any layers that you associate with the function. This can be crucial if your function is close to going over the 250MB limit.
Let’s see how to overcome this limit next!
Problem #2: how to overcome the 250MB package limit
What if your function’s dependencies include something bigger than a simple connector library? Or suppose you have multiple shared dependencies and using them together in a function puts you over the 250MB limit. Fear not, there’s an elegant solution to this problem: use a Docker container as your Lambda function.
As I mentioned, AWS announced container support for Lambda, which allows the package size to be up to 10GB. This gives you much more flexibility regarding what external libraries and dependencies you use in your Lambda function. This process includes containerizing your function and all its dependencies with Docker, then uploading it to AWS ECR. Finally, connect this image on ECR to Lambda.
What is AWS ECR (Elastic Container Registry)? AWS ECR is a Docker container registry system where you can store, manage, share, and deploy your Docker containers. ECR works well with other AWS services, including Lambda. It allows you to deploy your containers directly from this registry to Lambda.
Let’s look at a quick example of how to use a Docker image with Lambda in Python:
1) Create your Lambda function
2) Add a requirements file containing your dependencies
# requirements.txt:
requests
pandas
...
3) Create the Dockerfile, which is based on an AWS-provided Lambda image
FROM public.ecr.aws/lambda/python:3.8
COPY requirements.txt .RUN pip install -r requirements.txt
CMD ["function.handler"]
4) Build the image and upload it to ECR
5) Create the Lambda function using the image
After completing these steps, you have set up a Docker container as your Lambda function. For a detailed step-by-step tutorial on using Docker containers with Lambda, head over to our Pull and ingest data from a third-party API docs to learn how to use Docker to build a 3rd party API ingestion function with Lambda and TimescaleDB.
Being able to exceed the 250MB limit gives you more freedom and enables you to use Lambda for more use cases than before. For example, if you couldn’t import your favorite data processing library before because it was too big for Lambda, with containers, you can freely do it.
As with most things, however, the ability to run a larger, more complex Lambda function also brings a new layer of operational overhead - managing and deploying a Docker image aside from your function code.
To make that process easier, let’s look at how we can automate the management of the Docker image and continuously deploy changes to AWS Lambda!
Problem #3: how to set up continuous deployment
CI/CD is part of every software development process. In the Lambda ecosystem, there are multiple ways to build a CI/CD pipeline. One popular choice is GitHub Actions. Whether you’re using a serverless framework like SAM or zipping your dependencies then uploading it to Lambda, automating this process can save a lot of time packaging and deploying functions.
With GitHub Actions, you manage the process through a YAML file in your function source repository, which instructs GitHub how to package and deploy your image any time something happens, like merging a pull request into the `main` branch.
An overview of the process for setting up continuous deployment between a GitHub repository and Lambda, using GitHub Actions generally looks like this:
1) Create a new function in the AWS console
2) Push your function and its dependencies to a new GitHub repository
3) Add your AWS credentials to the repository using GitHub secrets
4) Set up GitHub Actions to auto-deploy when there are new changes on the master branch
YAML file:
name: deploy to lambda
on:
# Trigger the workflow on push requests on the master branch
push:
branches:
- master
jobs:
deploy_source:
name: deploy lambda from source
runs-on: ubuntu-latest
steps:
- name: checkout source code
uses: actions/checkout@v1
- name: default deploy
uses: appleboy/lambda-action@master
with:
aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws_region: ${{ secrets.AWS_REGION }}
function_name: lambda-cd
source: function.py
In this overview example, GitHub will deploy changes to AWS Lambda any time you push a commit to the `main` branch, using the credentials you specified as part of the setup. Depending on your use case, the deployment strategy and process could be more complex.
For details on building a continuous deployment pipeline between GitHub and Lambda, check out our Lambda continuous deployment with GitHub Actions tutorial.
Wrapping up
You’ve learned about AWS Lambda and how to solve three common challenges that you’ll likely run into as you use it to build data pipelines. We’ve provided a sample process to add external dependencies to your function, using Lambda Layers – and you’ve seen how to workaround Lambda’s 250MB package size limit (Docker containers to the rescue!). And finally, you know how to use GitHub Actions to set up continuous deployment with Lambda.
If you want to read more detailed Lambda tutorials, go to the TimescaleDB with AWS Lambda documentation section to dig deeper. If you don’t already have TimescaleDB installed, it’s easiest to sign up for a free Timescale account, or you can always install it on your own machine. You can also get free access to AWS Lambda.
Have questions? Join 7000+ TimescaleDB users and engineers in the TimescaleDB community Slack to learn more about working with time-series data!
Finally, here are some inspirational tutorials to get you started with building time-series data pipelines :)
- Analyze cryptocurrency market data
- Analyze historical intraday stock data
- Analyze data using TimescaleDB continuous aggregates and hyperfunctions
- Getting started with Grafana and TimescaleDB