Build and deploy your first Serverless Chat Assistant

By Jerry D Boonstra

June 25, 2024

From Zero to Hero: Cheaply build a robust multiuser chat assistant 🤖

developer2

“Little by little, one travels far.” - J.R.R. Tolkien

In this series of articles, we’ll delve into the nuts and bolts of creating a multi-user chat assistant with AWS serverless and OpenAI, providing you with a clear roadmap for building and continuously improving your solution.

The series

From Zero to Hero: Want to cheaply build a robust multiuser chat assistant?
(this article) 👉 Build and deploy your first a multi-user Chat Assistant, using AWS Lambda and OpenAI Assistant API and TypeScript
Add Logging Traces and Ratings to your Application
Add Unit Tests and Evals to your Application
Fine-tune your LLM Application to balance Accuracy, Robustness and Cost
…?

Part 2: Build and deploy a multi-user Chat Assistant, using AWS Lambda and OpenAI Assistant API and TypeScript

The journey building a robust LLM Application

As discussed in our previous post, a fully implemented continuous improvement process for LLM-based applications looks like this:

cycle

It takes a lot of effort to build a full pipeline, but you have to start somewhere.

In the previous post in the series, we created an Assistant instance at OpenAI and used the playground to do prompt engineering to get something that works most of the time. We completed Steps #0, 1, and 2 in a multi-step process.

Step 3: Building our prototype application

We want multiple users to be able to interact with our assistant to continue our journey. Its time to do some architecting of this multi-user prototype.

Why OpenAI?

OpenAI’s Assistant API is a robust tool for creating sophisticated chatbots. Since its release, it has introduced features like:

State Management: Maintain conversation context across multiple interactions.
Knowledge Augmentation: Access up to 10,000 reference documents to enrich your chatbot’s responses.
Code Execution: Execute code snippets generated during conversations, adding a layer of dynamic interaction.
Function Calling: Enables real-time calls to external functions or APIs
Streaming Output: Streams partial results for immediate feedback.
Ability to use fine-tuned models: Allows tailoring responses through model fine-tuning.

These features enable us to build a powerful, flexible chat assistant with a wide range of applications.

Why Serverless?

Choosing a serverless architecture offers several advantages for our chat assistant:

Cost Efficiency: You only pay for what you use, making it cost-effective for applications with variable traffic.
Scalability: Serverless platforms automatically scale to handle demand, ensuring your application remains responsive under load.
Maintenance-Free: You don’t have to worry about managing servers, allowing you to focus on development.

However, it’s essential to be aware of the limitations, such as execution timeouts and cold start delays, which we’ll address in detail.

Can we build it?

When looking at building our assistant using serverless, limitations that need to be addressed and worked around include

Aspect	Limitation	Observation/Workaround
Execution Timeout	API Gateway has a maximum execution timeout for synchronous connections of around 30 seconds.	The vast majority of single turn responses will execute in under 30 seconds. For others, we depend on websockets auto-reconnect and the stateful and idempotent nature of the Assistant API.
Execution Context Persistence	State, data, etc. are not preserved across invocations, except for the `/tmp` directory and in-memory variables within a single container.	Leverage the stateful nature of the Assistant API. Use DynamoDB for tracking the assistant thread used per user across stateless Lambda invocations.
Library and Dependency Management	Some Python libraries, especially those requiring native dependencies, may be difficult to install and use in Lambda. There is a limit of 5 layers.	OpenAI NodeJs SDK has no native dependencies. We use a Lambda layer to provide this library and its dependencies to our application backend.
Memory and CPU Limitations	Lambda functions can only allocate up to 10 GB of memory, and CPU performance is proportional to the memory allocated.	A Lambda function for our purposes is unlikely to need more than 1 GB of memory, and in practice use less. Inferencing can be memory and GPU intensive, but it is done in the OpenAI infrastructure.
Deployment Package Size	Deployment package size cannot exceed 50MB when uploaded directly, or 250MB using an Amazon S3 bucket.	OpenAI library and dependencies weigh in ~17MB uncompressed which leaves 233MB of headroom for other libraries for your lambda application.
Cold Start Delays	Cold start delays are the latency experienced when executing a serverless application (such as AWS Lambda) for the first time after being idle.	Usually not a deal breaker for low traffic applications. Can be mitigated by keeping functions warm, increasing function memory allocation, or using Provisioned Concurrency.

Costs

OpenAI

Assistants API can be used with a selection of models which vary in cost and quality.

By default, this project uses gpt-4o model, to demo its humorous example prompt.

3.5-turbo gives a 10x cost savings. Overall it will come down to your use case whether you can get away with the ultralow cost solution.

Since an assistant instance is easy to spin up or modify using our codebase, you can easily change models and compare results to quickly determine which direction to go in.

The details:

You can get great results using gpt-4o but - at the time of writing - at inference you’ll be paying
- $5 per 1M tokens for input and $15 per 1M tokens for output.
For an ultra-low cost solution you can choose gpt-3.5-turbo which isn’t free but is getting close: inference is
- $0.50 per 1M tokens for input and $1.50 per 1M tokens for output.

AWS

For lower traffic application, its unlikely you will exceed the free tier.

Architecture

arch

Requirements

You’ll need:

an AWS account with wide permissions, to be able to deploy entire CloudFormation stacks with serverless components.
an OpenAI account and API key, for storing your Assistant instance and for billing.

Code can be found at https://github.com/jerrydboonstra/serverless-assistant-chat/tree/v1

After cloning the codebase, we’ll need to

create our local environment
create our Assistant instance
create our backend deployment bucket

before running our CloudFormation template to create our stack.

Setup and Deployment

Clone the repo and switch to v1 branch

git clone https://github.com/jerrydboonstra/serverless-assistant-chat.git
cd serverless-assistant-chat && git checkout v1

Follow instructions in README.md to create a local environment and deploy the entire Assistant stack in your AWS environment.

Try it out

screengrab

Making changes after stack deployment

There is a provided process to make changes after deployment, allowing iterative development.

Step 4: Add Logging Traces and Ratings to your Application

In our next article, we will add logging traces and ratings to our multi-user application. We can use these later for rating quality.

Part 3: Add Logging Traces and Ratings to your Application (coming soon!)