Build and deploy your first Serverless Chat Assistant
By Jerry D Boonstra
From Zero to Hero: Cheaply build a robust multiuser chat assistant 🤖

“Little by little, one travels far.” - J.R.R. Tolkien
In this series of articles, we’ll delve into the nuts and bolts of creating a multi-user chat assistant with AWS serverless and OpenAI, providing you with a clear roadmap for building and continuously improving your solution.
The series
- From Zero to Hero: Want to cheaply build a robust multiuser chat assistant?
- (this article) 👉 Build and deploy your first a multi-user Chat Assistant, using AWS Lambda and OpenAI Assistant API and TypeScript
- Add Logging Traces and Ratings to your Application
- Add Unit Tests and Evals to your Application
- Fine-tune your LLM Application to balance Accuracy, Robustness and Cost
- …?
Part 2: Build and deploy a multi-user Chat Assistant, using AWS Lambda and OpenAI Assistant API and TypeScript
The journey building a robust LLM Application
As discussed in our previous post, a fully implemented continuous improvement process for LLM-based applications looks like this:

It takes a lot of effort to build a full pipeline, but you have to start somewhere.
In the previous post in the series, we created an Assistant instance at OpenAI and used the playground to do prompt engineering to get something that works most of the time. We completed Steps #0, 1, and 2 in a multi-step process.
Step 3: Building our prototype application
We want multiple users to be able to interact with our assistant to continue our journey. Its time to do some architecting of this multi-user prototype.
Why OpenAI?
OpenAI’s Assistant API is a robust tool for creating sophisticated chatbots. Since its release, it has introduced features like:
- State Management: Maintain conversation context across multiple interactions.
- Knowledge Augmentation: Access up to 10,000 reference documents to enrich your chatbot’s responses.
- Code Execution: Execute code snippets generated during conversations, adding a layer of dynamic interaction.
- Function Calling: Enables real-time calls to external functions or APIs
- Streaming Output: Streams partial results for immediate feedback.
- Ability to use fine-tuned models: Allows tailoring responses through model fine-tuning.
These features enable us to build a powerful, flexible chat assistant with a wide range of applications.
Why Serverless?
Choosing a serverless architecture offers several advantages for our chat assistant:
- Cost Efficiency: You only pay for what you use, making it cost-effective for applications with variable traffic.
- Scalability: Serverless platforms automatically scale to handle demand, ensuring your application remains responsive under load.
- Maintenance-Free: You don’t have to worry about managing servers, allowing you to focus on development.
However, it’s essential to be aware of the limitations, such as execution timeouts and cold start delays, which we’ll address in detail.
Can we build it?
When looking at building our assistant using serverless, limitations that need to be addressed and worked around include
| Aspect | Limitation | Observation/Workaround |
|---|---|---|
| Execution Timeout | API Gateway has a maximum execution timeout for synchronous connections of around 30 seconds. | The vast majority of single turn responses will execute in under 30 seconds. For others, we depend on websockets auto-reconnect and the stateful and idempotent nature of the Assistant API. |
| Execution Context Persistence | State, data, etc. are not preserved across invocations, except for the /tmp directory and in-memory variables within a single container. |
Leverage the stateful nature of the Assistant API. Use DynamoDB for tracking the assistant thread used per user across stateless Lambda invocations. |
| Library and Dependency Management | Some Python libraries, especially those requiring native dependencies, may be difficult to install and use in Lambda. There is a limit of 5 layers. | OpenAI NodeJs SDK has no native dependencies. We use a Lambda layer to provide this library and its dependencies to our application backend. |
| Memory and CPU Limitations | Lambda functions can only allocate up to 10 GB of memory, and CPU performance is proportional to the memory allocated. | A Lambda function for our purposes is unlikely to need more than 1 GB of memory, and in practice use less. Inferencing can be memory and GPU intensive, but it is done in the OpenAI infrastructure. |
| Deployment Package Size | Deployment package size cannot exceed 50MB when uploaded directly, or 250MB using an Amazon S3 bucket. | OpenAI library and dependencies weigh in ~17MB uncompressed which leaves 233MB of headroom for other libraries for your lambda application. |
| Cold Start Delays | Cold start delays are the latency experienced when executing a serverless application (such as AWS Lambda) for the first time after being idle. | Usually not a deal breaker for low traffic applications. Can be mitigated by keeping functions warm, increasing function memory allocation, or using Provisioned Concurrency. |
Costs
OpenAI
Assistants API can be used with a selection of models which vary in cost and quality.
By default, this project uses gpt-4o model, to demo its humorous example prompt.
3.5-turbo gives a 10x cost savings. Overall it will come down to your use case whether you can get away with the ultralow cost solution.
Since an assistant instance is easy to spin up or modify using our codebase, you can easily change models and compare results to quickly determine which direction to go in.
The details:
- You can get great results using
gpt-4obut - at the time of writing - at inference you’ll be paying$5per1Mtokens for input and$15per1Mtokens for output.
- For an ultra-low cost solution you can choose
gpt-3.5-turbowhich isn’t free but is getting close: inference is$0.50per1Mtokens for input and$1.50per1Mtokens for output.
AWS
For lower traffic application, its unlikely you will exceed the free tier.
Architecture

Requirements
You’ll need:
- an AWS account with wide permissions, to be able to deploy entire CloudFormation stacks with serverless components.
- an OpenAI account and API key, for storing your Assistant instance and for billing.
Code can be found at https://github.com/jerrydboonstra/serverless-assistant-chat/tree/v1
After cloning the codebase, we’ll need to
- create our local environment
- create our Assistant instance
- create our backend deployment bucket
before running our CloudFormation template to create our stack.
Setup and Deployment
-
Clone the repo and switch to
v1branchgit clone https://github.com/jerrydboonstra/serverless-assistant-chat.git cd serverless-assistant-chat && git checkout v1 -
Follow instructions in README.md to create a local environment and deploy the entire Assistant stack in your AWS environment.
Try it out

Making changes after stack deployment
There is a provided process to make changes after deployment, allowing iterative development.
Step 4: Add Logging Traces and Ratings to your Application
In our next article, we will add logging traces and ratings to our multi-user application. We can use these later for rating quality.
Part 3: Add Logging Traces and Ratings to your Application (coming soon!)