Handling AI Scaling and Reducing Costs

When it comes to the programming adage “Speed, Quality, Price: Choose Two” I tend to prefer low price. Perhaps it’s the influence that the Zen of Python has had on me, or perhaps it’s simply the trend of software to outpace hardware when it comes to improving efficiency.
Now is better than never. Although never is often better than right now.
~ Zen of Python
Regardless, the case I’ll make here is that we can demonstrate a method to handle simple AI tasks at scale favoring queued results that complete eventually over immediate results that require adding resources each time scaling is needed.
Scaling using a queue
The idea here is to handle tasks submitted by users by adding them to a queue, then a task executor takes items from the queue (processing one task at a time). This makes it so that the primary limitation to the number of tasks that can be handled is the queue itself. However, it is less expensive to add storage to our queue than it is to add additional task executors.
This highlights the contrast between the queued design vs physically scaling by adding an additional task executor for each additional user. Of course the users in the example on the right benefit from getting immediate results as they don’t have to wait their turn in a queue. One way we can begin to remedy this is to then scale our queueing infrastructure:
An example using Redis and Celery
Diving into a practical example, we can create a queue using Redis, and a task executor using Celery.
The following files will be needed:
# Dockerfile
FROM python:3.11-slim-bookworm
COPY ./requirements.txt /code/requirements.txt
RUN pip install -r /code/requirements.txt
RUN python -m spacy download en_core_web_sm
COPY ./tasks.py /code
WORKDIR /code
CMD ["celery", "-A", "tasks", "worker", "--loglevel=INFO"]
# docker-compose.yml
services:
rabbitmq:
image: rabbitmq:4.1
ports:
- "5672:5672"
celery:
build: .
volumes:
- ./data:/data
# requirements.txt
celery==5.5.2
chatterbot[dev]==1.2.6
# tasks.py
from celery import Celery
from chatterbot import ChatBot
app = Celery('tasks', broker='pyamqp://guest@rabbitmq//')
@app.task
def solve(message_id: int, message: str):
"""
Create a simple chatbot that can solve math problems.
"""
chatbot = ChatBot(
'Example Bot',
logic_adapters=[
'chatterbot.logic.MathematicalEvaluation'
]
)
response = chatbot.get_response(message)
# Save the response to a file in the data directory
with open(f'/data/{message_id}-response.txt', 'w+') as f:
f.write(response.text)
return response
Running the example
Next, use docker to start your services.
sudo docker compose up -d
You can check to make sure the expected services are running using the ps
subcommand. You should see output similar to the following:
sudo docker compose ps
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
chatterbot-celery-celery-1 chatterbot-celery-celery "celery -A tasks wor…" celery 7 seconds ago Up 6 seconds
chatterbot-celery-rabbitmq-1 rabbitmq:4.1 "docker-entrypoint.s…" rabbitmq About a minute ago Up About a minute 4369/tcp, 5671/tcp, 15691-15692/tcp, 25672/tcp, 0.0.0.0:5672->5672/tcp, [::]:5672->5672/tcp
Next, we can simulate a user submitting a task to the queue. Celery will then detect this task and start running it.
sudo docker compose exec celery python
Python 3.11.12 (main, Apr 28 2025, 22:10:55) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tasks import solve
>>> solve.delay(1, 'What is 50 divided by 2?')
<AsyncResult: 984463f9-603b-4242-b3f5-8d626dc4f7a6>
Now if we inspect the contents of the data
directory, we should see the a new file named 1-response.txt
containing the solved equation:
50 divided by 2 = 25
Scaling horizontally
We can horizontally scale the task executor (Celery) layer of our infrastructure. The following command will create a new instance of our celery
container so we have 2 total running.
sudo docker compose up -d --scale celery=2
[+] Running 3/3
✔ Container chatterbot-celery-rabbitmq-1 Running 0.0s
✔ Container chatterbot-celery-celery-1 Running 0.0s
✔ Container chatterbot-celery-celery-2 Started
Once the scaling is complete, both Celery workers will run in parallel, each able to take a task out of the queue for processing.
Final notes and musings
There is a time and a place for delayed processing in the method that has been described here. Actions such as conversing with AI likely require more real-time strategies for returning responses promptly. Tasks better suited to the strategy described in this post might include long running tasks such as AI image generation or analysis, or generating summaries of relatively static data.
As noted in my Earth Day post from a few weeks back, AI models can consume a large amount of power to generate output. I think an interesting experiment to consider might be to set up a queue, and use a Raspberry Pi connected to a solar panel to process tasks. Perhaps in some use cases the initial investment in hardware could provide some level of ROI compared to purchasing cloud services.
If you do happen to be using a Raspberry Pi running on solar power, a relevant note might be the use of a rabbitmq
container over an alternative such as redis
. Per the Celery docs redis is more susceptible to data loss in the event of abrupt power failures.
"blog post topic"
tag when doing so.