Handy tips for LLM application developers

Somnath Banerjee
5 min readJul 1, 2023

--

The world has been captivated by the incredible capabilities of Large Language Models (LLM) like ChatGPT. Many application developers are now integrating GPT or other LLMs into their applications. In this article, we will present a few simple tips to help you save time and reduce costs when working on LLM applications. We will provide concise code examples that demonstrate retry logic, caching, and parallelization techniques when making calls to OpenAI APIs. These code snippets are easy to implement will increase development velocity and decrease cost and latency of your LLM application.

Tip 1: Create a wrapper function

To start off, we suggest creating a wrapper function that provides a convenient approach to call openai.chat_completion

class GPTUtil(object):
def __init__(self):
openai.api_key = os.getenv("OPENAI_API_KEY")

def gpt_response(self, user_message, **kwargs):
messages = [{"role": "user", "content": user_message}]

request = dict(
model=kwargs.get("model", "gpt-3.5-turbo"),
temperature=kwargs.get("temperature", 0.7),
messages=messages
)

response = openai.ChatCompletion.create(**request)
return response["choices"][0]["message"]["content"]

This wrapper function makes API call to GPT more convenient

gpt = GPTUtil()
print(gpt.gpt_response("Hi there!"))

Tip 2: Add retry logic

Many times OpenAI server is busy and returns RateLimitError, or ServiceUnavailableError. Such failure can be easily handled using tenacity library and retry logic

from tenacity import (
retry,
stop_after_attempt,
wait_random_exponential,
)

class GPTUtil(object):

@retry(wait=wait_random_exponential(min=5, max=60), stop=stop_after_attempt(3))
def gpt_response(self, user_message, **kwargs):
messages = [{"role": "user", "content": user_message}]

request = dict(
model=kwargs.get("model", "gpt-3.5-turbo"),
temperature=kwargs.get("temperature", 0.7),
messages=messages
)

response = openai.ChatCompletion.create(**request)
return response["choices"][0]["message"]["content"]

In the event of an error, this code will make three retry attempts using exponential backoff, which means the wait time between each retry will increase progressively.

Tip 3: Fall back to GPT-4 for lengthier prompts

Using ChatGPT or gpt-3.5 as the default option instead of gpt-4 offers several advantages. GPT-3.5 is 20 times cheaper, significantly faster, and has much higher rate limits. However, it does have a token limit of 4K. If your prompt exceeds this limit, gpt-3.5 will raise an InvalidRequestError. To address this, you can implement logic to fallback to gpt-4 in the following manner.

   @retry(wait=wait_random_exponential(min=5, max=60), stop=stop_after_attempt(3))
def gpt_response(self, user_message, **kwargs):
messages = [{"role": "user", "content": user_message}]

request = dict(
model=kwargs.get("model", "gpt-3.5-turbo"),
temperature=kwargs.get("temperature", 0.7),
messages=messages
)

try:
response = openai.ChatCompletion.create(**request)
except openai.InvalidRequestError as e:
print("Trying with gpt-4...")
request["model"] = "gpt-4"
try:
response = openai.ChatCompletion.create(**request)
except openai.InvalidRequestError as e:
print("Invalid request", e)
return None

return response["choices"][0]["message"]["content"]
message = "AI is going to change the world! " * 1000
print(gpt.gpt_response("How many times AI is mentioned in the following text " + message))

The above prompt is longer than 4K tokens and it will fall back to gpt-4

Tip 4: Incorporate caching

During the development, debugging, or even in a production environment, it is common to make repetitive GPT calls for the same prompt. Implementing a cache to store the GPT responses can significantly save time and costs. Creating a SQLite cache for this purpose is relatively straightforward.

The following code creates a SQLite cache

import hashlib
import sqlite3

CACHE_TABLE = "full_cache"

class SQLiteCache(object):
def __init__(self, database_path=".llm.db"):
self.database_path = database_path
self.cache_table = "full_cache"
self.conn = sqlite3.connect(self.database_path)
self.cur = self.conn.cursor()

self.cur.execute("""
CREATE TABLE IF NOT EXISTS {CACHE_TABLE} (
key TEXT PRIMARY KEY,
response TEXT,
created_at timestamp NOT NULL DEFAULT current_timestamp)
""".format(CACHE_TABLE=CACHE_TABLE)
)

def _cache_key(self, messages):
message_str = ""
for message in messages:
message_str += message["role"] + ": " + message["content"] + "\n"

md5_key = hashlib.md5(message_str.encode("utf-8")).hexdigest()
return md5_key

def set(self, messages, response):
key = self._cache_key(messages)
insert_stmt = f"""
INSERT INTO {CACHE_TABLE} (key, response)
VALUES (?, ?)
"""

try:
self.cur.execute(insert_stmt, (key, response))
self.conn.commit()
return True
except sqlite3.OperationalError as e:
print("Failed to insert into cache", str(e))
print(insert_stmt)
return False

def get(self, messages):
key = self._cache_key(messages)
select_stmt = f"SELECT response FROM {CACHE_TABLE} WHERE key='{key}'"

res = self.cur.execute(select_stmt)
row = res.fetchone()
if not row:
return None

return row[0]

Now let’s update the wrapper function to use the SQLite cache

class GPTUtil(object):

def __init__(self):
openai.api_key = os.getenv("OPENAI_API_KEY")
self.cache = SQLiteCache()

@retry(wait=wait_random_exponential(min=5, max=60), stop=stop_after_attempt(3))
def gpt_response(self, user_message, **kwargs):
messages = [{"role": "user", "content": user_message}]

cached_response = self.cache.get(messages)
if cached_response:
return cached_response

request = dict(
model=kwargs.get("model", "gpt-3.5-turbo"),
temperature=kwargs.get("temperature", 0.7),
messages=messages
)

try:
response = openai.ChatCompletion.create(**request)
except openai.InvalidRequestError as e:
print("Trying with gpt-4...")
request["model"] = "gpt-4"
try:
response = openai.ChatCompletion.create(**request)
except openai.InvalidRequestError as e:
print("Invalid request", e)
return None

content = response["choices"][0]["message"]["content"]
self.cache.set(messages, content)
return content

Now the subsequent call to gpt with the same prompt will return response from cache

message = "AI is going to change the world! " * 1000

s = time.time()
print(gpt.gpt_response("How many times AI is mentioned in the following text " + message))
e = time.time()
print("Time taken", e - s)

s = time.time()
print(gpt.gpt_response("How many times AI is mentioned in the following text " + message))
e = time.time()
print("Time taken", e - s)

Alternatively, you can use langchain or GPTCache which offer additional functionalities but may involve slightly more complexity in their usage. However, it’s worth noting that as of the time of writing this article (June 2023), langchain does not currently support caching for chat models such as gpt-3.5 and gpt-4.

Tip 5: Utilize parallel calls

GPT-3.5 has the capacity to handle 3,500 requests per minute and process 90,000 tokens per minute. By taking advantage of parallel calls, you can significantly decrease the response time for a batch of prompts. To enable parallelization, you can make use of the async library in conjunction with the openai.ChatCompletion.acreate function.

class GPTUtil(object):

@retry(wait=wait_random_exponential(min=5, max=60), stop=stop_after_attempt(3))
async def gpt_response_async(self, user_message, **kwargs):
messages = [{"role": "user", "content": user_message}]

if self.cache:
cached_response = self.cache.get(messages)
if cached_response:
return cached_response

request = dict(
model=kwargs.get("model", "gpt-3.5-turbo"),
temperature=kwargs.get("temperature", 0.7),
messages=messages,
)

try:
response = await openai.ChatCompletion.acreate(**request)
except openai.InvalidRequestError as e:
print("Trying with gpt-4...")
request["model"] = "gpt-4"
try:
response = await openai.ChatCompletion.acreate(**request)
except openai.InvalidRequestError as e:
print("Invalid request", e)
return None

content = response["choices"][0]["message"]["content"]
if self.cache:
self.cache.set(messages, content)
return content

async def parallel_gpt_response(self, message_list, **kwargs):
tasks = [self.gpt_response_async(message, **kwargs) for message in message_list]
responses = await asyncio.gather(*tasks)
return responses

We can test it as follows. Please ensure that caching is disabled before conducting the test.

message_list = ["What is 2^10? Respond with an integer only:"] * 10
s = time.time()
responses = [gpt.gpt_response(message) for message in message_list]
t = time.time()
print("Time taken", t - s)

s = time.time()
responses = asyncio.run(gpt.parallel_gpt_response(message_list))
t = time.time()
print("Time taken", t - s)

Additional tips

When deploying LLM applications in a production environment, it is important to log the prompts and corresponding responses. https://promptlayer.com/ or http://log10.ai/ provides convenient libraries and infrastructure to effectively log and manage the prompts and responses of your LLM application.

The code snippets above demonstrate simple techniques to reduce both the time and cost involved in developing and running GPT/LLM applications. You can find all the aforementioned code on GitHub at https://github.com/sombandy/llm-util. Feel free to utilize this code in your own applications.

--

--

Somnath Banerjee
Somnath Banerjee

Written by Somnath Banerjee

Partner and Chief Data Scientist at Clear Ventures

No responses yet