Automate a major variety of duties. Because the launch of ChatGPT in 2022, there are an increasing number of LLM-powered AI merchandise in the marketplace. Nevertheless, there may be nonetheless a whole lot of room for enchancment in how LLM is used. For instance, bettering prompts utilizing the LLM Immediate Enchancment Program and leveraging cached tokens are two easy strategies you need to use to considerably enhance the efficiency of your LLM functions.
This text describes some particular strategies you possibly can apply on how you can create and configure prompts. This reduces latency, prices, and improves response high quality. Our purpose is to introduce you to those particular strategies so you possibly can rapidly implement them into your personal LLM functions.
Why it is advisable to optimize your prompts
In lots of instances, there could also be prompts that work with a selected LLM and yield acceptable outcomes. Nevertheless, many occasions we do not spend a whole lot of time optimizing our prompts, leaving us with a whole lot of potential.
I argue that you would be able to simply enhance the standard of your responses and scale back prices with out a lot effort utilizing the precise strategies introduced on this article. Simply because prompts and LLM work doesn’t imply optimum efficiency. In lots of instances, you possibly can see vital enhancements with little or no effort.
Particular strategies for optimization
This part describes particular strategies you need to use to optimize your prompts.
At all times maintain static content material early
The primary approach we’ll talk about is to at all times maintain static content material early within the immediate. Static content material refers to content material that is still the identical throughout a number of API calls.
The rationale it is advisable to protect static content material early is as a result of all main LLM suppliers like Anthropic, Google, and OpenAI make the most of cached tokens. Cached tokens are tokens which have already been processed in a earlier API request and are cheaper and sooner to course of. Relying on the supplier, the value of cached enter tokens is usually round 10% of standard enter tokens.
A cached token is one which has already been processed in a earlier API request and could be processed. Cheaper and sooner than common tokens
Because of this if you happen to ship the identical immediate twice in a row, the enter token for the second immediate will solely value 1/10 of the enter token for the primary immediate. This works as a result of the LLM supplier caches the processing of those enter tokens, making it cheaper and sooner to course of new requests.
In follow, caching of enter tokens is finished by holding variables on the finish of the immediate.
For instance, when you’ve got an extended system immediate with completely different questions for every request, you need to do one thing like this:
immediate = f"""
{lengthy static system immediate}
{consumer immediate}
"""
for instance:
immediate = f"""
You're a doc skilled ...
You need to at all times reply on this format ...
If a consumer asks about ... you need to reply ...
{consumer query}
"""
Right here we’re placing the static contents of the immediate first earlier than placing the contents of the variable (the consumer’s query) final.
In some eventualities, it’s possible you’ll need to feed the content material of the doc. If you’re working with many alternative paperwork, it’s possible you’ll need to go away the content material of the doc on the finish of the immediate.
# if processing completely different paperwork
immediate = f"""
{static system immediate}
{variable immediate instruction 1}
{doc content material}
{variable immediate instruction 2}
{consumer query}
"""
Nevertheless, suppose you need to course of the identical doc a number of occasions. In that case, you possibly can be certain that the doc’s token can be cached by ensuring that no variables are stuffed in on the immediate beforehand.
# if processing the identical paperwork a number of occasions
immediate = f"""
{static system immediate}
{doc content material} # maintain this earlier than any variable directions
{variable immediate instruction 1}
{variable immediate instruction 2}
{consumer query}
"""
Observe that cached tokens are usually activated provided that the primary 1024 tokens are the identical in two requests. For instance, if the static system immediate within the instance above is shorter than 1024 tokens, the cached tokens won’t be used.
# do NOT do that
immediate = f"""
{variable content material} < --- this removes all utilization of cached tokens
{static system immediate}
{doc content material}
{variable immediate instruction 1}
{variable immediate instruction 2}
{consumer query}
"""
Prompts ought to at all times be constructed with essentially the most static content material first (the content material that modifications the least between requests), adopted by essentially the most dynamic content material (the content material that modifications essentially the most between requests).
- In case you have lengthy system and consumer prompts with out variables, you need to maintain it originally and add the variables on the finish of the immediate.
- For instance, if you wish to fetch textual content from a doc and course of the identical doc twice, you need to do the next:
Doc content material or lengthy prompts -> Use caching
Final query
One other approach to make use of to enhance LLM efficiency is to at all times place the consumer’s query on the finish of the immediate. Ideally, system prompts must be organized in order that they include all normal directions, and consumer prompts ought to include solely the consumer’s questions, similar to:
system_prompt = "<normal directions>"
user_prompt = f"{user_question}"
Anthropic’s immediate engineering documentation states that states that embrace a consumer immediate on the finish can enhance efficiency by as much as 30%, particularly when utilizing lengthy contexts. Together with a query on the finish makes it clearer what activity the mannequin is making an attempt to perform, and infrequently results in higher outcomes.
Utilizing the immediate optimizer
Prompts created by people are sometimes messy, inconsistent, include redundant content material, or lack construction. Subsequently, prompts ought to at all times be stuffed via the immediate optimizer.
The best immediate optimizer you need to use is to immediate LLM. Please enhance this immediate {immediate}, It gives extra structured prompts with much less redundant content material.
Nevertheless, a greater method is to make use of a selected immediate optimizer, just like the one present in OpenAI or Anthropic’s console. These optimizers are LLMs which can be particularly prompted and written to optimize prompts and often give higher outcomes. Moreover, it’s essential to make sure that to incorporate:
- Particulars of the duty you are attempting to perform
- Examples of duties with profitable prompts and inputs and outputs
- Instance enter and output for a activity with failed prompts
Offering this extra info often leads to higher outcomes and in the end higher prompts. It usually takes about 10 to fifteen minutes and gives a extra performant immediate. This makes utilizing the immediate optimizer one of many least effortful approaches to bettering LLM efficiency.
Benchmark LLM
The LLM you utilize additionally has a major influence on the efficiency of your LLM software. Completely different LLMs are good at completely different duties, so you need to strive completely different LLMs for particular software areas. At a minimal, we advocate establishing entry to the biggest LLM suppliers similar to Google Gemini, OpenAI, and Anthropic. This setup may be very straightforward and takes a couple of minutes to change LLM supplier in case your credentials are already set. Moreover, you possibly can contemplate testing open supply LLMs, however they usually require extra effort.
Subsequent, it is advisable to set particular benchmarks for the duty you are attempting to perform and see which LLM performs finest. Moreover, main LLM suppliers could improve their fashions, somewhat than essentially releasing new variations, so you need to recurrently verify the efficiency of your fashions. In fact, you also needs to be ready to check out new fashions which can be launched by main LLM suppliers.
conclusion
This text described 4 completely different strategies that you need to use to enhance the efficiency of your LLM functions. We lined leveraging cached tokens, displaying questions on the finish of prompts, utilizing the immediate optimizer, and creating particular LLM benchmarks. All of those are comparatively straightforward to arrange and run, and may result in vital efficiency enhancements. I am positive there are lots of related easy strategies on the market, so at all times watch out. These matters are usually lined in varied weblog posts. Anthropic is likely one of the blogs that has contributed essentially the most to bettering LLM efficiency.
👉 Discover me on social:
🧑💻 contact
✍️ Medium
It’s also possible to learn my different articles.

