Benchmarking Amazon Nova and GPT-4o fashions with FloTorch

by root March 11, 2025

written by root March 11, 2025 0 comment 278 views

Based mostly on unique put up by Dr. Hemant Joshi, CTO, FloTorch.ai

A current analysis performed by FloTorch in contrast the efficiency of Amazon Nova fashions with OpenAI’s GPT-4o.

Amazon Nova is a brand new technology of state-of-the-art basis fashions (FMs) that ship frontier intelligence and industry-leading price-performance. The Amazon Nova household of fashions consists of Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Professional, which assist textual content, picture, and video inputs whereas producing text-based outputs. These fashions provide enterprises a spread of capabilities, balancing accuracy, pace, and cost-efficiency.

Utilizing its enterprise software program, FloTorch performed an intensive comparability between Amazon Nova fashions and OpenAI’s GPT-4o fashions with the Complete Retrieval Augmented Era (CRAG) benchmark dataset. FloTorch’s analysis targeted on three essential elements—latency, accuracy, and price—throughout 5 various matters.

Key findings from the benchmark research:

GPT-4o demonstrated a slight benefit in accuracy over Amazon Nova Professional
Amazon Nova Professional outperformed GPT-4o in effectivity, working 97% sooner whereas being 65.26% cheaper
Amazon Nova Micro and Amazon Nova Lite outperformed GPT-4o-mini by 2 proportion factors in accuracy
When it comes to affordability, Amazon Nova Micro and Amazon Nova Lite had been 10% and 56.59% cheaper than GPT-4o-mini, respectively
Amazon Nova Micro and Amazon Nova Lite additionally demonstrated sooner response occasions, with 48% and 26.60% enhancements, respectively

On this put up, we focus on the findings from this benchmarking in additional element.

The rising want for cost-effective AI fashions

The panorama of generative AI is quickly evolving. OpenAI launched GPT-4o in Could 2024, and Amazon launched Amazon Nova fashions at AWS re:Invent in December 2024. Though GPT-4o has gained traction within the AI neighborhood, enterprises are exhibiting elevated curiosity in Amazon Nova because of its decrease latency and cost-effectiveness.

Giant language fashions (LLMs) are typically proficient in responding to consumer queries, however they often generate overly broad or inaccurate responses. Moreover, LLMs may present solutions that stretch past the company-specific context, making them unsuitable for sure enterprise use circumstances.

One of the crucial essential functions for LLMs at present is Retrieval Augmented Era (RAG), which permits AI fashions to floor responses in enterprise data bases comparable to PDFs, inside paperwork, and structured information. This can be a essential requirement for enterprises that need their AI techniques to offer responses strictly inside an outlined scope.

To higher serve the enterprise prospects, the analysis aimed to reply three key questions:

How does Amazon Nova Professional examine to GPT-4o when it comes to latency, value, and accuracy?
How do Amazon Nova Micro and Amazon Nova Lite carry out towards GPT-4o mini in these similar metrics?
How effectively do these fashions deal with RAG use circumstances throughout completely different {industry} domains?

By addressing these questions, the analysis offers enterprises with actionable insights into deciding on the proper AI fashions for his or her particular wants—whether or not optimizing for pace, accuracy, or cost-efficiency.

Overview of the CRAG benchmark dataset

The CRAG dataset was launched by Meta for testing with factual queries throughout 5 domains with eight query sorts and a lot of question-answer pairs. 5 domains in CRAG dataset are Finance, Sports activities, Music, Film, and Open (miscellaneous). The eight completely different query sorts are easy, simple_w_condition, comparability, aggregation, set, false_premise, post-processing, and multi-hop. The next desk offers instance questions with their area and query sort.

Area	Query	Query Sort
Sports activities	Are you able to carry lower than the utmost variety of golf equipment throughout a spherical of golf?	`easy`
Music	Are you able to inform me what number of grammies had been received by arlo guthrie till sixtieth grammy (2017)?	`simple_w_condition`
Open	Am i able to make cookies in an air fryer?	`easy`
Finance	Did meta have any mergers or acquisitions in 2022?	`simple_w_condition`
Film	In 2016, which film was distinguished for its visible results on the oscars?	`simple_w_condition`

The analysis thought-about 200 queries from this dataset representing 5 domains and two query sorts, easy and simple_w_condition. Each sorts of questions are widespread from customers, and a typical Google seek for the question comparable to “Are you able to inform me what number of grammies had been received by arlo guthrie till sixtieth grammy (2017)?” won’t provide the appropriate reply (one Grammy). FloTorch used these queries and their floor reality solutions to create a subset benchmark dataset. The CRAG dataset additionally offers high 5 search end result pages for every question. These 5 webpages act as a data base (supply information) to restrict the RAG mannequin’s response. The objective is to index these 5 webpages dynamically utilizing a typical embedding algorithm after which use a retrieval (and reranking) technique to retrieve chunks of information from the listed data base to deduce the ultimate reply.

Analysis setup

The RAG analysis pipeline consists of the a number of key elements, as illustrated within the following diagram.

On this part, we discover every part in additional element.

Information base

FloTorch used the highest 5 HTML webpages supplied with the CRAG dataset for every question because the data base supply information. HTML pages had been parsed to extract textual content for the embedding stage.

Chunking technique

FloTorch used a set chunking technique with a bit measurement of 512 tokens (4 characters is often round one token) and a ten% overlap between chunks. Additional experiments with completely different chunking methods, chunk sizes, and p.c overlap shall be performed in coming weeks and can replace this put up.

Embedding technique

FloTorch used the Amazon Titan Textual content Embeddings V2 mannequin on Amazon Bedrock with an output vector measurement of 1024. With a most enter token restrict of 8,192 for the mannequin, the system efficiently embedded chunks from the data base supply information in addition to brief queries from the CRAG dataset effectively. Amazon Bedrock APIs make it easy to make use of Amazon Titan Textual content Embeddings V2 for embedding information.

Vector database

FloTorch chosen Amazon OpenSearch Service as a vector database for its high-performance metrics. The implementation included a provisioned three-node sharded OpenSearch Service cluster. Every provisioned node was r7g.4xlarge, chosen for its availability and adequate capability to satisfy the efficiency necessities. FloTorch used HSNW indexing in OpenSearch Service.

Retrieval (and reranking) technique

FloTorch used a retrieval technique with a k-nearest neighbor (k-NN) of 5 for retrieved chunks. The experiments excluded reranking algorithms to ensure retrieved chunks remained constant for each fashions when inferring the reply to the offered question. The next code snippet embeds the given question and passes the embeddings to the search perform:

def search_results(interaction_ids: Record[str], queries: Record[str], ok: int):
   """Retrieve search outcomes for queries."""
   outcomes = []
   embedding_max_length = int(os.getenv("EMBEDDING_MAX_LENGTH", 1024))
   normalize_embeddings = os.getenv("NORMALIZE_EMBEDDINGS", "True").decrease() == "true"

   for interaction_id, question in zip(interaction_ids, queries):
       strive:
           _, _, embedding = create_embeddings_with_titan_bedrock(question, embedding_max_length, normalize_embeddings)
           outcomes.append(search(interaction_id + '_titan', embedding, ok))
       besides Exception as e:
           logger.error(f"Error processing question {question}: {e}")
           outcomes.append(None)
   return outcomes

Inferencing

FloTorch used the GPT-4o model from OpenAI utilizing the API key out there and used the Amazon Nova Professional mannequin with dialog APIs. GPT-4o helps a context window of 128,000 in comparison with Amazon Nova Professional with a context window of 300,000. The utmost output token restrict of GPT-4o is 16,384 vs. the Amazon Nova Professional most output token restrict of 5,000. The benchmarking experiments had been performed with out Amazon Bedrock Guardrails performance. The implementation used the common gateway offered by the FloTorch enterprise model to allow constant API calls utilizing the identical perform and to trace token rely and latency metrics uniformly. The inference perform code is as follows:

def generate_responses(dataset_path: str, model_name: str, batch_size: int, api_endpoint: str, auth_header: str,
                        max_tokens: int, search_k: int, system_prompt: str):
   """Generate response for queries."""
   outcomes = []

   for batch in tqdm(load_data_in_batches(dataset_path, batch_size), desc="Producing responses"):
       interaction_ids = [item["interaction_id"] for merchandise in batch]
       queries = [item["query"] for merchandise in batch]
       search_results_list = search_results(interaction_ids, queries, search_k)

       for i, merchandise in enumerate(batch):
           merchandise["search_results"] = search_results_list[i]

       responses = send_batch_request(batch, model_name, api_endpoint, auth_header, max_tokens, system_prompt)

       for i, response in enumerate(responses):
           outcomes.append({
               "interaction_id": interaction_ids[i],
               "question": queries[i],
               "prediction": response.get("decisions", [{}])[0].get("message", {}).get("content material") if response else None,
               "response_time": response.get("response_time") if response else None,
               "response": response,
           })

   return outcomes

Analysis

Each fashions had been evaluated by operating batch queries. A batch of eight was chosen to adjust to Amazon Bedrock quota limits in addition to GPT-4o rate limits. The question perform code is as follows:

def send_batch_request(batch: Record[Dict], model_name: str, api_endpoint: str, auth_header: str, max_tokens: int,
                      system_prompt: str):
   """Ship batch queries to the API."""
   headers = {"Authorization": auth_header, "Content material-Sort": "utility/json"}
   responses = []

   for merchandise in batch:
       question = merchandise["query"]
       query_time = merchandise["query_time"]
       retrieval_results = merchandise.get("search_results", [])

       references = "# References n" + "n".be part of(
           [f"Reference {_idx + 1}:n{res['text']}n" for _idx, res in enumerate(retrieval_results)])
       user_message = f"{references}n------nnUsing solely the references listed above, reply the next query:nQuestion: {question}n"

       payload = {
           "mannequin": model_name,
           "messages": [{"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_message}],
           "max_tokens": max_tokens,
       }

       strive:
           start_time = time.time()
           response = requests.put up(api_endpoint, headers=headers, json=payload, timeout=25000)
           response.raise_for_status()
           response_json = response.json()
           response_json['response_time'] = time.time() - start_time
           responses.append(response_json)
       besides requests.RequestException as e:
           logger.error(f"API request failed for question: {question}. Error: {e}")
           responses.append(None)

   return responses

Benchmarking on the CRAG dataset

On this part, we focus on the latency, accuracy, and price measurements of benchmarking on the CRAG dataset.

Latency

Latency measurements for every question response had been calculated because the distinction between two timestamps: the timestamp when the API name is made to the inference LLM, and a second timestamp when your complete response is acquired from the inference endpoint. The distinction between these two timestamps determines the latency. A decrease latency signifies a faster-performing LLM, making it appropriate for functions requiring fast response occasions. The research signifies that latency will be additional diminished for each fashions by way of optimizations and caching strategies; nonetheless, the analysis targeted on measuring out-of-the-box latency efficiency for each fashions.

Accuracy

FloTorch used a modified model of the local_evaluation.py script supplied with the CRAG benchmark for accuracy evaluations. The script was enhanced to offer correct categorization of appropriate, incorrect, and lacking responses. The default GPT-4o analysis LLM within the analysis script was changed with the mixtral-8x7b-instruct-v0:1 mannequin API. Extra modifications to the script enabled monitoring of enter and output tokens and latency as described earlier.

Value

Value calculations had been easy as a result of each Amazon Nova Professional and GPT-4o have revealed worth per million enter and output tokens individually. The calculation methodology concerned multiplying enter tokens by corresponding charges and making use of the identical course of for output tokens. The entire value for operating 200 queries was decided by combining enter token and output token prices. OpenSearch Service provisioned cluster prices had been excluded from this evaluation as a result of the fee comparability targeted solely on the inference stage between Amazon Nova Professional and GPT-4o LLMs.

Outcomes

The next desk summarizes the outcomes.

.	Amazon Nova Professional	GPT-4o	Statement
Accuracy on subset of the CRAG dataset	51.50% (103 appropriate responses out of 200)	53.00% (106 appropriate responses out of 200)	GPT-4o outperforms Amazon Nova Professional by 1.5% on accuracy
Value for operating inference for 200 queries	$0.00030205	$0.000869537	Amazon Nova Professional saves 65.26% in prices in comparison with GPT-4o
Common latency (seconds)	1.682539835	2.15615045	Amazon Nova Professional is 21.97% sooner than GPT-4o
Common of enter and output tokens	1946.621359	1782.707547	Typical GPT-4o responses are shorter than Amazon Nova responses

For easy queries, Amazon Nova Professional and GPT-4o have related accuracies (55 and 56 appropriate responses, respectively) however for easy queries with circumstances, GPT-4o performs barely higher than Amazon Nova Professional (50 vs. 48 appropriate solutions). Think about you might be a part of a corporation operating an AI assistant service that handles 1,000 questions per thirty days from 10,000 customers (10,000,000 queries per thirty days). Amazon Nova Professional will save your group $5,674.88 per thirty days ($68,098 per yr) in comparison with GPT-4o.

Let’s take a look at related outcomes for Amazon Nova Micro, Amazon Nova Lite, and GPT-4o mini fashions on the identical dataset.

	Amazon Nova Lite	Nove Micro	GPT-4o mini	Statement
Accuracy on subset of the CRAG dataset	52.00% (104 appropriate responses out of 200)	54.00% (108 appropriate responses out of 200)	50.00% (100 appropriate responses out of 200)	Each Amazon Nova Lite and Amazon Nova Micro outperform GPT-4o mini by 2 and 4 factors, respectively
Value for operating inference for 200 queries	$0.00002247 (56.59% cheaper than GPT-4o mini)	$0.000013924 (73.10% cheaper than GPT-4o mini)	$0.000051768	Amazon Nova Lite and Amazon Nova Micro are cheaper than GPT-4o mini by 56.59% and 73.10%, respectively
Common latency (seconds)	1.553371465 (26.60% sooner than GPT-4o mini)	1.6828564 (20.48% sooner than GPT-4o mini)	2.116291895	Amazon Nova fashions are at the very least 20% sooner than GPT-4o mini
Common of enter and output tokens	1930.980769	1940.166667	1789.54	GPT-4o mini returns shorter solutions

Amazon Nova Micro is considerably sooner and cheaper in comparison with GPT-4o mini whereas offering extra correct solutions. In case you are operating a service that handles about 10 million queries every month, it would prevent on common 73% of what you can be paying for barely much less correct outcomes from the GPT-4o mini mannequin.

Conclusion

Based mostly on these exams for RAG circumstances, Amazon Nova fashions produce comparable or larger accuracy at considerably decrease value and latency in comparison with GPT-4o and GPT-4o mini fashions. FloTorch is constant additional experimentation with different related LLMs for comparability. Future analysis will embody further experiments with varied question sorts comparable to comparability, aggregation, set, false_premise, post-processing, and multi-hop queries.

Get began with Amazon Nova on the Amazon Bedrock console. Study extra on the Amazon Nova product web page.

About FloTorch

FloTorch.ai helps enterprise prospects design and handle agentic workflows in a safe and scalable method. FloTorch’s mission is to assist enterprises make data-driven selections within the end-to-end generative AI pipeline, together with however not restricted to mannequin choice, vector database choice, and analysis methods. FloTorch gives an open supply model for patrons with scalable experimentation with completely different chunking, embedding, retrieval, and inference methods. The open supply model works on a buyer’s AWS account so you’ll be able to experiment in your AWS account together with your proprietary information. customers are invited to check out FloTorch from AWS Market or from GitHub. FloTorch additionally gives an enterprise model of this product for scalable experimentation with LLM fashions and vector databases on cloud platforms. The enterprise model additionally features a common gateway with mannequin registry to customized outline new LLMs and advice engine to counsel ew LLMs and agent workflows. For extra data, contact us at data@flotorch.ai.

Concerning the writer

Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI options for enterprise prospects. With a ardour for serving to AWS prospects construct modern Gen AI functions, he focuses on creating scalable, cutting-edge AI options that drive enterprise transformation. You possibly can join with Prasanna on LinkedIn.

Dr. Hemant Joshi has over 20 years of {industry} expertise constructing services and products with AI/ML applied sciences. As CTO of FloTorch, Hemant is engaged with prospects to implement State of the Artwork GenAI options and agentic workflows for enterprises.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Benchmarking Amazon Nova and GPT-4o fashions with FloTorch

The rising want for cost-effective AI fashions

Overview of the CRAG benchmark dataset

Analysis setup

Information base

Chunking technique

Embedding technique

Vector database

Retrieval (and reranking) technique

Inferencing

Analysis

Benchmarking on the CRAG dataset

Latency

Accuracy

Value

Outcomes

Conclusion

About FloTorch

Concerning the writer

What’s Distribution Channel Administration (DCM)? Evolution of compliance.

That is how measles kills

Converter

Editors Pick

Newsletter

Categories

Related Posts