Metadata filtering for tabular information utilizing Amazon Bedrock data base

by root July 20, 2024

written by root July 20, 2024 0 comment 279 views

Amazon Bedrock is a totally managed service that enables organizations to decide on high-performance foundational fashions (FMs) from main synthetic intelligence (AI) firms, comparable to AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon, by a single API. To offer FMs with up-to-date, distinctive info, organizations use Retrieval Augmented Technology (RAG), a method that retrieves information from enterprise information sources and enriches prompts to offer extra related and correct responses. Amazon Bedrock data base is a totally managed characteristic that helps implement your entire RAG workflow, from ingestion to retrieval to immediate augmentation. Nevertheless, details about one dataset could also be present in one other dataset, known as metadata. With out metadata, the retrieval course of might retrieve irrelevant outcomes, decreasing the accuracy of the FM and rising the price of FM immediate tokens.

On March 27, 2024, Amazon Bedrock introduced an necessary new characteristic known as Metadata Filtering and in addition modified the default engine. This alteration means that you can use metadata fields throughout the ingest course of. Nevertheless, the metadata fields should be set throughout the data base ingestion course of. Typically, you’ll have tabular information the place particulars of 1 area can be found in one other area. You may additionally must quote the precise textual content doc or textual content area to forestall hallucinations. On this put up, we’ll present you the way to use the brand new metadata filtering characteristic in Amazon Bedrock data base for such tabular information.

Resolution overview

The answer consists of the next high-level steps:

Prepares information for metadata filtering.
Create and populate your data base with information and metadata.
Use metadata filtering to retrieve information from the data base.

Getting ready information for metadata filtering

On the time of writing, Amazon Bedrock data bases are supported by Amazon OpenSearch Serverless, Amazon Aurora, Pine cone, Redis Enterpriseand MongoDB Atlas Because the underlying vector retailer supplier. On this put up, we use the Amazon Bedrock Boto3 SDK to create and entry an OpenSearch Serverless vector retailer. For extra info, see Configuring a Information Base Vector Index with a Supported Vector Retailer.

On this put up, we’ll use a public dataset to create a data base. Food.com – Recipes and ReviewsThe next screenshot reveals an instance dataset.

of TotalTime It’s in ISO 8601 format, you may convert it to minutes utilizing the next logic:

# Operate to transform ISO 8601 period to minutes
def convert_to_minutes(period):
    hours = 0
    minutes = 0
    
    # Discover hours and minutes utilizing regex
    match = re.match(r'PT(?:(d+)H)?(?:(d+)M)?', period)
    
    if match:
        if match.group(1):
            hours = int(match.group(1))
        if match.group(2):
            minutes = int(match.group(2))
    
    # Convert complete time to minutes
    total_minutes = hours * 60 + minutes
    return total_minutes

df['TotalTimeInMinutes'] = df['TotalTime'].apply(convert_to_minutes)

After changing some options, CholesterolContent, SugarContent, and RecipeInstructionsYour information body ought to appear like the next screenshot:

To be able to have FM level to a particular menu with a hyperlink (citing a doc), I cut up every row of tabular information into one textual content file, with every file containing the next: RecipeInstructions As a knowledge area TotalTimeInMinutes, CholesterolContent, and SugarContent Reserve it as metadata. The metadata is saved in a separate JSON file with the identical title as the information file, .metadata.json The next characters are added to the file title: For instance, if the information file title is 100.txtThe metadata file title is 100.txt.metadata.jsonFor extra info, see Including metadata to recordsdata to allow them to be filtered. Moreover, the content material of the metadata file should be within the following format:

{
"metadataAttributes": {
"${attribute1}": "${value1}",
"${attribute2}": "${value2}",
...
}
}

For simplicity’s sake, we’ll solely course of the highest 2,000 rows to create the data base.

After you import the required libraries, create a neighborhood listing utilizing the next Python code:

import pandas as pd
import os, json, tqdm, boto3

metafolder="multi_file_recipe_data"os.mkdir(metafolder)

Iterate by the highest 2,000 rows and create a knowledge file and a metadata file to save lots of in a neighborhood folder.

for i in tqdm.trange(2000):
    desc = str(df['RecipeInstructions'][i])
    meta = {
    "metadataAttributes": {
        "Title": str(df['Name'][i]),
        "TotalTimeInMinutes": str(df['TotalTimeInMinutes'][i]),
        "CholesterolContent": str(df['CholesterolContent'][i]),
        "SugarContent": str(df['SugarContent'][i]),
    }
    }
    filename = metafolder+'/' + str(i+1)+ '.txt'
    f = open(filename, 'w')
    f.write(desc)
    f.shut()
    metafilename = filename+'.metadata.json'
    with open( metafilename, 'w') as f:
        json.dump(meta, f)

Create an Amazon Easy Storage Service (Amazon S3) bucket. food-kb Add the file:

# Add information to s3
s3_client = boto3.shopper("s3")
bucket_name = "recipe-kb"
data_root = metafolder+'/'
def uploadDirectory(path,bucket_name):
    for root,dirs,recordsdata in os.stroll(path):
        for file in tqdm.tqdm(recordsdata):
            s3_client.upload_file(os.path.be a part of(root,file),bucket_name,file)

uploadDirectory(data_root, bucket_name)

Create and populate your data base with information and metadata

After getting an S3 folder prepared, you may comply with this pattern pocket book to create a data base within the Amazon Bedrock console utilizing the SDK.

Utilizing Metadata Filtering to Retrieve Knowledge from the Information Base

Now, let’s get the information from the data base. On this put up, I am utilizing Amazon Bedrock’s Anthropic Claude Sonnet because the FM, however you may select from quite a lot of Amazon Bedrock fashions. First, you’ll want to set the next variables: kb_id is the ID of your data base. You could find the data base ID programmatically as proven within the following picture: Sample Noteor you may entry your data bases by navigating to your particular person data bases from the Amazon Bedrock console, as proven within the following screenshot.

Use the next code to set the required Amazon Bedrock parameters:

import boto3
import pprint
from botocore.shopper import Config
import json

pp = pprint.PrettyPrinter(indent=2)
session = boto3.session.Session()
area = session.region_name
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.shopper('bedrock-runtime', region_name = area)
bedrock_agent_client = boto3.shopper("bedrock-agent-runtime",
                              config=bedrock_config, region_name = area)
kb_id = "EIBBXVFDQP"
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

# retrieve api for fetching solely the related context.

question = " Inform me a recipe that I could make beneath half-hour and has ldl cholesterol lower than 10 "

relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'textual content': question
    },
    knowledgeBaseId=kb_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 2 
        }
    }
)
pp.pprint(relevant_documents["retrievalResults"])

The next code is the output of outcomes retrieved from the data base with none metadata filtering for the question “What recipes can I make in beneath half-hour which have lower than 10 ldl cholesterol?”. As you may see, the preparation time of the 2 recipes is 30 and 480 minutes respectively, and the ldl cholesterol content material is 86 and 112.4 respectively. Therefore, the retrieval doesn’t comply with the question precisely.

The next code reveals the way to use the Retrieve API for a similar question with metadata filters set to ldl cholesterol content material lower than 10 and cook dinner time lower than half-hour.

def retrieve(question, kbId, numberOfResults=5):
    return bedrock_agent_client.retrieve(
        retrievalQuery= {
            'textual content': question
        },
        knowledgeBaseId=kbId,
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': numberOfResults,
                 "filter": {
                            'andAll':[
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            ]
                        }
            }
        }
    ) 
question = "Inform me a recipe that I could make beneath half-hour and has ldl cholesterol lower than 10" 
response = retrieve(question, kb_id, 2)
retrievalResults = response['retrievalResults']
pp.pprint(retrievalResults)

As you may see from the next outcomes, the preparation instances for the 2 recipes are 27 and 20 respectively, and the ldl cholesterol content material is 0 and 0 respectively. You need to use metadata filtering to get extra correct outcomes.

The next code reveals the way to use the identical metadata filtering to get the precise output: retrieve_and_generate API. First configure the immediate, then configure the API with metadata filtering.

immediate = f"""
Human: You've got nice data about meals, so present solutions to questions through the use of reality. 
If you do not know the reply, simply say that you do not know, do not attempt to make up a solution.

Assistant:"""

def retrieve_and_generate(question, kb_id,modelId, numberOfResults=10):
    return bedrock_agent_client.retrieve_and_generate(
        enter= {
            'textual content': question,
        },
        retrieveAndGenerateConfiguration={
        'knowledgeBaseConfiguration': {
            'generationConfiguration': {
                'promptTemplate': {
                    'textPromptTemplate': f"{immediate} $search_results$"
                }
            },
            'knowledgeBaseId': kb_id,
            'modelArn': model_id,
            'retrievalConfiguration': {
                'vectorSearchConfiguration': {
                    'numberOfResults': numberOfResults,
                    'overrideSearchType': 'HYBRID',
                     "filter": {
                            'andAll':[
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            ]
                        },
                }
        }
                    },
        'sort': 'KNOWLEDGE_BASE'
    }
    )
    
question = "Inform me a recipe that I could make beneath half-hour and has ldl cholesterol lower than 10"
response = retrieve_and_generate(question, kb_id,modelId, numberOfResults=10)
pp.pprint(response['output']['text'])

As you may see within the following output, the mannequin returns detailed recipes following the indicated metadata filtering with prep time lower than half-hour and ldl cholesterol content material lower than 10.

cleansing

In the event you plan to make use of the data base you created for constructing a RAG utility, ensure to remark the next part. In case you are simply making an attempt to create a data base utilizing the SDK, ensure to delete all of the assets created as there’s a value for storing paperwork in an OpenSearch Serverless index. See the next code:

bedrock_agent_client.delete_data_source(dataSourceId = ds["dataSourceId"], knowledgeBaseId=kb['knowledgeBaseId'])
bedrock_agent_client.delete_knowledge_base(knowledgeBaseId=kb['knowledgeBaseId'])
oss_client.indices.delete(index=index_name)
aoss_client.delete_collection(id=collection_id)
aoss_client.delete_access_policy(sort="information", title=access_policy['accessPolicyDetail']['name'])
aoss_client.delete_security_policy(sort="community", title=network_policy['securityPolicyDetail']['name'])
aoss_client.delete_security_policy(sort="encryption", title=encryption_policy['securityPolicyDetail']['name'])
# Delete roles and polices 
iam_client.delete_role(RoleName=bedrock_kb_execution_role)
iam_client.delete_policy(PolicyArn=policy_arn)

Conclusion

On this put up, we now have seen the way to cut up a big tabular dataset into rows, arrange a data base with metadata for every file, and use metadata filtering to acquire the output. We additionally confirmed how utilizing metadata to acquire outcomes is extra correct than outcomes with out metadata filtering. Lastly, we confirmed the way to use FM to acquire correct outcomes.

To additional discover the data base capabilities of Amazon Bedrock, see the next assets:

In regards to the Writer

Tanay Choudhury He’s a Knowledge Scientist within the Generative AI Innovation Middle at Amazon Net Providers, the place he helps prospects resolve enterprise issues utilizing Generative AI and Machine Studying.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Metadata filtering for tabular information utilizing Amazon Bedrock data base

Resolution overview

Getting ready information for metadata filtering

Create and populate your data base with information and metadata

Utilizing Metadata Filtering to Retrieve Knowledge from the Information Base

cleansing

Conclusion

In regards to the Writer

DC Comics pronounces Catwoman phygital comedian at San Diego Comedian-Con

CrowdStrike glitch causes world IT outage, affecting banks, airways, companies worldwide

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply

Latest