Construct interactive PDF textual content extraction from Amazon S3

by root June 26, 2026

written by root June 26, 2026 0 comment 4 views

Image this: a compliance officer wants a selected clause throughout an audit, an lawyer wants contract phrases whereas a consumer waits on the telephone, or a finance analyst wants numbers from final quarter’s report earlier than a gathering that begins in 10 minutes. In every case, ready for a scheduled job to complete isn’t sensible. You want on-demand entry to the textual content inside your PDFs.

On this put up, you’ll construct a server that extracts textual content from PDF information in Amazon S3 in actual time. This protocol-based strategy gives programmatic doc entry. You’ll stroll by means of the structure, arrange the server, and run interactive doc queries. Alongside the best way, you’ll examine this strategy with Amazon Textract so you possibly can resolve which software suits your workload.

We constructed this answer after working with a number of groups who shared the identical frustration: their paperwork lived in Amazon S3, however getting textual content out of them on demand meant both writing customized scripts or ready on batch pipelines. This MCP server strategy sits in between, providing you with interactive entry with minimal setup. Interactive PDF textual content extraction from Amazon S3 offers you real-time solutions out of your paperwork with out batch pipelines or heavy infrastructure.

This MCP-based choice works effectively for text-based PDFs in improvement and proof of idea settings. For complicated doc processing like optical character recognition (OCR), type extraction, and format evaluation, Amazon Textract stays the really helpful alternative.

Who advantages from this strategy

This answer suits a number of widespread roles. If these situations sound like your day-to-day, learn on.

Compliance and authorized groups: Throughout a time-sensitive assessment, you’ll want to find a selected clause buried in a 200-page coverage doc or contract. Looking manually takes too lengthy. With this answer, you ask a query in pure language and get the related passage again in seconds.

Monetary providers groups: Throughout an audit session, you want speedy entry to the precise wording of an inside danger coverage or regulatory submitting. This answer permits you to pull that data immediately out of your Amazon S3 doc repository with out leaving your terminal.

Government groups: Throughout strategic planning conferences, you possibly can question a PDF on the spot when somebody asks a couple of knowledge level from final quarter’s earnings report. No flipping by means of printed copies or ready for somebody to look it up after the assembly.

These situations share a couple of widespread traits: they contain real-time data wants the place batch processing is simply too gradual, text-based PDF paperwork with commonplace formatting, price sensitivity in improvement and proof of idea environments, and integration necessities with current AWS workflows and tooling.

Amazon Textract is a completely managed AWS AI service purpose-built for doc processing at scale. It handles scanned pages, handwriting, and multi-column layouts. Select Amazon Textract once you want OCR for scanned paperwork, superior type and desk extraction, complicated format evaluation, production-scale batch processing with service stage settlement (SLA) necessities, or compliance options and enterprise help.

The MCP-based strategy addresses a complementary situation: giving an AI assistant interactive, on-demand entry to textual content already encoded inside PDFs. Select this sample when your paperwork are text-based PDFs (no OCR required), your workflow is interactive reasonably than batch, you might be working in improvement or proof of idea environments, and also you need minimal infrastructure between the AI assistant and the supply doc. For all the things else, together with any doc processing that advantages from OCR or structured extraction, route the work to Amazon Textract.

How the answer works

With this answer, you join your AI assistant on to your PDF paperwork in Amazon S3 and may get solutions rapidly. Beneath the hood, the answer makes use of the Mannequin Context Protocol (MCP), an open commonplace that gives a structured strategy to entry exterior knowledge sources. MCP acts as a communication layer between your software and your knowledge. The structure has 4 parts: a command-line interface because the person interface, the MCP layer for communication, a customized MCP server for PDF processing, and Amazon S3 for doc storage, secured by AWS Identification and Entry Administration (AWS IAM).

Value comparability

Select the strategy that matches your finances and necessities. For about 10,000 text-based PDF pages per 30 days in a proof of idea surroundings, right here is how the 2 approaches examine:

These two figures are worth factors for various function units and shouldn’t be learn as a head-to-head worth comparability. Use them to select the appropriate software for the workload, to not optimize purely on {dollars}. In case your workload entails scanned paperwork, types, tables, complicated layouts, or manufacturing SLAs, Amazon Textract is the suitable alternative and the extra capabilities are mirrored in its worth.

Amazon Textract scope: page-level processing, OCR-ready, type and desk extraction, format understanding, enterprise SLAs

Indicative month-to-month price: Amazon Textract processing roughly $15, Amazon S3 storage $2, AWS Lambda compute $1, and enormous language mannequin (LLM) token processing roughly $5 to $10, for a complete of roughly $23 to $28.

MCP server scope: direct textual content extraction from PDFs whose textual content is already encoded; no managed processing service concerned

Indicative month-to-month price: Amazon S3 storage $2 and knowledge switch $0.50, for a complete of roughly $2.50.

All price figures are illustrative and should change. Discuss with the official AWS pricing pages for present charges.

Structure overview

Component diagram showing the S3 PDF MCP Server architecture with Client Environment (User/Client, Kiro CLI, MCP Client) connecting to S3 PDF MCP Server containing StdioServer Transport, S3PdfMcpServer, Tool Handler with Extract s3_pdf_text function, AWS SDK S3 Client, and PDF Parser, all connecting to AWS S3 for PDF document storage.

The next sequence diagram illustrates the end-to-end workflow for extracting textual content from a PDF saved in Amazon S3. The method begins when the AI consumer initiates a request for PDF extraction by means of the CLI. The system forwards this request to the MCP server, which retrieves the PDF file from Amazon S3 utilizing the supplied bucket and object key.

After the MCP server fetches the PDF, it passes the file to a PDF parsing element. The element processes the doc and extracts the textual content material. The MCP server then returns the extracted textual content to the consumer, and the consumer shows it to the person.

Sequence diagram showing the PDF text extraction flow: AI Client requests PDF extraction from Kiro CLI, which calls extract_s3_pdf_text on MCP Server, MCP Server retrieves PDF from Amazon S3 using GetObject, PDF Parser processes the content and returns extracted text back through the chain to display to the user

Step-by-step implementation

Observe these steps to arrange and configure the PDF textual content extraction answer. Start by confirming you have got the required conditions in place.

Stipulations

Earlier than you start, affirm that you’ve got the next gadgets prepared. You’ll additionally want primary familiarity with Python programming and AWS providers.

An AWS account with Amazon S3 learn permissions.
Python 3.10 or later put in.
AWS Command Line Interface (AWS CLI) configured with legitimate credentials.
Kiro CLI put in.
```
pip set up boto3 PyPDF2 mcp
```

Set up

This part guides you thru putting in the MCP server and its dependencies. The method entails making a Python digital surroundings, putting in the required packages, and creating the server file. Observe these steps so as. Run every command in your terminal.

Earlier than you begin, you want:

Python 3.10 or newer put in in your machine.
The Kiro CLI put in and logged in.
AWS credentials arrange in your machine (run aws configure if you happen to haven’t).
An S3 bucket that comprises at the least one PDF file.

Step 1 — Create a folder for the mission

Run these two instructions in your terminal:

Step 2 — Navigate to the mission folder

Run this command:

Step 3 — Create a Python digital surroundings

Run this command:

Step 4 — Activate the digital surroundings

Run this command:

After this, your terminal immediate will present (venv) at the beginning. Hold this terminal open. You want to keep on this digital surroundings for the following steps.

Step 5 — Set up the required Python packages

Run this one command:

pip set up mcp boto3 PyPDF2

Watch for it to complete. It ought to finish with “Efficiently put in…”.

Step 6 — Create the server file

Contained in the ~/s3-pdf-extractor folder, create a brand new file named precisely:

Paste the next code into that file and put it aside:

Step 7 — Take a look at that the server begins

In your terminal (nonetheless contained in the s3-pdf-extractor folder with the venv lively), run:

python s3_pdf_extractor.py

The terminal will seem to “pause” with no output. That’s right. It means the server is working and ready for requests. Press Ctrl+C to cease it.

In case you see an error as a substitute, re-check Steps 2 and three.

from mcp.server import Server
from mcp.sorts import Device, TextContent
import boto3
from PyPDF2 import PdfReader
import tempfile
import os
import logging

# Configure logging for manufacturing use
logging.basicConfig(stage=logging.INFO)
logger = logging.getLogger(__name__)

server = Server("s3-pdf-extractor")

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="extract_s3_pdf_text",
            description="Extract text content from a PDF stored in Amazon S3",
            inputSchema={
                "type": "object",
                "properties": {
                    "bucket": {"type": "string", "description": "S3 bucket name"},
                    "key": {"type": "string", "description": "S3 object key"}
                },
                "required": ["bucket", "key"]
            }
        )
    ]

@server.call_tool()
async def call_tool(title: str, arguments: dict):
    if title == "extract_s3_pdf_text":
        bucket = arguments["bucket"]
        key = arguments["key"]

        attempt:
            # Use current AWS credentials and IAM permissions
            s3_client = boto3.consumer('s3')

            with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
                s3_client.download_file(bucket, key, tmp_file.title)
                tmp_path = tmp_file.title

            # Extract textual content utilizing PyPDF2
            reader = PdfReader(tmp_path)
            textual content = ""
            for web page in reader.pages:
                textual content += web page.extract_text() + "n"

            logger.data(f"Efficiently extracted textual content from {bucket}/{key}")
            return [TextContent(type="text", text=text)]

        besides Exception as e:
            logger.error(f"Error processing {bucket}/{key}: {str(e)}")
            elevate
        lastly:
            # Guarantee cleanup of short-term information
            if 'tmp_path' in locals():
                os.unlink(tmp_path)

if __name__ == "__main__":
    server.run()

Step 8 — Find or create the Kiro CLI configuration file

Kiro CLI makes use of a JSON configuration file to know which MCP servers can be found. You want to add your server to this file.

The Kiro CLI MCP configuration file is situated at:

~/.kiro/settings/instruments/mcp.json

If this file doesn’t exist, create it by working these instructions in your terminal:

mkdir -p ~/.kiro/settings/instruments
nano ~/.kiro/settings/instruments/mcp.json

Step 9 — Add the MCP server configuration

Paste the next JSON into the file. Change /path/to/s3_pdf_extractor.py with the precise path from Step 1 (for instance, ~/s3-pdf-extractor/s3_pdf_extractor.py):

{
    "mcpServers": {
        "s3-pdf-extractor": {
            "command": "python",
            "args": ["/path/to/s3_pdf_extractor.py"]
        }
    }
}

To get the complete absolute path, run echo ~/s3-pdf-extractor/s3_pdf_extractor.py in your terminal and use that output within the args area.

Step 10 — Save the configuration file

Press Ctrl+O, then press Enter to save lots of the file.

Step 11 — Shut the file editor

Press Ctrl+X to exit nano.

Step 12 — Restart Kiro CLI

Restart Kiro CLI to load the brand new configuration. Shut and reopen Kiro CLI, or run:

Step 13 — Confirm the MCP server connection

Confirm the connection by working a check extraction in Kiro CLI:

extract textual content from s3://your-bucket-name/pattern.pdf

Safety concerns

Safety is built-in from the start, not added as an afterthought. Right here is how the answer handles it:

IAM integration: The answer makes use of your current AWS credentials. You don’t want to create or handle separate API keys.
Least privilege entry: You grant solely Amazon S3 learn permissions, scoped to the particular buckets that include your PDF paperwork. Nothing extra.
Non permanent storage: The server deletes downloaded information mechanically after it completes processing. No PDF knowledge lingers on the native file system.
No knowledge persistence: Textual content extraction happens on demand with out storing outcomes.
Audit path: AWS CloudTrail logs Amazon S3 entry requests to your account.

Efficiency and limitations

Right here is what to anticipate by way of efficiency:

The server processes paperwork in actual time. For a typical 50-page text-based PDF, outcomes are typically obtainable in a couple of seconds, making it sensible for interactive workflows the place you ask follow-up questions.
Processing time scales linearly with doc dimension. A ten-page doc processes roughly 5 occasions sooner than a 50-page one.
Reminiscence utilization is proportional to doc dimension. For many text-based PDFs underneath 100 pages, reminiscence consumption stays effectively inside typical improvement machine limits.

This strategy has clear limits. Know them earlier than you commit:

Textual content-based PDFs solely. In case your paperwork are scanned pictures or images of paper, the server can’t learn them. Amazon Textract handles these circumstances natively with OCR.
No OCR functionality. The server reads embedded textual content from the PDF file format. It can’t interpret pixels in a picture.
Restricted format understanding. The server performs easy textual content extraction. It doesn’t reconstruct tables, columns, or complicated web page layouts. Amazon Textract handles this natively.
No type processing. In case your PDFs include fillable type fields or structured knowledge, the server doesn’t extract these components. Amazon Textract handles this natively.

Actual-world use circumstances

These capabilities translate immediately into measurable outcomes throughout industries. Whether or not it’s authorized groups retrieving contract clauses mid-call, compliance officers finding coverage language throughout audits, or executives pulling earnings knowledge in actual time, the answer removes the friction of guide doc search. The next examples present how completely different groups put it to work.

Authorized providers agency

A mid-sized authorized agency adopted this answer for contract assessment. Their attorneys used to spend 15 to twenty minutes looking out by means of PDF contracts to seek out particular indemnification clauses throughout consumer calls. That meant placing the consumer on maintain or promising to name again later. Now they sort a query into Kiro CLI and get the related passage in seconds. The agency studies that analysis time throughout consumer calls was considerably decreased.

Monetary providers compliance

A regional financial institution deployed the answer for regulatory examinations. Throughout audits, compliance officers must find particular coverage language rapidly. Beforehand, they bookmarked key sections manually throughout dozens of PDF information, which was error-prone and arduous to take care of as insurance policies modified. With the MCP server related to their S3 doc repository, they now pull up the precise paragraph an examiner asks about in actual time.

Company technique workforce

An enterprise management workforce makes use of the answer throughout quarterly technique conferences. When a board member asks a couple of particular metric from the earlier quarter’s earnings report, the workforce queries the PDF on the spot as a substitute of flipping by means of printed copies. This retains discussions transferring and grounded in precise knowledge.

Scaling and enhancement choices

This answer is a place to begin. As your wants develop, you possibly can lengthen it. Begin with caching in case your workforce accesses the identical paperwork repeatedly. Think about batch processing when you’ll want to deal with lots of of paperwork directly. Add vector search when key phrase matching is now not adequate.

Particularly, you possibly can lengthen the answer in these methods:

Add caching with Amazon DynamoDB for ceaselessly accessed paperwork.
Implement batch processing with Amazon Easy Queue Service (Amazon SQS) for bulk operations.
Combine vector search with Amazon OpenSearch Service for semantic doc discovery.
Create hybrid workflows that route complicated paperwork to Amazon Textract mechanically.
Add monitoring with Amazon CloudWatch to trace utilization patterns and error charges.

Cleanup

Whenever you’re carried out testing or wish to take away the answer, comply with these steps to keep away from pointless prices.

Cease the MCP ServerPress Ctrl+C within the terminal the place the server is working.
Take away the MCP ConfigurationOpen your Kiro CLI MCP configuration file (~/.kiro/settings/instruments/mcp.json) and delete the s3-pdf-extractor entry. Save and shut the file.
Delete the mission informationTake away the mission listing and all its contents:
```
rm -rf ~/s3-pdf-extractor
```
Warning: This command completely deletes all information within the listing with out affirmation. Be sure to have saved any modifications earlier than continuing.
Clear up S3 sources (non-compulsory)In case you created check PDFs in Amazon S3 particularly for this walkthrough, delete the check information or the check bucket utilizing the Amazon S3 console or the AWS CLI:
```
aws s3 rm s3://your-bucket-name/test-file.pdf
```
Solely delete sources you created for testing.
Evaluation IAM permissions (non-compulsory)Navigate to the IAM console and take away any S3 learn permissions added particularly for this answer. Hold permissions that different workflows rely on.
Confirm cleanupAffirm the listing now not exists:
Anticipated output: No such file or listing

After cleanup, you’ll now not incur S3 storage and knowledge switch prices for the sources you deleted. For detailed pricing data, see Amazon S3 Pricing. If you wish to redeploy later, repeat the set up steps. All code and configuration examples stay on this doc.

Conclusion

On this put up, you constructed an MCP server that extracts textual content from PDF information in Amazon S3 in actual time. You walked by means of the structure, in contrast prices with Amazon Textract, and noticed how 3 completely different groups put this strategy to work. The sample follows a transparent strategy: join your AI assistant to your paperwork, maintain the infrastructure minimal, and scale up solely when the workload calls for it.

In abstract, the MCP server sample is a centered, interactive complement to Amazon Textract. Use it when an AI assistant must learn text-based PDFs in actual time. When your wants embody OCR, types, tables, or production-scale processing, Amazon Textract is the AWS service designed for that work, and the 2 approaches match cleanly collectively. That is precisely the sample proven within the hybrid workflow choice earlier on this put up.

Subsequent steps:

Consider your use case towards the factors within the “The place this strategy suits alongside Amazon Textract” part.
Deploy the answer in your improvement surroundings by following the Set up part on this put up. Take a look at with 5 to 10 consultant paperwork to determine baseline efficiency.
Discover Amazon Textract for OCR capabilities, or be taught extra about Kiro CLI integration as your necessities evolve.
In case you do this answer or adapt it to your personal use case, we’d love to listen to about it within the feedback.

To be taught extra, discover the next sources:

In regards to the authors

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Construct interactive PDF textual content extraction from Amazon S3

Who advantages from this strategy

How the answer works

Value comparability

Structure overview

Step-by-step implementation

Stipulations

Set up

Safety concerns

Efficiency and limitations

Actual-world use circumstances

Authorized providers agency

Monetary providers compliance

Company technique workforce

Scaling and enhancement choices

Cleanup

Conclusion

In regards to the authors

The place We might Put money into Actual Property Proper Now (12 Markets)

‘Endgame’ shall be re-released with a brand new title

Converter

Editors Pick

Newsletter

Categories

Related Posts