This publish is co-written by Fan Zhang, Sr Principal Engineer / Architect from Palo Alto Networks.
Palo Alto Networks’ System Safety staff wished to detect early warning indicators of potential manufacturing points to supply extra time to SMEs to react to those rising issues. The first problem they confronted was that reactively processing over 200 million each day service and software log entries resulted in delayed response occasions to those vital points, leaving them in danger for potential service degradation.
To deal with this problem, they partnered with the AWS Generative AI Innovation Middle (GenAIIC) to develop an automatic log classification pipeline powered by Amazon Bedrock. The answer achieved 95% precision in detecting manufacturing points whereas lowering incident response occasions by 83%.
On this publish, we discover the right way to construct a scalable and cost-effective log evaluation system utilizing Amazon Bedrock to remodel reactive log monitoring into proactive subject detection. We focus on how Amazon Bedrock, via Anthropic’ s Claude Haiku mannequin, and Amazon Titan Textual content Embeddings work collectively to robotically classify and analyze log information. We discover how this automated pipeline detects vital points, look at the answer structure, and share implementation insights which have delivered measurable operational enhancements.
Palo Alto Networks gives Cloud-Delivered Security Services (CDSS) to deal with machine safety dangers. Their resolution makes use of machine studying and automatic discovery to supply visibility into linked units, imposing Zero Belief ideas. Groups dealing with related log evaluation challenges can discover sensible insights on this implementation.
Answer overview
Palo Alto Networks’ automated log classification system helps their System Safety staff detect and reply to potential service failures forward of time. The answer processes over 200 million service and software logs each day, robotically figuring out vital points earlier than they escalate into service outages that affect prospects.
The system makes use of Amazon Bedrock with Anthropic’s Claude Haiku mannequin to know log patterns and classify severity ranges, and Amazon Titan Textual content Embeddings permits clever similarity matching. Amazon Aurora gives a caching layer that makes processing large log volumes possible in actual time. The answer integrates seamlessly with Palo Alto Networks’ present infrastructure, serving to the System Safety staff give attention to stopping outages as a substitute of managing complicated log evaluation processes.
Palo Alto Networks and the AWS GenAIIC collaborated to construct an answer with the next capabilities:
- Clever deduplication and caching – The system scales by intelligently figuring out duplicate log entries for a similar code occasion. Relatively than utilizing a big language mannequin (LLM) to categorise each log individually, the system first identifies duplicates via precise matching, then makes use of overlap similarity, and at last employs semantic similarity provided that no earlier match is discovered. This method cost-effectively reduces the 200 million each day logs by over 99%, to logs solely representing distinctive occasions. The caching layer permits real-time processing by lowering the necessity for redundant LLM invocations.
- Context retrieval for distinctive logs – For distinctive logs, Anthropic’s Claude Haiku mannequin utilizing Amazon Bedrock classifies every log’s severity. The mannequin processes the incoming log together with related labeled historic examples. The examples are dynamically retrieved at inference time via vector similarity search. Over time, labeled examples are added to supply wealthy context to the LLM for classification. This context-aware method improves accuracy for Palo Alto Networks’ inner logs and methods and evolving log patterns that conventional rule-based methods wrestle to deal with.
- Classification with Amazon Bedrock – The answer gives structured predictions, together with severity classification (Precedence 1 (P1), Precedence 2 (P2), Precedence 3 (P3)) and detailed reasoning for every choice. This complete output helps Palo Alto Networks’ SMEs shortly prioritize responses and take preventive motion earlier than potential outages happen.
- Integration with present pipelines for motion – Outcomes combine with their present FluentD and Kafka pipeline, with information flowing to Amazon Easy Storage Service (Amazon S3) and Amazon Redshift for additional evaluation and reporting.
The next diagram (Determine 1) illustrates how the three-stage pipeline processes Palo Alto Networks’ 200 million each day log quantity whereas balancing scale, accuracy, and cost-efficiency. The structure consists of the next key parts:
- Knowledge ingestion layer – FluentD and Kafka pipeline and incoming logs
- Processing pipeline – Consisting of the next levels:
- Stage 1: Good caching and deduplication – Aurora for precise matching and Amazon Titan Textual content Embeddings for semantic matching
- Stage 2: Context retrieval – Amazon Titan Textual content Embeddings to allow historic labeled examples, and vector similarity search
- Stage 3: Classification – Anthropic’s Claude Haiku mannequin for severity classification (P1/P2/P3)
- Output layer – Aurora, Amazon S3, Amazon Redshift, and SME assessment interface
The processing workflow strikes via the next levels:
- Stage 1: Good caching and deduplication – Incoming logs from Palo Alto Networks’ FluentD and Kafka pipeline are instantly processed via an Aurora primarily based caching layer. The system first applies precise matching, then falls again to overlap similarity, and at last makes use of semantic similarity via Amazon Titan Textual content Embeddings if no earlier match is discovered. Throughout testing, this method recognized that greater than 99% of logs corresponded to duplicate occasions, though they contained completely different time stamps, log ranges, and phrasing. The caching system decreased response occasions for cached outcomes and decreased pointless LLM processing.
- Stage 2: Context retrieval for distinctive logs – The remaining lower than 1% of really distinctive logs require classification. For these entries, the system makes use of Amazon Titan Textual content Embeddings to determine probably the most related historic examples from Palo Alto Networks’ labeled dataset. Relatively than utilizing static examples, this dynamic retrieval makes certain every log receives contextually applicable steerage for classification.
- Stage 3: Classification with Amazon Bedrock – Distinctive logs and their chosen examples are processed by Amazon Bedrock utilizing Anthropic’s Claude Haiku mannequin. The mannequin analyzes the log content material alongside related historic examples to supply severity classifications (P1, P2, P3) and detailed explanations. Outcomes are saved in Aurora and the cache and built-in into Palo Alto Networks’ present information pipeline for SME assessment and motion.
This structure permits cost-effective processing of large log volumes whereas sustaining 95% precision for vital P1 severity detection. The system makes use of fastidiously crafted prompts that mix area experience with dynamically chosen examples:
system_prompt = """
<Activity>
You might be an skilled log evaluation system answerable for classifying manufacturing system logs primarily based on severity. Your evaluation helps engineering groups prioritize their response to system points and preserve service reliability.
</Activity>
<Severity_Definitions>
P1 (Essential): Requires instant motion - system-wide outages, repeated software crashes
P2 (Excessive): Warrants consideration throughout enterprise hours - efficiency points, partial service disruption
P3 (Low): May be addressed when sources out there - minor bugs, authorization failures, intermittent community points
</Severity_Definitions>
<Examples>
<log_snippet>
2024-08-17 01:15:00.00 [warn] failed (104: Connection reset by peer) whereas studying response header from upstream
</log_snippet>
severity: P3
class: Class A
<log_snippet>
2024-08-18 17:40:00.00 <warn> Error: Request failed with standing code 500 at settle
</log_snippet>
severity: P2
class: Class B
</Examples>
<Target_Log>
Log: {incoming_log_snippet}
Location: {system_location}
</Target_Log>"""
Present severity classification (P1/P2/P3) and detailed reasoning.
Implementation insights
The core worth of Palo Alto Networks’ resolution lies in making an insurmountable problem manageable: AI helps their staff analyze 200 million of each day volumes effectively, whereas the system’s dynamic adaptability makes it doable to increase the answer into the long run by including extra labeled examples. Palo Alto Networks’ profitable implementation of their automated log classification system yielded key insights that may assist organizations constructing production-scale AI options:
- Steady studying methods ship compounding worth – Palo Alto Networks designed their system to enhance robotically as SMEs validate classifications and label new examples. Every validated classification turns into a part of the dynamic few-shot retrieval dataset, enhancing accuracy for related future logs whereas growing cache hit charges. This method creates a cycle the place operational use enhances system efficiency and reduces prices.
- Clever caching permits AI at manufacturing scale – The multi-layered caching structure processes greater than 99% of logs via cache hits, remodeling costly per-log LLM operations into a cheap system able to dealing with 200 million each day volumes. This basis makes AI processing economically viable at enterprise scale whereas sustaining response occasions.
- Adaptive methods deal with evolving necessities with out code modifications – The answer accommodates new log classes and patterns with out requiring system modifications. When efficiency wants enchancment for novel log varieties, SMEs can label extra examples, and the dynamic few-shot retrieval robotically incorporates this data into future classifications. This adaptability permits the system to scale with enterprise wants.
- Explainable classifications drive operational confidence – SMEs responding to vital alerts require confidence in AI suggestions, significantly for P1 severity classifications. By offering detailed reasoning alongside every classification, Palo Alto Networks permits SMEs to shortly validate selections and take applicable motion. Clear explanations remodel AI outputs from predictions into actionable intelligence.
These insights reveal how AI methods designed for steady studying and explainability change into more and more worthwhile operational belongings.
Conclusion
Palo Alto Networks’ automated log classification system demonstrates how generative AI powered by AWS helps operational groups handle huge volumes in actual time. On this publish, we explored how an structure combining Amazon Bedrock, Amazon Titan Textual content Embeddings, and Aurora processes 200 million of each day logs via clever caching and dynamic few-shot studying, enabling proactive detection of vital points with 95% precision. Palo Alto Networks’ automated log classification system delivered concrete operational enhancements:
- 95% precision, 90% recall for P1 severity logs – Essential alerts are correct and actionable, minimizing false alarms whereas catching 9 out of 10 pressing points, leaving the remaining alerts to be captured by present monitoring methods
- 83% discount in debugging time – SMEs spend much less time on routine log evaluation and extra time on strategic enhancements
- Over 99% cache hit price – The clever caching layer processes 20 million each day quantity cost-effectively via subsecond responses
- Proactive subject detection – The system identifies potential issues earlier than they affect prospects, stopping the multi-week outages that beforehand disrupted service
- Steady enchancment – Every SME validation robotically improves future classifications and will increase cache effectivity, leading to decreased prices
For organizations evaluating AI initiatives for log evaluation and operational monitoring, Palo Alto Networks’ implementation gives a blueprint for constructing production-scale methods that ship measurable enhancements in operational effectivity and price discount. To construct your personal generative AI options, discover Amazon Bedrock for managed entry to basis fashions. For added steerage, try the AWS Machine Studying sources and browse implementation examples within the AWS Synthetic Intelligence Weblog.
The collaboration between Palo Alto Networks and the AWS GenAIIC demonstrates how considerate AI implementation can remodel reactive operations into proactive, scalable methods that ship sustained enterprise worth.
To get began with Amazon Bedrock, see Construct generative AI options with Amazon Bedrock.
Concerning the authors





