Speaker diarization is a vital course of in speech evaluation that segments audio recordsdata primarily based on speaker id. This publish particulars Hugging Face’s integration of PyAnnote with Amazon SageMaker asynchronous endpoints for speaker diarization.
We offer a complete information on methods to deploy speaker segmentation and clustering options utilizing SageMaker on the AWS Cloud. This answer can be utilized for functions that deal with multi-speaker (greater than 100) audio recordings.
Resolution overview
Amazon Transcribe is the go-to service for speaker diarization on AWS. Nevertheless, for unsupported languages, you should use different fashions (on this case PyAnnote) which might be deployed to SageMaker for inference. For brief audio recordsdata that take as much as 60 seconds to deduce, you should use real-time inference. If it is longer than 60 seconds, you need to use asynchronous inference. An extra advantage of asynchronous inference is that it may well save prices by robotically scaling the variety of situations to zero when there aren’t any requests to course of.
hug face is a well-liked open supply hub for machine studying (ML) fashions. AWS and Hug Face have partnership This allows seamless integration via SageMaker with a set of AWS Deep Studying Containers (DLC) for coaching and inference in PyTorch or TensorFlow with the SageMaker Python SDK’s Hugging Face estimator and predictor. Masu. SageMaker options assist builders and knowledge scientists simply get began with pure language processing (NLP) on their AWS.
Integration of this answer contains the usage of Hugging Face’s pre-trained speaker diarization mannequin. PyAnnote library. PyAnnote is an open supply toolkit for speaker diarization written in Python. The mannequin is skilled on a pattern audio dataset and allows efficient speaker segmentation inside audio recordsdata. The mannequin is deployed to SageMaker as an asynchronous endpoint configuration, offering environment friendly and scalable processing of diary duties.
The next diagram reveals the answer structure.
This text makes use of the next audio recordsdata:
Stereo or multichannel audio recordsdata are robotically downmixed to mono by averaging the channels. Audio recordsdata sampled at completely different charges are robotically resampled to 16kHz when loaded.
Stipulations
Meet the next conditions:
- Create a SageMaker area.
- Confirm that your AWS Id and Entry Administration (IAM) person has the required permissions to create the SageMaker position.
- Be sure your AWS account has the service quota to host SageMaker endpoints for ml.g5.2xlarge situations.
Create a mannequin perform to entry PyAnnote speaker diarization from Hugging Face
Use Hugging Face Hub to entry the pre-trained stuff you want. PyAnnote speaker diarization model. Use the identical script to obtain the mannequin file when creating the SageMaker endpoint.
See the code beneath.
Package deal the mannequin code
Put together vital recordsdata comparable to inference.py that comprise your inference code.
Put together. necessities.txt
The file comprises the Python libraries wanted to carry out inference.
Lastly, compress inference.py
and create a necessities.txt file and put it aside as: mannequin.tar.gz
:
Configure the SageMaker mannequin
Outline a SageMaker mannequin useful resource by specifying the picture URI, the placement of the mannequin knowledge in Amazon Easy Storage Service (S3), and the SageMaker position.
Add the mannequin to Amazon S3
Add the zipped PyAnnote Hugging Face mannequin file to your S3 bucket.
Create a SageMaker asynchronous endpoint
Configure an asynchronous endpoint to deploy your mannequin to SageMaker utilizing the offered asynchronous inference configuration.
Take a look at the endpoint
Consider the performance of the endpoint by sending an audio file for diarization and retrieving the JSON output saved within the specified S3 output path.
To deploy this answer at scale, we advocate utilizing AWS Lambda, Amazon Easy Discover Service (Amazon SNS), or Amazon Easy Queue Service (Amazon SQS). These companies are designed for scalability, event-driven structure, and environment friendly useful resource utilization. These assist decouple the asynchronous inference course of from outcome processing, permitting every part to scale independently and deal with bursts of inference requests extra successfully.
outcome
Mannequin output is saved to: s3://sagemaker-xxxx /async_inference/output/.
The output reveals the audio recording divided into three columns.
- Begin (Begin time (sec))
- Finish (finish time in seconds)
- Speaker (speaker label)
The next code reveals an instance outcome.
cleansing
You possibly can set the scaling coverage to zero by setting MinCapacity to 0. Asynchronous inference permits you to autoscale to zero with none requests. There is no such thing as a have to delete the endpoint. Scale from scratch while you want it once more and cut back prices while you’re not utilizing it. See the code beneath.