[Python] pyannote + whisper 사용

2024. 6. 4. 16:46파이썬

[찾아보게 된 이유]

 

1. whisper로 웹에서 전송된 mp3파일을 STT(Spepech To Text)를 진행했음.

2. 화자 식별이 되면 좋지 않을까? 하다 찾아봄

3. 용량이 그리크지않고, 코드도 별로 길지 않아 둘다 합쳐보기로함

4. 다행히 라이브러리가 있었음

 

 


[라이브러리 설치 및 준비]

 

1번. pip install pyannote.audio

 

 

2번 2개 링크 접속 후 accept 받기

https://huggingface.co/pyannote/segmentation-3.0

 

pyannote/segmentation-3.0 · Hugging Face

This repository is publicly accessible, but you have to accept the conditions to access its files and content. The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though thi

huggingface.co

 

https://huggingface.co/pyannote/speaker-diarization-3.1

 

pyannote/speaker-diarization-3.1 · Hugging Face

This repository is publicly accessible, but you have to accept the conditions to access its files and content. The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though thi

huggingface.co

 

 

 

3번 개인 키 발급 받기 (New Token -> 이름 하고싶은거 지정)

hf.co/settings/tokens

 

Hugging Face – The AI community building the future.

 

huggingface.co

 

4번 git clone하기

git clone https://github.com/yinruiqing/pyannote-whisper.git 

 

 

https://github.com/yinruiqing/pyannote-whisper.git

 

GitHub - yinruiqing/pyannote-whisper

Contribute to yinruiqing/pyannote-whisper development by creating an account on GitHub.

github.com

 

 

 


 

 

[코드]

import whisper
from pyannote.audio import Pipeline
from pyannote_whisper.utils import diarize_text
from datetime import datetime

def log_time(label):
    return f'{label}: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}'

#pyannote
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token="huggingface_primary_key")

#whisper 모델 로드
model = whisper.load_model("small")
model_load_time = log_time('MODEL_LOAD_TIME')

# whisper 적용
model_start_time = log_time('MODEL_START')
asr_result = model.transcribe("ko_s2.wav")
model_end_time = log_time('MODEL_END & PIPELINE_START')

#pyannote 적용
diarization_result = pipeline("ko_s2.wav")
pipeline_end_time = log_time('PIPELINE_END')


# 결과 병합
final_result = diarize_text(asr_result, diarization_result)
merge_time = log_time('MERGE')


# txt 파일로 저장
with open("Result_test.txt", "w", encoding="utf-8") as f:
    # 시간정보 기입
    f.write(f'[INFO]\n{model_load_time}\n{model_start_time}\n{model_end_time}\n{pipeline_end_time}\n{merge_time}\n')
    #stt + 화자인식 결과
    for seg, spk, sent in final_result:
        f.write(f'[{seg.start:.2f} --> {seg.end:.2f}] {spk} : {sent}\n')

 

 

 

결과 

 

 

[생각해봐야할 것]

 

기존 whisper만 사용하면 mp3파일 시간의 절반이 소요되었음

화자분리(pyannote)를 추가하면 그만큼 시간이 증가

 

줄이거나 쓰레드처리 혹은 큐 처리를 해서 처리하지만 사용자가 증가하면 aws서버가 안받쳐줄듯............................

728x90