Audio transcribing + diarization pipeline.
- Whisper Large v3 Turbo (CTranslate 2 version
faster-whisper==1.1.1
) - Pyannote audio 3.3.1
- Used at Audiogest and Spectropic
- Or try at Replicate
- Or deploy yourself on Replicate or any machine with a GPU
- Make sure you have cog installed
- Accept pyannote/segmentation-3.0 user conditions
- Accept pyannote/speaker-diarization-3.1 user conditions
- Create HuggingFace token at hf.co/settings/tokens.
- Insert your own HuggingFace token in
predict.py
in thesetup
function- (Be careful not to commit this token!)
- Run
cog build
- Run
cog predict -i input.wav
- Or push to Replicate with
cog push r8.im/<username>/<name>
- Or push to Replicate with
- Please follow instructions on cog.run if you run into issues
file_string: str
: Either provide a Base64 encoded audio file.file_url: str
: Or provide a direct audio file URL.file: Path
: Or provide an audio file.num_speakers: int
: Number of speakers. Leave empty to autodetect. Must be between 1 and 50.translate: bool
: Translate the speech into English.language: str
: Language of the spoken words as a language code like 'en'. Leave empty to auto detect language.prompt: str
: Vocabulary: provide names, acronyms, and loanwords in a list. Use punctuation for best accuracy. Also now used as 'hotwords' paramater in transcribing,
segments: List[Dict]
: List of segments with speaker, start and end time.- Includes
avg_logprob
for each segment andprobability
for each word level segment.
- Includes
num_speakers: int
: Number of speakers (detected, unless specified in input).language: str
: Language of the spoken words as a language code like 'en' (detected, unless specified in input).
- pyannote - Speaker diarization model
- whisper - Speech recognition model
- faster-whisper - Reimplementation of Whisper model for faster inference
- cog - ML containerization framework