Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
Automatic speech recognition responses return both a single string text property with the audio transcription and an optional array of words with start and end timestamps if the model supports that.