Yamnet

pip install vectorhub[encoders-audio-tfhub]

Details

Release date: 2020-03-11

Vector length: 1024 (default)

Repo: https://tfhub.dev/google/yamnet/1

Paper:

Example

#pip install vectorhub[encoders-audio-tfhub]
from vectorhub.encoders.audio.tfhub import Yamnet2Vec
model = Yamnet2Vec()
sample = model.read('https://vecsearch-bucket.s3.us-east-2.amazonaws.com/voices/common_voice_en_2.wav')
model.encode(sample)

Index and search vectors

Index and search your vectors easily on the cloud using 1 line of code!

username = '<your username>'
email = '<your email>'
# You can request an api_key using - type in your username and email.
api_key = model.request_api_key(username, email)

# Index in 1 line of code
items = ['https://vecsearch-bucket.s3.us-east-2.amazonaws.com/voices/common_voice_en_69.wav', 'https://vecsearch-bucket.s3.us-east-2.amazonaws.com/voices/common_voice_en_99.wav', 'https://vecsearch-bucket.s3.us-east-2.amazonaws.com/voices/common_voice_en_10.wav', 'https://vecsearch-bucket.s3.us-east-2.amazonaws.com/voices/common_voice_en_5.wav']
model.add_documents(user, api_key, items)

# Search in 1 line of code and get the most similar results.
model.search('https://vecsearch-bucket.s3.us-east-2.amazonaws.com/voices/common_voice_en_69.wav')

# Add metadata to your search
metadata = None
model.add_documents(user, api_key, items, metadata=metadata)

Description

YAMNet is an audio event classifier that takes audio waveform as input and makes independent predictions for each of 521 audio events from the AudioSet ontology. The model uses the MobileNet v1 architecture and was trained using the AudioSet corpus. This model was originally released in the TensorFlow Model Garden, where we have the model source code, the original model checkpoint, and more detailed documentation. This model can be used:

  • as a stand-alone audio event classifier that provides a reasonable baseline across a wide variety of audio events.
  • as a high-level feature extractor: the 1024-D embedding output of YAMNet can be used as the input features of another shallow model which can then be trained on a small amount of data for a particular task. This allows quickly creating specialized audio classifiers without requiring a lot of labeled data and without having to train a large model end-to-end.
  • as a warm start: the YAMNet model parameters can be used to initialize part of a larger model which allows faster fine-tuning and model exploration.

Working in Colab

If you are using this in colab and want to save this so you don't have to reload, use:

import os 
os.environ['TFHUB_CACHE_DIR'] = "drive/MyDrive/"
os.environ["TFHUB_MODEL_LOAD_FORMAT"] = "COMPRESSED"

Limitations

YAMNet's classifier outputs have not been calibrated across classes, so you cannot directly treat the outputs as probabilities. For any given task, you will very likely need to perform a calibration with task-specific data which lets you assign proper per-class score thresholds and scaling. YAMNet has been trained on millions of YouTube videos and although these are very diverse, there can still be a domain mismatch between the average YouTube video and the audio inputs expected for any given task. You should expect to do some amount of fine-tuning and calibration to make YAMNet usable in any system that you build.