LaBSE - Language-agnostic BERT Sentence Embedding

pip install vectorhub[encoders-text-tfhub]

Details

Release date: 2020-07-03

Vector length: 768 (default)

Repo: https://tfhub.dev/google/LaBSE/1

Paper: https://arxiv.org/pdf/2007.01852v1.pdf

Example

#pip install vectorhub[encoders-text-tfhub]
#FOR WINDOWS: pip install vectorhub[encoders-text-tfhub-windows]
from vectorhub.encoders.text.tfhub import LaBSE2Vec
model = LaBSE2Vec()
model.encode("I enjoy taking long walks along the beach with my dog.")

Index and search vectors

Index and search your vectors easily on the cloud using 1 line of code!

username = '<your username>'
email = '<your email>'
# You can request an api_key using - type in your username and email.
api_key = model.request_api_key(username, email)

# Index in 1 line of code
items = ['chicken', 'toilet', 'paper', 'enjoy walking']
model.add_documents(user, api_key, items)

# Search in 1 line of code and get the most similar results.
model.search('basin')

# Add metadata to your search
metadata = [{'num_of_letters': 7, 'type': 'animal'}, {'num_of_letters': 6, 'type': 'household_items'}, {'num_of_letters': 5, 'type': 'household_items'}, {'num_of_letters': 12, 'type': 'emotion'}]
model.add_documents(user, api_key, items, metadata=metadata)

Description

The language-agnostic BERT sentence embedding encodes text into high dimensional vectors. The model is trained and optimized to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. So it can be used for mining for translations of a sentence in a larger corpus. In “Language-agnostic BERT Sentence Embedding”, we present a multilingual BERT embedding model, called LaBSE, that produces language-agnostic cross-lingual sentence embeddings for 109 languages. The model is trained on 17 billion monolingual sentences and 6 billion bilingual sentence pairs using MLM and TLM pre-training, resulting in a model that is effective even on low-resource languages for which there is no data available during training. Further, the model establishes a new state of the art on multiple parallel text (a.k.a. bitext) retrieval tasks. We have released the pre-trained model to the community through tfhub, which includes modules that can be used as-is or can be fine-tuned using domain-specific data.

Working in Colab

If you are using this in colab and want to save this so you don't have to reload, use:

import os 
os.environ['TFHUB_CACHE_DIR'] = "drive/MyDrive/"
os.environ["TFHUB_MODEL_LOAD_FORMAT"] = "COMPRESSED"

Training Corpora

LABSE has 2 types of data:

  • Monolingual data (CommonCrawl and Wikipedia)
  • Bilingual translation pairs (translation corpus is constructed from webpages using a bitext mining system)

The extracted sentence pairs are filtered by a pre-trained contrastive data-selection (CDS) scoring model. Human annotators manually evaluate sentence pairs from a small sub-set of the harvested pairs and mark the pairs as either "GOOD" or "BAD" translations, from which 80% of the retrained pairs from the manual are rated as "GOOD".

Training Setup

Short lines less than 10 characters and long lines more than 5000 characters are removed. Wiki data was extracted from the 05-21-2020 dump using WikiExtractor.