Universal Sentence Encoder Multilingual Question Answering

pip install vectorhub[encoders-text-tfhub]

Details

Release date: 2019-07-01

Vector length: 512 (default)

Repo: https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3

Paper:

Example

#pip install vectorhub[encoders-text-tfhub]
from vectorhub.bi_encoders.text_text.tfhub import USEMultiQA2Vec
model = USEMultiQA2Vec()
model.encode_question('How is the weather today?')
model.encode_answer('The weather is great today.')

Index and search vectors

Index and search your vectors easily on the cloud using 1 line of code! If you require metadata to not be stored on the cloud, simply attach with an ID for personal referral.

username = '<your username>'
email = '<your email>'
# You can request an api_key using - type in your username and email.
api_key = model.request_api_key(username, email)

# Index in 1 line of code
items = ['chicken', 'toilet', 'paper', 'enjoy walking']
model.add_documents(user, api_key, items)

# Search in 1 line of code and get the most similar results.
model.search('basin')

# Add metadata to your search
metadata = [{'num_of_letters': 7, 'type': 'animal'}, {'num_of_letters': 6, 'type': 'household_items'}, {'num_of_letters': 5, 'type': 'household_items'}, {'num_of_letters': 12, 'type': 'emotion'}]
model.add_documents(user, api_key, items, metadata=metadata)

Description

  • Developed by researchers at Google, 2019, v2 [1].
  • Covers 16 languages, strong performance on cross-lingual question answer retrieval.
  • It is trained on a variety of data sources and tasks, with the goal of learning text representations that are useful out-of-the-box to retrieve an answer given a question, as well as question and answers across different languages.
  • It can also be used in other applications, including any type of text classification, clustering, etc.

Supported Languages

Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian

Training Corpora

Reddit, Wikipedia, Stanford Natural Language Inference and web mined translation pairs.

Training Setup

Question-Answering was trained on 4 unique task types: i) conversational response prediction ii) quick thought iii) natural language inference iv) tranlsation ranking (bridge task)

Note: to learn cross-lingual representations, they used translation ranking tasks using parallel corpora for the source-target pairs.

Multi-task training is performed through different tasks and performed an optimization step for a single task at a time. All models are trained with a batch size of 100 using SGD with a learning rate of 0.008 and 30million steps.