LAReQA: Language-agnostic answer retrieval from a multilingual pool

pip install vectorhub[encoders-text-tfhub]

Details

Release date: 2020-04-11

Vector length: 512 (default)

Repo: https://tfhub.dev/google/LAReQA/mBERT_En_En/1

Paper: https://arxiv.org/abs/2004.05484

Example

#pip install vectorhub[encoders-text-tfhub]
from vectorhub.bi_encoders.text_text.tfhub import LAReQA2Vec
model = LAReQA2Vec()
model.encode_question('How is the weather today?')
model.encode_answer('The weather is great today.')

Index and search vectors

Index and search your vectors easily on the cloud using 1 line of code! If you require metadata to not be stored on the cloud, simply attach with an ID for personal referral.

username = '<your username>'
email = '<your email>'
# You can request an api_key using - type in your username and email.
api_key = model.request_api_key(username, email)

# Index in 1 line of code
items = ['chicken', 'toilet', 'paper', 'enjoy walking']
model.add_documents(user, api_key, items)

# Search in 1 line of code and get the most similar results.
model.search('basin')

# Add metadata to your search
metadata = [{'num_of_letters': 7, 'type': 'animal'}, {'num_of_letters': 6, 'type': 'household_items'}, {'num_of_letters': 5, 'type': 'household_items'}, {'num_of_letters': 12, 'type': 'emotion'}]
model.add_documents(user, api_key, items, metadata=metadata)

Description

We present LAReQA, a challenging new benchmark for language-agnostic answer retrieval from a multilingual candidate pool. Unlike previous cross-lingual tasks, LAReQA tests for "strong" cross-lingual alignment, requiring semantically related cross-language pairs to be closer in representation space than unrelated same-language pairs. Building on multilingual BERT (mBERT), we study different strategies for achieving strong alignment. We find that augmenting training data via machine translation is effective, and improves significantly over using mBERT out-of-the-box. Interestingly, the embedding baseline that performs the best on LAReQA falls short of competing baselines on zero-shot variants of our task that only target "weak" alignment. This finding underscores our claim that languageagnostic retrieval is a substantively new kind of cross-lingual evaluation.