Dense Passage Retrieval

pip install vectorhub[encoders-text-torch-transformers]

Details

Release date: 2020-10-04

Vector length: 768 (default)

Repo:

Paper: https://arxiv.org/abs/2004.04906

Example

#pip install vectorhub[encoders-text-torch-transformers]
from vectorhub.bi_encoders.text_text.torch_transformers import DPR2Vec
model = DPR2Vec()
model.encode_question('How is the weather today?')
model.encode_answer('The weather is great today.')

Index and search vectors

Index and search your vectors easily on the cloud using 1 line of code! If you require metadata to not be stored on the cloud, simply attach with an ID for personal referral.

username = '<your username>'
email = '<your email>'
# You can request an api_key using - type in your username and email.
api_key = model.request_api_key(username, email)

# Index in 1 line of code
items = ['chicken', 'toilet', 'paper', 'enjoy walking']
model.add_documents(user, api_key, items)

# Search in 1 line of code and get the most similar results.
model.search('basin')

# Add metadata to your search
metadata = [{'num_of_letters': 7, 'type': 'animal'}, {'num_of_letters': 6, 'type': 'household_items'}, {'num_of_letters': 5, 'type': 'household_items'}, {'num_of_letters': 12, 'type': 'emotion'}]
model.add_documents(user, api_key, items, metadata=metadata)

Description

Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.