Reference of `vektara`

For more details, please refer to the [Github repo](https://github.com/forrestbao/vektara) of vektara.

Summary of Functions

Administrative

Function	Vectara endpoint	Purpose
`vektara.acquire_jwt_token()`	N/A because it is through AWS	Acquire OAuth2 token
`vektara.list_jobs()`	`list-jobs`	List jobs, with filters if applicable

Corpus management

Function	Vectara endpoint	Purpose
`vektara.create_corpus()`	`create-corpus`	Create a new corpus
`vektara.reset_corpus()`	`reset-corpus`	Remove all documents in a corpus but keeping the metadata if there is any
`vektara.list_documents()`	`list-documents`	List documents in a corpus
`vektara.set_corpus_filter()`	`replace-corpus-filter -attrs`	Set certain metadata fields to filterable
`vektara.delete_document()`	`delete-doc`	Delete a document by its ID

Adding content to a corpus

Function	Vectara endpoint	Purpose
`vektara.upload()`	`fileUpload`	Upload a single file, a list of files, or an entire folder. Supports adding metadata.
`vektara.create_document_from_sections()`	`index`	Create a document by adding texts with hierarchy, like a book consisting of chapters consisting of sections, etc. But you have no control over how texts are chunked.
`vektara.create_document_from_chunks()`	`core/index`	Create a document by adding text chunks without hierarchy. Each chunk becomes a unit in retrieval.

Querying a corpus

Function	Vectara endpoint	Purpose
`vektara.query()`	`query`	Query a corpus. Supports filtering.

The background classes

Here are the classes used by the Vectara class.

class vektara.Filter(*, name: str, type: Literal['str', 'float', 'int', 'bool'], level: Literal['doc', 'part'], description: str = '', indexed: bool = False)

A filter to be set on a corpus.

for level, part means chunk-level.

The `Vectara` class

The Vectara class is the main offering of the vektara package. It allows you to establish a connection to the Vectara service, upload data, and make queries.

class vektara.Vectara(base_url: str = 'https://api.vectara.io', customer_id: str | None = None, api_key: str | None = None, client_id: str | None = None, client_secret: str | None = None, from_cli: bool = False, use_oauth2: bool = False)

__init__(base_url: str = 'https://api.vectara.io', customer_id: str | None = None, api_key: str | None = None, client_id: str | None = None, client_secret: str | None = None, from_cli: bool = False, use_oauth2: bool = False)

Initialize a Vectara-class object.

This function supports authentication with Vectara server using either OAuth2 or API Key. If using OAuth2, client_id and client_secret must be provided. If using API Key, api_key must be provided. When both OAuth2 and API credentials are provided, API Key will be used.

Following the convention set by OpenAI’s API in the GenAI era, the credentials will default to those in the environment variables of the operating system.

Examples

>>> from vektara import Vectara
>>> client = Vectara() # get default credentials from environment variables
>>> client = Vectara(api_key='abc', customer_id='123') # pass in credentials for using Personal API key
>>> client = Vectara(client_id='abc', client_secret='xyz', customer_id='123') # pass in credentials for using OAuth2

Parameters:

base_url (str) – The base URL of the Vectara API. Default is https://api.vectara.io. It can be an URL provided by an API proxy, such as one from LlamaKey.ai.
customer_id (str) – The customer ID of the Vectara account. Default to environment variable VECTARA_CUSTOMER_ID. To get an Vectara customer ID, see here.
api_key (str) –
The API key of the Vectara account. Default to environment variable VECTARA_API_KEY. To get a Vectara API key, see here.
client_id (str) –
The client ID for OAuth2 authentication. Default to environment variable VECTARA_CLIENT_ID. To get an OAuth2 client ID, follow the instructions here or here.
client_secret (str) –
The client secret for OAuth2 authentication. Default to environment variable VECTARA_CLIENT_SECRET. To get an OAuth2 client secret, follow the instructions here or here.
from_cli (bool) – Whether the initialization is from a command line interface. If True, the initialization will be silent and the JWT token will be saved in a dotenv file. Default: False.
use_oauth2 (bool) – Whether to use OAuth2 for authentication. If True, the client_id and client_secret must be provided. If False, the api_key must be provided. Default: False.

create_corpus(corpus_name: str, corpus_description: str = '', verbose=False) → int | dict

Create a corpus, given a corpus_name and an optional corpus_description.

Examples

>>> from vektara import Vectara
>>> client = Vectara() # get default credentials from environment variables
>>> corpus_id = client.create_corpus('America, the Beautiful') # create a new corpus called 'America, the Beautiful'

Parameters:

corpus_name (str) – The name of the corpus being created.
corpus_description (str) – (Optional) The descrption to a corpus.

Returns:

The ID of the newly created corpus. If the creation fails, return the response as a nested Python dict for further inspection.

Return type:

int | dict

reset_corpus(corpus_id: int) → int | dict

Reset a corpus specified by corpus_id.

Examples

>>> from vektara import Vectara
>>> client = Vectara() # get default credentials from environment variables
>>> client.reset_corpus(11) # reset the corpus with ID 11

Parameters:: corpus_id (int) – the ID of the corpus to reset
Returns:: 1 if the reset is successful. Else, the response as a nested Python dict for further inspection.
Return type:: int | str

References

https://docs.vectara.com/docs/rest-api/reset-corpus

list_documents(corpus_id: int, numResults: int = 10, pageKey: str | None = None) → dict

List documents in a corpus specified by corpus_id up to the number of the optional parameter numResults.

Examples

>>> from vektara import Vectara
>>> client = Vectara() # get default credentials from environment variables
>>> client.list_documents(11, numResults=5) # list the first 5 documents in the corpus with ID 11

Parameters:

corpus_id (int) – the ID of the corpus to list documents from
numResults (int) – the number of documents to list. The max value is 1,000. Default is 10.
pageKey (str) – (Optional) the page key to get the next page of results. Default is None.

Returns:

A nested Python dict containing the list of documents in the corpus.

Return type:

dict

References

https://docs.vectara.com/docs/rest-api/list-documents

delete_document(corpus_id: int, doc_id: str) → dict

Delete a document specified by doc_id from a corpus specified by corpus_id.

Examples

>>> from vektara import Vectara
>>> client = Vectara() # get default credentials from environment variables
>>> client.delete_document(11, 'we the people') # delete the document with ID 'we the people' from the corpus with ID 11

upload(corpus_id: int, source: str | List[str], doc_id: str | List[str] | None = None, metadata: Dict | List[Dict] = {}, verbose: bool = False) → dict | List[dict]

Upload a file, a list of files, or files in a folder, to a corpus specified by corpus_id

Examples

>>> from vektara import Vectara
>>> client = Vectara() # get default credentials from environment variables
>>> client.upload(corpus_id, 'test_data/consitution_united_states.txt') # upload one file
>>> client.upload(corpus_id, ['test_data/consitution_united_states.txt', 'test_data/declaration_of_independence.txt'])  # upload a list of files
>>> client.upload(corpus_id, "test_data") # upload all files in a folder, no recursion
>>> client.upload(
        corpus_id = 11,
        source = 'test_data/consitution_united_states.txt',
        doc_id='we the people',
        metadata={
            'number of amendements': '27',
            'Author': 'Representatives from 13 states',
            'number of words': 4543
            },
        verbose=True
    )
>>> client.upload(
        corpus_id = 11,
        source = ['test_data/consitution_united_states.txt', 'test_data/declaration_of_independence.txt', 'test_data/gettysburg_address.txt'],
        doc_id=[
            'the rights',
            'the beginning',
            'the war'
        ],
        metadata=[
            {'Last update': 'May 5, 1992', 'Author': "U.S. Congress"},
            {'Location': 'Philadelphia, PA', 'Author': "Thomas Jefferson et al."}, # Declaration of Independence
            {'Location': 'Gettysburg, PA', 'Author': 'Abraham Lincoln', 'Date': 'November 19, 1863'} # Gettysburg Address
        ]
    )

Parameters:

corpus_id (int) – the corpus ID to upload to
source (str | List[str]) – the source to upload, a file path, a folder path, or a list of file paths
doc_id (str or list of str) – (Optional) alphanumeric ID(s) for referring to the document(s) later. If a string, then it only works when source is a single file. If a list of strings, then each element of doc_id is the document ID for each file in the request. Default is None.
metadata (dict or list of dict) – (Optional) metadata for file(s). If a dict, then the same metadata will be used for all files in the request. If a list of dict, then each element dict is the metadata for each file in the request – in this case, it is not required that all documents to have the same fields in their metadata. Default is an empty dict.
verbose (bool) – (Optional) whether to print the detailed information. Default is False.

Returns:

The response from the Vectara server. If a single file is uploaded, then a dict is returned. If multiple files are uploaded (when source is a list of filepaths or a folder), then a list of dict is returned.

Return type:

dict | List[dict]

query(corpus_id: int, query: str, top_k: int = 5, offset: int = 0, lang: str = 'auto', contextConfig: dict | None = None, do_generation: bool = True, LLM: Literal['GPT-4', 'GPT-3.5-Turbo', 'GPT-4-Turbo'] = 'GPT-3.5-Turbo', lambda_: float = 0.005, prompt_template_string: str = '', metadata_filter: str = '', print_format: Literal['json', 'markdown'] = '', jupyter_display: bool = False, print_curl: bool = False, **kwargs: Dict[str, str]) → Dict

Make a query to a corpus at Vectara

Examples

>>> from vektara import Vectara
>>> client = Vektara() # get default credentials from environment variables
>>> client.query(
        corpus_id,
        "What if the government fails to protect your rights?",
        metadata_filter="doc.id = 'we the people'",
        top_k=3,
        print_format='json',
        verbose=True
    )

Parameters:

corpus_id (int) – the corpus ID to send the query to
query (str) – the query (question, search terms) to ask
top_k (int) – the number of most matching results to return and to be used for generation
offset (int) – the number of top results to skip. For pagination. Default: 0
lang (str) – the ISO 639-1 or ISO 639-3 language code for the language in which a summary is generated. Default: ‘auto’, letting the Vectara platform to determine.
contextConfig (dict) – See https://docs.vectara.com/docs/rest-api/query? for details
do_generation (bool) – whether to use the retrieved results for generation. Default: True
LLM (Literal['GPT-4', 'GPT-3.5-Turbo', 'GPT-4-Turbo']) – the language model to use for generation. Default: ‘GPT-3.5-Turbo’
lambda (float) – The weight for keyword search. Search ranking is lambda * keyword_score + (1-lambda) * neural_score. When lambda is zero, the search is pure neural. When 1, pure keyword-based. Default: 0.005
prompt_messages (List[Message])
metadata_filter (str) – Vectara’s metadata filter to narrow down the search results. See https://docs.vectara.com/docs/learn/metadata-search-filtering/filter-overview and for details.

Returns:

The response from the Vectara server.

Return type:

dict

create_document_from_sections(corpus_id: int, sections: List[str], section_ids: List[int] = [], section_metadata: List[Dict] = [], doc_id: str = '', doc_metadata: Dict = {}, verbose: bool = False) → Dict

Create a document in a corpus specified by corpus_id from a list of texts, each of which is a section of the document.

This is for experts. A document is a collection of sections, each of which is a collection of texts.

The difference between this method and create_document_from_chunks is that in this methods, you cannot control the chunking of texts but you can hierarchically organize texts (although currently only one level of hierarchy is supported in this SDK).

Examples

>>> from vektara import Vectara
>>> client = Vectara() # get default credentials from environment variables
>>> client.add_sections(
        corpus_id = 11,
        sections = [
            "I have one TV. ",
            "Ich habe einen TV."
        ],
        section_ids = [100, 200],
        section_metadata = [
            {"language": "English"},
            {"language": "German"}
        ],
        doc_id = "my apartment",
        doc_metadata = {"genre": "life"},
        verbose = True
    )

Parameters:

corpus_id (int) – the corpus ID to upload to
sections (list of str) – A section is a concept in document retrieval system. It is the sub-unit of a document. Here multiple sections are being added into a document at once.
section_ids (list of int) – (Optional) The section IDs for each section. If not provided, the section IDs will be generated by Vectara.
section_metadata (list of dict) – (Optional) The metadata for each section. If provided, the metadata will be pigged back in the query result. If not provided, the metadata will be empty.
doc_id (str) – (Optional) The document ID for the document. If not provided, the document will just not have an ID. Note the ID is not a number. It is a string.
doc_metadata (dict) – (Optional) The metadata for the document. If provided, the metadata will be piggied back in the query result. If not provided, the metadata will be empty.
verbose (bool) – (Optional) Whether to print the detailed information. Default is False.

Returns:

The response from the Vectara server.

Return type:

dict

Notes

Vectara does not allow updating a document. If you want to update a document, you need to delete the document first and then re-add content into it.
A section ID must be positive integers. If it is 0, then it will not show up in the metadata of the query return.

Limitations

Vectara supports hierarchical documents, thus a section can recursively be a collection of sections. However, this method only support one level of hierarchy. We will add the support for more levels in the future.

References

https://docs.vectara.com/docs/rest-api/index

create_document_from_chunks(corpus_id: int, chunks: List[str], chunk_metadata: List[Dict] = [], doc_id: str = '', doc_metadata: Dict = {}, verbose: bool = False, print_curl: bool = False, silent: bool = False) → dict

Create a document in a corpus specified by corpus_id from a list of texts, each of which is a chunk of the document.

This is for experts. A document is a collection of chunks. Each chunk is a unit in retrieval.

The difference between this method and create_document_from_sections is that in this method, you can control the chunking of texts – a chunk you upload is the retrieval unit – and all chunks are at the same level, while in create_document_from_sections, you cannot control the chunking of texts and the sections can be hierarchical (although currently only one level of hierarchy is supported in this SDK).

Parameters:

corpus_id (int) – the corpus ID in which to create a document
chunks (list of str) – the chunks of the document
chunk_metadata (list of dict) – (Optional) the metadata for each chunk. If provided, the metadata will be pigged back in the query result. If not provided, the metadata will be empty.
doc_id (str) – (Optional) the document ID for the document. If not provided, the document will just not have an ID. Note the ID is not a number. It is a string.
doc_metadata (dict) – (Optional) the metadata for the document. If provided, the metadata will be piggied back in the query result. If not provided, the metadata will be empty.
verbose (bool) – (Optional) whether to print the detailed information. Default is False.
print_curl (bool) – (Optional) whether to print the curl command. Default is False.
silent (bool) – (Optional) whether to suppress the printout. Default is False.

Returns:

the response from the Vectara server.

Return type:

dict

Examples

>>> from vektara import Vectara
>>> client = Vectara() # get default credentials from environment variables
>>> client.create_document_from_chunks(
        corpus_id = 11,
        chunks = [
            "I have one TV. ",
            "Ich habe einen TV."
        ],
        chunk_metadata = [
            {"language": "English"},
            {"language": "German"}
        ],
        doc_id = "my apartment",
        doc_metadata = {"genre": "life"},
        verbose = True
    )

list_jobs(jobID: int | None = None, corpus_ids: List[int] | None = None, elapsed_seconds: int | None = None, states: List[Literal['QUEUED', 'STARTED', 'COMPLETED', 'FAILED', 'ABORTED', 'UNKNOWN']] | None = None, numResults: int = 10, print_curl: bool = False, pageKey: str | None = None) → dict

List the statuses of jobs.

All parameters are optional to narrow down the job listing. If no parameters are provided, then the method will return the latested 100 jobs in the past 180 days.

An example of the response is as follows:

{
    "status": [],
    "job": [
        {
        "id": "SDIzYktHMzNHMlJpsbXC8p5IJaONNGgnpbsUViXXkOoqnA==",
        "type": "JOB__CORPUS_REPLACE_FILTER_ATTRS",
        "corpusId": [
            12
        ],
        "state": "JOB_STATE__COMPLETED",
        "tsCreate": "1714181371",
        "tsStart": "1714181400",
        "tsComplete": "1714181400",
        "userHandle": "forrest.bao@gmail.com"
        },
        {
        "id": "SDIzYktHMzNHMlJpsbXC8pRDZL2Iw1JV41cTcneieXc2CA==",
        "type": "JOB__CORPUS_REPLACE_FILTER_ATTRS",
        "corpusId": [
            12
        ],
        "state": "JOB_STATE__COMPLETED",
        "tsCreate": "1714285513",
        "tsStart": "1714285562",
        "tsComplete": "1714285562",
        "userHandle": "forrest.bao@gmail.com"
        }
    ],
    "pageKey": "e8jhrDQNZwagrBQmcuoGVHUCeaqHF+2TE4nUkm34HPWjm147U6223iT9bO/oa6NKohoQZTT2NuqFRuCqp143g3rIVseAPi0liPTvXfKNc0FGBjuB"
}

Parameters:

jobID (int) – (Optional) the ID of the job to check
corpus_ids (List[int]) – (Optional) the corpus ID to list jobs for
elapsed_seconds (int) – (Optional) only return jobs that were within these many seconds ago. Max allowed value is 180 days ago.
states (List[Literal['QUEUED', 'STARTED', 'COMPLETED', 'FAILED', 'ABORTED', 'UNKNOWN']]) – (Optional) only return job matching these states
numResults (int) – (Optional) the number of jobs to return. Max is 100.
pageKey (str) – (Optional) return the jobs starting from this page

Returns:

A nested Python dict containing the list of jobs. The structure of the dict is described in this page https://docs.vectara.com/docs/rest-api/list-jobs

Return type:

dict

set_corpus_filters(corpus_id: int, filters: List[Filter], print_curl: bool = False) → Dict

Set the filters for a corpus.

Parameters:

corpus_id (int) – the corpus ID to set filters for
filters (List[Filter]) – a list of filters to set. Each filter is an instance of the Filter class.

Returns:

A job ID if the request is successful. Else, the response as a nested Python dict for further inspection.

Return type:

int | dict

Examples

>>> from vektara import Vectara, Filter
>>> client = Vectara() # get default credentials from environment variables
>>> filters = [
        Filter(name="country", type='str', level='doc', indexed=True),
        Filter(name="note", type='str', level='part', indexed=False)
    ]
>>> client.set_corpus_filters(2, filters)

References

https://docs.vectara.com/docs/rest-api/replace-corpus-filter-attrs

Reference of vektara

Summary of Functions

Administrative

Corpus management

Adding content to a corpus

Querying a corpus

The background classes

The Vectara class

Limitations

Reference of `vektara`

The `Vectara` class