Getting Started with the Chat API

Overview

A set of agents is exposed through the Chat API. Each agent is configured with a specific behaviour and has access to the Search API in order to follow the RAG pattern when needed.

The Chat API documentation can be found here.

Quickstart: Chat Streaming

The Chat API supports streaming through SSE based on the streaming endpoint.

Assume the following packages are installed:

pip install requests==2.28.1
pip install sseclient-py==1.8.0

Assume the ZETA_ALPHA_API_KEY is stored in env variables:

export ZETA_ALPHA_API_KEY=my-api-key

Streaming a message

At each stream, the content of the agent's message is enhanced with the next token, so that each stream contains the whole message generated so far along with the evidences if any.

import json
import os

import requests
import sseclient

TENANT = "zetaalpha"
CHAT_STREAMING_ENDPOINT = (
    f"https://api.zeta-alpha.com/v0/service/chat/stream?tenant={TENANT}"
)

headers = {
    "accept": "text/event-stream",
    "Content-Type": "application/json",
    "x-auth": os.getenv("ZETA_ALPHA_API_KEY"),
}

response = requests.post(
    CHAT_STREAMING_ENDPOINT,
    headers=headers,
    json={
        "conversation_context": {
            "document_context": {
                "document_ids": [
                    "73effa5a188b69d32b5889a5ed564db5b66aeeb6_0",
                    "6277484c482c12e80166b3388ef8069b088dfcb6_0",
                    "6c42c17b131d886f0ccf4897d055e42580574240_0",
                    "df40f22694ea7515ef8cd321d877e54c30d336ca_0",
                    "bea6364917260019f43a72a3906d9a417029b9be_0",
                ],
                "retrieval_unit": "document",
            }
        },
        "conversation": [
            {
                "sender": "user",
                "content": "What is BERT?",
            },
        ],
        "agent_identifier": "verbose_qa_with_dynamic_retrieval",
    },
    stream=True,
)

response.raise_for_status()
client = sseclient.SSEClient(response)

for event in client.events():
    try:
        streamed_data = json.loads(event.data)
        print(f"Data stream: {streamed_data}")
    except Exception:
        print(f"Data stream error: {event.data}")
        streamed_data = None

if streamed_data:
    print("\n---------------- COMPLETE MESSAGE ----------------")
    print(f"Message:\n{streamed_data['content']}\n")
    print(f"Evidences:\n{streamed_data['evidences']}\n")
    print(f"Function Call:\n{streamed_data['function_call_request']}\n")
    print("--------------------------------------------------")

Sample final output:

...
---------------- COMPLETE MESSAGE ----------------
Message:
BERT stands for Bidirectional Encoder Representations from Transformers. It is a transformer-based language model developed by Google and released in late 2018. BERT represents a significant advancement in natural language processing (NLP) due to its ability to understand the context of words in relation to all other words in a sentence, rather than just the words that come before or after them. This bidirectional approach allows BERT to capture the full context of a word, making it particularly effective for various language tasks.

BERT is pre-trained on a large corpus of text and can be fine-tuned for specific tasks such as sentiment analysis, question answering, and named entity recognition. The model consists of multiple encoder layers and self-attention heads, which enable it to process and generate contextualized word embeddings. Fine-tuning BERT for specific tasks typically involves adding a small classification layer on top of the model, allowing it to adapt to the requirements of the task at hand <sup>4</sup><sup>5</sup>.

Evidences:
[{'document_hit_url': '/documents/document/list?tenant=zetaalpha&index_cluster=default:None&property_name=id&property_values=df40f22694ea7515ef8cd321d877e54c30d336ca_0', 'text_extract': 'What is BERT? <b>In 2018 Google developed a transformer-based NLP pretraining model called BERT or Bidirectional Encoder Representations from Transformers.</b> It is nothing but a Transformer language model with multiple encoder layers and self-attention heads.', 'anchor_text': '<sup>4</sup>'}, {'document_hit_url': '/documents/document/list?tenant=zetaalpha&index_cluster=default:None&property_name=id&property_values=bea6364917260019f43a72a3906d9a417029b9be_0', 'text_extract': 'Breaking BERT Down\nShreya Ghelani\nBreaking BERT Down BERT is short for Bidirectional Encoder Representations from Transformers. <b>It is a new type of language model developed and released by Google in late 2018.</b> Pre-trained language models like BERT play…\nAt the output, the token representations are fed into an output layer for token level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.', 'anchor_text': '<sup>5</sup>'}]

Function Call:
None

--------------------------------------------------

Streaming a function call

The function call requests are streamed at once in order to avoid sending partial JSON params to the streamed object.

import json
import os

import requests
import sseclient

TENANT = "zetaalpha"
CHAT_STREAMING_ENDPOINT = (
    f"https://api.zeta-alpha.com/v0/service/chat/stream?tenant={TENANT}"
)

headers = {
    "accept": "text/event-stream",
    "Content-Type": "application/json",
    "x-auth": os.getenv("ZETA_ALPHA_API_KEY"),
}

response = requests.post(
    CHAT_STREAMING_ENDPOINT,
    headers=headers,
    json={
        "conversation_context": {
            "document_context": {
                "document_ids": [
                    "73effa5a188b69d32b5889a5ed564db5b66aeeb6_0",
                    "6277484c482c12e80166b3388ef8069b088dfcb6_0",
                    "6c42c17b131d886f0ccf4897d055e42580574240_0",
                    "df40f22694ea7515ef8cd321d877e54c30d336ca_0",
                    "bea6364917260019f43a72a3906d9a417029b9be_0",
                ],
                "retrieval_unit": "document",
            }
        },
        "conversation": [
            {
                "sender": "user",
                "content": "What is BERT?",
            },
            {
                "sender": "bot",
                "content": "BERT stands for Bidirectional Encoder Representations from Transformers. It is a transformer-based language model developed by Google and released in late 2018. BERT represents a significant advancement in natural language processing (NLP) due to its ability to understand the context of words in relation to all other words in a sentence, rather than just the words that come before or after them. This bidirectional approach allows BERT to capture the full context of a word, making it particularly effective for various language tasks.\n\nBERT is pre-trained on a large corpus of text and can be fine-tuned for specific tasks such as sentiment analysis, question answering, and named entity recognition. The model consists of multiple encoder layers and self-attention heads, which enable it to process and generate contextualized word embeddings. Fine-tuning BERT for specific tasks typically involves adding a small classification layer on top of the model, allowing it to adapt to the requirements of the task at hand <sup>4</sup><sup>5</sup>.",
                "evidences": [
                    {
                        "document_hit_url": "/documents/document/list?tenant=zetaalpha&index_cluster=default:None&property_name=id&property_values=df40f22694ea7515ef8cd321d877e54c30d336ca_0",
                        "text_extract": "What is BERT? <b>In 2018 Google developed a transformer-based NLP pretraining model called BERT or Bidirectional Encoder Representations from Transformers.</b> It is nothing but a Transformer language model with multiple encoder layers and self-attention heads.",
                        "anchor_text": "<sup>4</sup>",
                    },
                    {
                        "document_hit_url": "/documents/document/list?tenant=zetaalpha&index_cluster=default:None&property_name=id&property_values=bea6364917260019f43a72a3906d9a417029b9be_0",
                        "text_extract": "Breaking BERT Down\nShreya Ghelani\nBreaking BERT Down BERT is short for Bidirectional Encoder Representations from Transformers. <b>It is a new type of language model developed and released by Google in late 2018.</b> Pre-trained language models like BERT play…\nAt the output, the token representations are fed into an output layer for token level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.",
                        "anchor_text": "<sup>5</sup>",
                    },
                ],
            },
            {
                "sender": "user",
                "content": "What is CLIP?",
            },
        ],
        "agent_identifier": "verbose_qa_with_dynamic_retrieval",
    },
    stream=True,
)

response.raise_for_status()
client = sseclient.SSEClient(response)

for event in client.events():
    try:
        streamed_data = json.loads(event.data)
        print(f"Data stream: {streamed_data}")
    except Exception:
        print(f"Data stream error: {event.data}")
        streamed_data = None

if streamed_data:
    print("\n---------------- COMPLETE MESSAGE ----------------")
    print(f"Message:\n{streamed_data['content']}\n")
    print(f"Evidences:\n{streamed_data['evidences']}\n")
    print(f"Function Call:\n{streamed_data['function_call_request']}\n")
    print("--------------------------------------------------")

Sample final output:

...
---------------- COMPLETE MESSAGE ----------------
Message:


Evidences:
None

Function Call:
{'name': 'document_search', 'params': {'search_engine': 'zeta_alpha', 'retrieval_method': 'mixed', 'query_string': 'What is CLIP?', 'document_type': ['document']}}

--------------------------------------------------

Chat with custom context

Besides chatting with documents that are ingested in the Zeta Alpha index, the Chat API also offers the possibility to chat with custom context, using the custom_context field of conversation_context and passing an arbitrary content as a string or as a JSON with a schema of your choice. Keep in mind that all the fields of the JSON will be passed to the LLM in order to generate the response.

Also, the document_hit_url of the evidence items in the response will have the format custom://{document_id}, where document_id is the one passed as custom context in the request.

import json
import os

import requests
import sseclient

TENANT = "zetaalpha"
CHAT_STREAMING_ENDPOINT = (
    f"https://api.zeta-alpha.com/v0/service/chat/stream?tenant={TENANT}"
)

headers = {
    "accept": "text/event-stream",
    "Content-Type": "application/json",
    "x-auth": os.getenv("ZETA_ALPHA_API_KEY"),
}

response = requests.post(
    CHAT_STREAMING_ENDPOINT,
    headers=headers,
    json={
        "conversation_context": {
            "custom_context": {
                "items": [
                    {
                        "document_id" : "myID_1",
                        "content": "The weather on Monday will be sunny"
                    },
                    {
                        "document_id" : "myID_2",
                        "content": {
                            "title": "Forecast for Tuesday",
                            "prediction": "Cloudy",
                            "temperature": "17 degrees Celsius"
                        }
                    },
                    {
                        "document_id" : "myID_3",
                        "content": "The weather on Wednesday will be windy"
                    }
                ],
            }
        },
        "conversation": [
            {
                "sender": "user",
                "content": "What's the weather on Tuesday?",
            },
        ],
        "agent_identifier": "verbose_qa_with_dynamic_retrieval",
    },
    stream=True,
)

response.raise_for_status()
client = sseclient.SSEClient(response)

for event in client.events():
    try:
        streamed_data = json.loads(event.data)
    except Exception:
        print(f"Data stream error: {event.data}")
        streamed_data = None

if streamed_data:
    print("\n---------------- COMPLETE MESSAGE ----------------")
    print(f"Message:\n{streamed_data['content']}\n")
    print(f"Evidences:\n{streamed_data['evidences']}\n")
    print(f"Function Call:\n{streamed_data['function_call_request']}\n")
    print("--------------------------------------------------")

Sample output:

---------------- COMPLETE MESSAGE ----------------
Message: The weather on Tuesday is expected to be cloudy with 17 degrees Celsius<sup>2</sup>.


Evidences:
[{'document_hit_url': 'custom://myID_2, 'text_extract': 'Forecast for Tuesday\nCloudy\n17 degrees Celsius', 'anchor_text': '<sup>2</sup>'}]

Function Call:
None

--------------------------------------------------

Usage in a Frontend app

In case you want to use the above functionality in a Frontend app, you might need to display the context and the evidences to the user as a UI component. In order to do so, you can pass as the content field of the custom_context.items the fields that you need in order to display this UI component.

For example, in case you want to display a document card with title, source, description etc, you can pass all these fields as part of the content object inside an item of the custom_context field. Those fields will be used by the LLM to answer the user's question, while the evidence will contain a pointer to the document_id of the context in order for the Frontend to grab the whole document item and display it as desired in the UI component of their choice.

Handling dynamic context added on the fly

Some agents may need to perform a dynamic retrieval in the middle of the conversation in order to search for more context that can be used to respond to the user's question. Those agents can communicate back to the API client what context was added.

A use case of this functionality would be to display the searching state and the dynamically retrieved context to the user in order for them to be aware that new context was found.

For the above reason, the content_parts field of the ChatMessage should be used instead of the content field. The content_parts field contains a list of message parts, which could be either text or context with the text being the bot message sent to the user, and the context being the new ConversationContext that was added during the conversation.

import json
import os

import requests
import sseclient

TENANT = "zetaalpha"
CHAT_STREAMING_ENDPOINT = (
    f"https://api.zeta-alpha.com/v0/service/chat/stream?tenant={TENANT}"
)

headers = {
    "accept": "text/event-stream",
    "Content-Type": "application/json",
    "x-auth": os.getenv("ZETA_ALPHA_API_KEY"),
}

response = requests.post(
    CHAT_STREAMING_ENDPOINT,
    headers=headers,
    json={
        "conversation_context": {},
        "conversation": [
            {
                "sender": "user",
                "content": "What is RAGElo?",
            },
        ],
        "agent_identifier": "chat_with_dynamic_retrieval",
    },
    stream=True,
)

response.raise_for_status()
client = sseclient.SSEClient(response)

new_context = None
for event in client.events():
    try:
        streamed_data = json.loads(event.data)
        content_parts = streamed_data.get("content_parts", [])
        context_part = next(
            (part for part in content_parts if part.get("type") == "context"), None
        )
        if context_part:
            if not context_part.get("context"):
                print("Searching for new context...\n")
            elif not new_context:
                new_context = context_part["context"]
                print(f"Found new context: {new_context}\n")

    except Exception:
        print(f"Data stream error: {event.data}")
        streamed_data = None

if streamed_data:
    text = " ".join([part["text"] for part in content_parts if part["text"]])
    print("\n---------------- COMPLETE MESSAGE ----------------")
    print(f"Full response:\n{streamed_data}\n")
    print(f"Message:\n{text}\n")
    print(f"Evidences:\n{streamed_data['evidences']}\n")
    print("--------------------------------------------------")

Sample output:

Searching for new context...

Found new context: {'document_context': {'document_ids': ['93b951a0b39ad32a8702a034716b5fbf1fddb24a_0', '419e39909b4026aa549f1dfbafb9f3958e464b6d_7', 'e7ef3919880d15363d70f854f3f1c61a22174414_4', '419e39909b4026aa549f1dfbafb9f3958e464b6d_10', '93b951a0b39ad32a8702a034716b5fbf1fddb24a_6', '704708f14b5e5167bc2819551fc0a61eaa2a6bcf_5', '93b951a0b39ad32a8702a034716b5fbf1fddb24a_2', 'a65c0ac8d43793998c5c4622cf61877ac76e1fed_0', '93b951a0b39ad32a8702a034716b5fbf1fddb24a_5', '9d1a992abbec6dfef57d2281b4bd7c734fd2dc09_10'], 'retrieval_unit': 'chunk'}, 'custom_context': None}


---------------- COMPLETE MESSAGE ----------------
Full response:
{'sender': 'bot', 'content': 'RAGElo is a toolkit designed to evaluate Retrieval Augmented Generation (RAG)-powered Large Language Models (LLMs) using the Elo rating system. It facilitates the comparison of different RAG pipelines and prompts by ranking their outputs through a tournament-style evaluation. This method helps in identifying the most effective configurations for LLM-based question-answering agents by comparing their performance across multiple questions and scenarios <sup>e4ac3bd9</sup><sup>a83188b8</sup>.', 'content_parts': [{'type': 'context', 'context': {'document_context': {'document_ids': ['93b951a0b39ad32a8702a034716b5fbf1fddb24a_6', '704708f14b5e5167bc2819551fc0a61eaa2a6bcf_5', '93b951a0b39ad32a8702a034716b5fbf1fddb24a_0', 'bf925bdeb58fdfffb124fd5c266f890423d1f890_7', '93b951a0b39ad32a8702a034716b5fbf1fddb24a_5', 'e0929619dd96c529a3b52a615d6445e5049b4930_45', 'e7ef3919880d15363d70f854f3f1c61a22174414_14', '72df85aa7e308aa322cc218c09dc1d7f2a1bcdff_2', 'e7ef3919880d15363d70f854f3f1c61a22174414_4', '48ac0c6a2174f8c2942cbefee9d4da1a2277c438_2'], 'retrieval_unit': 'chunk'}, 'custom_context': None}, 'text': None}, {'type': 'text', 'context': None, 'text': 'RAGElo is a toolkit designed to evaluate Retrieval Augmented Generation (RAG)-powered Large Language Models (LLMs) using the Elo rating system. It facilitates the comparison of different RAG pipelines and prompts by ranking their outputs through a tournament-style evaluation. This method helps in identifying the most effective configurations for LLM-based question-answering agents by comparing their performance across multiple questions and scenarios [e4ac3bd9][a83188b8].'}], 'image_uri': None, 'function_call_request': None, 'evidences': [{'document_hit_url': '/documents/chunk/list?tenant=zetaalpha&property_name=id&property_values=93b951a0b39ad32a8702a034716b5fbf1fddb24a_6', 'text_extract': " <b>RAGElo\n['zetaalphavector']\nRAGElo RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo rank</b>", 'anchor_text': '<sup>e4ac3bd9</sup>'}, {'document_hit_url': '/documents/chunk/list?tenant=zetaalpha&property_name=id&property_values=93b951a0b39ad32a8702a034716b5fbf1fddb24a_0', 'text_extract': " <b>RAGElo\n['zetaalphavector']\nElo-based RAG Agent evaluator \n\nRAGElo[^1] is a streamlined toolkit for evaluating Retrieval Augmented Generation (RAG)-powered Large Language Models (LLMs) question answering agents using the Elo rating system.</b> While it has become easier to prototype and incorporate generative LLMs in production, evaluation is still the most challenging part of the solution.", 'anchor_text': '<sup>a83188b8</sup>'}], 'function_specs': None}

Message:
RAGElo is a toolkit designed to evaluate Retrieval Augmented Generation (RAG)-powered Large Language Models (LLMs) using the Elo rating system. It facilitates the comparison of different RAG pipelines and prompts by ranking their outputs in a tournament-style format. This helps in identifying the most effective configurations for question-answering agents without frequent expert intervention [a83188b8][d698dfd4].

Evidences:
[{'document_hit_url': '/documents/chunk/list?tenant=zetaalpha&property_name=id&property_values=93b951a0b39ad32a8702a034716b5fbf1fddb24a_0', 'text_extract': " <b>RAGElo\n['zetaalphavector']\nElo-based RAG Agent evaluator \n\nRAGElo[^1] is a streamlined toolkit for evaluating Retrieval Augmented Generation (RAG)-powered Large Language Models (LLMs) question answering agents using the Elo rating system.</b> While it has become easier to prototype and incorporate generative LLMs in production, evaluation is still the most challenging part of the solution.", 'anchor_text': '<sup>a83188b8</sup>'}, {'document_hit_url': '/documents/chunk/list?tenant=zetaalpha&property_name=id&property_values=e7ef3919880d15363d70f854f3f1c61a22174414_4', 'text_extract': 'Table 1: Sample of questions submitted by users to the Infi-\nneon RAG-Fusion system\nUser-submitted queries\nWhat is the country of origin of IM72D128, and how does geopolitical\nexposure affect the market and my SAM for the microphone? <b>What is the IP rating of mounted IM72D128?</b>', 'anchor_text': '<sup>d698dfd4</sup>'}]

--------------------------------------------------

Overview​

Quickstart: Chat Streaming​

Streaming a message​

Streaming a function call​

Chat with custom context​

Usage in a Frontend app​

Handling dynamic context added on the fly​