How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference

by Blog Admin
November 7, 2025
8

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference

A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources.

Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables chat models to incorporate external, domain-specific knowledge from vector store collections. This improves the contextual accuracy of responses, reduces hallucinations, and ensures that the output is based on the most relevant and up-to-date information. The RAG endpoint also supports tool calling, allowing models to invoke defined functions during RAG-based interactions. This enables advanced use cases where the model can not only retrieve external knowledge but also act on it—for example, performing calculations, fetching live data, or calling APIs based on retrieved context.

Note

The models that support RAG-based chat completion on Vultr Serverless Inference are: deepseek-r1-distill-qwen-32b, qwen2.5-32b-instruct, qwen2.5-coder-32b-instruct, llama-3.1-70b-instruct-fp8, llama-3.3-70b-instruct-fp8, deepseek-r1-distill-llama-70b, and deepseek-r1. mistral-7B-v0.3 and mistral-nemo-instruct-2407 are not compatible with RAG.

Follow this guide to use RAG-based chat completion on Vultr Serverless Inference using the Vultr API.

Generate RAG-Based Chat Completions

Send a GET request to the List Collections endpoint and note the target collection’s ID.

console

$ curl "https://api.vultrinference.com/v1/vector_store" \
    -X GET \
    -H "Authorization: Bearer ${INFERENCE_API_KEY}"

Send a GET request to the List Models endpoint and note the preferred inference model’s ID.

console

$ curl "https://api.vultrinference.com/v1/models" \
    -X GET \
    -H "Authorization: Bearer ${INFERENCE_API_KEY}"

Send a POST request to the RAG Chat Completion endpoint to generate responses using Retrieval-Augmented Generation (RAG).

console

$ curl "https://api.vultrinference.com/v1/chat/completions/RAG" \
    -X POST \
    -H "Authorization: Bearer ${INFERENCE_API_KEY}" \
    -H "Content-Type: application/json" \
    --data '{
        "collection": "{collection-id}",
        "model": "{model-id}",
        "messages": [
            {
                "role": "user",
                "content": "{user-input}"
            }
        ],
        "max_tokens": 512
    }'

Visit the RAG Chat Completion endpoint to view additional attributes you can apply for greater control when interacting with the preferred inference model.

Use Tool Calling with the RAG Endpoint

Note

Tool calling is currently supported only on the kimi-k2-instruct model.

Define your tools using the “tools” parameter in the RAG chat request body.
Set “tool_choice” to “auto”, “required”, or “none” to control when the model triggers a tool call.

Send a POST request to the RAG Chat Completion endpoint to combine RAG retrieval and tool invocation.

console

$ curl "https://api.vultrinference.com/v1/chat/completions/RAG" \
    -X POST \
    -H "Authorization: Bearer ${INFERENCE_API_KEY}" \
    -H "Content-Type: application/json" \
    --data '{
        "collection": "{collection-id}",
        "model": "{model-id}",
        "messages": [
            { "role": "user", "content": "Ask a question that requires external data retrieval." }
        ],
        "tools": [
            {
                "type": "function",
                "function": {
                    "name": "function_name",
                    "description": "Describe the purpose of the function.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "parameter_name": {
                                "type": "string",
                                "description": "Describe the expected input parameter."
                            }
                        },
                        "required": ["parameter_name"]
                    }
                }
            }
        ],
        "tool_choice": "auto",
        "max_tokens": 512
    }'

The model responds with a structured tool call, for example:

{
  "role": "assistant",
  "tool_calls": [
    {
      "type": "function",
      "function": {
        "name": "function_name",
        "arguments": "{\"parameter_name\": \"example_value\"}"
      }
    }
  ]
}

You can execute this function locally or through an API and send the output back to the RAG endpoint in a follow-up request to generate a complete, context-aware response.

For detailed implementation steps and usage examples, refer to the Tool Calling with Vultr Serverless Inference Guide.

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources. Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables…

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources. Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables…

How to Use the Text-to-Speech Endpoint in Vultr Serverless Inference

How to Add a File to Vector Store Collection

Cobra Softwares Blog