How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference

by Blog Admin
October 27, 2025
6

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference

A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources.

Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables chat models to incorporate external, domain-specific knowledge from vector store collections. This improves the contextual accuracy of responses, reduces hallucinations, and ensures that the output is based on the most relevant and up-to-date information.

Note

The models that support RAG-based chat completion on Vultr Serverless Inference are: deepseek-r1-distill-qwen-32b, qwen2.5-32b-instruct, qwen2.5-coder-32b-instruct, llama-3.1-70b-instruct-fp8, llama-3.3-70b-instruct-fp8, deepseek-r1-distill-llama-70b, and deepseek-r1. mistral-7B-v0.3 and mistral-nemo-instruct-2407 are not compatible with RAG.

Follow this guide to use RAG-based chat completion on Vultr Serverless Inference using the Vultr API.

Send a GET request to the List Collections endpoint and note the target collection’s ID.

console

$ curl "https://api.vultrinference.com/v1/vector_store" \
    -X GET \
    -H "Authorization: Bearer ${INFERENCE_API_KEY}"

Send a GET request to the List Models endpoint and note the preferred inference model’s ID.

console

$ curl "https://api.vultrinference.com/v1/models" \
    -X GET \
    -H "Authorization: Bearer ${INFERENCE_API_KEY}"

Send a POST request to the RAG Chat Completion endpoint to generate responses using Retrieval-Augmented Generation (RAG).

console

$ curl "https://api.vultrinference.com/v1/chat/completions/RAG" \
    -X POST \
    -H "Authorization: Bearer ${INFERENCE_API_KEY}" \
    -H "Content-Type: application/json" \
    --data '{
        "collection": "{collection-id}",
        "model": "{model-id}",
        "messages": [
            {
                "role": "user",
                "content": "{user-input}"
            }
        ],
        "max_tokens": 512
    }'

Visit the RAG Chat Completion endpoint to view additional attributes you can apply for greater control when interacting with the preferred inference model.

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources. Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables…

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources. Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables…

How to Update a Repository in Vultr Container Registry

How to Add a File to Vector Store Collection

Cobra Softwares Blog

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference

Leave a Reply Cancel reply