How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference

A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources.


Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables chat models to incorporate external, domain-specific knowledge from vector store collections. This improves the contextual accuracy of responses, reduces hallucinations, and ensures that the output is based on the most relevant and up-to-date information.

Note

The models that support RAG-based chat completion on Vultr Serverless Inference are: deepseek-r1-distill-qwen-32bqwen2.5-32b-instructqwen2.5-coder-32b-instructllama-3.1-70b-instruct-fp8llama-3.3-70b-instruct-fp8deepseek-r1-distill-llama-70b, and deepseek-r1mistral-7B-v0.3 and mistral-nemo-instruct-2407 are not compatible with RAG.

Follow this guide to use RAG-based chat completion on Vultr Serverless Inference using the Vultr API.

  1. Send a GET request to the List Collections endpoint and note the target collection’s ID.
    console
    $ curl "https://api.vultrinference.com/v1/vector_store" \
        -X GET \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}"
  2. Send a GET request to the List Models endpoint and note the preferred inference model’s ID.
    console
    $ curl "https://api.vultrinference.com/v1/models" \
        -X GET \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}"
  3. Send a POST request to the RAG Chat Completion endpoint to generate responses using Retrieval-Augmented Generation (RAG).
    console
    $ curl "https://api.vultrinference.com/v1/chat/completions/RAG" \
        -X POST \
        -H "Authorization: Bearer ${INFERENCE_API_KEY}" \
        -H "Content-Type: application/json" \
        --data '{
            "collection": "{collection-id}",
            "model": "{model-id}",
            "messages": [
                {
                    "role": "user",
                    "content": "{user-input}"
                }
            ],
            "max_tokens": 512
        }'

    Visit the RAG Chat Completion endpoint to view additional attributes you can apply for greater control when interacting with the preferred inference model.

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources. Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables…

How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources. Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables…

Leave a Reply

Your email address will not be published. Required fields are marked *