How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference
-
by Blog Admin
- 6
How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference
A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources.
Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables chat models to incorporate external, domain-specific knowledge from vector store collections. This improves the contextual accuracy of responses, reduces hallucinations, and ensures that the output is based on the most relevant and up-to-date information.
The models that support RAG-based chat completion on Vultr Serverless Inference are: deepseek-r1-distill-qwen-32b, qwen2.5-32b-instruct, qwen2.5-coder-32b-instruct, llama-3.1-70b-instruct-fp8, llama-3.3-70b-instruct-fp8, deepseek-r1-distill-llama-70b, and deepseek-r1. mistral-7B-v0.3 and mistral-nemo-instruct-2407 are not compatible with RAG.
Follow this guide to use RAG-based chat completion on Vultr Serverless Inference using the Vultr API.
- Send a
GETrequest to the List Collections endpoint and note the target collection’s ID.console$ curl "https://api.vultrinference.com/v1/vector_store" \ -X GET \ -H "Authorization: Bearer ${INFERENCE_API_KEY}"
- Send a
GETrequest to the List Models endpoint and note the preferred inference model’s ID.console$ curl "https://api.vultrinference.com/v1/models" \ -X GET \ -H "Authorization: Bearer ${INFERENCE_API_KEY}"
- Send a
POSTrequest to the RAG Chat Completion endpoint to generate responses using Retrieval-Augmented Generation (RAG).console$ curl "https://api.vultrinference.com/v1/chat/completions/RAG" \ -X POST \ -H "Authorization: Bearer ${INFERENCE_API_KEY}" \ -H "Content-Type: application/json" \ --data '{ "collection": "{collection-id}", "model": "{model-id}", "messages": [ { "role": "user", "content": "{user-input}" } ], "max_tokens": 512 }'
Visit the RAG Chat Completion endpoint to view additional attributes you can apply for greater control when interacting with the preferred inference model.
How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources. Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables…
How to Use Retrieval-Augmented Generation (RAG) with Vultr Serverless Inference A collection of RAG (Retrieval-Augmented Generation) chat models that enhance AI responses with relevant information from your data sources. Retrieval-Augmented Generation (RAG) enhances AI models by combining real-time information retrieval with generative capabilities. When using Vultr Serverless Inference, RAG enables…