- ai-rag-chat-evaluator: Repo with scripts to generate sample data based on Azure AI search index and evaluate responses from a chat app using GPT-4 model and custom metrics. Designed for use with azure-search-openai-demo but can be modified for use with other chat apps.
- azure-ai-generative: Python package used by ai-rag-chat-evaluator tools, developed by Microsoft engineers. Built on top of PromptFlow.
- Docs: Evaluation of generative AI applications with Azure AI Studio: A UI for evaluations that uses similar underlying technology and metrics.
(Non-Microsoft) Python packages for evaluation:
Improving RAG quality:
- Improve search results by using hybrid search with a semantic re-ranker. Azure AI Search has built-in support, but it's also possible to implement on any vector-capable data store, like Azure PostgreSQL Flexible Server.
- Don't send unnecessary results, or your LLM will get lost in the middle.
- Pre-process user questions before sending them to your search engine, using either function calling or other libraries for structured output (like Microsoft's TypeChat).
- Consider fine-tuning in combination with RAG, like the RAFT project.
- In my experience: easiest way to improve results is to improve the model (3.5 to 4).
- Evaluate and see for yourself!