pamelafox/links.md

## links.md

      
    Raw
  

              links.md
            
          
    Evaluation:


ai-rag-chat-evaluator: Repo with scripts to generate sample data based on Azure AI search index and evaluate responses from a chat app using GPT-4 model and custom metrics. Designed for use with azure-search-openai-demo but can be modified for use with other chat apps.

📺 Quick video overview
📺 Talk: Automated evaluations of LLM apps: Talk for Data Science Day. Slides also available.
📝 Blog post: Evaluating a RAG Chat App - Approach, SDKs, Tools
📝 Blog post: Can your app say "I don't know?"
📺 Video: Can your app say "I don't know?"


azure-ai-generative: Python package used by ai-rag-chat-evaluator tools, developed by Microsoft engineers. Built on top of PromptFlow.
Docs: Evaluation of generative AI applications with Azure AI Studio: A UI for evaluations that uses similar underlying technology and metrics.

(Non-Microsoft) Python packages for evaluation:

Langchain Evaluators
RAGAS
DeepEval

Improving RAG quality:

Improve search results by using hybrid search with a semantic re-ranker. Azure AI Search has built-in support, but it's also possible to implement on any vector-capable data store, like Azure PostgreSQL Flexible Server.
Don't send unnecessary results, or your LLM will get lost in the middle.
Pre-process user questions before sending them to your search engine, using either function calling or other libraries for structured output (like Microsoft's TypeChat).
Consider fine-tuning in combination with RAG, like the RAFT project.
In my experience: easiest way to improve results is to improve the model (3.5 to 4).
Evaluate and see for yourself!