Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
kingabzpro 
posted an update Sep 7
Post
1838
How can I make my RAG application generate real-time responses? Up until now, I have been using Groq for fast LLM generation and the Gradio Live function. I am looking for a better solution that can help me build a real-time application without any delay. @abidlabs

kingabzpro/Real-Time-RAG

This all depends on your use case(s), but here are some options:

  • Profile your code to determine where it performs the slowest. Troubleshoot these areas of your code first.
  • Speculative decoding (Qwen2-0.5B helping Qwen2-7B, for example)
  • Model preloading
  • Preloading and/or caching data
  • Caching query responses
  • Use smaller models
    for embedding/retrieval
  • experiment with inference optimizations like torch.compile() and Unsloth/Liger/Marlin
  • Use fp8, bfloat16, or float16 torch dtype instead of float32 on GPU.
  • Consider a smaller vector DB of summarized data for the first retrieval instead of searching an entire fulltext DB up front.
  • Use async code where appropriate

Please note that RAG may not be the best choice for realtime use cases. The best thing to remember is to keep the data as close to the user as possible if you want to get it to the user faster.

·

I'm having some issues with the RAG pipeline. It generally takes 0.2-2 seconds for it to respond, and most of the time the embedding model takes even longer. I can implement prompt caching, but I was considering a more hardware-related solution. What do you think about using Ray for distributed serving? Also, what do you think about GraphQL?