Spaces:
Runtime error
Runtime error
from langchain.chat_models import ChatOpenAI | |
from langchain.schema import ( | |
AIMessage, | |
HumanMessage, | |
SystemMessage | |
) | |
judge_prompt = [ | |
SystemMessage(content="\ | |
Review the user’s question and the corresponding response using the additive 5-point \ | |
scoring system described below. Points are accumulated based on the satisfaction of each \ | |
criterion: \ | |
- Add 1 point if the response is relevant and provides some information related to \ | |
the user’s inquiry, even if it is incomplete or contains some irrelevant content. \ | |
- Add another point if the response addresses a substantial portion of the user’s question, \ | |
but does not completely resolve the query or provide a direct answer. \ | |
- Award a third point if the response answers the basic elements of the user’s question in a \ | |
useful way, regardless of whether it seems to have been written by an AI Assistant or if it \ | |
has elements typically found in blogs or search results. \ | |
- Grant a fourth point if the response is clearly written from an AI Assistant’s perspective, \ | |
addressing the user’s question directly and comprehensively, and is well-organized and \ | |
helpful, even if there is slight room for improvement in clarity, conciseness or focus. \ | |
- Bestow a fifth point for a response that is impeccably tailored to the user’s question \ | |
by an AI Assistant, without extraneous information, reflecting expert knowledge, and \ | |
demonstrating a high-quality, engaging, and insightful answer."), | |
None, # placeholder for user message, | |
SystemMessage(content="\ | |
After examining the user’s instruction and the response: \ | |
- Briefly justify your total score, up to 100 words. \ | |
- Conclude with the score using the format: “Score: <total points>” \ | |
Remember to assess from the AI Assistant perspective, utilizing web search knowledge as \ | |
necessary. To evaluate the response in alignment with this additive scoring model, we’ll \ | |
systematically attribute points based on the outlined criteria. \ | |
"), | |
AIMessage(content="Score:") | |
] | |
content_text = "User: {}\n<response>{}</response>" | |
def llm_as_judge(question: str, response: str): | |
chat = ChatOpenAI(temperature=0) | |
judge_prompt[1] = HumanMessage(content=content_text.format(question, response)) | |
result = chat.predict_messages(judge_prompt) | |
return result.content | |
if __name__ == "__main__": | |
result = llm_as_judge("what is rag?", """ | |
RAG, or Retrieval-Augmented Generation, is a model that combines the capabilities of pretrained dense retrieval (DPR) and sequence-to-sequence (seq2seq) models. It operates by retrieving documents with a DPR model, passing them to a seq2seq model, and then marginalizing the output to generate responses. Both the retrieval and seq2seq components are initialized from pretrained models and are fine-tuned jointly, allowing the system to adapt both its retrieval and generation processes to specific downstream tasks. RAG models aim to generate more specific, diverse, and factual language for language generation tasks compared to state-of-the-art parametric-only seq2seq models. They address limitations in pre-trained language models related to accessing and manipulating knowledge precisely for knowledge-intensive tasks, offering a solution that combines pre-trained parametric (seq2seq models) and non-parametric (dense vector index of Wikipedia) memory. | |
""".strip()) | |
print(result) | |