陈俊杰
commited on
Commit
•
5706545
1
Parent(s):
b91860a
fontSize
Browse files
app.py
CHANGED
@@ -135,7 +135,7 @@ if page == "Introduction":
|
|
135 |
st.header("Introduction")
|
136 |
st.markdown("""
|
137 |
<div class='main-text'>
|
138 |
-
<p>The Automatic Evaluation of LLMs (AEOLLM) task is a new core task in <a href="http://research.nii.ac.jp/ntcir/ntcir-18">NTCIR-18</a> to support in-depth research on large language models (LLMs) evaluation. As LLMs grow popular in both fields of academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including the task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we proposed the Automatic Evaluation of LLMs (AEOLLM) task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as summary generation, non-factoid question answering, text expansion, and dialogue generation to comprehensively test different methods. We believe that the AEOLLM task will facilitate the development of the LLMs community.</p>
|
139 |
</div>
|
140 |
""", unsafe_allow_html=True)
|
141 |
|
|
|
135 |
st.header("Introduction")
|
136 |
st.markdown("""
|
137 |
<div class='main-text'>
|
138 |
+
<p class='main-text'>The Automatic Evaluation of LLMs (AEOLLM) task is a new core task in <a href="http://research.nii.ac.jp/ntcir/ntcir-18">NTCIR-18</a> to support in-depth research on large language models (LLMs) evaluation. As LLMs grow popular in both fields of academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including the task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we proposed the Automatic Evaluation of LLMs (AEOLLM) task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as summary generation, non-factoid question answering, text expansion, and dialogue generation to comprehensively test different methods. We believe that the AEOLLM task will facilitate the development of the LLMs community.</p>
|
139 |
</div>
|
140 |
""", unsafe_allow_html=True)
|
141 |
|