陈俊杰 commited on
Commit
9f4f414
1 Parent(s): 9348641
Files changed (1) hide show
  1. app.py +9 -3
app.py CHANGED
@@ -105,18 +105,24 @@ elif page == "Methodology":
105
  """,unsafe_allow_html=True)
106
 
107
  elif page == "Datasets":
108
- st.header("Answer Generation and Human Annotation")
109
  st.markdown("""
110
  We randomly sampled **100 instances** from **each** dataset as the question set and selected **7 different LLMs** to generate answers, forming the answer set. As a result, each dataset produced 700 instances, totaling **2,800 instances across the four datasets**.
 
 
 
 
111
 
112
- For each instance (question-answer pair), we employed human annotators to provide a score ranging from 1 to 5 and took the median of these scores as the final score. Based on this score, we calculated the rankings of the 7 answers for each question. If scores were identical, the answers were assigned the same rank, with the lowest rank being used.
113
  """)
114
  st.header("Data Acquisition and Usage")
115
  st.markdown("""
116
  We divided the 2,800 instances into three parts:
117
 
118
  1️⃣ train set: 20% of the data (covering all four datasets) was designated as the training set (including human annotations) for participants to reference when designing their methods.
 
119
  2️⃣ test set: Another 20% of the data was set aside as the test set (excluding human annotations), used to evaluate the performance of participants' methods and to generate the **leaderboard**.
 
120
  3️⃣ reserved set: The remaining 60% of the data was reserved for **the final evaluation**.
121
 
122
  Both the training set and the test set can be downloaded from the provided link: [https://huggingface.co/datasets/THUIR/AEOLLM](https://huggingface.co/datasets/THUIR/AEOLLM).
@@ -162,7 +168,7 @@ elif page == "Important Dates":
162
  <span class='main-text'>Jun 10-13 2025</span><br />
163
  """,unsafe_allow_html=True)
164
  st.markdown("""
165
- <p>During the Dry run (until Jan 15, 2025), we will use the <a href="https://huggingface.co/datasets/THUIR/AEOLLM">test set</a> to evaluate the performance of participants' methods and release the results on the Leaderboard.
166
  <br />
167
  Before the Formal run begins (before Jan 15, 2025), we will release the reserved set. Participants need to submit their results for the reserved set before the Formal run ends (before Feb 1, 2025).</p>
168
  """,unsafe_allow_html=True)
 
105
  """,unsafe_allow_html=True)
106
 
107
  elif page == "Datasets":
108
+ st.header("Answer Generation")
109
  st.markdown("""
110
  We randomly sampled **100 instances** from **each** dataset as the question set and selected **7 different LLMs** to generate answers, forming the answer set. As a result, each dataset produced 700 instances, totaling **2,800 instances across the four datasets**.
111
+ """)
112
+ st.header("Human Annotation")
113
+ st.markdown("""
114
+ - For each instance (question-answer pair), we employed human annotators to provide a score ranging from 1 to 5 and took the median of these scores as the final score.
115
 
116
+ - Based on this score, we calculated the rankings of the 7 answers for each question. If scores were identical, the answers were assigned the same rank, with the lowest rank being used.
117
  """)
118
  st.header("Data Acquisition and Usage")
119
  st.markdown("""
120
  We divided the 2,800 instances into three parts:
121
 
122
  1️⃣ train set: 20% of the data (covering all four datasets) was designated as the training set (including human annotations) for participants to reference when designing their methods.
123
+
124
  2️⃣ test set: Another 20% of the data was set aside as the test set (excluding human annotations), used to evaluate the performance of participants' methods and to generate the **leaderboard**.
125
+
126
  3️⃣ reserved set: The remaining 60% of the data was reserved for **the final evaluation**.
127
 
128
  Both the training set and the test set can be downloaded from the provided link: [https://huggingface.co/datasets/THUIR/AEOLLM](https://huggingface.co/datasets/THUIR/AEOLLM).
 
168
  <span class='main-text'>Jun 10-13 2025</span><br />
169
  """,unsafe_allow_html=True)
170
  st.markdown("""
171
+ <p>During the Dry run (until Jan 15, 2025), we will use the <a href="https://huggingface.co/datasets/THUIR/AEOLLM">test set (https://huggingface.co/datasets/THUIR/AEOLLM)</a> to evaluate the performance of participants' methods and release the results on the Leaderboard.
172
  <br />
173
  Before the Formal run begins (before Jan 15, 2025), we will release the reserved set. Participants need to submit their results for the reserved set before the Formal run ends (before Feb 1, 2025).</p>
174
  """,unsafe_allow_html=True)