Spaces:

THUIR
/

AEOLLM

Running

App Files Files Community

陈俊杰 commited on Sep 3

Commit

9f4f414

•

1 Parent(s): 9348641

cjj-data

Browse files

Files changed (1) hide show

app.py +9 -3

app.py CHANGED Viewed

@@ -105,18 +105,24 @@ elif page == "Methodology":
     """,unsafe_allow_html=True)
 elif page == "Datasets":
-    st.header("Answer Generation and Human Annotation")
     st.markdown("""
 We randomly sampled **100 instances** from **each** dataset as the question set and selected **7 different LLMs** to generate answers, forming the answer set. As a result, each dataset produced 700 instances, totaling **2,800 instances across the four datasets**.
-For each instance (question-answer pair), we employed human annotators to provide a score ranging from 1 to 5 and took the median of these scores as the final score. Based on this score, we calculated the rankings of the 7 answers for each question. If scores were identical, the answers were assigned the same rank, with the lowest rank being used.
 """)
     st.header("Data Acquisition and Usage")
     st.markdown("""
 We divided the 2,800 instances into three parts:
 1️⃣ train set: 20% of the data (covering all four datasets) was designated as the training set (including human annotations) for participants to reference when designing their methods.
 2️⃣ test set: Another 20% of the data was set aside as the test set (excluding human annotations), used to evaluate the performance of participants' methods and to generate the **leaderboard**.
 3️⃣ reserved set: The remaining 60% of the data was reserved for **the final evaluation**.
 Both the training set and the test set can be downloaded from the provided link: [https://huggingface.co/datasets/THUIR/AEOLLM](https://huggingface.co/datasets/THUIR/AEOLLM).
@@ -162,7 +168,7 @@ elif page == "Important Dates":
 <span class='main-text'>Jun 10-13 2025</span><br />
 """,unsafe_allow_html=True)
     st.markdown("""
-<p>During the Dry run (until Jan 15, 2025), we will use the <a href="https://huggingface.co/datasets/THUIR/AEOLLM">test set</a> to evaluate the performance of participants' methods and release the results on the Leaderboard.
 <br />
 Before the Formal run begins (before Jan 15, 2025), we will release the reserved set. Participants need to submit their results for the reserved set before the Formal run ends (before Feb 1, 2025).</p>
 """,unsafe_allow_html=True)

     """,unsafe_allow_html=True)
 elif page == "Datasets":
+    st.header("Answer Generation")
     st.markdown("""
 We randomly sampled **100 instances** from **each** dataset as the question set and selected **7 different LLMs** to generate answers, forming the answer set. As a result, each dataset produced 700 instances, totaling **2,800 instances across the four datasets**.
+""")
+    st.header("Human Annotation")
+    st.markdown("""
+- For each instance (question-answer pair), we employed human annotators to provide a score ranging from 1 to 5 and took the median of these scores as the final score.
+- Based on this score, we calculated the rankings of the 7 answers for each question. If scores were identical, the answers were assigned the same rank, with the lowest rank being used.
 """)
     st.header("Data Acquisition and Usage")
     st.markdown("""
 We divided the 2,800 instances into three parts:
 1️⃣ train set: 20% of the data (covering all four datasets) was designated as the training set (including human annotations) for participants to reference when designing their methods.
 2️⃣ test set: Another 20% of the data was set aside as the test set (excluding human annotations), used to evaluate the performance of participants' methods and to generate the **leaderboard**.
 3️⃣ reserved set: The remaining 60% of the data was reserved for **the final evaluation**.
 Both the training set and the test set can be downloaded from the provided link: [https://huggingface.co/datasets/THUIR/AEOLLM](https://huggingface.co/datasets/THUIR/AEOLLM).
 <span class='main-text'>Jun 10-13 2025</span><br />
 """,unsafe_allow_html=True)
     st.markdown("""
+<p>During the Dry run (until Jan 15, 2025), we will use the <a href="https://huggingface.co/datasets/THUIR/AEOLLM">test set (https://huggingface.co/datasets/THUIR/AEOLLM)</a> to evaluate the performance of participants' methods and release the results on the Leaderboard.
 <br />
 Before the Formal run begins (before Jan 15, 2025), we will release the reserved set. Participants need to submit their results for the reserved set before the Formal run ends (before Feb 1, 2025).</p>
 """,unsafe_allow_html=True)