add mistral table
Browse files- index.html +89 -13
index.html
CHANGED
@@ -599,7 +599,7 @@
|
|
599 |
<img src="./static/images/method_plot_v8.png"
|
600 |
class="method_overview"
|
601 |
alt="Methodlogy Overview of DPP"/>
|
602 |
-
<p>Overview of <strong>Defensive Prompt Patch</strong>. (a) showcases an example of jailbreak attacks.
|
603 |
(b) is the DPP training phase in which the algorithm takes in the refusal and helpful datasets and a prototype of the defense prompt.
|
604 |
Then, the algorithm forms the defense prompt population by revising the prototype using LLM. For each of the defense prompts in the population,
|
605 |
the algorithm will evaluate the defense and utility scores. The algorithm keeps editing the defense prompts with low scores using the Hierarchical Genetic Search algorithm.
|
@@ -738,18 +738,18 @@
|
|
738 |
|
739 |
<h3>Numerical Results:</h3>
|
740 |
<table border="1" style="width:100%; text-align:center;">
|
741 |
-
<caption>Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
|
742 |
<thead>
|
743 |
<tr>
|
744 |
<th>Methods</th>
|
745 |
-
<th>Base64 [
|
746 |
-
<th>ICA [
|
747 |
-
<th>AutoDAN [
|
748 |
-
<th>GCG [
|
749 |
-
<th>PAIR [
|
750 |
-
<th>TAP [
|
751 |
-
<th>Average ASR [
|
752 |
-
<th>Win-Rate [
|
753 |
</tr>
|
754 |
</thead>
|
755 |
<tbody>
|
@@ -765,7 +765,7 @@
|
|
765 |
<td>81.37</td>
|
766 |
</tr>
|
767 |
<tr>
|
768 |
-
<td>RPO
|
769 |
<td>0.000</td>
|
770 |
<td>0.420</td>
|
771 |
<td>0.280</td>
|
@@ -776,7 +776,7 @@
|
|
776 |
<td>79.23</td>
|
777 |
</tr>
|
778 |
<tr>
|
779 |
-
<td>Goal Prioritization
|
780 |
<td>0.000</td>
|
781 |
<td>0.020</td>
|
782 |
<td>0.520</td>
|
@@ -787,7 +787,7 @@
|
|
787 |
<td>34.29</td>
|
788 |
</tr>
|
789 |
<tr>
|
790 |
-
<td>Self-Reminder
|
791 |
<td>0.030</td>
|
792 |
<td>0.290</td>
|
793 |
<td>0.000</td>
|
@@ -810,6 +810,82 @@
|
|
810 |
</tr>
|
811 |
</tbody>
|
812 |
</table>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
813 |
|
814 |
</div>
|
815 |
</div>
|
|
|
599 |
<img src="./static/images/method_plot_v8.png"
|
600 |
class="method_overview"
|
601 |
alt="Methodlogy Overview of DPP"/>
|
602 |
+
<p><strong>Figure 1.</strong> Overview of <strong>Defensive Prompt Patch</strong>. (a) showcases an example of jailbreak attacks.
|
603 |
(b) is the DPP training phase in which the algorithm takes in the refusal and helpful datasets and a prototype of the defense prompt.
|
604 |
Then, the algorithm forms the defense prompt population by revising the prototype using LLM. For each of the defense prompts in the population,
|
605 |
the algorithm will evaluate the defense and utility scores. The algorithm keeps editing the defense prompts with low scores using the Hierarchical Genetic Search algorithm.
|
|
|
738 |
|
739 |
<h3>Numerical Results:</h3>
|
740 |
<table border="1" style="width:100%; text-align:center;">
|
741 |
+
<caption><strong>Table 1.</strong> Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
|
742 |
<thead>
|
743 |
<tr>
|
744 |
<th>Methods</th>
|
745 |
+
<th>Base64 [β]</th>
|
746 |
+
<th>ICA [β]</th>
|
747 |
+
<th>AutoDAN [β]</th>
|
748 |
+
<th>GCG [β]</th>
|
749 |
+
<th>PAIR [β]</th>
|
750 |
+
<th>TAP [β]</th>
|
751 |
+
<th>Average ASR [β]</th>
|
752 |
+
<th>Win-Rate [β]</th>
|
753 |
</tr>
|
754 |
</thead>
|
755 |
<tbody>
|
|
|
765 |
<td>81.37</td>
|
766 |
</tr>
|
767 |
<tr>
|
768 |
+
<td>RPO </td>
|
769 |
<td>0.000</td>
|
770 |
<td>0.420</td>
|
771 |
<td>0.280</td>
|
|
|
776 |
<td>79.23</td>
|
777 |
</tr>
|
778 |
<tr>
|
779 |
+
<td>Goal Prioritization</td>
|
780 |
<td>0.000</td>
|
781 |
<td>0.020</td>
|
782 |
<td>0.520</td>
|
|
|
787 |
<td>34.29</td>
|
788 |
</tr>
|
789 |
<tr>
|
790 |
+
<td>Self-Reminder</td>
|
791 |
<td>0.030</td>
|
792 |
<td>0.290</td>
|
793 |
<td>0.000</td>
|
|
|
810 |
</tr>
|
811 |
</tbody>
|
812 |
</table>
|
813 |
+
<table border="1" style="width:100%; text-align:center;">
|
814 |
+
<caption>Attack Success Rates (ASRs) and Win-Rates (utility) on Mistral-7B-Instruct-v0.2 model across six different jailbreak attacks. Our method can achieve the lowest Average attack success rate with reasonable trade-off of Win-Rate when compared with other defense baselines.</caption>
|
815 |
+
<thead>
|
816 |
+
<tr>
|
817 |
+
<th>Methods</th>
|
818 |
+
<th>Base64 [β]</th>
|
819 |
+
<th>ICA [β]</th>
|
820 |
+
<th>GCG [β]</th>
|
821 |
+
<th>AutoDAN [β]</th>
|
822 |
+
<th>PAIR [β]</th>
|
823 |
+
<th>TAP [β]</th>
|
824 |
+
<th>Average ASR [β]</th>
|
825 |
+
<th>Win-Rate [β]</th>
|
826 |
+
</tr>
|
827 |
+
</thead>
|
828 |
+
<tbody>
|
829 |
+
<tr>
|
830 |
+
<td>w/o defense</td>
|
831 |
+
<td>0.990</td>
|
832 |
+
<td>0.960</td>
|
833 |
+
<td>0.990</td>
|
834 |
+
<td>0.970</td>
|
835 |
+
<td>1.000</td>
|
836 |
+
<td>1.000</td>
|
837 |
+
<td>0.985</td>
|
838 |
+
<td>90.31</td>
|
839 |
+
</tr>
|
840 |
+
<tr>
|
841 |
+
<td>Self-Reminder</td>
|
842 |
+
<td>0.550</td>
|
843 |
+
<td>0.270</td>
|
844 |
+
<td>0.510</td>
|
845 |
+
<td>0.880</td>
|
846 |
+
<td>0.420</td>
|
847 |
+
<td>0.260</td>
|
848 |
+
<td>0.482</td>
|
849 |
+
<td>88.82</td>
|
850 |
+
</tr>
|
851 |
+
<tr>
|
852 |
+
<td>System Prompt</td>
|
853 |
+
<td>0.740</td>
|
854 |
+
<td>0.470</td>
|
855 |
+
<td>0.300</td>
|
856 |
+
<td>0.970</td>
|
857 |
+
<td>0.500</td>
|
858 |
+
<td>0.180</td>
|
859 |
+
<td>0.527</td>
|
860 |
+
<td>84.97</td>
|
861 |
+
</tr>
|
862 |
+
<tr>
|
863 |
+
<td>Goal Prioritization</td>
|
864 |
+
<td>0.030</td>
|
865 |
+
<td>0.440</td>
|
866 |
+
<td>0.030</td>
|
867 |
+
<td>0.390</td>
|
868 |
+
<td>0.300</td>
|
869 |
+
<td>0.140</td>
|
870 |
+
<td>0.222</td>
|
871 |
+
<td>56.59</td>
|
872 |
+
</tr>
|
873 |
+
<tr>
|
874 |
+
<td>DPP (Ours)</td>
|
875 |
+
<td>0.000</td>
|
876 |
+
<td>0.010</td>
|
877 |
+
<td>0.020</td>
|
878 |
+
<td>0.030</td>
|
879 |
+
<td>0.040</td>
|
880 |
+
<td>0.020</td>
|
881 |
+
<td><strong>0.020</strong></td>
|
882 |
+
<td>75.06</td>
|
883 |
+
</tr>
|
884 |
+
</tbody>
|
885 |
+
</table>
|
886 |
+
|
887 |
+
|
888 |
+
|
889 |
|
890 |
</div>
|
891 |
</div>
|