XicoC commited on
Commit
9d4fa34
1 Parent(s): 20b6b01

Add new SentenceTransformer model.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,679 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Snowflake/snowflake-arctic-embed-m
3
+ library_name: sentence-transformers
4
+ metrics:
5
+ - cosine_accuracy@1
6
+ - cosine_accuracy@3
7
+ - cosine_accuracy@5
8
+ - cosine_accuracy@10
9
+ - cosine_precision@1
10
+ - cosine_precision@3
11
+ - cosine_precision@5
12
+ - cosine_precision@10
13
+ - cosine_recall@1
14
+ - cosine_recall@3
15
+ - cosine_recall@5
16
+ - cosine_recall@10
17
+ - cosine_ndcg@10
18
+ - cosine_mrr@10
19
+ - cosine_map@100
20
+ - dot_accuracy@1
21
+ - dot_accuracy@3
22
+ - dot_accuracy@5
23
+ - dot_accuracy@10
24
+ - dot_precision@1
25
+ - dot_precision@3
26
+ - dot_precision@5
27
+ - dot_precision@10
28
+ - dot_recall@1
29
+ - dot_recall@3
30
+ - dot_recall@5
31
+ - dot_recall@10
32
+ - dot_ndcg@10
33
+ - dot_mrr@10
34
+ - dot_map@100
35
+ pipeline_tag: sentence-similarity
36
+ tags:
37
+ - sentence-transformers
38
+ - sentence-similarity
39
+ - feature-extraction
40
+ - generated_from_trainer
41
+ - dataset_size:600
42
+ - loss:MatryoshkaLoss
43
+ - loss:MultipleNegativesRankingLoss
44
+ widget:
45
+ - source_sentence: How can high compute resource utilization in training GAI models
46
+ affect ecosystems?
47
+ sentences:
48
+ - "should not be used in education, work, housing, or in other contexts where the\
49
+ \ use of such surveillance \ntechnologies is likely to limit rights, opportunities,\
50
+ \ or access. Whenever possible, you should have access to \nreporting that confirms\
51
+ \ your data decisions have been respected and provides an assessment of the \n\
52
+ potential impact of surveillance technologies on your rights, opportunities, or\
53
+ \ access. \nNOTICE AND EXPLANATION"
54
+ - "Legal Disclaimer \nThe Blueprint for an AI Bill of Rights: Making Automated Systems\
55
+ \ Work for the American People is a white paper \npublished by the White House\
56
+ \ Office of Science and Technology Policy. It is intended to support the \ndevelopment\
57
+ \ of policies and practices that protect civil rights and promote democratic values\
58
+ \ in the building, \ndeployment, and governance of automated systems. \nThe Blueprint\
59
+ \ for an AI Bill of Rights is non-binding and does not constitute U.S. government\
60
+ \ policy. It \ndoes not supersede, modify, or direct an interpretation of any\
61
+ \ existing statute, regulation, policy, or \ninternational instrument. It does\
62
+ \ not constitute binding guidance for the public or Federal agencies and"
63
+ - "or stereotyping content . \n4. Data Privacy: Impacts due to l eakage and unauthorized\
64
+ \ use, disclosure , or de -anonymization of \nbiometric, health, location , or\
65
+ \ other personally identifiable information or sensitive data .7 \n5. Environmental\
66
+ \ Impacts: Impacts due to high compute resource utilization in training or \n\
67
+ operating GAI models, and related outcomes that may adversely impact ecosystems.\
68
+ \ \n6. Harmful Bias or Homogenization: Amplification and exacerbation of historical,\
69
+ \ societal, and \nsystemic biases ; performance disparities8 between sub- groups\
70
+ \ or languages , possibly due to \nnon- representative training data , that result\
71
+ \ in discrimination, amplification of biases, or"
72
+ - source_sentence: What are the potential risks associated with human-AI configuration
73
+ in GAI systems?
74
+ sentences:
75
+ - "establish approved GAI technology and service provider lists. Value Chain and\
76
+ \ Component \nIntegration \nGV-6.1-0 08 Maintain records of changes to content\
77
+ \ made by third parties to promote content \nprovenance, including sources, timestamps,\
78
+ \ metadata . Information Integrity ; Value Chain \nand Component Integration;\
79
+ \ Intellectual Property \nGV-6.1-0 09 Update and integrate due diligence processes\
80
+ \ for GAI acquisition and \nprocurement vendor assessments to include intellectual\
81
+ \ property, data privacy, security, and other risks. For example, update p rocesses\
82
+ \ \nto: Address solutions that \nmay rely on embedded GAI technologies; Address\
83
+ \ ongoing monitoring , \nassessments, and alerting, dynamic risk assessments,\
84
+ \ and real -time reporting"
85
+ - "could lead to homogenized outputs, including by amplifying any homogenization\
86
+ \ from the model used to \ngenerate the synthetic training data . \nTrustworthy\
87
+ \ AI Characteristics: Fair with Harmful Bias Managed, Valid and Reliable \n\
88
+ 2.7. Human -AI Configuration \nGAI system use can involve varying risks of misconfigurations\
89
+ \ and poor interactions between a system \nand a human who is interacti ng with\
90
+ \ it. Humans bring their unique perspectives , experiences , or domain -\nspecific\
91
+ \ expertise to interactions with AI systems but may not have detailed knowledge\
92
+ \ of AI systems and \nhow they work. As a result, h uman experts may be unnecessarily\
93
+ \ “averse ” to GAI systems , and thus \ndeprive themselves or others of GAI’s\
94
+ \ beneficial uses ."
95
+ - "requests image features that are inconsistent with the stereotypes. Harmful\
96
+ \ b ias in GAI models , which \nmay stem from their training data , can also \
97
+ \ cause representational harm s or perpetuate or exacerbate \nbias based on\
98
+ \ race, gender, disability, or other protected classes . \nHarmful b ias in GAI\
99
+ \ systems can also lead to harms via disparities between how a model performs\
100
+ \ for \ndifferent subgroups or languages (e.g., an LLM may perform less well\
101
+ \ for non- English languages or \ncertain dialects ). Such disparities can contribute\
102
+ \ to discriminatory decision -making or amplification of \nexisting societal biases.\
103
+ \ In addition, GAI systems may be inappropriately trusted to perform similarly"
104
+ - source_sentence: What types of content are considered harmful biases in the context
105
+ of information security?
106
+ sentences:
107
+ - "MS-2.5-0 05 Verify GAI system training data and TEVV data provenance, and that\
108
+ \ fine -tuning \nor retrieval- augmented generation data is grounded. Information\
109
+ \ Integrity \nMS-2.5-0 06 Regularly review security and safety guardrails, especially\
110
+ \ if the GAI system is \nbeing operated in novel circumstances. This includes\
111
+ \ reviewing reasons why the \nGAI system was initially assessed as being safe\
112
+ \ to deploy. Information Security ; Dangerous , \nViolent, or Hateful Content\
113
+ \ \nAI Actor Tasks: Domain Experts, TEVV"
114
+ - "to diminished transparency or accountability for downstream users. While this\
115
+ \ is a risk for traditional AI \nsystems and some other digital technologies\
116
+ \ , the risk is exacerbated for GAI due to the scale of the \ntraining data, which\
117
+ \ may be too large for humans to vet; the difficulty of training foundation models,\
118
+ \ \nwhich leads to extensive reuse of limited numbers of models; an d the extent\
119
+ \ to which GAI may be \nintegrat ed into other devices and services. As GAI\
120
+ \ systems often involve many distinct third -party \ncomponents and data sources\
121
+ \ , it may be difficult to attribute issues in a system’s behavior to any one of\
122
+ \ \nthese sources. \nErrors in t hird-party GAI components can also have downstream\
123
+ \ impacts on accuracy and robustness ."
124
+ - "biases in the generated content. Information Security ; Harmful Bias \nand Homogenization\
125
+ \ \nMG-2.2-005 Engage in due diligence to analyze GAI output for harmful content,\
126
+ \ potential \nmisinformation , and CBRN -related or NCII content . CBRN Information\
127
+ \ or Capabilities ; \nObscene, Degrading, and/or \nAbusive Content ; Harmful Bias\
128
+ \ and \nHomogenization ; Dangerous , \nViolent, or Hateful Content"
129
+ - source_sentence: What is the focus of the paper by Padmakumar et al (2024) regarding
130
+ language models and content diversity?
131
+ sentences:
132
+ - "Content \nMS-2.12- 002 Document anticipated environmental impacts of model development,\
133
+ \ \nmaintenance, and deployment in product design decisions. Environmental \n\
134
+ MS-2.12- 003 Measure or estimate environmental impacts (e.g., energy and water\
135
+ \ \nconsumption) for training, fine tuning, and deploying models: Verify tradeoffs\
136
+ \ \nbetween resources used at inference time versus additional resources required\
137
+ \ at training time. Environmental \nMS-2.12- 004 Verify effectiveness of carbon\
138
+ \ capture or offset programs for GAI training and \napplications , and address\
139
+ \ green -washing concerns . Environmental \nAI Actor Tasks: AI Deployment, AI\
140
+ \ Impact Assessment, Domain Experts, Operation and Monitoring, TEVV"
141
+ - "opportunities, undermine their privac y, or pervasively track their activity—often\
142
+ \ without their knowledge or \nconsent. \nThese outcomes are deeply harmful—but\
143
+ \ they are not inevitable. Automated systems have brought about extraor-\ndinary\
144
+ \ benefits, from technology that helps farmers grow food more efficiently and\
145
+ \ computers that predict storm \npaths, to algorithms that can identify diseases\
146
+ \ in patients. These tools now drive important decisions across \nsectors, while\
147
+ \ data is helping to revolutionize global industries. Fueled by the power of American\
148
+ \ innovation, \nthese tools hold the potential to redefine every part of our society\
149
+ \ and make life better for everyone."
150
+ - "Publishing, Paris . https://doi.org/10.1787/d1a8d965- en \nOpenAI (2023) GPT-4\
151
+ \ System Card . https://cdn.openai.com/papers/gpt -4-system -card.pdf \nOpenAI\
152
+ \ (2024) GPT-4 Technical Report. https://arxiv.org/pdf/2303.08774 \nPadmakumar,\
153
+ \ V. et al. (2024) Does writing with language models reduce content diversity?\
154
+ \ ICLR . \nhttps://arxiv.org/pdf/2309.05196 \nPark, P. et. al. (2024) AI\
155
+ \ deception: A survey of examples, risks, and potential solutions. Patterns,\
156
+ \ 5(5). \narXiv . https://arxiv.org/pdf/2308.14752 \nPartnership on AI (2023)\
157
+ \ Building a Glossary for Synthetic Media Transparency Methods, Part 1: Indirect\
158
+ \ \nDisclosure . https://partnershiponai.org/glossary -for-synthetic -media- transparency\
159
+ \ -methods -part-1-\nindirect -disclosure/"
160
+ - source_sentence: What are the key components involved in ensuring data quality and
161
+ ethical considerations in AI systems?
162
+ sentences:
163
+ - "(such as where significant negative impacts are imminent, severe harms are actually\
164
+ \ occurring, or large -scale risks could occur); and broad GAI negative risks,\
165
+ \ \nincluding: Immature safety or risk cultures related to AI and GAI design,\
166
+ \ development and deployment, public information integrity risks, including impacts\
167
+ \ on democratic processes, unknown long -term performance characteristics of GAI.\
168
+ \ Information Integrity ; Dangerous , \nViolent, or Hateful Content ; CBRN \n\
169
+ Information or Capabilities \nGV-1.3-007 Devise a plan to halt development or\
170
+ \ deployment of a GAI system that poses unacceptable negative risk. CBRN Information\
171
+ \ and Capability ; \nInformation Security ; Information \nIntegrity \nAI Actor\
172
+ \ Tasks: Governance and Oversight"
173
+ - "30 MEASURE 2.2: Evaluations involving human subjects meet applicable requirements\
174
+ \ (including human subject protection) and are \nrepresentative of the relevant\
175
+ \ population. \nAction ID Suggested Action GAI Risks \nMS-2.2-001 Assess and\
176
+ \ manage statistical biases related to GAI content provenance through \ntechniques\
177
+ \ such as re -sampling, re -weighting, or adversarial training. Information Integrity\
178
+ \ ; Information \nSecurity ; Harmful Bias and \nHomogenization \nMS-2.2-002 Document\
179
+ \ how content provenance data is tracked and how that data interact s \nwith\
180
+ \ privacy and security . Consider : Anonymiz ing data to protect the privacy\
181
+ \ of \nhuman subjects; Leverag ing privacy output filters; Remov ing any personally"
182
+ - "Data quality; Model architecture (e.g., convolutional neural network, transformers,\
183
+ \ etc.); Optimizatio n objectives; Training algorithms; RLHF \napproaches; Fine\
184
+ \ -tuning or retrieval- augmented generation approaches; \nEvaluation data; Ethical\
185
+ \ considerations; Legal and regulatory requirements. Information Integrity ;\
186
+ \ Harmful Bias \nand Homogenization \nAI Actor Tasks: AI Deployment, AI Impact\
187
+ \ Assessment, Domain Experts, End -Users, Operation and Monitoring, TEVV \n \n\
188
+ MEASURE 2.10: Privacy risk of the AI system – as identified in the MAP function\
189
+ \ – is examined and documented. \nAction ID Suggested Action GAI Risks \n\
190
+ MS-2.10- 001 Conduct AI red -teaming to assess issues such as: Outputting of\
191
+ \ training data"
192
+ model-index:
193
+ - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
194
+ results:
195
+ - task:
196
+ type: information-retrieval
197
+ name: Information Retrieval
198
+ dataset:
199
+ name: Unknown
200
+ type: unknown
201
+ metrics:
202
+ - type: cosine_accuracy@1
203
+ value: 0.8
204
+ name: Cosine Accuracy@1
205
+ - type: cosine_accuracy@3
206
+ value: 0.99
207
+ name: Cosine Accuracy@3
208
+ - type: cosine_accuracy@5
209
+ value: 0.99
210
+ name: Cosine Accuracy@5
211
+ - type: cosine_accuracy@10
212
+ value: 1.0
213
+ name: Cosine Accuracy@10
214
+ - type: cosine_precision@1
215
+ value: 0.8
216
+ name: Cosine Precision@1
217
+ - type: cosine_precision@3
218
+ value: 0.33000000000000007
219
+ name: Cosine Precision@3
220
+ - type: cosine_precision@5
221
+ value: 0.19799999999999998
222
+ name: Cosine Precision@5
223
+ - type: cosine_precision@10
224
+ value: 0.09999999999999998
225
+ name: Cosine Precision@10
226
+ - type: cosine_recall@1
227
+ value: 0.8
228
+ name: Cosine Recall@1
229
+ - type: cosine_recall@3
230
+ value: 0.99
231
+ name: Cosine Recall@3
232
+ - type: cosine_recall@5
233
+ value: 0.99
234
+ name: Cosine Recall@5
235
+ - type: cosine_recall@10
236
+ value: 1.0
237
+ name: Cosine Recall@10
238
+ - type: cosine_ndcg@10
239
+ value: 0.9195108324425135
240
+ name: Cosine Ndcg@10
241
+ - type: cosine_mrr@10
242
+ value: 0.8916666666666667
243
+ name: Cosine Mrr@10
244
+ - type: cosine_map@100
245
+ value: 0.8916666666666666
246
+ name: Cosine Map@100
247
+ - type: dot_accuracy@1
248
+ value: 0.8
249
+ name: Dot Accuracy@1
250
+ - type: dot_accuracy@3
251
+ value: 0.99
252
+ name: Dot Accuracy@3
253
+ - type: dot_accuracy@5
254
+ value: 0.99
255
+ name: Dot Accuracy@5
256
+ - type: dot_accuracy@10
257
+ value: 1.0
258
+ name: Dot Accuracy@10
259
+ - type: dot_precision@1
260
+ value: 0.8
261
+ name: Dot Precision@1
262
+ - type: dot_precision@3
263
+ value: 0.33000000000000007
264
+ name: Dot Precision@3
265
+ - type: dot_precision@5
266
+ value: 0.19799999999999998
267
+ name: Dot Precision@5
268
+ - type: dot_precision@10
269
+ value: 0.09999999999999998
270
+ name: Dot Precision@10
271
+ - type: dot_recall@1
272
+ value: 0.8
273
+ name: Dot Recall@1
274
+ - type: dot_recall@3
275
+ value: 0.99
276
+ name: Dot Recall@3
277
+ - type: dot_recall@5
278
+ value: 0.99
279
+ name: Dot Recall@5
280
+ - type: dot_recall@10
281
+ value: 1.0
282
+ name: Dot Recall@10
283
+ - type: dot_ndcg@10
284
+ value: 0.9195108324425135
285
+ name: Dot Ndcg@10
286
+ - type: dot_mrr@10
287
+ value: 0.8916666666666667
288
+ name: Dot Mrr@10
289
+ - type: dot_map@100
290
+ value: 0.8916666666666666
291
+ name: Dot Map@100
292
+ ---
293
+
294
+ # SentenceTransformer based on Snowflake/snowflake-arctic-embed-m
295
+
296
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
297
+
298
+ ## Model Details
299
+
300
+ ### Model Description
301
+ - **Model Type:** Sentence Transformer
302
+ - **Base model:** [Snowflake/snowflake-arctic-embed-m](https://huggingface.co/Snowflake/snowflake-arctic-embed-m) <!-- at revision e2b128b9fa60c82b4585512b33e1544224ffff42 -->
303
+ - **Maximum Sequence Length:** 512 tokens
304
+ - **Output Dimensionality:** 768 tokens
305
+ - **Similarity Function:** Cosine Similarity
306
+ <!-- - **Training Dataset:** Unknown -->
307
+ <!-- - **Language:** Unknown -->
308
+ <!-- - **License:** Unknown -->
309
+
310
+ ### Model Sources
311
+
312
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
313
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
314
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
315
+
316
+ ### Full Model Architecture
317
+
318
+ ```
319
+ SentenceTransformer(
320
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
321
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
322
+ (2): Normalize()
323
+ )
324
+ ```
325
+
326
+ ## Usage
327
+
328
+ ### Direct Usage (Sentence Transformers)
329
+
330
+ First install the Sentence Transformers library:
331
+
332
+ ```bash
333
+ pip install -U sentence-transformers
334
+ ```
335
+
336
+ Then you can load this model and run inference.
337
+ ```python
338
+ from sentence_transformers import SentenceTransformer
339
+
340
+ # Download from the 🤗 Hub
341
+ model = SentenceTransformer("XicoC/midterm-finetuned-arctic")
342
+ # Run inference
343
+ sentences = [
344
+ 'What are the key components involved in ensuring data quality and ethical considerations in AI systems?',
345
+ 'Data quality; Model architecture (e.g., convolutional neural network, transformers, etc.); Optimizatio n objectives; Training algorithms; RLHF \napproaches; Fine -tuning or retrieval- augmented generation approaches; \nEvaluation data; Ethical considerations; Legal and regulatory requirements. Information Integrity ; Harmful Bias \nand Homogenization \nAI Actor Tasks: AI Deployment, AI Impact Assessment, Domain Experts, End -Users, Operation and Monitoring, TEVV \n \nMEASURE 2.10: Privacy risk of the AI system – as identified in the MAP function – is examined and documented. \nAction ID Suggested Action GAI Risks \nMS-2.10- 001 Conduct AI red -teaming to assess issues such as: Outputting of training data',
346
+ '30 MEASURE 2.2: Evaluations involving human subjects meet applicable requirements (including human subject protection) and are \nrepresentative of the relevant population. \nAction ID Suggested Action GAI Risks \nMS-2.2-001 Assess and manage statistical biases related to GAI content provenance through \ntechniques such as re -sampling, re -weighting, or adversarial training. Information Integrity ; Information \nSecurity ; Harmful Bias and \nHomogenization \nMS-2.2-002 Document how content provenance data is tracked and how that data interact s \nwith privacy and security . Consider : Anonymiz ing data to protect the privacy of \nhuman subjects; Leverag ing privacy output filters; Remov ing any personally',
347
+ ]
348
+ embeddings = model.encode(sentences)
349
+ print(embeddings.shape)
350
+ # [3, 768]
351
+
352
+ # Get the similarity scores for the embeddings
353
+ similarities = model.similarity(embeddings, embeddings)
354
+ print(similarities.shape)
355
+ # [3, 3]
356
+ ```
357
+
358
+ <!--
359
+ ### Direct Usage (Transformers)
360
+
361
+ <details><summary>Click to see the direct usage in Transformers</summary>
362
+
363
+ </details>
364
+ -->
365
+
366
+ <!--
367
+ ### Downstream Usage (Sentence Transformers)
368
+
369
+ You can finetune this model on your own dataset.
370
+
371
+ <details><summary>Click to expand</summary>
372
+
373
+ </details>
374
+ -->
375
+
376
+ <!--
377
+ ### Out-of-Scope Use
378
+
379
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
380
+ -->
381
+
382
+ ## Evaluation
383
+
384
+ ### Metrics
385
+
386
+ #### Information Retrieval
387
+
388
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
389
+
390
+ | Metric | Value |
391
+ |:--------------------|:-----------|
392
+ | cosine_accuracy@1 | 0.8 |
393
+ | cosine_accuracy@3 | 0.99 |
394
+ | cosine_accuracy@5 | 0.99 |
395
+ | cosine_accuracy@10 | 1.0 |
396
+ | cosine_precision@1 | 0.8 |
397
+ | cosine_precision@3 | 0.33 |
398
+ | cosine_precision@5 | 0.198 |
399
+ | cosine_precision@10 | 0.1 |
400
+ | cosine_recall@1 | 0.8 |
401
+ | cosine_recall@3 | 0.99 |
402
+ | cosine_recall@5 | 0.99 |
403
+ | cosine_recall@10 | 1.0 |
404
+ | cosine_ndcg@10 | 0.9195 |
405
+ | cosine_mrr@10 | 0.8917 |
406
+ | **cosine_map@100** | **0.8917** |
407
+ | dot_accuracy@1 | 0.8 |
408
+ | dot_accuracy@3 | 0.99 |
409
+ | dot_accuracy@5 | 0.99 |
410
+ | dot_accuracy@10 | 1.0 |
411
+ | dot_precision@1 | 0.8 |
412
+ | dot_precision@3 | 0.33 |
413
+ | dot_precision@5 | 0.198 |
414
+ | dot_precision@10 | 0.1 |
415
+ | dot_recall@1 | 0.8 |
416
+ | dot_recall@3 | 0.99 |
417
+ | dot_recall@5 | 0.99 |
418
+ | dot_recall@10 | 1.0 |
419
+ | dot_ndcg@10 | 0.9195 |
420
+ | dot_mrr@10 | 0.8917 |
421
+ | dot_map@100 | 0.8917 |
422
+
423
+ <!--
424
+ ## Bias, Risks and Limitations
425
+
426
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
427
+ -->
428
+
429
+ <!--
430
+ ### Recommendations
431
+
432
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
433
+ -->
434
+
435
+ ## Training Details
436
+
437
+ ### Training Dataset
438
+
439
+ #### Unnamed Dataset
440
+
441
+
442
+ * Size: 600 training samples
443
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
444
+ * Approximate statistics based on the first 600 samples:
445
+ | | sentence_0 | sentence_1 |
446
+ |:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
447
+ | type | string | string |
448
+ | details | <ul><li>min: 13 tokens</li><li>mean: 21.67 tokens</li><li>max: 34 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 132.86 tokens</li><li>max: 512 tokens</li></ul> |
449
+ * Samples:
450
+ | sentence_0 | sentence_1 |
451
+ |:-------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
452
+ | <code>What is the title of the NIST publication related to Artificial Intelligence Risk Management?</code> | <code>NIST Trustworthy and Responsible AI <br>NIST AI 600 -1 <br>Artificial Intelligence Risk Management <br>Framework: Generative Artificial <br>Intelligence Profile <br> <br> <br>This publication is available free of charge from: <br>https://doi.org/10.6028/NIST.AI.600 -1</code> |
453
+ | <code>Where can the NIST AI 600 -1 publication be accessed for free?</code> | <code>NIST Trustworthy and Responsible AI <br>NIST AI 600 -1 <br>Artificial Intelligence Risk Management <br>Framework: Generative Artificial <br>Intelligence Profile <br> <br> <br>This publication is available free of charge from: <br>https://doi.org/10.6028/NIST.AI.600 -1</code> |
454
+ | <code>What is the title of the publication released by NIST in July 2024 regarding artificial intelligence?</code> | <code>NIST Trustworthy and Responsible AI <br>NIST AI 600 -1 <br>Artificial Intelligence Risk Management <br>Framework: Generative Artificial <br>Intelligence Profile <br> <br> <br>This publication is available free of charge from: <br>https://doi.org/10.6028/NIST.AI.600 -1 <br> <br>July 2024 <br> <br> <br> <br> <br>U.S. Department of Commerce <br>Gina M. Raimondo, Secretary <br>National Institute of Standards and Technology <br>Laurie E. Locascio, NIST Director and Under Secretary of Commerce for Standards and Technology</code> |
455
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
456
+ ```json
457
+ {
458
+ "loss": "MultipleNegativesRankingLoss",
459
+ "matryoshka_dims": [
460
+ 768,
461
+ 512,
462
+ 256,
463
+ 128,
464
+ 64
465
+ ],
466
+ "matryoshka_weights": [
467
+ 1,
468
+ 1,
469
+ 1,
470
+ 1,
471
+ 1
472
+ ],
473
+ "n_dims_per_step": -1
474
+ }
475
+ ```
476
+
477
+ ### Training Hyperparameters
478
+ #### Non-Default Hyperparameters
479
+
480
+ - `eval_strategy`: steps
481
+ - `per_device_train_batch_size`: 20
482
+ - `per_device_eval_batch_size`: 20
483
+ - `num_train_epochs`: 5
484
+ - `multi_dataset_batch_sampler`: round_robin
485
+
486
+ #### All Hyperparameters
487
+ <details><summary>Click to expand</summary>
488
+
489
+ - `overwrite_output_dir`: False
490
+ - `do_predict`: False
491
+ - `eval_strategy`: steps
492
+ - `prediction_loss_only`: True
493
+ - `per_device_train_batch_size`: 20
494
+ - `per_device_eval_batch_size`: 20
495
+ - `per_gpu_train_batch_size`: None
496
+ - `per_gpu_eval_batch_size`: None
497
+ - `gradient_accumulation_steps`: 1
498
+ - `eval_accumulation_steps`: None
499
+ - `torch_empty_cache_steps`: None
500
+ - `learning_rate`: 5e-05
501
+ - `weight_decay`: 0.0
502
+ - `adam_beta1`: 0.9
503
+ - `adam_beta2`: 0.999
504
+ - `adam_epsilon`: 1e-08
505
+ - `max_grad_norm`: 1
506
+ - `num_train_epochs`: 5
507
+ - `max_steps`: -1
508
+ - `lr_scheduler_type`: linear
509
+ - `lr_scheduler_kwargs`: {}
510
+ - `warmup_ratio`: 0.0
511
+ - `warmup_steps`: 0
512
+ - `log_level`: passive
513
+ - `log_level_replica`: warning
514
+ - `log_on_each_node`: True
515
+ - `logging_nan_inf_filter`: True
516
+ - `save_safetensors`: True
517
+ - `save_on_each_node`: False
518
+ - `save_only_model`: False
519
+ - `restore_callback_states_from_checkpoint`: False
520
+ - `no_cuda`: False
521
+ - `use_cpu`: False
522
+ - `use_mps_device`: False
523
+ - `seed`: 42
524
+ - `data_seed`: None
525
+ - `jit_mode_eval`: False
526
+ - `use_ipex`: False
527
+ - `bf16`: False
528
+ - `fp16`: False
529
+ - `fp16_opt_level`: O1
530
+ - `half_precision_backend`: auto
531
+ - `bf16_full_eval`: False
532
+ - `fp16_full_eval`: False
533
+ - `tf32`: None
534
+ - `local_rank`: 0
535
+ - `ddp_backend`: None
536
+ - `tpu_num_cores`: None
537
+ - `tpu_metrics_debug`: False
538
+ - `debug`: []
539
+ - `dataloader_drop_last`: False
540
+ - `dataloader_num_workers`: 0
541
+ - `dataloader_prefetch_factor`: None
542
+ - `past_index`: -1
543
+ - `disable_tqdm`: False
544
+ - `remove_unused_columns`: True
545
+ - `label_names`: None
546
+ - `load_best_model_at_end`: False
547
+ - `ignore_data_skip`: False
548
+ - `fsdp`: []
549
+ - `fsdp_min_num_params`: 0
550
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
551
+ - `fsdp_transformer_layer_cls_to_wrap`: None
552
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
553
+ - `deepspeed`: None
554
+ - `label_smoothing_factor`: 0.0
555
+ - `optim`: adamw_torch
556
+ - `optim_args`: None
557
+ - `adafactor`: False
558
+ - `group_by_length`: False
559
+ - `length_column_name`: length
560
+ - `ddp_find_unused_parameters`: None
561
+ - `ddp_bucket_cap_mb`: None
562
+ - `ddp_broadcast_buffers`: False
563
+ - `dataloader_pin_memory`: True
564
+ - `dataloader_persistent_workers`: False
565
+ - `skip_memory_metrics`: True
566
+ - `use_legacy_prediction_loop`: False
567
+ - `push_to_hub`: False
568
+ - `resume_from_checkpoint`: None
569
+ - `hub_model_id`: None
570
+ - `hub_strategy`: every_save
571
+ - `hub_private_repo`: False
572
+ - `hub_always_push`: False
573
+ - `gradient_checkpointing`: False
574
+ - `gradient_checkpointing_kwargs`: None
575
+ - `include_inputs_for_metrics`: False
576
+ - `eval_do_concat_batches`: True
577
+ - `fp16_backend`: auto
578
+ - `push_to_hub_model_id`: None
579
+ - `push_to_hub_organization`: None
580
+ - `mp_parameters`:
581
+ - `auto_find_batch_size`: False
582
+ - `full_determinism`: False
583
+ - `torchdynamo`: None
584
+ - `ray_scope`: last
585
+ - `ddp_timeout`: 1800
586
+ - `torch_compile`: False
587
+ - `torch_compile_backend`: None
588
+ - `torch_compile_mode`: None
589
+ - `dispatch_batches`: None
590
+ - `split_batches`: None
591
+ - `include_tokens_per_second`: False
592
+ - `include_num_input_tokens_seen`: False
593
+ - `neftune_noise_alpha`: None
594
+ - `optim_target_modules`: None
595
+ - `batch_eval_metrics`: False
596
+ - `eval_on_start`: False
597
+ - `eval_use_gather_object`: False
598
+ - `batch_sampler`: batch_sampler
599
+ - `multi_dataset_batch_sampler`: round_robin
600
+
601
+ </details>
602
+
603
+ ### Training Logs
604
+ | Epoch | Step | cosine_map@100 |
605
+ |:------:|:----:|:--------------:|
606
+ | 1.0 | 30 | 0.8722 |
607
+ | 1.6667 | 50 | 0.8817 |
608
+ | 2.0 | 60 | 0.8867 |
609
+ | 3.0 | 90 | 0.8867 |
610
+ | 3.3333 | 100 | 0.8917 |
611
+
612
+
613
+ ### Framework Versions
614
+ - Python: 3.10.12
615
+ - Sentence Transformers: 3.1.0
616
+ - Transformers: 4.44.2
617
+ - PyTorch: 2.4.0+cu121
618
+ - Accelerate: 0.34.2
619
+ - Datasets: 2.19.2
620
+ - Tokenizers: 0.19.1
621
+
622
+ ## Citation
623
+
624
+ ### BibTeX
625
+
626
+ #### Sentence Transformers
627
+ ```bibtex
628
+ @inproceedings{reimers-2019-sentence-bert,
629
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
630
+ author = "Reimers, Nils and Gurevych, Iryna",
631
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
632
+ month = "11",
633
+ year = "2019",
634
+ publisher = "Association for Computational Linguistics",
635
+ url = "https://arxiv.org/abs/1908.10084",
636
+ }
637
+ ```
638
+
639
+ #### MatryoshkaLoss
640
+ ```bibtex
641
+ @misc{kusupati2024matryoshka,
642
+ title={Matryoshka Representation Learning},
643
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
644
+ year={2024},
645
+ eprint={2205.13147},
646
+ archivePrefix={arXiv},
647
+ primaryClass={cs.LG}
648
+ }
649
+ ```
650
+
651
+ #### MultipleNegativesRankingLoss
652
+ ```bibtex
653
+ @misc{henderson2017efficient,
654
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
655
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
656
+ year={2017},
657
+ eprint={1705.00652},
658
+ archivePrefix={arXiv},
659
+ primaryClass={cs.CL}
660
+ }
661
+ ```
662
+
663
+ <!--
664
+ ## Glossary
665
+
666
+ *Clearly define terms in order to be accessible across audiences.*
667
+ -->
668
+
669
+ <!--
670
+ ## Model Card Authors
671
+
672
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
673
+ -->
674
+
675
+ <!--
676
+ ## Model Card Contact
677
+
678
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
679
+ -->
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "midterm-finetuned_arctic",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.44.2",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.1.0",
4
+ "transformers": "4.44.2",
5
+ "pytorch": "2.4.0+cu121"
6
+ },
7
+ "prompts": {
8
+ "query": "Represent this sentence for searching relevant passages: "
9
+ },
10
+ "default_prompt_name": null,
11
+ "similarity_fn_name": null
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e0023811e92eb44d32fdb9fe8bd88a6fd762711e7b567617175a836f424f844a
3
+ size 437951328
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "max_length": 512,
49
+ "model_max_length": 512,
50
+ "pad_to_multiple_of": null,
51
+ "pad_token": "[PAD]",
52
+ "pad_token_type_id": 0,
53
+ "padding_side": "right",
54
+ "sep_token": "[SEP]",
55
+ "stride": 0,
56
+ "strip_accents": null,
57
+ "tokenize_chinese_chars": true,
58
+ "tokenizer_class": "BertTokenizer",
59
+ "truncation_side": "right",
60
+ "truncation_strategy": "longest_first",
61
+ "unk_token": "[UNK]"
62
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff