Zuchen commited on
Commit
b024045
1 Parent(s): 7f2739c

Add new SentenceTransformer model.

Browse files
Files changed (2) hide show
  1. README.md +100 -145
  2. config_sentence_transformers.json +1 -1
README.md CHANGED
@@ -6,180 +6,135 @@ tags:
6
  - sentence-similarity
7
  - feature-extraction
8
  ---
9
- <div align="center">
10
- <img src="https://raw.githubusercontent.com/Anditty/OASIS/refs/heads/main/Group.svg" width="60%" alt="Kwaipilot" />
11
- </div>
12
- <hr>
13
 
14
- # Kwaipilot OASIS-1.3B
 
 
15
 
16
  ## Model Details
17
- **Model Name**: OASIS (Optimized Augmentation Strategy for Improved code Search)
18
 
19
- **Introduction**
 
 
 
 
 
 
 
 
20
 
21
- OASIS is a state-of-the-art code embedding model developed by Kwaipilot. This model incorporates unique, proprietary methods including **repository-level program analysis**, the **OASIS-instruct data synthesis** algorithm, and a **specialized fusion loss function**, setting new benchmarks in code search efficiency and accuracy.
22
 
23
- **Intended Use**
 
 
24
 
25
- This model is ideal for developers and researchers engaged in enhancing **code retrieval systems**. OASIS excels in scenarios requiring semantic understanding and retrieval of code snippets within varied programming contexts.
26
 
27
- **Training and Performance**
 
 
 
 
 
28
 
29
- OASIS was trained on a synthetic dataset created through repository-level analysis, ensuring broad understanding across different coding styles and languages. It has demonstrated state-of-the-art performance on latest code search benchmarks.
30
 
31
- ## Future Directions
32
- Kwaipilot upcoming initiatives include:
33
 
34
- - Open sourcing improved models.
35
- - Releasing technical reports.
36
- - Releasing natural language processing models.
37
- - ...
38
 
 
 
 
39
 
40
- ## Performance
 
 
41
 
42
- | | Size | CoSQA | AdvTest | CSN-Py | CSN-Ja | CSN-JS | CSN-PHP | CSN-Go | CSN-Ruby |
43
- |-----------------|:-----:|:------:|:---------:|:--------:|:-------:|:-------:|:-------:|:-------:|:-------:|
44
- |Openai-Embedding-Ada-002 | Unknown | 0.4423| 0.3808 | 0.6802 | 0.7149| 0.6750| 0.6062| 0.8563| 0.7472|
45
- |jina-embeddings-v2-base-code | 161M |0.6837 |0.385 | 0.6634 | 0.6803| 0.6304| 0.5701| 0.8595| 0.7095|
46
- | CodeSage-large | 1.3B | 0.4753| 0.5267 | 0.7077 | 0.7021| 0.695 | 0.6133| 0.8371| 0.7192|
47
- | CodeFuse-CGE-Small | 3.8B | 0.5619| 0.4639 | 0.6958 | 0.6863| 0.6564| 0.6133| 0.8637| 0.7341|
48
- | OASIS-1.3B | 1.3B | 0.5532| 0.4861 | 0.701 | 0.7199| 0.6727| 0.6217| 0.8732| 0.7333|
 
 
 
 
49
 
50
- ## Usage
 
 
 
 
51
 
52
- ### Direct Usage
 
53
 
54
- ```bash
55
- pip install -U torch
56
- pip install -U transformers
57
- ```
58
 
59
- Avoid using torch=2.5.0 when loading the model with torch_dtype=torch.bfloat16. For optimal performance and stability, please use PyTorch version 2.4.1 or earlier, or upgrade to 2.5.1 or later.
 
60
 
61
- ```python
62
- import torch
63
- import torch.nn.functional as F
64
-
65
- from torch import Tensor
66
- from transformers import AutoModel, AutoTokenizer
67
-
68
- def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
69
- left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
70
- if left_padding:
71
- return last_hidden_states[:, -1]
72
- else:
73
- sequence_lengths = attention_mask.sum(dim=1) - 1
74
- batch_size = last_hidden_states.shape[0]
75
- return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
76
-
77
- # Add query prompt
78
- def get_query_prompt(query: str):
79
- query_description = 'Given a code search query, retrieve relevant code snippet that answer the query'
80
- prompt = f'Instruct: {query_description}\nQuery: {query}'
81
- return prompt
82
-
83
- query = "How to do quicksort in python?"
84
-
85
- code1 = """def bubble_sort(arr):
86
- n = len(arr)
87
- for i in range(n):
88
- swapped = False
89
- for j in range(1, n - i):
90
- if arr[j - 1] > arr[j]:
91
- arr[j - 1], arr[j] = arr[j], arr[j - 1]
92
- swapped = True
93
- if not swapped:
94
- break
95
- return arr"""
96
-
97
- code2 = """def quick_sort(arr):
98
- if len(arr) <= 1:
99
- return arr
100
- else:
101
- pivot = arr[0]
102
- less = [x for x in arr[1:] if x <= pivot]
103
- greater = [x for x in arr[1:] if x > pivot]
104
- return quick_sort(less) + [pivot] + quick_sort(greater)"""
105
-
106
- model = AutoModel.from_pretrained("Kwaipilot/OASIS-code-1.3B", output_hidden_states=True)
107
- tokenizer = AutoTokenizer.from_pretrained("Kwaipilot/OASIS-code-1.3B")
108
-
109
- # Tokenize and inference
110
- inputs = tokenizer([get_query_prompt(query), code1, code2], max_length=8192, padding=True, truncation=True, return_tensors='pt')
111
- outputs = model(**inputs)
112
-
113
- # Last token pooling
114
- embeddings = last_token_pool(outputs.hidden_states[-1], inputs['attention_mask'])
115
- print(embeddings.shape)
116
- # torch.Size([3, 2048])
117
 
118
- embeddings = F.normalize(embeddings, dim=1, p=2)
119
- similarity = embeddings @ embeddings.T
120
- print(similarity[0, 1:])
121
- # tensor([0.6495, 0.8036])
122
- ```
123
 
 
124
 
 
 
125
 
126
- ### Sentence Transformers
 
127
 
128
- First install the Sentence Transformers library:
 
129
 
130
- ```bash
131
- pip install -U sentence-transformers
132
- ```
133
 
134
- Then you can load this model and run inference.
135
- ```python
136
- from sentence_transformers import SentenceTransformer
137
 
138
- # Download from the 🤗 Hub
139
- model = SentenceTransformer("Kwaipilot/OASIS-code-1.3B")#, model_kwargs={"torch_dtype": torch.bfloat16})
140
-
141
- query = "How to do quicksort in python?"
142
-
143
- code1 = """def bubble_sort(arr):
144
- n = len(arr)
145
- for i in range(n):
146
- swapped = False
147
- for j in range(1, n - i):
148
- if arr[j - 1] > arr[j]:
149
- arr[j - 1], arr[j] = arr[j], arr[j - 1]
150
- swapped = True
151
- if not swapped:
152
- break
153
- return arr"""
154
-
155
- code2 = """def quick_sort(arr):
156
- if len(arr) <= 1:
157
- return arr
158
- else:
159
- pivot = arr[0]
160
- less = [x for x in arr[1:] if x <= pivot]
161
- greater = [x for x in arr[1:] if x > pivot]
162
- return quick_sort(less) + [pivot] + quick_sort(greater)"""
163
 
164
- # Run inference
165
- query_embedding = model.encode([query], prompt_name="query")
166
- code_embeddings = model.encode([code1, code2])
167
 
168
- print(code_embeddings.shape)
169
- # (2, 2048)
170
 
171
- # Get the similarity scores for the embeddings
172
- print(model.similarity(query_embedding[0], code_embeddings[0]))
173
- print(model.similarity(query_embedding[0], code_embeddings[1]))
174
- # tensor([[0.6495]])
175
- # tensor([[0.8036]])
176
- ```
 
 
 
 
177
 
178
  ### BibTeX
179
- ```bibtex
180
- @misc{kwaipilotoasis,
181
- title = {Optimized Augmentation Strategy for Improved code Search},
182
- author = {Kwaipilot team},
183
- year = {2024},
184
- }
185
- ```
 
 
 
 
 
 
 
 
 
 
 
 
6
  - sentence-similarity
7
  - feature-extraction
8
  ---
 
 
 
 
9
 
10
+ # SentenceTransformer
11
+
12
+ This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 2048-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
13
 
14
  ## Model Details
 
15
 
16
+ ### Model Description
17
+ - **Model Type:** Sentence Transformer
18
+ <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
19
+ - **Maximum Sequence Length:** 8192 tokens
20
+ - **Output Dimensionality:** 2048 tokens
21
+ - **Similarity Function:** Cosine Similarity
22
+ <!-- - **Training Dataset:** Unknown -->
23
+ <!-- - **Language:** Unknown -->
24
+ <!-- - **License:** Unknown -->
25
 
26
+ ### Model Sources
27
 
28
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
29
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
30
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
31
 
32
+ ### Full Model Architecture
33
 
34
+ ```
35
+ SentenceTransformer(
36
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel
37
+ (1): Pooling({'word_embedding_dimension': 2048, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
38
+ )
39
+ ```
40
 
41
+ ## Usage
42
 
43
+ ### Direct Usage (Sentence Transformers)
 
44
 
45
+ First install the Sentence Transformers library:
 
 
 
46
 
47
+ ```bash
48
+ pip install -U sentence-transformers
49
+ ```
50
 
51
+ Then you can load this model and run inference.
52
+ ```python
53
+ from sentence_transformers import SentenceTransformer
54
 
55
+ # Download from the 🤗 Hub
56
+ model = SentenceTransformer("Kwaipilot/OASIS-code-1.3B")
57
+ # Run inference
58
+ sentences = [
59
+ 'The weather is lovely today.',
60
+ "It's so sunny outside!",
61
+ 'He drove to the stadium.',
62
+ ]
63
+ embeddings = model.encode(sentences)
64
+ print(embeddings.shape)
65
+ # [3, 2048]
66
 
67
+ # Get the similarity scores for the embeddings
68
+ similarities = model.similarity(embeddings, embeddings)
69
+ print(similarities.shape)
70
+ # [3, 3]
71
+ ```
72
 
73
+ <!--
74
+ ### Direct Usage (Transformers)
75
 
76
+ <details><summary>Click to see the direct usage in Transformers</summary>
 
 
 
77
 
78
+ </details>
79
+ -->
80
 
81
+ <!--
82
+ ### Downstream Usage (Sentence Transformers)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
+ You can finetune this model on your own dataset.
 
 
 
 
85
 
86
+ <details><summary>Click to expand</summary>
87
 
88
+ </details>
89
+ -->
90
 
91
+ <!--
92
+ ### Out-of-Scope Use
93
 
94
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
95
+ -->
96
 
97
+ <!--
98
+ ## Bias, Risks and Limitations
 
99
 
100
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
101
+ -->
 
102
 
103
+ <!--
104
+ ### Recommendations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
107
+ -->
 
108
 
109
+ ## Training Details
 
110
 
111
+ ### Framework Versions
112
+ - Python: 3.9.20
113
+ - Sentence Transformers: 3.1.1
114
+ - Transformers: 4.45.2
115
+ - PyTorch: 2.4.1+cu121
116
+ - Accelerate: 1.0.0
117
+ - Datasets: 3.0.1
118
+ - Tokenizers: 0.20.1
119
+
120
+ ## Citation
121
 
122
  ### BibTeX
123
+
124
+ <!--
125
+ ## Glossary
126
+
127
+ *Clearly define terms in order to be accessible across audiences.*
128
+ -->
129
+
130
+ <!--
131
+ ## Model Card Authors
132
+
133
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
134
+ -->
135
+
136
+ <!--
137
+ ## Model Card Contact
138
+
139
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
140
+ -->
config_sentence_transformers.json CHANGED
@@ -8,5 +8,5 @@
8
  "query": "Instruct: Given a code search query, retrieve relevant code snippet that answer the query\nQuery: "
9
  },
10
  "default_prompt_name": null,
11
- "similarity_fn_name": "cosine"
12
  }
 
8
  "query": "Instruct: Given a code search query, retrieve relevant code snippet that answer the query\nQuery: "
9
  },
10
  "default_prompt_name": null,
11
+ "similarity_fn_name": null
12
  }