|
--- |
|
license: mit |
|
--- |
|
# Python clone detection |
|
|
|
This is a codebert model for detecting Python clone codes, fine-tuned on the dataset shared by [PoolC](https://github.com/PoolC) on [Hugging Face Hub](https://huggingface.co/datasets/PoolC/1-fold-clone-detection-600k-5fold). The original source code for using the model can be found at https://github.com/sangHa0411/CloneDetection/blob/main/inference.py. |
|
|
|
# How to use |
|
|
|
To use the model in an efficient way, you can refer to this repository: https://github.com/RepoAnalysis/PythonCloneDetection, which contains a class that integrates data preprocessing, input tokenization, and model inferencing. |
|
|
|
You can also follow the original inference source code at https://github.com/sangHa0411/CloneDetection/blob/main/inference.py. |
|
|
|
More conveniently, a pipeline for this model has been implemented, and you can initialize it with only two lines of code: |
|
```python |
|
from transformers import pipeline |
|
|
|
pipe = pipeline(model="Lazyhope/python-clone-detection", trust_remote_code=True) |
|
``` |
|
To use it, pass a tuple of code pairs: |
|
```python |
|
code1 = """def token_to_inputs(feature): |
|
inputs = {} |
|
for k, v in feature.items(): |
|
inputs[k] = torch.tensor(v).unsqueeze(0) |
|
|
|
return inputs""" |
|
code2 = """def f(feature): |
|
return {k: torch.tensor(v).unsqueeze(0) for k, v in feature.items()}""" |
|
|
|
is_clone = pipe((code1, code2)) |
|
is_clone |
|
# {False: 1.3705984201806132e-05, True: 0.9999862909317017} |
|
``` |
|
|
|
# Credits |
|
|
|
We would like to thank the original team and authors of the model and the fine-tuning dataset: |
|
- [PoolC](https://github.com/PoolC) |
|
- [sangHa0411](https://github.com/sangHa0411) |
|
- [snoop2head](https://github.com/snoop2head) |
|
|
|
# Lincese |
|
|
|
This model is released under the MIT license. |
|
|