CuBERT: Learning and Evaluating Contextual Embedding of Source Code

Overview

This model is the unofficial HuggingFace version of "CuBERT". In particular, this version comes from gs://cubert/20210711_Python/pre_trained_model_epochs_2__length_512. It was trained 2021-07-11 for 2 epochs with a 512 token context window on the Python BigQuery dataset. I manually converted the Tensorflow checkpoint to PyTorch and have uploaded it here. The tokenizer has not been converted yet. All credit goes to Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi.

The other versions are available here:

cubert-20210711-Python-512

cubert-20210711-Python-1024

cubert-20210711-Python-2048

cubert-20210711-Java-512

cubert-20210711-Java-1024

cubert-20210711-Java-2048

Citation:

@inproceedings{cubert,
author    = {Aditya Kanade and
             Petros Maniatis and
             Gogul Balakrishnan and
             Kensen Shi},
title     = {Learning and evaluating contextual embedding of source code},
booktitle = {Proceedings of the 37th International Conference on Machine Learning,
               {ICML} 2020, 12-18 July 2020},
series    = {Proceedings of Machine Learning Research},
publisher = {{PMLR}},
year      = {2020},
}
Downloads last month
14
Safetensors
Model size
356M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.