File size: 4,352 Bytes
bd3b70f
 
00b5837
 
 
 
fbdb7e4
00b5837
 
 
 
 
 
 
 
 
 
 
 
bd3b70f
bde0882
 
2ff8392
bde0882
541b81a
 
bde0882
 
c2bf918
bde0882
 
 
c2bf918
bde0882
 
 
0dd2273
bde0882
 
3c676b3
 
 
c848ce7
3c676b3
 
 
 
 
 
 
 
 
 
0dd2273
3c676b3
 
 
6bdb0dc
c848ce7
6bdb0dc
 
 
 
 
 
 
 
 
 
 
3c676b3
 
 
bde0882
3c676b3
 
31e895f
 
 
 
 
 
 
 
 
 
00b5837
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
extra_gated_fields:
  Institution: text
  Country: text
  I agree to use this model for non-commercial use ONLY: checkbox
  I agree not to use the model to conduct experiments that cause harm to human subjects: checkbox
widget:
- text: <s>
  example_title: Example 1
- text: <s>1234
  example_title: Example 2
- text: <s>ilov
  example_title: Example 3
- text: <s>admin
  example_title: Example 4
pipeline_tag: text-generation
tags:
- passwords
- cybersecurity
---
# PassGPT

PassGPT is a causal language model trained on password leaks. It was first introduced in [this paper](https://arxiv.org/abs/2306.01545). This version of the model was trained on passwords from the RockYou leak, after filtering those that were at most 10 characters long. If you need access to PassGPT trained on passwords up to 16 characters long, you can apply [here](https://huggingface.co/javirandor/passgpt-16characters).

**This is a curated version of the model reported in the paper**. Vocabulary size was reduced to the most meaningful characters and training was slightly optimized. Results are slightly better with these architectures.

### Usage and License Notices
[![License](https://img.shields.io/badge/License-CC%20By%20NC%204.0-yellow)](https://github.com/javirandor/passbert/blob/main/LICENSE)
PassGPT is intended and licensed for research use only. The model and code are CC BY NC 4.0 (allowing only non-commercial use) and should not be used outside of research purposes. This model should never be used to attack real systems.

### Model description

The model inherits the [GPT2LMHeadModel](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2LMHeadModel) architecture and implements a custom [BertTokenizer](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer) that encodes each character in a password as a single token, avoiding merges. It was trained from a random initialization, and the code for training can be found in the [official repository](https://github.com/javirandor/passgpt/).

### Password Generation

Passwords can be sampled from the model using the [built-in generation methods](https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/text_generation#transformers.GenerationMixin.generate) provided by HuggingFace and using the "start of password token" as seed (i.e. `<s>`). This code can be used to generate one password with PassGPT.

```
from transformers import GPT2LMHeadModel
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("javirandor/passgpt-10characters",
                                                  max_len=12,
                                                  padding="max_length", 
                                                  truncation=True,
                                                  do_lower_case=False,
                                                  strip_accents=False,
                                                  mask_token="<mask>",
                                                  unk_token="<unk>",
                                                  pad_token="<pad>",
                                                  truncation_side="right")

model = GPT2LMHeadModel.from_pretrained("javirandor/passgpt-10characters").eval()

NUM_GENERATIONS = 1

# Generate passwords sampling from the beginning of password token
g = model.generate(torch.tensor([[tokenizer.bos_token_id]]),
                  do_sample=True,
                  num_return_sequences=NUM_GENERATIONS,
                  max_length=12,
                  pad_token_id=tokenizer.pad_token_id,
                  bad_words_ids=[[tokenizer.bos_token_id]])

# Remove start of sentence token
g = g[:, 1:]

decoded = tokenizer.batch_decode(g.tolist())
decoded_clean = [i.split("</s>")[0] for i in decoded] # Get content before end of password token

# Print your sampled passwords!
print(decoded_clean)
```

You can find a more flexible script for sampling [here](https://github.com/javirandor/passgpt/blob/main/src/generate_passwords.py).

### Cite our work

```
@article{rando2023passgpt,
  title={PassGPT: Password Modeling and (Guided) Generation with Large Language Models},
  author={Rando, Javier and Perez-Cruz, Fernando and Hitaj, Briland},
  journal={arXiv preprint arXiv:2306.01545},
  year={2023}
}
```