File size: 6,920 Bytes
0b934be
9c0f93c
0b934be
 
9c0f93c
 
0dc27d9
9c0f93c
 
 
d9781db
 
9c0f93c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42e1290
9c0f93c
 
 
 
 
 
 
 
42e1290
9c0f93c
 
 
42e1290
 
9c0f93c
42e1290
9c0f93c
 
 
 
 
 
 
0dc27d9
 
9c0f93c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0dc27d9
 
9c0f93c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0dc27d9
 
9c0f93c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0dc27d9
 
9c0f93c
 
 
 
 
 
 
 
 
 
 
 
0dc27d9
 
9c0f93c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
language: zh
license: apache-2.0
---


# G2PTL-1

## Introduction

G2PTL-1: A Geography-Graph Pre-trained model for address.  
This work is the first version of G2PTL (v1.0)


## Model description
G2PTL is a Transformer model that is pretrained on a large corpus of Chinese addresses in a self-supervised manner. It has three pretraining objectives:

- Masked language modeling (MLM): taking an address, the model randomly masks some words in the input text and predicts the masked words. It should be noted that for the geographical entities in the address, we adopt the Whole Word Masking (WWM) approach to mask them and learn the co-occurrence relationships among them.

- Hierarchical text modeling (HTC): an address is a text with a hierarchical structure of province, city, district, and street. HTC is used to model the hierarchical relationship among these levels in addresses.
![HTC.jpg](./Images/HTC.jpg)

- Geocoding (GC): an address can be represented by a point with latitude and longitude in the real world. The GC task is designed to learn the mapping relationship between address text and geographical location.

More detail: https://arxiv.org/abs/2304.01559
![Model.jpg](./Images/Model.jpg)


## Intended uses & limitations


This model is designed for decision tasks based on address text, including tasks related to understanding address texts and Spatial-Temporal downstream tasks which rely on address text representation.

1. Address text understanding tasks
- Geocoding
- Named Entity Recognition
- Geographic Entity Alignment
- Address Text Similarity
- Address Texy Classification
- ...
2. Spatial-Temporal downstream tasks:
- Estimated Time of Arrival (ETA) Prediction
- Pick-up & Delivery Route Prediction.
- Express Volume Prediction
- ...

The model currently only supports Chinese addresses, and it is an encoder-only model which is not suitable for text generation scenarios such as question answering. If you need to use address text based dialogue capabilities, you can look forward to our second version of G2PTL (v2.0)


## How to use
You can use this model directly with a pipeline for masked language modeling:

```Python
>>> from transformers import pipeline, AutoModel, AutoTokenizer
>>> model = AutoModel.from_pretrained('Cainiao-AI/G2PTL', trust_remote_code=True)
>>> tokenizer = AutoTokenizer.from_pretrained('Cainiao-AI/G2PTL', trust_remote_code=True)

>>> mask_filler = pipeline(task= 'fill-mask', model= model,tokenizer = tokenizer)
>>> mask_filler("浙江省杭州市[MASK]杭区五常街道阿里巴巴西溪园区")
```
```json
[{'score': 1.0,
  'token': 562,
  'token_str': '余',
  'sequence': '浙 江 省 杭 州 市 余 杭 区 五 常 街 道 阿 里 巴 巴 西 溪 园 区'},
 {'score': 7.49648343401077e-09,
  'token': 1852,
  'token_str': '杭',
  'sequence': '浙 江 省 杭 州 市 杭 杭 区 五 常 街 道 阿 里 巴 巴 西 溪 园 区'},
 {'score': 5.823675763849678e-09,
  'token': 213,
  'token_str': '西',
  'sequence': '浙 江 省 杭 州 市 西 杭 区 五 常 街 道 阿 里 巴 巴 西 溪 园 区'},
 {'score': 3.383779922927488e-09,
  'token': 346,
  'token_str': '五',
  'sequence': '浙 江 省 杭 州 市 五 杭 区 五 常 街 道 阿 里 巴 巴 西 溪 园 区'},
 {'score': 2.9116642430437878e-09,
  'token': 2268,
  'token_str': '荆',
  'sequence': '浙 江 省 杭 州 市 荆 杭 区 五 常 街 道 阿 里 巴 巴 西 溪 园 区'}]
```

You can also use this model for multiple [MASK] filling in PyTorch:
```python
from transformers import pipeline, AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained('Cainiao-AI/G2PTL', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('Cainiao-AI/G2PTL', trust_remote_code=True)
model.eval()
text = ['浙江省杭州市[MASK][MASK][MASK]五常街道阿里巴巴西溪园区']
encoded_input = tokenizer(text, return_tensors='pt')
outputs = model(**encoded_input)
prediction_scores = outputs.logits
prediction_scores = torch.argmax(prediction_scores, dim=-1)
prediction_scores = prediction_scores.cpu().detach().numpy()
input_ids = encoded_input['input_ids']
print('G2PTL:', tokenizer.decode(prediction_scores[torch.where(input_ids.cpu()>0)][1:-1]))
```

```json
G2PTL: 浙 江 省 杭 州 市 余 杭 区 五 常 街 道 阿 里 巴 巴 西 溪 园 区
```

Here is how to use this model to get the HTC output of a given text in PyTorch:

```python
from transformers import pipeline, AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('Cainiao-AI/G2PTL', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('Cainiao-AI/G2PTL', trust_remote_code=True)
model.eval()
text = "浙江省杭州市五常街道阿里巴巴西溪园区"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
htc_layer_out = output.htc_layer_out
htc_pred = model.get_htc_code(htc_layer_out)
print('HTC Result: ', model.decode_htc_code_2_chn(htc_pred))
```
```json
HTC Result:  ['浙江省杭州市余杭区五常街道', '浙江省杭州市五常街道']
```

Here is how to use this model to get the features/embeddings of a given text in PyTorch:

```python
from transformers import pipeline, AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('Cainiao-AI/G2PTL', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('Cainiao-AI/G2PTL', trust_remote_code=True)
model.eval()
text = "浙江省杭州市余杭区五常街道阿里巴巴西溪园区"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
final_hidden_state = output.final_hidden_state
```

Here is how to use this model to get cosine similarity between two address texts in PyTorch:

```python
from transformers import pipeline, AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained('Cainiao-AI/G2PTL', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('Cainiao-AI/G2PTL', trust_remote_code=True)
model.eval()
text = ["浙江省杭州市余杭区五常街道阿里巴巴西溪园区", "浙江省杭州市阿里巴巴西溪园区"]
encoded_input = tokenizer(text, return_tensors='pt', padding=True)
output = model(**encoded_input)
final_pooler_output = output.final_pooler_output
cos_sim = torch.cosine_similarity(final_pooler_output[0], final_pooler_output[1])
print('Cosin Similarity: ', cos_sim[0].detach().numpy())
```
```json
Cosin Similarity:  0.8974346
```
## Requirements
python>=3.8
```shell
tqdm==4.65.0
torch==1.13.1
transformers==4.27.4
datasets==2.11.0
fairseq==0.12.2
```

## Citation
```bibtex
@misc{wu2023g2ptl,
      title={G2PTL: A Pre-trained Model for Delivery Address and its Applications in Logistics System}, 
      author={Lixia Wu and Jianlin Liu and Junhong Lou and Haoyuan Hu and Jianbin Zheng and Haomin Wen and Chao Song and Shu He},
      year={2023},
      eprint={2304.01559},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}
```