File size: 7,778 Bytes
8b513d0
 
 
abba49f
 
8b513d0
abba49f
 
 
 
 
 
8b513d0
87f6346
8b513d0
 
 
 
 
 
abba49f
 
 
8b513d0
 
abba49f
8b513d0
abba49f
8b513d0
abba49f
 
8b513d0
abba49f
d6504ae
8b513d0
abba49f
8b513d0
abba49f
 
 
8b513d0
 
abba49f
 
f51d9c9
 
24a93ab
 
8b513d0
24a93ab
abba49f
 
 
 
24a93ab
 
8b513d0
24a93ab
 
 
abba49f
 
 
 
 
 
 
8b513d0
abba49f
24a93ab
abba49f
24a93ab
 
 
 
 
 
 
 
abba49f
 
 
 
8b513d0
 
abba49f
 
 
24a93ab
 
 
 
 
abba49f
24a93ab
abba49f
f51d9c9
abba49f
 
24a93ab
abba49f
24a93ab
abba49f
f51d9c9
abba49f
 
24a93ab
 
 
abba49f
 
 
 
 
 
 
 
 
 
 
 
 
24a93ab
 
8b513d0
abba49f
 
 
 
8b513d0
 
abba49f
8b513d0
 
 
abba49f
 
 
 
8b513d0
 
 
 
 
abba49f
8b513d0
d6504ae
8b513d0
 
 
abba49f
 
 
 
 
 
 
 
 
 
274016f
f671616
274016f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
abba49f
 
 
 
 
8b513d0
abba49f
 
8b513d0
87f6346
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# Spacy Entity Linker

## Introduction

Spacy Entity Linker is a pipeline for spaCy that performs Linked Entity Extraction with Wikidata on a given Document.
The Entity Linking System operates by matching potential candidates from each sentence
(subject, object, prepositional phrase, compounds, etc.) to aliases from Wikidata. The package allows to easily find the
category behind each entity (e.g. "banana" is type "food" OR "Microsoft" is type "company"). It can is therefore useful
for information extraction tasks and labeling tasks.

The package was written before a working Linked Entity Solution existed inside spaCy. In comparison to spaCy's linked
entity system, it has the following advantages:

- no extensive training required (entity-matching via database)
- knowledge base can be dynamically updated without retraining
- entity categories can be easily resolved
- grouping entities by category

It also comes along with a number of disadvantages:

- it is slower than the spaCy implementation due to the use of a database for finding entities
- no context sensitivity due to the implementation of the "max-prior method" for entitiy disambiguation (an improved
  method for this is in progress)

## Use

```python
import spacy  # version 3.0.6'

# initialize language model
nlp = spacy.load("en_core_web_md")

# add pipeline (declared through entry_points in setup.py)
nlp.add_pipe("entityLinker", last=True)

doc = nlp("I watched the Pirates of the Caribbean last silvester")

# returns all entities in the whole document
all_linked_entities = doc._.linkedEntities
# iterates over sentences and prints linked entities
for sent in doc.sents:
    sent._.linkedEntities.pretty_print()

# OUTPUT:
# https://www.wikidata.org/wiki/Q194318     Pirates of the Caribbean        Series of fantasy adventure films                                                                   
# https://www.wikidata.org/wiki/Q12525597   Silvester                       the day celebrated on 31 December (Roman Catholic Church) or 2 January (Eastern Orthodox Churches)  

```

### EntityCollection

contains an array of entity elements. It can be accessed like an array but also implements the following helper
functions:

- <code>pretty_print()</code> prints out information about all contained entities
- <code>print_super_classes()</code> groups and prints all entites by their super class

```python
doc = nlp("Elon Musk was born in South Africa. Bill Gates and Steve Jobs come from the United States")
doc._.linkedEntities.print_super_entities()
# OUTPUT:
# human (3) : Elon Musk,Bill Gates,Steve Jobs
# country (2) : South Africa,United States of America
# sovereign state (2) : South Africa,United States of America
# federal state (1) : United States of America
# constitutional republic (1) : United States of America
# democratic republic (1) : United States of America
```

### EntityElement

each linked Entity is an object of type <code>EntityElement</code>. Each entity contains the methods

- <code>get_description()</code> returns description from Wikidata
- <code>get_id()</code> returns Wikidata ID
- <code>get_label()</code> returns Wikidata label
- <code>get_span()</code> returns the span from the spacy document that contains the linked entity
- <code>get_url()</code> returns the url to the corresponding Wikidata item
- <code>pretty_print()</code> prints out information about the entity element
- <code>get_sub_entities(limit=10)</code> returns EntityCollection of all entities that derive from the current
  entityElement (e.g. fruit -> apple, banana, etc.)
- <code>get_super_entities(limit=10)</code> returns EntityCollection of all entities that the current entityElement
  derives from (e.g. New England Patriots -> Football Team))

## Example

In the following example we will use SpacyEntityLinker to find find the mentioned Football Team in our text and explore
other football teams of the same type

```python

doc = nlp("I follow the New England Patriots")

patriots_entity = doc._.linkedEntities[0]
patriots_entity.pretty_print()
# OUTPUT:
# https://www.wikidata.org/wiki/Q193390     
# New England Patriots            
# National Football League franchise in Foxborough, Massachusetts                    

football_team_entity = patriots_entity.get_super_entities()[0]
football_team_entity.pretty_print()
# OUTPUT:
# https://www.wikidata.org/wiki/Q17156793 
# American football team          
# organization, in which a group of players are organized to compete as a team in American football   


for child in football_team_entity.get_sub_entities(limit=32):
    print(child)
    # OUTPUT:
    # New Orleans Saints
    # New York Giants
    # Pittsburgh Steelers
    # New England Patriots
    # Indianapolis Colts
    # Miami Seahawks
    # Dallas Cowboys
    # Chicago Bears
    # Washington Redskins
    # Green Bay Packers
    # ...
```

### Entity Linking Policy

Currently the only method for choosing an entity given different possible matches (e.g. Paris - city vs Paris -
firstname) is max-prior. This method achieves around 70% accuracy on predicting the correct entities behind link
descriptions on wikipedia.

## Note

The Entity Linker at the current state is still experimental and should not be used in production mode.

## Performance

The current implementation supports only Sqlite. This is advantageous for development because it does not requirement
any special setup and configuration. However, for more performance critical usecases, a different database with
in-memory access (e.g. Redis) should be used. This may be implemented in the future.

## Installation

To install the package run: <code>pip install spacy-entity-linker</code>

Afterwards, the knowledge base (Wikidata) must be downloaded. This can be done by calling

<code>python -m spacy_entity_linker "download_knowledge_base"</code>

This will download and extract a ~500mb file that contains a preprocessed version of Wikidata

## Data
the knowledge base was derived from this dataset: https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data

It was cleaned and post-procesed, including filtering out entities of "overrepresented" categories such as
  * village in China
  * train stations
  * stars in the Galaxy
  * etc.
  
The purpose behind the knowledge base cleaning was to reduce the knowledge base size, while keeping the most useful entities for general purpose applications.

Currently, the only way to change the knowledge base is a bit hacky and requires to replace or modify the underlying sqlite database. You will find it under <code>site_packages/data_spacy_entity_linker/wikidb_filtered.db</code>. The database contains 3 tables:
* <b>aliases</b>
  * en_alias (english alias)
  * en_alias_lowercase (english alias lowercased)
* <b>joined</b>
  * en_label (label of the wikidata item)
  * views (number of views of the corresponding wikipedia page (in a given period of time))
  * inlinks (number of inlinks to the corresponding wikipedia page)
  * item_id (wikidata id)
  * description (description of the wikidata item)
* <b>statements</b>
  * source_item_id (references item_id)
  * target_item_id (references item_id)
  * edge_property_id
      * 279=subclass of (https://www.wikidata.org/wiki/Property:P279)
      * 31=instance of (https://www.wikidata.org/wiki/Property:P31)
      * 361=part of (https://www.wikidata.org/wiki/Property:P361)


## Versions:

- <code>spacy_entity_linker>=0.0</code> (requires <code>spacy>=2.2,<3.0</code>)
- <code>spacy_entity_linker>=1.0</code> (requires <code>spacy>=3.0</code>)

## TODO

- [ ] implement Entity Classifier based on sentence embeddings for improved accuracy
- [ ] implement get_picture_urls() on EntityElement
- [ ] retrieve statements for each EntityElement (inlinks + outlinks)