File size: 7,673 Bytes
6ed21b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
#################################################################
#                                                               #
#      README file for evalb                                    #
#                                                               #
#                                         Satoshi Sekine (NYU)  #
#                                         Mike Collins (UPenn)  #
#                                                               #
#                                         October.1997          #
#################################################################

Contents of this README:

   [0] COPYRIGHT
   [1] INTRODUCTION
   [2] INSTALLATION AND RUN
   [3] OPTIONS
   [4] OUTPUT FORMAT FROM THE SCORER
   [5] HOW TO CREATE A GOLDFILE FROM THE TREEBANK
   [6] THE PARAMETER FILE
   [7] MORE DETAILS ABOUT THE SCORING ALGORITHM


[0] COPYRIGHT

The authors abandon the copyright of this program. Everyone is 
permitted to copy and distribute the program or a portion of the program
with no charge and no restrictions unless it is harmful to someone.

However, the authors are delightful for the user's kindness of proper
usage and letting the authors know bugs or problems.

This software is provided "AS IS", and the authors make no warranties,
express or implied.


[1] INTRODUCTION

Evaluation of bracketing looks simple, but in fact, there are minor
differences from system to system. This is a program to parametarize
such minor differences and to give an informative result.

"evalb" evaluates bracketing accuracy in a test-file against a gold-file.
It returns recall, precision, tagging accuracy. It uses an identical 
algorithm to that used in (Collins ACL97).


[2] Installation and Run

To compile the scorer, type 

> make


To run the scorer:

> evalb -p Parameter_file Gold_file Test_file

 
For example to use the sample files:

> evalb -p sample.prm sample.gld sample.tst



[3] OPTIONS

You can specify system parameters in the command line options.
Other options concerning to evaluation metrix should be specified
in parameter file, described later.

        -p param_file  parameter file                        
        -d             debug mode                            
        -e n           number of error to kill (default=10)  
        -h             help                                  



[4] OUTPUT FORMAT FROM THE SCORER

The scorer gives individual scores for each sentence, for
example:

  Sent.                        Matched  Bracket   Cross        Correct Tag
 ID  Len.  Stat. Recal  Prec.  Bracket gold test Bracket Words  Tags Accracy
============================================================================
   1    8    0  100.00 100.00     5      5    5      0      6     5    83.33

At the end of the output the === Summary === section gives statistics 
for all sentences, and for sentences <=40 words in length. The summary
contains the following information:

i)   Number of sentences -- total number of sentences.

ii)  Number of Error/Skip sentences -- should both be 0 if there is no
    problem with the parsed/gold files.

iii) Number of valid sentences = Number of sentences - Number of Error/Skip
    sentences 

iv)  Bracketing recall =     (number of correct constituents)
                         ----------------------------------------
                         (number of constituents in the goldfile)

v)   Bracketing precision = (number of correct constituents)
                         ----------------------------------------
                         (number of constituents in the parsed file)

vi)  Complete match = percentaage of sentences where recall and precision are
    both 100%. 

vii) Average crossing = (number of constituents crossing a goldfile constituen
                         ----------------------------------------------------
                                        (number of sentences)

viii) No crossing = percentage of sentences which have 0 crossing brackets.

ix)   2 or less crossing = percentage of sentences which have <=2 crossing brackets.

x)    Tagging accuracy = percentage of correct POS tags (but see [5].3 for exact
     details of what is counted).



[5] HOW TO CREATE A GOLDFILE FROM THE PENN TREEBANK


The gold and parsed files are in a format similar to this:

(TOP (S (INTJ (RB No)) (, ,) (NP (PRP it)) (VP (VBD was) (RB n't) (NP (NNP Black) (NNP Monday))) (. .)))

To create a gold file from the treebank:

tgrep -wn '/.*/' | tgrep_proc.prl 

will produce a goldfile in the required format.  ("tgrep -wn '/.*/'" prints
parse trees, "tgrep_process.prl" just skips blank lines).

For example, to produce a goldfile for section 23 of the treebank:

tgrep -wn '/.*/' | tail +90895 | tgrep_process.prl | sed 2416q > sec23.gold



[6] THE PARAMETER (.prm) FILE


The .prm file sets options regarding the scoring method. COLLINS.prm gives
the same scoring behaviour as the scorer used in (Collins 97). The options 
chosen were: 

1) LABELED 1

to give labelled precision/recall figures, i.e. a constituent must have the
same span *and* label as a constituent in the goldfile.

2) DELETE_LABEL TOP   

Don't count the "TOP" label (which is always given in the output of tgrep) 
when scoring. 

3) DELETE_LABEL -NONE-  

Remove traces (and all constituents which dominate nothing but traces) when
scoring. For example

.... (VP (VBD reported) (SBAR (-NONE- 0) (S (-NONE- *T*-1)))) (. .)))

would be processed to give

.... (VP (VBD reported)) (. .)))


4)
DELETE_LABEL ,     -- for the purposes of scoring remove punctuation
DELETE_LABEL :
DELETE_LABEL ``
DELETE_LABEL ''
DELETE_LABEL .

5) DELETE_LABEL_FOR_LENGTH -NONE-   -- don't include traces when calculating
                                       the length of a sentence (important
                                       when classifying a sentence as <=40
                                       words or >40 words)

6) EQ_LABEL ADVP PRT

Count ADVP and PRT as being the same label when scoring.




[7] MORE DETAILS ABOUT THE SCORING ALGORITHM


1) The scorer initially processes the files to remove all nodes specified
by DELETE_LABEL in the .prm file. It also recursively removes nodes which
dominate nothing due to all their children being removed. For example, if
-NONE- is specified as a label to be deleted, 

.... (VP (VBD reported) (SBAR (-NONE- 0) (S (-NONE- *T*-1)))) (. .)))

would be processed to give

.... (VP (VBD reported)) (. .)))

2) The scorer also removes all functional tags attached to non-terminals
(functional tags are prefixed with "-" or "=" in the treebank). For example
"NP-SBJ" is processed to give "NP", "NP=2" is changed to "NP".


3) Tagging accuracy counts tags for all words *except* any tags which are
deleted by a DELETE_LABEL specification in the .prm file. (For example, for
COLLINS.prm, punctuation tagged as "," ":" etc. would not be included).

4) When calculating the length of a sentence, all words with POS tags not 
included in the "DELETE_LABEL_FOR_LENGTH" list in the .prm file are
counted. (For COLLINS.prm, only "-NONE-" is specified in this list, so
traces are removed before calculating the length of the sentence).

5) There are some subtleties in scoring when either the goldfile or parsed
file contains multiple constituents for the same span which have the same
non-terminal label. e.g. (NP (NP the man)) If the goldfile contains n 
constituents for the same span, and the parsed file contains m constituents
with that nonterminal, the scorer works as follows:

i) If m>n, then the precision is n/m, recall is 100%

ii) If n>m, then the precision is 100%, recall is m/n.

iii) If n==m, recall and precision are both 100%.