Spaces:
Sleeping
Sleeping
################################################################# | |
# # | |
# README file for evalb # | |
# # | |
# Satoshi Sekine (NYU) # | |
# Mike Collins (UPenn) # | |
# # | |
# October.1997 # | |
################################################################# | |
Contents of this README: | |
[0] COPYRIGHT | |
[1] INTRODUCTION | |
[2] INSTALLATION AND RUN | |
[3] OPTIONS | |
[4] OUTPUT FORMAT FROM THE SCORER | |
[5] HOW TO CREATE A GOLDFILE FROM THE TREEBANK | |
[6] THE PARAMETER FILE | |
[7] MORE DETAILS ABOUT THE SCORING ALGORITHM | |
[0] COPYRIGHT | |
The authors abandon the copyright of this program. Everyone is | |
permitted to copy and distribute the program or a portion of the program | |
with no charge and no restrictions unless it is harmful to someone. | |
However, the authors are delightful for the user's kindness of proper | |
usage and letting the authors know bugs or problems. | |
This software is provided "AS IS", and the authors make no warranties, | |
express or implied. | |
[1] INTRODUCTION | |
Evaluation of bracketing looks simple, but in fact, there are minor | |
differences from system to system. This is a program to parametarize | |
such minor differences and to give an informative result. | |
"evalb" evaluates bracketing accuracy in a test-file against a gold-file. | |
It returns recall, precision, tagging accuracy. It uses an identical | |
algorithm to that used in (Collins ACL97). | |
[2] Installation and Run | |
To compile the scorer, type | |
> make | |
To run the scorer: | |
> evalb -p Parameter_file Gold_file Test_file | |
For example to use the sample files: | |
> evalb -p sample.prm sample.gld sample.tst | |
[3] OPTIONS | |
You can specify system parameters in the command line options. | |
Other options concerning to evaluation metrix should be specified | |
in parameter file, described later. | |
-p param_file parameter file | |
-d debug mode | |
-e n number of error to kill (default=10) | |
-h help | |
[4] OUTPUT FORMAT FROM THE SCORER | |
The scorer gives individual scores for each sentence, for | |
example: | |
Sent. Matched Bracket Cross Correct Tag | |
ID Len. Stat. Recal Prec. Bracket gold test Bracket Words Tags Accracy | |
============================================================================ | |
1 8 0 100.00 100.00 5 5 5 0 6 5 83.33 | |
At the end of the output the === Summary === section gives statistics | |
for all sentences, and for sentences <=40 words in length. The summary | |
contains the following information: | |
i) Number of sentences -- total number of sentences. | |
ii) Number of Error/Skip sentences -- should both be 0 if there is no | |
problem with the parsed/gold files. | |
iii) Number of valid sentences = Number of sentences - Number of Error/Skip | |
sentences | |
iv) Bracketing recall = (number of correct constituents) | |
---------------------------------------- | |
(number of constituents in the goldfile) | |
v) Bracketing precision = (number of correct constituents) | |
---------------------------------------- | |
(number of constituents in the parsed file) | |
vi) Complete match = percentaage of sentences where recall and precision are | |
both 100%. | |
vii) Average crossing = (number of constituents crossing a goldfile constituen | |
---------------------------------------------------- | |
(number of sentences) | |
viii) No crossing = percentage of sentences which have 0 crossing brackets. | |
ix) 2 or less crossing = percentage of sentences which have <=2 crossing brackets. | |
x) Tagging accuracy = percentage of correct POS tags (but see [5].3 for exact | |
details of what is counted). | |
[5] HOW TO CREATE A GOLDFILE FROM THE PENN TREEBANK | |
The gold and parsed files are in a format similar to this: | |
(TOP (S (INTJ (RB No)) (, ,) (NP (PRP it)) (VP (VBD was) (RB n't) (NP (NNP Black) (NNP Monday))) (. .))) | |
To create a gold file from the treebank: | |
tgrep -wn '/.*/' | tgrep_proc.prl | |
will produce a goldfile in the required format. ("tgrep -wn '/.*/'" prints | |
parse trees, "tgrep_process.prl" just skips blank lines). | |
For example, to produce a goldfile for section 23 of the treebank: | |
tgrep -wn '/.*/' | tail +90895 | tgrep_process.prl | sed 2416q > sec23.gold | |
[6] THE PARAMETER (.prm) FILE | |
The .prm file sets options regarding the scoring method. COLLINS.prm gives | |
the same scoring behaviour as the scorer used in (Collins 97). The options | |
chosen were: | |
1) LABELED 1 | |
to give labelled precision/recall figures, i.e. a constituent must have the | |
same span *and* label as a constituent in the goldfile. | |
2) DELETE_LABEL TOP | |
Don't count the "TOP" label (which is always given in the output of tgrep) | |
when scoring. | |
3) DELETE_LABEL -NONE- | |
Remove traces (and all constituents which dominate nothing but traces) when | |
scoring. For example | |
.... (VP (VBD reported) (SBAR (-NONE- 0) (S (-NONE- *T*-1)))) (. .))) | |
would be processed to give | |
.... (VP (VBD reported)) (. .))) | |
4) | |
DELETE_LABEL , -- for the purposes of scoring remove punctuation | |
DELETE_LABEL : | |
DELETE_LABEL `` | |
DELETE_LABEL '' | |
DELETE_LABEL . | |
5) DELETE_LABEL_FOR_LENGTH -NONE- -- don't include traces when calculating | |
the length of a sentence (important | |
when classifying a sentence as <=40 | |
words or >40 words) | |
6) EQ_LABEL ADVP PRT | |
Count ADVP and PRT as being the same label when scoring. | |
[7] MORE DETAILS ABOUT THE SCORING ALGORITHM | |
1) The scorer initially processes the files to remove all nodes specified | |
by DELETE_LABEL in the .prm file. It also recursively removes nodes which | |
dominate nothing due to all their children being removed. For example, if | |
-NONE- is specified as a label to be deleted, | |
.... (VP (VBD reported) (SBAR (-NONE- 0) (S (-NONE- *T*-1)))) (. .))) | |
would be processed to give | |
.... (VP (VBD reported)) (. .))) | |
2) The scorer also removes all functional tags attached to non-terminals | |
(functional tags are prefixed with "-" or "=" in the treebank). For example | |
"NP-SBJ" is processed to give "NP", "NP=2" is changed to "NP". | |
3) Tagging accuracy counts tags for all words *except* any tags which are | |
deleted by a DELETE_LABEL specification in the .prm file. (For example, for | |
COLLINS.prm, punctuation tagged as "," ":" etc. would not be included). | |
4) When calculating the length of a sentence, all words with POS tags not | |
included in the "DELETE_LABEL_FOR_LENGTH" list in the .prm file are | |
counted. (For COLLINS.prm, only "-NONE-" is specified in this list, so | |
traces are removed before calculating the length of the sentence). | |
5) There are some subtleties in scoring when either the goldfile or parsed | |
file contains multiple constituents for the same span which have the same | |
non-terminal label. e.g. (NP (NP the man)) If the goldfile contains n | |
constituents for the same span, and the parsed file contains m constituents | |
with that nonterminal, the scorer works as follows: | |
i) If m>n, then the precision is n/m, recall is 100% | |
ii) If n>m, then the precision is 100%, recall is m/n. | |
iii) If n==m, recall and precision are both 100%. | |