File size: 4,721 Bytes
73754f9
 
5b32a4a
 
 
 
73754f9
5b32a4a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
license: mit
language:
- en
- fr
- es
---

# πŸ”₯ Classifiers of FinTOC 2022 Shared task winners (ISPRAS team) πŸ”₯

Classifiers of texual lines of English, French and Spanish financial prospects in PDF format for the [FinTOC 2022 Shared task](https://wp.lancs.ac.uk/cfie/fintoc2022/).

## πŸ€— Source code πŸ€—

Training scripts are available in the repository https://github.com/ispras/dedoc/ (see `scripts/fintoc2022` directory).

## πŸ€— Task description πŸ€—

Lines are classified in two stages:
1. Binary classification title/not title (title detection task).
2. Classification of title lines into title depth classes (TOC generation task).

There are two types of classifiers according to the stage:
1. For the first stage, **binary classifiers** are trained. They return `bool` values: `True` for title lines and `False` for non-title lines.
2. For the second stage, **target classifiers** are trained. They return `int` title depth classes from 1 to 6. More important lines have a lesser depth.

## πŸ€— Results evaluation πŸ€—

The training dataset contains English, French, and Spanish documents, so three language categories are available ("en", "fr", "sp").
To obtain document lines, we use [dedoc](https://dedoc.readthedocs.io) library (`dedoc.readers.PdfTabbyReader`, `dedoc.readers.PdfTxtlayerReader`), so two reader categories are available ("tabby", "txt_layer").

To obtain FinTOC structure, we use our method described in [our article](https://aclanthology.org/2022.fnp-1.13.pdf) (winners of FinTOC 2022 Shared task!).
The results of our method (3-fold cross-validation on the FinTOC 2022 training dataset) for different languages and readers are given in the table below (they slightly changed since the competition finished).
As in the FinTOC 2022 Shared task, we use two metrics for results evaluation (metrics from the [article](https://aclanthology.org/2022.fnp-1.12.pdf)):
**TD** - F1 measure for the title detection task, **TOC** - harmonic mean of Inex F1 score and Inex level accuracy for the TOC generation task.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: left;">
      <th></th>
      <th>TD 0</th>
      <th>TD 1</th>
      <th>TD 2</th>
      <th>TD mean</th>
      <th>TOC 0</th>
      <th>TOC 1</th>
      <th>TOC 2</th>
      <th>TOC mean</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>en_tabby</th>
      <td>0.811522</td>
      <td>0.833798</td>
      <td>0.864239</td>
      <td>0.836520</td>
      <td>56.5</td>
      <td>58.0</td>
      <td>64.9</td>
      <td>59.800000</td>
    </tr>
    <tr>
      <th>en_txt_layer</th>
      <td>0.821360</td>
      <td>0.853258</td>
      <td>0.833623</td>
      <td>0.836081</td>
      <td>57.8</td>
      <td>62.1</td>
      <td>57.8</td>
      <td>59.233333</td>
    </tr>
    <tr>
      <th>fr_tabby</th>
      <td>0.753409</td>
      <td>0.744232</td>
      <td>0.782169</td>
      <td>0.759937</td>
      <td>51.2</td>
      <td>47.9</td>
      <td>51.5</td>
      <td>50.200000</td>
    </tr>
    <tr>
      <th>fr_txt_layer</th>
      <td>0.740530</td>
      <td>0.794460</td>
      <td>0.766059</td>
      <td>0.767016</td>
      <td>45.6</td>
      <td>52.2</td>
      <td>50.1</td>
      <td>49.300000</td>
    </tr>
    <tr>
      <th>sp_tabby</th>
      <td>0.606718</td>
      <td>0.622839</td>
      <td>0.599094</td>
      <td>0.609550</td>
      <td>37.1</td>
      <td>43.6</td>
      <td>43.4</td>
      <td>41.366667</td>
    </tr>
    <tr>
      <th>sp_txt_layer</th>
      <td>0.629052</td>
      <td>0.667976</td>
      <td>0.446827</td>
      <td>0.581285</td>
      <td>46.4</td>
      <td>48.8</td>
      <td>30.7</td>
      <td>41.966667</td>
    </tr>
  </tbody>
</table>

## πŸ€— See also πŸ€—

Please see our article [ISPRAS@FinTOC-2022 shared task: Two-stage TOC generation model](https://aclanthology.org/2022.fnp-1.13.pdf)
to get more information about the FinTOC 2022 Shared task and our method of solving it.
We will be grateful, if you cite our work (see citation in BibTeX format below).

```
@inproceedings{bogatenkova-etal-2022-ispras,
    title = "{ISPRAS}@{F}in{TOC}-2022 Shared Task: Two-stage {TOC} Generation Model",
    author = "Bogatenkova, Anastasiia  and
      Belyaeva, Oksana Vladimirovna  and
      Perminov, Andrew Igorevich  and
      Kozlov, Ilya Sergeevich",
    editor = "El-Haj, Mahmoud  and
      Rayson, Paul  and
      Zmandar, Nadhem",
    booktitle = "Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.fnp-1.13",
    pages = "89--94"
}
```