File size: 173,707 Bytes
7f7b773
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "DATA_PATH = Path(\"/data/tommaso/data\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', metadata={'source': PosixPath('/data/tommaso/data/papers_processed/1.txt'), 'filename': '1.txt', 'file_directory': '/data/tommaso/data/papers_processed', 'filetype': 'text/plain', 'category': 'UncategorizedText'}),\n",
       " Document(page_content='Abstract: Rheumatoid arthritis is an autoimmune disorder of complex disease etiology. Currently available serological diagnostic markers lack in terms of sensitivity and specificity and thus addi- tional biomarkers are warranted for early disease diagnosis and management. We aimed to screen and compare serum proteome profiles of rheumatoid arthritis serotypes with healthy controls in the Pakistani population for identification of potential disease biomarkers. Serum samples from rheumatoid arthritis patients and healthy controls were enriched for low abundance proteins using ProteoMinerTM columns. Rheumatoid arthritis patients were assigned to one of the four serotypes based on anti-citrullinated peptide antibodies and rheumatoid factor. Serum protein profiles were ana- lyzed via liquid chromatography-tandem mass spectrometry. The changes in the protein abundances were determined using label-free quantification software ProgenesisQITM followed by pathway analysis. Findings were validated in an independent cohort of patients and healthy controls using an enzyme-linked immunosorbent assay. A total of 213 proteins were identified.', metadata={'source': PosixPath('/data/tommaso/data/papers_processed/1.txt'), 'filename': '1.txt', 'file_directory': '/data/tommaso/data/papers_processed', 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
       " Document(page_content='Comparative analysis of all groups (false discovery rate < 0.05, >2-fold change, and identified with ≥2 unique peptides) identified ten proteins that were differentially expressed between rheumatoid arthritis serotypes and healthy controls including pregnancy zone protein, selenoprotein P, C4b-binding protein beta chain, apolipoprotein M, N-acetylmuramoyl-L-alanine amidase, catalytic chain, oncoprotein-induced transcript 3 protein, Carboxypeptidase N subunit 2, Apolipoprotein C-I and Apolipoprotein C-III. Pathway analysis predicted inhibition of liver X receptor/retinoid X receptor activation pathway and production of nitric oxide and reactive oxygen species pathway in macrophages in all serotypes. A catalogue of potential serum biomarkers for rheumatoid arthritis were identified. These biomark- ers can be further evaluated in larger cohorts from different populations for their diagnostic and prognostic potential.', metadata={'source': PosixPath('/data/tommaso/data/papers_processed/1.txt'), 'filename': '1.txt', 'file_directory': '/data/tommaso/data/papers_processed', 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
       " Document(page_content='Keywords: rheumatoid arthritis; serum; proteomics; biomarkers; LC-MS', metadata={'source': PosixPath('/data/tommaso/data/papers_processed/1.txt'), 'filename': '1.txt', 'file_directory': '/data/tommaso/data/papers_processed', 'filetype': 'text/plain', 'category': 'Title'})]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.document_loaders import UnstructuredFileLoader\n",
    "from unstructured.cleaners.core import clean_extra_whitespace, group_broken_paragraphs\n",
    "\n",
    "loader = UnstructuredFileLoader(\n",
    "    DATA_PATH / \"papers_processed\" / \"1.txt\",\n",
    "    strategy=\"hi_res\",\n",
    "    mode=\"elements\",\n",
    "    post_processors=[\n",
    "        clean_extra_whitespace,\n",
    "        group_broken_paragraphs,\n",
    "    ])\n",
    "docs = loader.load()\n",
    "docs[:4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.document_loaders.parsers import GrobidParser\n",
    "from langchain.document_loaders.generic import GenericLoader\n",
    "\n",
    "loader = GenericLoader.from_filesystem(\n",
    "    DATA_PATH / \"papers\",\n",
    "    glob=\"1.pdf\",\n",
    "    suffixes=[\".pdf\"],\n",
    "    parser=GrobidParser(segment_sentences=False),\n",
    ")\n",
    "docs = loader.load()\n",
    "docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import spacy\n",
    "spacy.require_gpu(gpu_id=1)\n",
    "\n",
    "import spacy_transformers # needed by SpacyTextSplitter when using the en_core_web_trf pipeline\n",
    "from langchain.text_splitter import SpacyTextSplitter\n",
    "from itertools import chain\n",
    "\n",
    "splitter = SpacyTextSplitter(chunk_size=1000, pipeline=\"en_core_web_trf\")\n",
    "chunks = splitter.split_documents(docs)\n",
    "chunks[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BioBERT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Rheumatoid factor (RF) and anti-citrullinated peptide antibodies (ACPA) are considered as the main serological markers for RA that have been included in the 2010 American College of Rheumatology (ACR)/European League against Rheumatism (EULAR) classification criteria for RA [7][8][9].Based on 2010 ACR/EULAR classification criteria for RA, clinically diagnosed RA patients can be categorized into four serotypes: (i) positive for both RF and ACPA, (ii) positive for RF and negative for ACPA, (iii) negative for RF and positive for ACPA and (iv) negative for both RF and ACPA.However, the levels of RF are also perturbed in connective tissue diseases [10] and some chronic infectious diseases such as hepatitis B and hepatitis C virus infections [11].RF is thus not a specific diagnostic marker for RA.ACPA is comparatively a more specific biomarker and two-thirds of the individuals ultimately diagnosed with RA were tested positive for ACPAs 6-10 years before diagnosis [12,13].A total of 1-3% of the healthy population may also test positive for ACPAs suggesting the decreased specificity of this biomarker [14][15][16][17].Therefore, it is important to discover the biomarkers for the diagnosis of RA with both increased sensitivity and specificity.'"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs[1].page_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use a pipeline as a high-level helper\n",
    "from transformers import pipeline\n",
    "\n",
    "pipe = pipeline(\"question-answering\", model=\"dmis-lab/biobert-large-cased-v1.1-squad\", device=1, handle_impossible_answer=True, max_seq_len=512)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BertForQuestionAnswering(\n",
       "  (bert): BertModel(\n",
       "    (embeddings): BertEmbeddings(\n",
       "      (word_embeddings): Embedding(58996, 1024, padding_idx=0)\n",
       "      (position_embeddings): Embedding(512, 1024)\n",
       "      (token_type_embeddings): Embedding(2, 1024)\n",
       "      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)\n",
       "      (dropout): Dropout(p=0.1, inplace=False)\n",
       "    )\n",
       "    (encoder): BertEncoder(\n",
       "      (layer): ModuleList(\n",
       "        (0-23): 24 x BertLayer(\n",
       "          (attention): BertAttention(\n",
       "            (self): BertSelfAttention(\n",
       "              (query): Linear(in_features=1024, out_features=1024, bias=True)\n",
       "              (key): Linear(in_features=1024, out_features=1024, bias=True)\n",
       "              (value): Linear(in_features=1024, out_features=1024, bias=True)\n",
       "              (dropout): Dropout(p=0.1, inplace=False)\n",
       "            )\n",
       "            (output): BertSelfOutput(\n",
       "              (dense): Linear(in_features=1024, out_features=1024, bias=True)\n",
       "              (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)\n",
       "              (dropout): Dropout(p=0.1, inplace=False)\n",
       "            )\n",
       "          )\n",
       "          (intermediate): BertIntermediate(\n",
       "            (dense): Linear(in_features=1024, out_features=4096, bias=True)\n",
       "            (intermediate_act_fn): GELUActivation()\n",
       "          )\n",
       "          (output): BertOutput(\n",
       "            (dense): Linear(in_features=4096, out_features=1024, bias=True)\n",
       "            (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)\n",
       "            (dropout): Dropout(p=0.1, inplace=False)\n",
       "          )\n",
       "        )\n",
       "      )\n",
       "    )\n",
       "  )\n",
       "  (qa_outputs): Linear(in_features=1024, out_features=2, bias=True)\n",
       ")"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe.model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Question: How did the authors detect protein abundances?\n",
      "Answer 1 (score: 0.121): 'Mass spectrometry (MS)-based serum proteomics'\n",
      "Answer 2 (score: 0.114): 'ProgenesisQITM followed by pathway analysis'\n",
      "\n",
      "\n",
      "Question: How can RA patients be categorized?\n",
      "Answer 1 (score: 0.377): 'four serotypes'\n",
      "Answer 2 (score: 0.320): 'into four serotypes'\n",
      "\n"
     ]
    }
   ],
   "source": [
    "questions = [\n",
    "    \"How did the authors detect protein abundances?\",\n",
    "    \"How can RA patients be categorized?\"\n",
    "]\n",
    "context = \"\\n\".join([x.page_content for x in docs])\n",
    "\n",
    "for q in questions:\n",
    "    a = pipe(question=q, context=context, top_k=2)\n",
    "    print(f'''\n",
    "Question: {q}\n",
    "Answer 1 (score: {a[0][\"score\"]:.3f}): '{a[0][\"answer\"]}'\n",
    "Answer 2 (score: {a[1][\"score\"]:.3f}): '{a[1][\"answer\"]}'\n",
    "''')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'score': 0.12108789384365082,\n",
       " 'start': 4854,\n",
       " 'end': 4899,\n",
       " 'answer': 'Mass spectrometry (MS)-based serum proteomics'}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "context = \"\\n\".join([x.page_content for x in docs])\n",
    "pipe(question=\"How did the authors detect protein abundances?\", context=context)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## BioGPT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain import HuggingFaceHub, HuggingFacePipeline\n",
    "\n",
    "HUGGINGFACE_TOKEN = \"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\"\n",
    "\n",
    "# llm = HuggingFacePipeline.from_model_id(\n",
    "#     model_id=\"stanford-crfm/BioMedLM\",\n",
    "#     task=\"text-generation\",\n",
    "#     device=1,\n",
    "#     model_kwargs={\"temperature\": 0},\n",
    "# )\n",
    "\n",
    "from langchain import PromptTemplate, LLMChain\n",
    "\n",
    "template = \"\"\"You are a useful and reliableQuestion: {question}\n",
    "Context: {context}\"\"\"\n",
    "prompt = PromptTemplate(template=template, input_variables=[\"question\", \"context\"])\n",
    "llm = HuggingFaceHub(\n",
    "    repo_id=\"microsoft/BioGPT-Large-PubMedQA\",\n",
    "    model_kwargs={\"temperature\": 0.1, \"max_length\":200},\n",
    "    huggingfacehub_api_token=HUGGINGFACE_TOKEN\n",
    ")\n",
    "llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
    "question = \"How did the authors detect protein abundances?\"\n",
    "context = \"\\n\".join([x.page_content for x in chunks])\n",
    "\n",
    "# print(llm_chain.run(question=question, context=context))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Rheumatoid arthritis (RA) is an autoimmune disorder of complex disease etiology.RA leads to the inflammation of joints and surrounding synovial membrane [1].The global prevalence rate of RA is 0.24% and RA has been ranked as the 42nd highest contributor to global disability [2].Diagnosing RA is a highly individualized process and is based on a combination of both clinical manifestations and serological assays.Early disease diagnosis is the key to prevent joint damage and permanent physical disability in RA [3].RA is considered to be a continuum that begins with a disease-susceptibility stage characterized by a combination of genetic risk factors.This stage proceeds through a pre-clinical stage before the development of early RA characterized by articular inflammation.Environmental and microbial triggers continuously operate across this continuum.Immune-mediated etiology associated with stromal tissue dysregulation contributes to the chronic inflammation and ultimate articular destruction that is identified as established RA [4,5].A number of proteins and pathways have been linked to the disease pathogenesis of RA.However, there are still some gaps in current knowledge.Research aimed at the better clarification of these mechanisms can enable the development of more specific disease-modifying therapies [6].', metadata={'text': 'Rheumatoid arthritis (RA) is an autoimmune disorder of complex disease etiology.RA leads to the inflammation of joints and surrounding synovial membrane [1].The global prevalence rate of RA is 0.24% and RA has been ranked as the 42nd highest contributor to global disability [2].Diagnosing RA is a highly individualized process and is based on a combination of both clinical manifestations and serological assays.Early disease diagnosis is the key to prevent joint damage and permanent physical disability in RA [3].RA is considered to be a continuum that begins with a disease-susceptibility stage characterized by a combination of genetic risk factors.This stage proceeds through a pre-clinical stage before the development of early RA characterized by articular inflammation.Environmental and microbial triggers continuously operate across this continuum.Immune-mediated etiology associated with stromal tissue dysregulation contributes to the chronic inflammation and ultimate articular destruction that is identified as established RA [4,5].A number of proteins and pathways have been linked to the disease pathogenesis of RA.However, there are still some gaps in current knowledge.Research aimed at the better clarification of these mechanisms can enable the development of more specific disease-modifying therapies [6].', 'para': '11', 'bboxes': \"[[{'page': '1', 'x': '187.65', 'y': '696.70', 'h': '354.85', 'w': '9.58'}], [{'page': '1', 'x': '545.55', 'y': '696.70', 'h': '14.12', 'w': '9.58'}, {'page': '1', 'x': '166.39', 'y': '709.26', 'h': '341.80', 'w': '9.58'}], [{'page': '1', 'x': '511.79', 'y': '709.26', 'h': '47.49', 'w': '9.58'}, {'page': '1', 'x': '166.10', 'y': '721.81', 'h': '393.18', 'w': '9.58'}, {'page': '1', 'x': '166.39', 'y': '734.36', 'h': '88.77', 'w': '9.58'}], [{'page': '1', 'x': '258.26', 'y': '734.36', 'h': '301.02', 'w': '9.58'}, {'page': '1', 'x': '166.39', 'y': '746.91', 'h': '288.55', 'w': '9.58'}], [{'page': '1', 'x': '458.05', 'y': '746.91', 'h': '101.22', 'w': '9.58'}, {'page': '1', 'x': '166.39', 'y': '759.47', 'h': '346.80', 'w': '9.58'}], [{'page': '2', 'x': '187.65', 'y': '98.05', 'h': '371.62', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '110.60', 'h': '248.38', 'w': '9.58'}], [{'page': '2', 'x': '420.94', 'y': '110.60', 'h': '138.33', 'w': '9.58'}, {'page': '2', 'x': '166.10', 'y': '123.15', 'h': '394.83', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '135.71', 'h': '20.27', 'w': '9.58'}], [{'page': '2', 'x': '190.03', 'y': '135.71', 'h': '370.99', 'w': '9.58'}], [{'page': '2', 'x': '166.39', 'y': '148.26', 'h': '392.89', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '160.81', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '173.37', 'h': '38.95', 'w': '9.58'}], [{'page': '2', 'x': '208.46', 'y': '173.37', 'h': '352.47', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '185.92', 'h': '47.87', 'w': '9.58'}], [{'page': '2', 'x': '216.91', 'y': '185.92', 'h': '256.92', 'w': '9.58'}], [{'page': '2', 'x': '477.36', 'y': '185.92', 'h': '81.91', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '198.47', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '211.02', 'h': '141.30', 'w': '9.58'}]]\", 'pages': \"('1', '2')\", 'section_title': 'Introduction', 'section_number': '1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Rheumatoid factor (RF) and anti-citrullinated peptide antibodies (ACPA) are considered as the main serological markers for RA that have been included in the 2010 American College of Rheumatology (ACR)/European League against Rheumatism (EULAR) classification criteria for RA [7][8][9].Based on 2010 ACR/EULAR classification criteria for RA, clinically diagnosed RA patients can be categorized into four serotypes: (i) positive for both RF and ACPA, (ii) positive for RF and negative for ACPA, (iii) negative for RF and positive for ACPA and (iv) negative for both RF and ACPA.However, the levels of RF are also perturbed in connective tissue diseases [10] and some chronic infectious diseases such as hepatitis B and hepatitis C virus infections [11].RF is thus not a specific diagnostic marker for RA.ACPA is comparatively a more specific biomarker and two-thirds of the individuals ultimately diagnosed with RA were tested positive for ACPAs 6-10 years before diagnosis [12,13].A total of 1-3% of the healthy population may also test positive for ACPAs suggesting the decreased specificity of this biomarker [14][15][16][17].Therefore, it is important to discover the biomarkers for the diagnosis of RA with both increased sensitivity and specificity.', metadata={'text': 'Rheumatoid factor (RF) and anti-citrullinated peptide antibodies (ACPA) are considered as the main serological markers for RA that have been included in the 2010 American College of Rheumatology (ACR)/European League against Rheumatism (EULAR) classification criteria for RA [7][8][9].Based on 2010 ACR/EULAR classification criteria for RA, clinically diagnosed RA patients can be categorized into four serotypes: (i) positive for both RF and ACPA, (ii) positive for RF and negative for ACPA, (iii) negative for RF and positive for ACPA and (iv) negative for both RF and ACPA.However, the levels of RF are also perturbed in connective tissue diseases [10] and some chronic infectious diseases such as hepatitis B and hepatitis C virus infections [11].RF is thus not a specific diagnostic marker for RA.ACPA is comparatively a more specific biomarker and two-thirds of the individuals ultimately diagnosed with RA were tested positive for ACPAs 6-10 years before diagnosis [12,13].A total of 1-3% of the healthy population may also test positive for ACPAs suggesting the decreased specificity of this biomarker [14][15][16][17].Therefore, it is important to discover the biomarkers for the diagnosis of RA with both increased sensitivity and specificity.', 'para': '6', 'bboxes': \"[[{'page': '2', 'x': '187.65', 'y': '223.58', 'h': '373.27', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '236.13', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '248.68', 'h': '394.53', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '261.24', 'h': '133.81', 'w': '9.58'}], [{'page': '2', 'x': '303.29', 'y': '261.24', 'h': '257.23', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '273.79', 'h': '393.08', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '286.34', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.10', 'y': '298.90', 'h': '272.66', 'w': '9.58'}], [{'page': '2', 'x': '441.85', 'y': '298.90', 'h': '117.43', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '311.45', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '324.00', 'h': '240.16', 'w': '9.58'}], [{'page': '2', 'x': '409.64', 'y': '324.00', 'h': '149.63', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '336.55', 'h': '67.99', 'w': '9.58'}], [{'page': '2', 'x': '236.99', 'y': '336.55', 'h': '322.28', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '349.11', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '361.66', 'h': '107.38', 'w': '9.58'}], [{'page': '2', 'x': '276.86', 'y': '361.66', 'h': '282.42', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '374.21', 'h': '325.69', 'w': '9.58'}], [{'page': '2', 'x': '495.20', 'y': '374.21', 'h': '64.08', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '386.77', 'h': '393.27', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '399.32', 'h': '65.18', 'w': '9.58'}]]\", 'pages': \"('2', '2')\", 'section_title': 'Introduction', 'section_number': '1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Mass spectrometry (MS)-based serum proteomics has emerged as a powerful technology in biological research targeted at the RA biomarker discovery [18,19].Several proteins and peptides have been identified that are unique to serum proteome of RA patients [18,20].A recent study compared the serum proteome profiles of seronegative patients with healthy controls [21].However, to our knowledge, no study has compared the serum proteome profiles of all the RA serotypes based on ACPAs and RF.Furthermore, the proteomic profiles of Pakistani RA patients have not been investigated in any previous study.This study aims to screen the RA serotypes, based on ACPAs and RF, and compare them with healthy controls in the Pakistani population for the identification of biomarkers that are differentially expressed (DE) between RA patients and healthy controls.', metadata={'text': 'Mass spectrometry (MS)-based serum proteomics has emerged as a powerful technology in biological research targeted at the RA biomarker discovery [18,19].Several proteins and peptides have been identified that are unique to serum proteome of RA patients [18,20].A recent study compared the serum proteome profiles of seronegative patients with healthy controls [21].However, to our knowledge, no study has compared the serum proteome profiles of all the RA serotypes based on ACPAs and RF.Furthermore, the proteomic profiles of Pakistani RA patients have not been investigated in any previous study.This study aims to screen the RA serotypes, based on ACPAs and RF, and compare them with healthy controls in the Pakistani population for the identification of biomarkers that are differentially expressed (DE) between RA patients and healthy controls.', 'para': '5', 'bboxes': \"[[{'page': '2', 'x': '187.65', 'y': '411.87', 'h': '373.27', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '424.42', 'h': '319.69', 'w': '9.58'}], [{'page': '2', 'x': '489.19', 'y': '424.42', 'h': '70.09', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '436.98', 'h': '394.62', 'w': '9.58'}], [{'page': '2', 'x': '166.01', 'y': '449.53', 'h': '393.66', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '462.08', 'h': '57.92', 'w': '9.58'}], [{'page': '2', 'x': '228.10', 'y': '462.08', 'h': '331.17', 'w': '9.58'}, {'page': '2', 'x': '166.10', 'y': '474.64', 'h': '262.67', 'w': '9.58'}], [{'page': '2', 'x': '432.38', 'y': '474.64', 'h': '126.90', 'w': '9.58'}, {'page': '2', 'x': '166.10', 'y': '487.19', 'h': '370.43', 'w': '9.58'}], [{'page': '2', 'x': '539.87', 'y': '487.19', 'h': '19.41', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '499.74', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '512.30', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '524.85', 'h': '315.47', 'w': '9.58'}]]\", 'pages': \"('2', '2')\", 'section_title': 'Introduction', 'section_number': '1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The study was approved by the institutional review board (IRB) of the National University of Sciences and Technology (NUST), Islamabad, Pakistan, and written informed consent was obtained from all the study subjects.Human blood sera were collected from Pakistani RA patients that were diagnosed according to 2010 ACR/EULAR criteria [7] as well as healthy controls.The venous blood was collected from each patient in a 5 mL BD Vacutainer ® tubes (BD vacutainer TM, Frankin Lakes, NJ, USA) containing spray-coated silica and a polymer gel for serum separation.Butterfly needle was used depending on the condition of the patient.The samples were allowed to clot, and the serum was carefully alliquoted and stored at -80 • C. ACPA-status was evaluated using the commercial ACPA AESKULISA ® enzyme-linked immunosorbent assay (ELISA) assay kit (AESKU.Diagnostics, Wendelsheim, Germany).RF-status was determined using a latex agglutination slide test kit for RF (Werfen, Barcelona, Spain).A total of 18 patients (mean age ± SD = 40.1 ± 12.13) selected for the study were divided into 4 cohorts.The first cohort included RA patients that were double-positive for both RF and ACPA (n = 5), the second and third cohort included RA patients that were either positive for RF or ACPA (n = 5 each) and the fourth cohort included RA patients that were negative for both of these serological markers (n = 3).', metadata={'text': 'The study was approved by the institutional review board (IRB) of the National University of Sciences and Technology (NUST), Islamabad, Pakistan, and written informed consent was obtained from all the study subjects.Human blood sera were collected from Pakistani RA patients that were diagnosed according to 2010 ACR/EULAR criteria [7] as well as healthy controls.The venous blood was collected from each patient in a 5 mL BD Vacutainer ® tubes (BD vacutainer TM, Frankin Lakes, NJ, USA) containing spray-coated silica and a polymer gel for serum separation.Butterfly needle was used depending on the condition of the patient.The samples were allowed to clot, and the serum was carefully alliquoted and stored at -80 • C. ACPA-status was evaluated using the commercial ACPA AESKULISA ® enzyme-linked immunosorbent assay (ELISA) assay kit (AESKU.Diagnostics, Wendelsheim, Germany).RF-status was determined using a latex agglutination slide test kit for RF (Werfen, Barcelona, Spain).A total of 18 patients (mean age ± SD = 40.1 ± 12.13) selected for the study were divided into 4 cohorts.The first cohort included RA patients that were double-positive for both RF and ACPA (n = 5), the second and third cohort included RA patients that were either positive for RF or ACPA (n = 5 each) and the fourth cohort included RA patients that were negative for both of these serological markers (n = 3).', 'para': '7', 'bboxes': \"[[{'page': '2', 'x': '187.65', 'y': '576.26', 'h': '371.62', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '588.81', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '601.36', 'h': '217.09', 'w': '9.58'}], [{'page': '2', 'x': '386.61', 'y': '601.36', 'h': '172.66', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '613.91', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '165.98', 'y': '626.47', 'h': '107.48', 'w': '9.58'}], [{'page': '2', 'x': '276.54', 'y': '626.47', 'h': '282.74', 'w': '9.58'}, {'page': '2', 'x': '166.04', 'y': '639.02', 'h': '47.95', 'w': '9.58'}, {'page': '2', 'x': '213.98', 'y': '637.03', 'h': '5.66', 'w': '7.28'}, {'page': '2', 'x': '222.64', 'y': '639.02', 'h': '336.63', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '651.57', 'h': '198.34', 'w': '9.58'}], [{'page': '2', 'x': '367.82', 'y': '651.57', 'h': '191.46', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '664.13', 'h': '107.76', 'w': '9.58'}], [{'page': '2', 'x': '277.53', 'y': '664.13', 'h': '282.13', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '676.36', 'h': '125.23', 'w': '9.90'}, {'page': '2', 'x': '294.21', 'y': '674.45', 'h': '3.94', 'w': '6.92'}, {'page': '2', 'x': '298.74', 'y': '676.68', 'h': '260.92', 'w': '9.58'}, {'page': '2', 'x': '166.01', 'y': '689.23', 'h': '55.35', 'w': '9.58'}, {'page': '2', 'x': '221.35', 'y': '687.24', 'h': '5.66', 'w': '7.28'}, {'page': '2', 'x': '229.57', 'y': '689.23', 'h': '330.96', 'w': '9.58'}, {'page': '2', 'x': '165.90', 'y': '701.79', 'h': '112.72', 'w': '9.58'}], [{'page': '2', 'x': '281.70', 'y': '701.79', 'h': '277.57', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '714.34', 'h': '159.98', 'w': '9.58'}], [{'page': '2', 'x': '329.49', 'y': '714.02', 'h': '230.78', 'w': '9.90'}, {'page': '2', 'x': '166.39', 'y': '726.89', 'h': '223.73', 'w': '9.58'}], [{'page': '2', 'x': '393.21', 'y': '726.89', 'h': '166.06', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '739.44', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '752.00', 'h': '392.89', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '764.55', 'h': '394.63', 'w': '9.58'}]]\", 'pages': \"('2', '2')\", 'section_title': 'Study Subjects and Serum Collection', 'section_number': '2.1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Life 2022, 12, 464 3 of 17 A total of 5 healthy controls (n = 5) (mean age ± SD = 43.4± 9.11) were also included in the study.Each cohort contained age-matched samples with a female-to-male ratio of 4:1.Blood samples from both RA cases and healthy controls were collected in vacutainers without anticoagulants.Serum was then separated from blood at 4000× g for 5 min, aliquoted into polyethylene tubes (Eppendorf AG, Hamburg, Germany) and stored at -80 • C until use.', metadata={'text': 'Life 2022, 12, 464 3 of 17 A total of 5 healthy controls (n = 5) (mean age ± SD = 43.4± 9.11) were also included in the study.Each cohort contained age-matched samples with a female-to-male ratio of 4:1.Blood samples from both RA cases and healthy controls were collected in vacutainers without anticoagulants.Serum was then separated from blood at 4000× g for 5 min, aliquoted into polyethylene tubes (Eppendorf AG, Hamburg, Germany) and stored at -80 • C until use.', 'para': '4', 'bboxes': \"[[{'page': '3', 'x': '35.49', 'y': '57.46', 'h': '57.79', 'w': '7.77'}, {'page': '3', 'x': '536.53', 'y': '57.56', 'h': '22.95', 'w': '7.67'}, {'page': '3', 'x': '166.01', 'y': '97.73', 'h': '249.40', 'w': '9.90'}], [{'page': '3', 'x': '417.90', 'y': '97.73', 'h': '141.38', 'w': '9.90'}, {'page': '3', 'x': '166.39', 'y': '110.60', 'h': '25.94', 'w': '9.58'}], [{'page': '3', 'x': '195.28', 'y': '110.60', 'h': '335.62', 'w': '9.58'}], [{'page': '3', 'x': '533.84', 'y': '110.60', 'h': '25.43', 'w': '9.58'}, {'page': '3', 'x': '166.39', 'y': '123.15', 'h': '392.88', 'w': '9.58'}, {'page': '3', 'x': '166.39', 'y': '135.71', 'h': '66.29', 'w': '9.58'}], [{'page': '3', 'x': '235.79', 'y': '135.58', 'h': '323.49', 'w': '9.71'}, {'page': '3', 'x': '166.10', 'y': '147.94', 'h': '333.23', 'w': '9.90'}, {'page': '3', 'x': '501.91', 'y': '146.03', 'h': '3.94', 'w': '6.92'}, {'page': '3', 'x': '506.44', 'y': '148.26', 'h': '50.39', 'w': '9.58'}]]\", 'pages': \"('3', '3')\", 'section_title': 'Study Subjects and Serum Collection', 'section_number': '2.1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='For validation, serum samples were collected and processed from RA patients (n = 60) (mean age ± SD = 41.495 ± 12.8275) and healthy controls (n = 20) (mean age ± SD = 45.4 ± 11.31) from the same population.The demographics and clinical characteristics of the experimental and validation cohort are shown in Table 1.', metadata={'text': 'For validation, serum samples were collected and processed from RA patients (n = 60) (mean age ± SD = 41.495 ± 12.8275) and healthy controls (n = 20) (mean age ± SD = 45.4 ± 11.31) from the same population.The demographics and clinical characteristics of the experimental and validation cohort are shown in Table 1.', 'para': '1', 'bboxes': \"[[{'page': '3', 'x': '187.65', 'y': '160.81', 'h': '372.02', 'w': '9.58'}, {'page': '3', 'x': '166.10', 'y': '173.05', 'h': '394.17', 'w': '9.90'}, {'page': '3', 'x': '166.07', 'y': '185.60', 'h': '256.73', 'w': '9.90'}], [{'page': '3', 'x': '425.92', 'y': '185.92', 'h': '133.36', 'w': '9.58'}, {'page': '3', 'x': '166.39', 'y': '198.47', 'h': '343.00', 'w': '9.58'}]]\", 'pages': \"('3', '3')\", 'section_title': 'Study Subjects and Serum Collection', 'section_number': '2.1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Serum samples were thawed on ice followed by centrifugation at 14,000× g for 10 min at 4 • C. Protein concentrations for serum samples from each donor were then determined through Pierce ® 660 nm protein assay kit for protein concentration (Thermo Scientific, Waltham, MA, USA).The sample volumes containing 10 mg total protein were calculated and mixed with double-distilled water (ddH 2 O) to make the total volume up to 500 µL.', metadata={'text': 'Serum samples were thawed on ice followed by centrifugation at 14,000× g for 10 min at 4 • C. Protein concentrations for serum samples from each donor were then determined through Pierce ® 660 nm protein assay kit for protein concentration (Thermo Scientific, Waltham, MA, USA).The sample volumes containing 10 mg total protein were calculated and mixed with double-distilled water (ddH 2 O) to make the total volume up to 500 µL.', 'para': '1', 'bboxes': \"[[{'page': '3', 'x': '187.65', 'y': '635.30', 'h': '371.63', 'w': '9.71'}, {'page': '3', 'x': '166.39', 'y': '647.98', 'h': '15.71', 'w': '9.58'}, {'page': '3', 'x': '184.70', 'y': '645.75', 'h': '3.94', 'w': '6.92'}, {'page': '3', 'x': '189.24', 'y': '647.98', 'h': '370.04', 'w': '9.58'}, {'page': '3', 'x': '166.39', 'y': '660.54', 'h': '66.80', 'w': '9.58'}, {'page': '3', 'x': '233.20', 'y': '658.55', 'h': '5.66', 'w': '7.28'}, {'page': '3', 'x': '242.68', 'y': '660.54', 'h': '317.84', 'w': '9.58'}, {'page': '3', 'x': '165.90', 'y': '673.09', 'h': '93.36', 'w': '9.58'}], [{'page': '3', 'x': '261.76', 'y': '673.09', 'h': '297.51', 'w': '9.58'}, {'page': '3', 'x': '166.39', 'y': '685.53', 'h': '384.50', 'w': '10.84'}]]\", 'pages': \"('3', '3')\", 'section_title': 'Protein Assay', 'section_number': '2.2.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Serum samples were analyzed using one-dimensional (1D) sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE) for assessment of the gross quantitative as well as qualitative differences in the serum protein profiles of the study subjects.Briefly, 16 µg of serum samples were mixed with an equal volume of NativePAGE™ sample buffer (Thermo Scientific, Waltham, MA, USA) and loaded on NativePAGE™ 1.0 mm, 4-16%, bis-tris, mini protein gels (Thermo Scientific, Waltham, MA, USA).Novex Sharp Pre-Stained Protein Standard for molecular weight estimation (Thermo Scientific, Waltham, MA, USA) was also loaded in a separate well.The samples and the standard were run in NuPAGE™ MES SDS running buffer (Thermo Scientific, Waltham, MA, USA) at 120 V for 60 min and then at 150 V for 30 min.The gels were washed for 5 min in ddH 2 O.The washing was repeated thrice.Prior to visualization, the protein gels were stained for 16 hours in Coomassie Brilliant Blue R-250 dye (Bio-Rad, Hemel Hempstead, UK) and rinsed in ddH 2 O for 30 min.The whole figure can be found at Supplementary Materials (Figures S1-S3).', metadata={'text': 'Serum samples were analyzed using one-dimensional (1D) sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE) for assessment of the gross quantitative as well as qualitative differences in the serum protein profiles of the study subjects.Briefly, 16 µg of serum samples were mixed with an equal volume of NativePAGE™ sample buffer (Thermo Scientific, Waltham, MA, USA) and loaded on NativePAGE™ 1.0 mm, 4-16%, bis-tris, mini protein gels (Thermo Scientific, Waltham, MA, USA).Novex Sharp Pre-Stained Protein Standard for molecular weight estimation (Thermo Scientific, Waltham, MA, USA) was also loaded in a separate well.The samples and the standard were run in NuPAGE™ MES SDS running buffer (Thermo Scientific, Waltham, MA, USA) at 120 V for 60 min and then at 150 V for 30 min.The gels were washed for 5 min in ddH 2 O.The washing was repeated thrice.Prior to visualization, the protein gels were stained for 16 hours in Coomassie Brilliant Blue R-250 dye (Bio-Rad, Hemel Hempstead, UK) and rinsed in ddH 2 O for 30 min.The whole figure can be found at Supplementary Materials (Figures S1-S3).', 'para': '7', 'bboxes': \"[[{'page': '3', 'x': '187.65', 'y': '723.60', 'h': '371.62', 'w': '9.58'}, {'page': '3', 'x': '166.10', 'y': '736.15', 'h': '393.18', 'w': '9.58'}, {'page': '3', 'x': '165.98', 'y': '748.71', 'h': '360.04', 'w': '9.58'}], [{'page': '3', 'x': '529.21', 'y': '748.71', 'h': '31.31', 'w': '9.58'}, {'page': '3', 'x': '165.90', 'y': '761.15', 'h': '393.58', 'w': '9.69'}, {'page': '3', 'x': '166.07', 'y': '773.81', 'h': '394.45', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '98.05', 'h': '282.71', 'w': '9.58'}], [{'page': '4', 'x': '451.18', 'y': '98.05', 'h': '108.09', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '110.60', 'h': '393.88', 'w': '9.58'}, {'page': '4', 'x': '165.98', 'y': '123.15', 'h': '152.63', 'w': '9.58'}], [{'page': '4', 'x': '321.71', 'y': '123.15', 'h': '239.52', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '135.71', 'h': '392.88', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '148.26', 'h': '132.03', 'w': '9.58'}], [{'page': '4', 'x': '302.77', 'y': '148.26', 'h': '195.38', 'w': '10.73'}], [{'page': '4', 'x': '501.06', 'y': '148.26', 'h': '58.22', 'w': '9.58'}, {'page': '4', 'x': '165.98', 'y': '160.81', 'h': '90.55', 'w': '9.58'}], [{'page': '4', 'x': '260.29', 'y': '160.81', 'h': '298.99', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '173.37', 'h': '392.88', 'w': '10.73'}, {'page': '4', 'x': '166.39', 'y': '185.92', 'h': '47.62', 'w': '9.58'}], [{'page': '4', 'x': '217.10', 'y': '185.92', 'h': '331.85', 'w': '9.58'}]]\", 'pages': \"('3', '4')\", 'section_title': 'SDS-PAGE and Silver Staining', 'section_number': '2.3.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='For qualitative assessment of the elution efficiency of ProteoMiner™ columns (Bio-Rad, Hemel Hempstead, UK), one serum sample processed through the column was also evaluated using 1D SDS-PAGE.For this purpose, the serum sample, the flow-through after each wash, and the eluted samples were run using the aforementioned protocol.Additionally, trypsin digested samples were also analyzed using 1D SDS-PAGE to confirm complete protein digestion before liquid chromatography-tandem mass spectrometry (LC-MS).', metadata={'text': 'For qualitative assessment of the elution efficiency of ProteoMiner™ columns (Bio-Rad, Hemel Hempstead, UK), one serum sample processed through the column was also evaluated using 1D SDS-PAGE.For this purpose, the serum sample, the flow-through after each wash, and the eluted samples were run using the aforementioned protocol.Additionally, trypsin digested samples were also analyzed using 1D SDS-PAGE to confirm complete protein digestion before liquid chromatography-tandem mass spectrometry (LC-MS).', 'para': '2', 'bboxes': \"[[{'page': '4', 'x': '187.65', 'y': '198.47', 'h': '373.27', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '211.02', 'h': '392.88', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '223.58', 'h': '136.31', 'w': '9.58'}], [{'page': '4', 'x': '305.20', 'y': '223.58', 'h': '254.27', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '236.13', 'h': '348.40', 'w': '9.58'}], [{'page': '4', 'x': '517.88', 'y': '236.13', 'h': '43.05', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '248.68', 'h': '392.88', 'w': '9.58'}, {'page': '4', 'x': '166.10', 'y': '261.24', 'h': '377.12', 'w': '9.58'}]]\", 'pages': \"('4', '4')\", 'section_title': 'SDS-PAGE and Silver Staining', 'section_number': '2.3.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='ProteoMiner™ Small Capacity bead columns for protein enrichment were loaded with 10 mg of protein from each sample separately.The bead columns were then rotated at the room temperature for 2 h followed by centrifugation at 1000× g for 60 s.Washing of the beads was performed thrice in phosphate-buffered saline (Sigma-Aldrich, Gillingham, UK) followed by rotation for 5 min and subsequent centrifugation for 60 s at 1000× g.This eluted the maximum amount of unbound protein.', metadata={'text': 'ProteoMiner™ Small Capacity bead columns for protein enrichment were loaded with 10 mg of protein from each sample separately.The bead columns were then rotated at the room temperature for 2 h followed by centrifugation at 1000× g for 60 s.Washing of the beads was performed thrice in phosphate-buffered saline (Sigma-Aldrich, Gillingham, UK) followed by rotation for 5 min and subsequent centrifugation for 60 s at 1000× g.This eluted the maximum amount of unbound protein.', 'para': '3', 'bboxes': \"[[{'page': '4', 'x': '187.65', 'y': '301.41', 'h': '371.62', 'w': '9.58'}, {'page': '4', 'x': '165.90', 'y': '313.96', 'h': '202.25', 'w': '9.58'}], [{'page': '4', 'x': '371.24', 'y': '313.96', 'h': '188.03', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '326.38', 'h': '322.63', 'w': '9.71'}], [{'page': '4', 'x': '492.17', 'y': '326.51', 'h': '67.11', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '339.06', 'h': '393.87', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '351.49', 'h': '368.81', 'w': '9.71'}], [{'page': '4', 'x': '539.86', 'y': '351.62', 'h': '19.41', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '364.17', 'h': '220.03', 'w': '9.58'}]]\", 'pages': \"('4', '4')\", 'section_title': 'ProteoMiner TM Column Processing', 'section_number': '2.4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='A pre-mixed solution of 0.05% (w/v) RapiGest (Waters, Elstree, Hertfordshire, UK) and 160 µL of 25 mM ammonium bicarbonate (NH 4 HCO 3 ) (Fluka Chemicals Ltd., Gillingham, UK) was used for resuspension of the Proteominer TM beads.The resuspended beads were then heated for 10 min at 80 • C; DL-Dithiothreitol (Sigma-Aldrich, Gillingham, UK) to 3 mM final concentration was added, incubated for 10 min at 60 • C and iodoacetamide (Sigma-Aldrich, Gillingham, UK) was added to a final concentration of 9 mM, incubated in the dark for 30 min at room temperature.Protease enzyme trypsin (Sigma-Aldrich, Gillingham, UK) was used for enzymatic protein digestion.A total of 2 µg of trypsin was added to each sample and rotated at 37 • C for 16 h.The samples containing the beads were supplemented again with 2 µg trypsin and rotation for 2 h at 37 • C. The digested serum samples were then centrifuged at 1000× g for 1 min at room temperature.Supernatant was removed followed by the inhibition of the trypsin activity by acidification with 0.5% (v/v) trifluoroacetic acid (TFA, Sigma-Aldrich, Gillingham, UK) and rotation at 37 • C for 30 min.The samples were then centrifuged at 13,000× g for 15 min at 4 • C.', metadata={'text': 'A pre-mixed solution of 0.05% (w/v) RapiGest (Waters, Elstree, Hertfordshire, UK) and 160 µL of 25 mM ammonium bicarbonate (NH 4 HCO 3 ) (Fluka Chemicals Ltd., Gillingham, UK) was used for resuspension of the Proteominer TM beads.The resuspended beads were then heated for 10 min at 80 • C; DL-Dithiothreitol (Sigma-Aldrich, Gillingham, UK) to 3 mM final concentration was added, incubated for 10 min at 60 • C and iodoacetamide (Sigma-Aldrich, Gillingham, UK) was added to a final concentration of 9 mM, incubated in the dark for 30 min at room temperature.Protease enzyme trypsin (Sigma-Aldrich, Gillingham, UK) was used for enzymatic protein digestion.A total of 2 µg of trypsin was added to each sample and rotated at 37 • C for 16 h.The samples containing the beads were supplemented again with 2 µg trypsin and rotation for 2 h at 37 • C. The digested serum samples were then centrifuged at 1000× g for 1 min at room temperature.Supernatant was removed followed by the inhibition of the trypsin activity by acidification with 0.5% (v/v) trifluoroacetic acid (TFA, Sigma-Aldrich, Gillingham, UK) and rotation at 37 • C for 30 min.The samples were then centrifuged at 13,000× g for 15 min at 4 • C.', 'para': '6', 'bboxes': \"[[{'page': '4', 'x': '187.65', 'y': '402.13', 'h': '371.62', 'w': '9.58'}, {'page': '4', 'x': '165.90', 'y': '414.57', 'h': '394.62', 'w': '10.84'}, {'page': '4', 'x': '166.39', 'y': '427.23', 'h': '220.48', 'w': '9.58'}, {'page': '4', 'x': '386.88', 'y': '425.24', 'h': '11.80', 'w': '7.28'}, {'page': '4', 'x': '401.67', 'y': '427.23', 'h': '27.81', 'w': '9.58'}], [{'page': '4', 'x': '432.57', 'y': '427.23', 'h': '126.71', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '439.79', 'h': '128.16', 'w': '9.58'}, {'page': '4', 'x': '297.71', 'y': '437.56', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '302.25', 'y': '439.79', 'h': '257.02', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '452.34', 'h': '289.28', 'w': '9.58'}, {'page': '4', 'x': '458.58', 'y': '450.11', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '463.12', 'y': '452.34', 'h': '96.15', 'w': '9.58'}, {'page': '4', 'x': '166.07', 'y': '464.89', 'h': '393.21', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '477.45', 'h': '201.41', 'w': '9.58'}], [{'page': '4', 'x': '373.30', 'y': '477.45', 'h': '187.22', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '490.00', 'h': '261.11', 'w': '9.58'}], [{'page': '4', 'x': '430.61', 'y': '489.89', 'h': '128.66', 'w': '9.69'}, {'page': '4', 'x': '166.39', 'y': '502.55', 'h': '168.86', 'w': '9.58'}, {'page': '4', 'x': '337.78', 'y': '500.32', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '342.32', 'y': '502.55', 'h': '44.55', 'w': '9.58'}], [{'page': '4', 'x': '389.93', 'y': '502.55', 'h': '169.35', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '515.00', 'h': '284.75', 'w': '9.69'}, {'page': '4', 'x': '453.78', 'y': '512.87', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '458.32', 'y': '515.11', 'h': '100.96', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '527.53', 'h': '317.21', 'w': '9.71'}], [{'page': '4', 'x': '486.71', 'y': '527.66', 'h': '72.56', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '540.21', 'h': '393.88', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '552.76', 'h': '330.90', 'w': '9.58'}, {'page': '4', 'x': '499.89', 'y': '550.53', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '504.43', 'y': '552.76', 'h': '56.60', 'w': '9.58'}], [{'page': '4', 'x': '166.09', 'y': '565.19', 'h': '276.76', 'w': '9.71'}, {'page': '4', 'x': '445.43', 'y': '563.09', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '449.97', 'y': '565.32', 'h': '9.55', 'w': '9.58'}]]\", 'pages': \"('4', '4')\", 'section_title': 'Protein Digestion', 'section_number': '2.5.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Each serum digest sample was analyzed using LC-MS/MS on an UltiMate 3000 Nano LC System (Dionex/Thermo Scientific, Waltham, MA, USA).The system was attached to a Q Exactive TM Quadrupole-Orbitrap instrument (Thermo Scientific, Waltham, MA, USA).Prior to loading onto the instrument, the samples were carefully randomized using Microsoft Excel.All the samples were run in one single batch.For this purpose, 150 ng of the tryptic digest from each trypsin-digested serum sample was subjected to LC-MS/MS via a 90 min gradient.For loading on trapping column (100 Å, 75 µm × 2 cm, Acclaim PepMap 100 C18, 3 µm packing material) loading buffer was used that contained 2% (v/v) acetonitrile and 0.1% (v/v) TFA in water.The sample digests mixed with loaded buffer were run at a flow rate of 12 µL min -1 for 7 min.Then, a trapping column was coupled with an analytical column (100 Å, 75 µm × 50 cm, EASY-Spray PepMap RSLC C18, 2 µm packing material) followed by elution of the peptides through a linear gradient.The linear gradient consisted of 96.2%A composed of 0.1% (v/v) formic acid: 3.8% B consisting of 0.1% (v/v) formic acid in water/acetonitrile [80/20] (v/v) to 50% A: 50% B at a flow rate of 300 nl min -1 over 90 min and washed for 5 min at 1% A: 99% B. The column was then re-equilibrated to the starting conditions and maintained at 40 • C before direct introduction of the affluent into the integrated nano-electrospray ionization source that was operating in the positive ion mode.The MS instrument was operated in the data-dependent acquisition (DDA) mode with the survey scans between the mass to charge ratio (m/z) range of 350 to 2000 that were acquired at a mass resolution of about 60,000 and the fullwidth at halfmaximum (FWHM) at m/z of about 200.The automatic gain control was set to 3e6 with a maximum injection time of 100 ms.For MS/MS, 12 of the most intense precursor ions with an isolation window of 2 m/z units and charge states ranging from 2+ to 5+ were selected.For this, the automatic gain control was set to a value of 1e5 with the maximum injection time of 100 ms.The peptide fragmentation was obtained by the higher-energy collisional dissociation utilizing a normalized collision energy of 30%.Dynamic exclusion of the m/z values was used to avoid the repeated fragmentation of the same peptide with an exclusion time of 20 s.All MS raw files for this experiment have been deposited to the ProteomeXchange Consortium through the PRIDE partner proteomics repository.The dataset identifier for this submission is PXD020235 and 10.6019/PXD020235 [22].', metadata={'text': 'Each serum digest sample was analyzed using LC-MS/MS on an UltiMate 3000 Nano LC System (Dionex/Thermo Scientific, Waltham, MA, USA).The system was attached to a Q Exactive TM Quadrupole-Orbitrap instrument (Thermo Scientific, Waltham, MA, USA).Prior to loading onto the instrument, the samples were carefully randomized using Microsoft Excel.All the samples were run in one single batch.For this purpose, 150 ng of the tryptic digest from each trypsin-digested serum sample was subjected to LC-MS/MS via a 90 min gradient.For loading on trapping column (100 Å, 75 µm × 2 cm, Acclaim PepMap 100 C18, 3 µm packing material) loading buffer was used that contained 2% (v/v) acetonitrile and 0.1% (v/v) TFA in water.The sample digests mixed with loaded buffer were run at a flow rate of 12 µL min -1 for 7 min.Then, a trapping column was coupled with an analytical column (100 Å, 75 µm × 50 cm, EASY-Spray PepMap RSLC C18, 2 µm packing material) followed by elution of the peptides through a linear gradient.The linear gradient consisted of 96.2%A composed of 0.1% (v/v) formic acid: 3.8% B consisting of 0.1% (v/v) formic acid in water/acetonitrile [80/20] (v/v) to 50% A: 50% B at a flow rate of 300 nl min -1 over 90 min and washed for 5 min at 1% A: 99% B. The column was then re-equilibrated to the starting conditions and maintained at 40 • C before direct introduction of the affluent into the integrated nano-electrospray ionization source that was operating in the positive ion mode.The MS instrument was operated in the data-dependent acquisition (DDA) mode with the survey scans between the mass to charge ratio (m/z) range of 350 to 2000 that were acquired at a mass resolution of about 60,000 and the fullwidth at halfmaximum (FWHM) at m/z of about 200.The automatic gain control was set to 3e6 with a maximum injection time of 100 ms.For MS/MS, 12 of the most intense precursor ions with an isolation window of 2 m/z units and charge states ranging from 2+ to 5+ were selected.For this, the automatic gain control was set to a value of 1e5 with the maximum injection time of 100 ms.The peptide fragmentation was obtained by the higher-energy collisional dissociation utilizing a normalized collision energy of 30%.Dynamic exclusion of the m/z values was used to avoid the repeated fragmentation of the same peptide with an exclusion time of 20 s.All MS raw files for this experiment have been deposited to the ProteomeXchange Consortium through the PRIDE partner proteomics repository.The dataset identifier for this submission is PXD020235 and 10.6019/PXD020235 [22].', 'para': '17', 'bboxes': \"[[{'page': '4', 'x': '187.65', 'y': '603.27', 'h': '371.62', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '615.83', 'h': '275.76', 'w': '9.58'}], [{'page': '4', 'x': '445.29', 'y': '615.83', 'h': '113.98', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '628.38', 'h': '69.65', 'w': '9.58'}, {'page': '4', 'x': '236.04', 'y': '626.39', 'h': '11.80', 'w': '7.28'}, {'page': '4', 'x': '251.61', 'y': '628.38', 'h': '308.91', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '640.93', 'h': '26.46', 'w': '9.58'}], [{'page': '4', 'x': '195.34', 'y': '640.93', 'h': '363.93', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '653.49', 'h': '70.68', 'w': '9.58'}], [{'page': '4', 'x': '240.17', 'y': '653.49', 'h': '198.27', 'w': '9.58'}], [{'page': '4', 'x': '441.54', 'y': '653.49', 'h': '117.74', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '666.04', 'h': '392.88', 'w': '9.58'}, {'page': '4', 'x': '166.12', 'y': '678.59', 'h': '99.03', 'w': '9.58'}], [{'page': '4', 'x': '269.47', 'y': '678.28', 'h': '289.81', 'w': '9.90'}, {'page': '4', 'x': '166.39', 'y': '691.04', 'h': '393.88', 'w': '9.69'}, {'page': '4', 'x': '166.39', 'y': '703.70', 'h': '184.20', 'w': '9.58'}], [{'page': '4', 'x': '354.67', 'y': '703.70', 'h': '204.81', 'w': '9.58'}, {'page': '4', 'x': '165.98', 'y': '716.14', 'h': '161.87', 'w': '9.69'}, {'page': '4', 'x': '327.94', 'y': '714.02', 'h': '10.01', 'w': '6.92'}, {'page': '4', 'x': '341.08', 'y': '716.25', 'h': '43.65', 'w': '9.58'}], [{'page': '4', 'x': '388.22', 'y': '716.25', 'h': '171.05', 'w': '9.58'}, {'page': '4', 'x': '165.98', 'y': '728.49', 'h': '393.30', 'w': '9.90'}, {'page': '4', 'x': '166.10', 'y': '741.36', 'h': '346.29', 'w': '9.58'}], [{'page': '4', 'x': '515.48', 'y': '741.36', 'h': '43.99', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '753.91', 'h': '123.17', 'w': '9.58'}], [{'page': '4', 'x': '292.23', 'y': '753.91', 'h': '267.05', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '766.46', 'h': '392.88', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '98.05', 'h': '57.86', 'w': '9.58'}, {'page': '5', 'x': '224.34', 'y': '95.82', 'h': '10.01', 'w': '6.92'}, {'page': '5', 'x': '237.34', 'y': '98.05', 'h': '321.93', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '110.60', 'h': '266.57', 'w': '9.58'}, {'page': '5', 'x': '435.34', 'y': '108.37', 'h': '3.94', 'w': '6.92'}, {'page': '5', 'x': '439.88', 'y': '110.60', 'h': '119.40', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '123.15', 'h': '392.88', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '135.71', 'h': '96.68', 'w': '9.58'}], [{'page': '5', 'x': '266.18', 'y': '135.71', 'h': '293.10', 'w': '9.58'}, {'page': '5', 'x': '166.07', 'y': '148.26', 'h': '393.21', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '160.81', 'h': '394.53', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '173.24', 'h': '181.48', 'w': '9.71'}], [{'page': '5', 'x': '351.05', 'y': '173.37', 'h': '208.23', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '185.92', 'h': '165.42', 'w': '9.58'}], [{'page': '5', 'x': '335.20', 'y': '185.92', 'h': '224.07', 'w': '9.58'}, {'page': '5', 'x': '165.98', 'y': '198.34', 'h': '393.30', 'w': '9.71'}, {'page': '5', 'x': '166.39', 'y': '211.02', 'h': '37.95', 'w': '9.58'}], [{'page': '5', 'x': '207.44', 'y': '211.02', 'h': '351.83', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '223.58', 'h': '109.24', 'w': '9.58'}], [{'page': '5', 'x': '279.32', 'y': '223.58', 'h': '280.35', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '236.13', 'h': '305.80', 'w': '9.58'}], [{'page': '5', 'x': '475.28', 'y': '236.13', 'h': '83.99', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '248.55', 'h': '392.88', 'w': '9.71'}, {'page': '5', 'x': '166.39', 'y': '261.24', 'h': '115.17', 'w': '9.58'}], [{'page': '5', 'x': '286.60', 'y': '261.24', 'h': '272.67', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '273.79', 'h': '373.17', 'w': '9.58'}], [{'page': '5', 'x': '542.65', 'y': '273.79', 'h': '16.63', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '286.34', 'h': '355.30', 'w': '9.58'}]]\", 'pages': \"('4', '5')\", 'section_title': 'LC-MS/MS', 'section_number': '2.6.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='For label-free quantification, all the raw files were processed using Progenesis™ QI 2.0 software (Nonlinear Dynamics, Waters).Progenesis™ QI software undertakes the spectral alignment, consistent peak picking across all runs, normalization of the total protein abundance as well as peptide/protein quantification.For each feature, the top five spectra were exported, and the peptide and protein identifications were carried out via in-house Mascot server (Version 2.6.2).Reviewed Homo sapiens database was used to perform the identifications.Search parameters included: fragment mass tolerance value of 0.01 Da; peptide mass tolerance value of 10.0 ppm; enzyme, trypsin; one allowed missed cleavage; carbamidomethylation (cysteine) as the fixed modifications and oxidation (methionine) as the variable modification; The criteria used for protein identification included a false discovery rate (FDR) of 1% and ≥2 unique peptides.', metadata={'text': 'For label-free quantification, all the raw files were processed using Progenesis™ QI 2.0 software (Nonlinear Dynamics, Waters).Progenesis™ QI software undertakes the spectral alignment, consistent peak picking across all runs, normalization of the total protein abundance as well as peptide/protein quantification.For each feature, the top five spectra were exported, and the peptide and protein identifications were carried out via in-house Mascot server (Version 2.6.2).Reviewed Homo sapiens database was used to perform the identifications.Search parameters included: fragment mass tolerance value of 0.01 Da; peptide mass tolerance value of 10.0 ppm; enzyme, trypsin; one allowed missed cleavage; carbamidomethylation (cysteine) as the fixed modifications and oxidation (methionine) as the variable modification; The criteria used for protein identification included a false discovery rate (FDR) of 1% and ≥2 unique peptides.', 'para': '4', 'bboxes': \"[[{'page': '5', 'x': '187.65', 'y': '324.30', 'h': '371.62', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '336.85', 'h': '174.99', 'w': '9.58'}], [{'page': '5', 'x': '344.48', 'y': '336.85', 'h': '214.80', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '349.41', 'h': '392.88', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '361.96', 'h': '231.45', 'w': '9.58'}], [{'page': '5', 'x': '400.98', 'y': '361.96', 'h': '158.30', 'w': '9.58'}, {'page': '5', 'x': '165.98', 'y': '374.51', 'h': '393.30', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '387.06', 'h': '131.22', 'w': '9.58'}], [{'page': '5', 'x': '300.78', 'y': '386.93', 'h': '258.50', 'w': '9.71'}, {'page': '5', 'x': '166.39', 'y': '399.62', 'h': '66.54', 'w': '9.58'}], [{'page': '5', 'x': '237.66', 'y': '399.62', 'h': '322.86', 'w': '9.58'}, {'page': '5', 'x': '166.10', 'y': '412.17', 'h': '394.42', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '424.72', 'h': '393.87', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '437.28', 'h': '392.88', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '449.51', 'h': '230.16', 'w': '9.90'}]]\", 'pages': \"('5', '5')\", 'section_title': 'Label-Free Quantification', 'section_number': '2.7.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Canonical pathways, networks and disregulated regulators of the proteins that were identified with an FDR adjusted p-value of <0.05 and ≥2 unique peptides were performed using Ingenuity Pathway Analysis (IPA) (Qiagen, Hilden, Germany).For this, the gene names for the identified proteins were uploaded and analyzed for humans.All identified proteins were used as a background.The uncharacterized proteins were excluded from analysis.', metadata={'text': 'Canonical pathways, networks and disregulated regulators of the proteins that were identified with an FDR adjusted p-value of <0.05 and ≥2 unique peptides were performed using Ingenuity Pathway Analysis (IPA) (Qiagen, Hilden, Germany).For this, the gene names for the identified proteins were uploaded and analyzed for humans.All identified proteins were used as a background.The uncharacterized proteins were excluded from analysis.', 'para': '3', 'bboxes': \"[[{'page': '5', 'x': '187.65', 'y': '487.79', 'h': '371.62', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '500.02', 'h': '392.89', 'w': '9.90'}, {'page': '5', 'x': '166.39', 'y': '512.89', 'h': '310.88', 'w': '9.58'}], [{'page': '5', 'x': '481.23', 'y': '512.89', 'h': '78.04', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '525.45', 'h': '342.92', 'w': '9.58'}], [{'page': '5', 'x': '514.37', 'y': '525.45', 'h': '46.56', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '538.00', 'h': '186.85', 'w': '9.58'}], [{'page': '5', 'x': '357.87', 'y': '538.00', 'h': '201.40', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '550.55', 'h': '61.84', 'w': '9.58'}]]\", 'pages': \"('5', '5')\", 'section_title': 'Pathway Analysis', 'section_number': '2.8.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content=\"A human PZP ELISA kit (CSB-EL019131HU, CUSABIO, Houston, TX, USA) was used for the quantification of PZP protein in human samples from an independent cohort of RA patients and controls according to the manufacturer's directions.All the samples were analyzed in duplicates and protein concentration was determined as an average of the duplicates.\", metadata={'text': \"A human PZP ELISA kit (CSB-EL019131HU, CUSABIO, Houston, TX, USA) was used for the quantification of PZP protein in human samples from an independent cohort of RA patients and controls according to the manufacturer's directions.All the samples were analyzed in duplicates and protein concentration was determined as an average of the duplicates.\", 'para': '1', 'bboxes': \"[[{'page': '5', 'x': '187.65', 'y': '588.51', 'h': '371.62', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '601.06', 'h': '392.88', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '613.62', 'h': '315.12', 'w': '9.58'}], [{'page': '5', 'x': '487.71', 'y': '613.62', 'h': '71.56', 'w': '9.58'}, {'page': '5', 'x': '165.98', 'y': '626.17', 'h': '393.30', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '638.72', 'h': '64.33', 'w': '9.58'}]]\", 'pages': \"('5', '5')\", 'section_title': 'Validation of MS Using ELISA', 'section_number': '2.9.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Heat map plots were created and visualized using MetaboAnalyst 4.0.Principal component analysis (PCA) was also performed using MetaboAnalyst 4.0.Log transformation and Pareto scaling were applied for data analysis of the normalized data.For this study, the DE proteins were defined as those with a FDR adjusted p-value of <0.05, identified ≥2 unique peptides and a >2 fold expression change using ANOVA.For comparison of PZP concentration between RA patients and healthy controls, a t-test was used.A boxplot depicting the ELISA results was designed using R 4.1.1.', metadata={'text': 'Heat map plots were created and visualized using MetaboAnalyst 4.0.Principal component analysis (PCA) was also performed using MetaboAnalyst 4.0.Log transformation and Pareto scaling were applied for data analysis of the normalized data.For this study, the DE proteins were defined as those with a FDR adjusted p-value of <0.05, identified ≥2 unique peptides and a >2 fold expression change using ANOVA.For comparison of PZP concentration between RA patients and healthy controls, a t-test was used.A boxplot depicting the ELISA results was designed using R 4.1.1.', 'para': '5', 'bboxes': \"[[{'page': '5', 'x': '187.65', 'y': '676.68', 'h': '324.45', 'w': '9.58'}], [{'page': '5', 'x': '518.63', 'y': '676.68', 'h': '40.64', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '689.23', 'h': '331.18', 'w': '9.58'}], [{'page': '5', 'x': '501.86', 'y': '689.23', 'h': '59.07', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '701.79', 'h': '372.04', 'w': '9.58'}], [{'page': '5', 'x': '544.26', 'y': '701.79', 'h': '15.21', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '714.21', 'h': '394.12', 'w': '9.71'}, {'page': '5', 'x': '166.39', 'y': '726.58', 'h': '339.08', 'w': '9.90'}], [{'page': '5', 'x': '507.91', 'y': '726.89', 'h': '53.02', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '739.31', 'h': '382.39', 'w': '9.71'}], [{'page': '5', 'x': '551.89', 'y': '739.44', 'h': '7.78', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '752.00', 'h': '280.43', 'w': '9.58'}]]\", 'pages': \"('5', '5')\", 'section_title': 'Statistics', 'section_number': '2.10.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Life 2022, 12, 464 6 of 17', metadata={'text': 'Life 2022, 12, 464 6 of 17', 'para': '0', 'bboxes': \"[[{'page': '6', 'x': '35.49', 'y': '57.46', 'h': '57.79', 'w': '7.77'}, {'page': '6', 'x': '536.53', 'y': '57.56', 'h': '22.95', 'w': '7.67'}]]\", 'pages': \"('6', '6')\", 'section_title': 'Statistics', 'section_number': '2.10.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='1-D SDS PAGE did not demonstrate any significant differences among groups (Figure 1).A large band of serum albumin appeared at 67 kDa in all the samples; the most abundant protein in human serum.1-D SDS-PAGE of the serum samples processed through Pro-teoMiner™ columns showed that with each wash, the albumin and other high abundance proteins gradually decreased, and all the on-bead proteins were enriched gradually as depicted by the presence of all protein bands and their respective intensities in the SDS-PAGE of eluted samples (Figure 2).', metadata={'text': '1-D SDS PAGE did not demonstrate any significant differences among groups (Figure 1).A large band of serum albumin appeared at 67 kDa in all the samples; the most abundant protein in human serum.1-D SDS-PAGE of the serum samples processed through Pro-teoMiner™ columns showed that with each wash, the albumin and other high abundance proteins gradually decreased, and all the on-bead proteins were enriched gradually as depicted by the presence of all protein bands and their respective intensities in the SDS-PAGE of eluted samples (Figure 2).', 'para': '2', 'bboxes': \"[[{'page': '6', 'x': '187.55', 'y': '127.04', 'h': '373.46', 'w': '9.58'}], [{'page': '6', 'x': '166.01', 'y': '139.59', 'h': '393.27', 'w': '9.58'}, {'page': '6', 'x': '166.10', 'y': '152.14', 'h': '113.19', 'w': '9.58'}], [{'page': '6', 'x': '283.92', 'y': '152.14', 'h': '277.01', 'w': '9.58'}, {'page': '6', 'x': '166.39', 'y': '164.70', 'h': '392.88', 'w': '9.58'}, {'page': '6', 'x': '166.10', 'y': '177.25', 'h': '394.83', 'w': '9.58'}, {'page': '6', 'x': '166.10', 'y': '189.80', 'h': '393.18', 'w': '9.58'}, {'page': '6', 'x': '166.39', 'y': '202.36', 'h': '125.01', 'w': '9.58'}]]\", 'pages': \"('6', '6')\", 'section_title': '1-D SDS-PAGE Qualitative Analysis', 'section_number': '3.1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='A total of 213 proteins were identified following ProgenesisQI™ using Mascot (Table S1).One RF-negative and ACPA-positive sample returned a very low alignment score of 8.6% and was, therefore, excluded from the analysis.For the remaining samples, more than 1 unique peptide was mapped to 165 proteins out of 213 proteins.Out of 213 proteins, 124 proteins showed >a 2-fold change.A total of 37 out of these 213 proteins had q-value < 0.05.', metadata={'text': 'A total of 213 proteins were identified following ProgenesisQI™ using Mascot (Table S1).One RF-negative and ACPA-positive sample returned a very low alignment score of 8.6% and was, therefore, excluded from the analysis.For the remaining samples, more than 1 unique peptide was mapped to 165 proteins out of 213 proteins.Out of 213 proteins, 124 proteins showed >a 2-fold change.A total of 37 out of these 213 proteins had q-value < 0.05.', 'para': '4', 'bboxes': \"[[{'page': '7', 'x': '187.65', 'y': '453.98', 'h': '371.62', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '466.53', 'h': '16.21', 'w': '9.58'}], [{'page': '7', 'x': '185.69', 'y': '466.53', 'h': '373.58', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '479.08', 'h': '238.50', 'w': '9.58'}], [{'page': '7', 'x': '409.44', 'y': '479.08', 'h': '149.84', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '491.63', 'h': '305.54', 'w': '9.58'}], [{'page': '7', 'x': '475.02', 'y': '491.63', 'h': '85.50', 'w': '9.58'}, {'page': '7', 'x': '165.90', 'y': '504.19', 'h': '181.57', 'w': '9.58'}], [{'page': '7', 'x': '356.21', 'y': '504.19', 'h': '203.06', 'w': '9.58'}, {'page': '7', 'x': '166.12', 'y': '516.74', 'h': '64.13', 'w': '9.58'}]]\", 'pages': \"('7', '7')\", 'section_title': 'Identification of Proteins in Serum', 'section_number': '3.2.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The comparative analysis of all groups (a FDR adjusted p-value of <0.05, identified ≥2 unique peptides and a >2 fold expression change) identified 25 proteins that were DE (Table 2), of which 10 proteins were DE between healthy control subjects and 1 of the serotypes including PZP, selenoprotein P (SELENOP), C4b-binding protein (C4BP) beta chain, apolipoprotein M (ApoM), N-acetylmuramoyl-L-alanine amidase (NAMLAA), carboxypeptidase N (CPN) catalytic chain, oncoprotein Induced Transcript 3 (OIT3), CPN subunit 2, apolipoprotein C-I (ApoC1) and apolipoprotein C-III (ApoCIII).', metadata={'text': 'The comparative analysis of all groups (a FDR adjusted p-value of <0.05, identified ≥2 unique peptides and a >2 fold expression change) identified 25 proteins that were DE (Table 2), of which 10 proteins were DE between healthy control subjects and 1 of the serotypes including PZP, selenoprotein P (SELENOP), C4b-binding protein (C4BP) beta chain, apolipoprotein M (ApoM), N-acetylmuramoyl-L-alanine amidase (NAMLAA), carboxypeptidase N (CPN) catalytic chain, oncoprotein Induced Transcript 3 (OIT3), CPN subunit 2, apolipoprotein C-I (ApoC1) and apolipoprotein C-III (ApoCIII).', 'para': '0', 'bboxes': \"[[{'page': '7', 'x': '187.65', 'y': '554.57', 'h': '373.28', 'w': '9.71'}, {'page': '7', 'x': '166.39', 'y': '566.93', 'h': '392.88', 'w': '9.90'}, {'page': '7', 'x': '166.39', 'y': '579.80', 'h': '392.88', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '592.36', 'h': '393.87', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '604.91', 'h': '394.12', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '617.46', 'h': '392.88', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '630.02', 'h': '326.51', 'w': '9.58'}]]\", 'pages': \"('7', '7')\", 'section_title': 'Differentially Expressed Proteins', 'section_number': '3.3.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The PCA analysis (Figure 3A,B) showed that only 22.1% of the proteins (PC1) were divided between RA patients and healthy controls.The distribution only decreased to 21.3%, when only the patient groups were included in PCA (Figure 3C).The heat map of the proteins showed that the group averages of various proteins were different between patients and healthy subjects (Figure 4A).The heat map of the patient serotypes and controls however showed that although distinguishable patterns of expression existed between normalized abundances of individual proteins between patient serotypes as well as healthy subjects, only Q96PD5 (NAMLAA) showed similar trends across all the RA serogrpups as compared to healthy controls (Figure 4B).', metadata={'text': 'The PCA analysis (Figure 3A,B) showed that only 22.1% of the proteins (PC1) were divided between RA patients and healthy controls.The distribution only decreased to 21.3%, when only the patient groups were included in PCA (Figure 3C).The heat map of the proteins showed that the group averages of various proteins were different between patients and healthy subjects (Figure 4A).The heat map of the patient serotypes and controls however showed that although distinguishable patterns of expression existed between normalized abundances of individual proteins between patient serotypes as well as healthy subjects, only Q96PD5 (NAMLAA) showed similar trends across all the RA serogrpups as compared to healthy controls (Figure 4B).', 'para': '3', 'bboxes': \"[[{'page': '7', 'x': '187.65', 'y': '642.57', 'h': '371.62', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '655.12', 'h': '231.62', 'w': '9.58'}], [{'page': '7', 'x': '402.93', 'y': '655.12', 'h': '156.34', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '667.67', 'h': '318.32', 'w': '9.58'}], [{'page': '7', 'x': '487.20', 'y': '667.67', 'h': '72.08', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '680.23', 'h': '392.88', 'w': '9.58'}, {'page': '7', 'x': '166.10', 'y': '692.78', 'h': '192.29', 'w': '9.58'}], [{'page': '7', 'x': '362.14', 'y': '692.78', 'h': '197.14', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '705.33', 'h': '392.88', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '717.89', 'h': '392.88', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '730.44', 'h': '393.27', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '742.99', 'h': '246.41', 'w': '9.58'}]]\", 'pages': \"('7', '7')\", 'section_title': 'Differentially Expressed Proteins', 'section_number': '3.3.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Canonical pathway analysis was undertaken on the DE proteins between each serotype of RA and healthy controls.The comparison of double-positive RA samples with healthy controls predicted activation of dendritic cell maturation (p = 0.009); and inhibition of liver X receptor/retinoid X receptor (LXR/RXR) pathway (p = 7.9 × 10 -28 ), acute phase response signalling (p = 3.16 × 10 -27 ) and production of NO and ROS species in themacrophages (p = 1.41 × 10 -08 ) (Figure 5A).The comparison of RF-positive RA patients with healthy controls revealed an activation of the coagulation system (p = 3.98 × 10 -11 ), the intrinsic prothrombin activation pathway (p = 8.70 × 10 -09 ) and the GP6 signaling pathway (p = 0.0009); and inhibition of the LXR/RXR pathway (p = 5.01 × 10 -21 ), production of NO and ROS in macrophages (p = 2.57 × 10 -08 ) and maturity onset diabetes of young (MODY) signaling (p = 2.29 × 10 -06 ) (Figure 5B).The comparison of ACPA-positive RA patients with healthy controls revealed activation of the coagulation system (p = 3.54 × 10 -08 ), the intrinsic prothrombin activation pathway (p = 4.89 × 10 -06 ), the extrinsic prothrombin activation pathway (p = 5.01 × 10 -10 ) and acute phase response signalling (p = 5.01 × 10 -11 ); and inhibition of the LXR/RXR pathway (p = 1.99 × 10 -14 ) and production of NO and ROS in macrophages (p = 0.001) (Figure 5C).Pathway analysis of double-negative RA patients with healthy controls revealed the activation of the coagulation system (p = 7.94 × 10 -19 ), the intrinsic prothrombin activation pathway (p = 5.01 × 10 -12 ) and the extrinsic prothrombin activation pathway (p = 1.25 × 10 -13 ); and inhibition of the LXR/RXR pathway (p = 1.58 × 10 -25 ); acute phase response signalling (p = 1 × 10 -23 ) and production of NO and ROS in macrophages (p = 2.18 × 10 -10 ) (Figure 5D).', metadata={'text': 'Canonical pathway analysis was undertaken on the DE proteins between each serotype of RA and healthy controls.The comparison of double-positive RA samples with healthy controls predicted activation of dendritic cell maturation (p = 0.009); and inhibition of liver X receptor/retinoid X receptor (LXR/RXR) pathway (p = 7.9 × 10 -28 ), acute phase response signalling (p = 3.16 × 10 -27 ) and production of NO and ROS species in themacrophages (p = 1.41 × 10 -08 ) (Figure 5A).The comparison of RF-positive RA patients with healthy controls revealed an activation of the coagulation system (p = 3.98 × 10 -11 ), the intrinsic prothrombin activation pathway (p = 8.70 × 10 -09 ) and the GP6 signaling pathway (p = 0.0009); and inhibition of the LXR/RXR pathway (p = 5.01 × 10 -21 ), production of NO and ROS in macrophages (p = 2.57 × 10 -08 ) and maturity onset diabetes of young (MODY) signaling (p = 2.29 × 10 -06 ) (Figure 5B).The comparison of ACPA-positive RA patients with healthy controls revealed activation of the coagulation system (p = 3.54 × 10 -08 ), the intrinsic prothrombin activation pathway (p = 4.89 × 10 -06 ), the extrinsic prothrombin activation pathway (p = 5.01 × 10 -10 ) and acute phase response signalling (p = 5.01 × 10 -11 ); and inhibition of the LXR/RXR pathway (p = 1.99 × 10 -14 ) and production of NO and ROS in macrophages (p = 0.001) (Figure 5C).Pathway analysis of double-negative RA patients with healthy controls revealed the activation of the coagulation system (p = 7.94 × 10 -19 ), the intrinsic prothrombin activation pathway (p = 5.01 × 10 -12 ) and the extrinsic prothrombin activation pathway (p = 1.25 × 10 -13 ); and inhibition of the LXR/RXR pathway (p = 1.58 × 10 -25 ); acute phase response signalling (p = 1 × 10 -23 ) and production of NO and ROS in macrophages (p = 2.18 × 10 -10 ) (Figure 5D).', 'para': '4', 'bboxes': \"[[{'page': '10', 'x': '187.65', 'y': '187.01', 'h': '371.62', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '199.57', 'h': '121.34', 'w': '9.58'}], [{'page': '10', 'x': '290.85', 'y': '199.57', 'h': '268.81', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '212.12', 'h': '393.08', 'w': '9.58'}, {'page': '10', 'x': '166.07', 'y': '224.36', 'h': '279.60', 'w': '9.90'}, {'page': '10', 'x': '445.76', 'y': '222.44', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '460.56', 'y': '224.67', 'h': '98.72', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '236.91', 'h': '107.64', 'w': '9.90'}, {'page': '10', 'x': '274.13', 'y': '235.00', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '288.92', 'y': '237.23', 'h': '270.35', 'w': '9.58'}, {'page': '10', 'x': '166.07', 'y': '249.46', 'h': '61.63', 'w': '9.90'}, {'page': '10', 'x': '227.80', 'y': '247.55', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '242.59', 'y': '249.78', 'h': '60.14', 'w': '9.58'}], [{'page': '10', 'x': '305.42', 'y': '249.78', 'h': '254.24', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '262.02', 'h': '324.85', 'w': '9.90'}, {'page': '10', 'x': '491.34', 'y': '260.10', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '506.13', 'y': '262.33', 'h': '54.79', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '274.57', 'h': '228.54', 'w': '9.90'}, {'page': '10', 'x': '395.03', 'y': '272.65', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '409.83', 'y': '274.88', 'h': '149.84', 'w': '9.58'}, {'page': '10', 'x': '166.07', 'y': '287.12', 'h': '292.45', 'w': '9.90'}, {'page': '10', 'x': '458.61', 'y': '285.21', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '473.41', 'y': '287.44', 'h': '85.87', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '299.67', 'h': '171.14', 'w': '9.90'}, {'page': '10', 'x': '337.63', 'y': '297.76', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '352.42', 'y': '299.99', 'h': '207.85', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '312.23', 'h': '106.03', 'w': '9.90'}, {'page': '10', 'x': '272.52', 'y': '310.31', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '287.32', 'y': '312.54', 'h': '58.57', 'w': '9.58'}], [{'page': '10', 'x': '348.64', 'y': '312.54', 'h': '210.64', 'w': '9.58'}, {'page': '10', 'x': '165.98', 'y': '324.78', 'h': '356.34', 'w': '9.90'}, {'page': '10', 'x': '522.41', 'y': '322.87', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '537.20', 'y': '325.10', 'h': '22.07', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '337.33', 'h': '240.24', 'w': '9.90'}, {'page': '10', 'x': '406.72', 'y': '335.42', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '421.52', 'y': '337.65', 'h': '139.41', 'w': '9.58'}, {'page': '10', 'x': '166.12', 'y': '349.89', 'h': '131.91', 'w': '9.90'}, {'page': '10', 'x': '298.12', 'y': '347.97', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '312.92', 'y': '349.89', 'h': '226.91', 'w': '9.90'}, {'page': '10', 'x': '539.92', 'y': '347.97', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '554.71', 'y': '350.20', 'h': '5.81', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '362.44', 'h': '235.96', 'w': '9.90'}, {'page': '10', 'x': '402.45', 'y': '360.53', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '417.24', 'y': '362.76', 'h': '142.04', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '375.31', 'h': '172.72', 'w': '9.58'}], [{'page': '10', 'x': '341.60', 'y': '375.31', 'h': '217.67', 'w': '9.58'}, {'page': '10', 'x': '165.98', 'y': '387.55', 'h': '373.85', 'w': '9.90'}, {'page': '10', 'x': '539.92', 'y': '385.63', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '554.71', 'y': '387.86', 'h': '5.81', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '400.10', 'h': '272.29', 'w': '9.90'}, {'page': '10', 'x': '438.77', 'y': '398.18', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '453.57', 'y': '400.41', 'h': '107.36', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '412.65', 'h': '190.95', 'w': '9.90'}, {'page': '10', 'x': '357.44', 'y': '410.74', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '372.24', 'y': '412.97', 'h': '187.43', 'w': '9.58'}, {'page': '10', 'x': '166.07', 'y': '425.20', 'h': '60.01', 'w': '9.90'}, {'page': '10', 'x': '226.17', 'y': '423.29', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '240.97', 'y': '425.20', 'h': '198.64', 'w': '9.90'}, {'page': '10', 'x': '439.70', 'y': '423.29', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '454.50', 'y': '425.52', 'h': '104.78', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '437.76', 'h': '173.96', 'w': '9.90'}, {'page': '10', 'x': '340.45', 'y': '435.84', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '355.24', 'y': '438.07', 'h': '58.63', 'w': '9.58'}]]\", 'pages': \"('10', '10')\", 'section_title': 'Pathway Analysis', 'section_number': '3.4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The comparison of the four serotypes of RA with healthy controls revealed an inhibition of inflammatory response, leukocyte migration, binding of professional phagocytic cells, migration of cells, adhesion of phagocytes, cell movement of phagocytes and cell movement of leukocytes in all serotypes except double-negative serotype.Accumulation of leukocytes was, however, inhibited in all serotypes.Concentration of cholesterol was inhibited in all serotypes except ACPA-positive patients that did not show activation or inhibition of this protein (Figure 6).The detailed results of pathway analysis are provided in Table S2.', metadata={'text': 'The comparison of the four serotypes of RA with healthy controls revealed an inhibition of inflammatory response, leukocyte migration, binding of professional phagocytic cells, migration of cells, adhesion of phagocytes, cell movement of phagocytes and cell movement of leukocytes in all serotypes except double-negative serotype.Accumulation of leukocytes was, however, inhibited in all serotypes.Concentration of cholesterol was inhibited in all serotypes except ACPA-positive patients that did not show activation or inhibition of this protein (Figure 6).The detailed results of pathway analysis are provided in Table S2.', 'para': '3', 'bboxes': \"[[{'page': '10', 'x': '187.65', 'y': '450.63', 'h': '373.27', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '463.18', 'h': '392.88', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '475.73', 'h': '392.88', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '488.29', 'h': '326.66', 'w': '9.58'}], [{'page': '10', 'x': '496.17', 'y': '488.29', 'h': '63.10', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '500.84', 'h': '242.32', 'w': '9.58'}], [{'page': '10', 'x': '412.04', 'y': '500.84', 'h': '147.24', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '513.39', 'h': '393.08', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '525.94', 'h': '154.46', 'w': '9.58'}], [{'page': '10', 'x': '323.94', 'y': '525.94', 'h': '235.34', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '538.50', 'h': '49.73', 'w': '9.58'}]]\", 'pages': \"('10', '10')\", 'section_title': 'Pathway Analysis', 'section_number': '3.4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='cells, migration of cells, adhesion of phagocytes, cell movement of phagocytes and cell movement of leukocytes in all serotypes except double-negative serotype.Accumulation of leukocytes was, however, inhibited in all serotypes.Concentration of cholesterol was inhibited in all serotypes except ACPA-positive patients that did not show activation or inhibition of this protein (Figure 6).The detailed results of pathway analysis are provided in Table S2.', metadata={'text': 'cells, migration of cells, adhesion of phagocytes, cell movement of phagocytes and cell movement of leukocytes in all serotypes except double-negative serotype.Accumulation of leukocytes was, however, inhibited in all serotypes.Concentration of cholesterol was inhibited in all serotypes except ACPA-positive patients that did not show activation or inhibition of this protein (Figure 6).The detailed results of pathway analysis are provided in Table S2.', 'para': '3', 'bboxes': \"[[{'page': '11', 'x': '161.33', 'y': '2.64', 'h': '392.96', 'w': '10.17'}, {'page': '11', 'x': '161.33', 'y': '15.42', 'h': '327.31', 'w': '10.17'}], [{'page': '11', 'x': '491.28', 'y': '15.42', 'h': '63.01', 'w': '10.17'}, {'page': '11', 'x': '161.33', 'y': '28.26', 'h': '243.11', 'w': '10.17'}], [{'page': '11', 'x': '407.57', 'y': '28.26', 'h': '146.64', 'w': '10.17'}, {'page': '11', 'x': '161.33', 'y': '41.10', 'h': '392.99', 'w': '10.17'}, {'page': '11', 'x': '161.33', 'y': '53.88', 'h': '154.79', 'w': '10.17'}], [{'page': '11', 'x': '318.41', 'y': '53.88', 'h': '236.00', 'w': '10.17'}, {'page': '11', 'x': '161.33', 'y': '66.71', 'h': '50.88', 'w': '10.17'}]]\", 'pages': \"('11', '11')\", 'section_title': 'Pathway Analysis', 'section_number': '3.4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='We validated the mass spectrometry results using ELISA for PZP.As Figure 7 shows, the expression of PZP was significantly higher among patients (7.54 ± 6.35 µg/mL) as compared to controls (1.03 ± 0.54 µg/mL (p-value 7.41 × 10 -11 ).The PZP concentration for each sample is represented in Table S3.The sensitivity of PZP for detecting RA is 96.7% and specificity is 95%.', metadata={'text': 'We validated the mass spectrometry results using ELISA for PZP.As Figure 7 shows, the expression of PZP was significantly higher among patients (7.54 ± 6.35 µg/mL) as compared to controls (1.03 ± 0.54 µg/mL (p-value 7.41 × 10 -11 ).The PZP concentration for each sample is represented in Table S3.The sensitivity of PZP for detecting RA is 96.7% and specificity is 95%.', 'para': '3', 'bboxes': \"[[{'page': '13', 'x': '187.65', 'y': '113.59', 'h': '286.90', 'w': '9.58'}], [{'page': '13', 'x': '477.04', 'y': '113.59', 'h': '83.48', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '125.83', 'h': '392.88', 'w': '9.90'}, {'page': '13', 'x': '166.39', 'y': '138.38', 'h': '267.36', 'w': '9.90'}, {'page': '13', 'x': '433.85', 'y': '136.46', 'h': '13.80', 'w': '6.92'}, {'page': '13', 'x': '448.14', 'y': '138.70', 'h': '5.91', 'w': '9.58'}], [{'page': '13', 'x': '457.13', 'y': '138.70', 'h': '102.14', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '151.25', 'h': '183.14', 'w': '9.58'}], [{'page': '13', 'x': '352.62', 'y': '151.25', 'h': '207.49', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '163.80', 'h': '96.73', 'w': '9.58'}]]\", 'pages': \"('13', '13')\", 'section_title': 'Validation of Mass Spectrometry Using ELISA', 'section_number': '3.5.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Life 2022, 12, x FOR PEER REVIEW 14 of 19 of the proteins up-or downregulation with the activation of the respective function.The negative Z score on contrary represents inhibition of the function.The orange-colored squares represent upregulation during the disease state and the blue squares represent downregulation with the color intensity being directly correlated with the prediction strength.', metadata={'text': 'Life 2022, 12, x FOR PEER REVIEW 14 of 19 of the proteins up-or downregulation with the activation of the respective function.The negative Z score on contrary represents inhibition of the function.The orange-colored squares represent upregulation during the disease state and the blue squares represent downregulation with the color intensity being directly correlated with the prediction strength.', 'para': '2', 'bboxes': \"[[{'page': '13', 'x': '37.64', 'y': '1.90', 'h': '123.72', 'w': '8.10'}, {'page': '13', 'x': '529.95', 'y': '1.90', 'h': '30.99', 'w': '8.04'}, {'page': '13', 'x': '168.02', 'y': '39.19', 'h': '331.53', 'w': '9.07'}], [{'page': '13', 'x': '501.74', 'y': '39.19', 'h': '59.29', 'w': '9.07'}, {'page': '13', 'x': '168.01', 'y': '50.71', 'h': '221.30', 'w': '9.07'}], [{'page': '13', 'x': '392.29', 'y': '50.71', 'h': '168.68', 'w': '9.07'}, {'page': '13', 'x': '168.01', 'y': '62.30', 'h': '392.96', 'w': '9.07'}, {'page': '13', 'x': '168.01', 'y': '73.88', 'h': '250.47', 'w': '9.07'}]]\", 'pages': \"('13', '13')\", 'section_title': 'Validation of Mass Spectrometry Using ELISA', 'section_number': '3.5.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='We validated the mass spectrometry results using ELISA for PZP.As Figure 7 shows, the expression of PZP was significantly higher among patients (7.54 ± 6.35 µg/mL) as compared to controls (1.03 ± 0.54 µg/mL (p-value 7.41 × 10 -11 ).The PZP concentration for each sample is represented in Table S3.The sensitivity of PZP for detecting RA is 96.7% and specificity is 95%.', metadata={'text': 'We validated the mass spectrometry results using ELISA for PZP.As Figure 7 shows, the expression of PZP was significantly higher among patients (7.54 ± 6.35 µg/mL) as compared to controls (1.03 ± 0.54 µg/mL (p-value 7.41 × 10 -11 ).The PZP concentration for each sample is represented in Table S3.The sensitivity of PZP for detecting RA is 96.7% and specificity is 95%.', 'para': '3', 'bboxes': \"[[{'page': '13', 'x': '189.26', 'y': '113.46', 'h': '286.80', 'w': '10.10'}], [{'page': '13', 'x': '478.18', 'y': '113.46', 'h': '82.72', 'w': '10.10'}, {'page': '13', 'x': '168.01', 'y': '126.30', 'h': '393.04', 'w': '10.10'}, {'page': '13', 'x': '168.00', 'y': '139.14', 'h': '251.76', 'w': '10.11'}], [{'page': '13', 'x': '422.30', 'y': '139.14', 'h': '138.68', 'w': '10.10'}, {'page': '13', 'x': '168.02', 'y': '151.92', 'h': '153.23', 'w': '10.10'}], [{'page': '13', 'x': '324.44', 'y': '151.92', 'h': '236.56', 'w': '10.10'}, {'page': '13', 'x': '168.02', 'y': '164.75', 'h': '77.85', 'w': '10.10'}]]\", 'pages': \"('13', '13')\", 'section_title': 'Validation of Mass Spectrometry Using ELISA', 'section_number': '3.5.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='In this study, we identified 10 DE proteins between RA serotypes and healthy controls.Next, we undertook successfully validation of one of the DE proteins; PZP, in an independent sample cohort indicating our findings for this protein are applicable to another population.We then performed canonical pathway analysis for the DE proteins across each serotype in comparison to healthy controls to identify the key pathways and biological processes that are perturbed across these serotypes.We used ProteoMiner TM protein enrichment columns to deplete the proteins with high abundance and enrich the proteins with low abundance [23].ProteoMiner TM protein enrichment of low abundance proteins has several advantages over the immunoaffinity-based protein depletion', metadata={'text': 'In this study, we identified 10 DE proteins between RA serotypes and healthy controls.Next, we undertook successfully validation of one of the DE proteins; PZP, in an independent sample cohort indicating our findings for this protein are applicable to another population.We then performed canonical pathway analysis for the DE proteins across each serotype in comparison to healthy controls to identify the key pathways and biological processes that are perturbed across these serotypes.We used ProteoMiner TM protein enrichment columns to deplete the proteins with high abundance and enrich the proteins with low abundance [23].ProteoMiner TM protein enrichment of low abundance proteins has several advantages over the immunoaffinity-based protein depletion', 'para': '4', 'bboxes': \"[[{'page': '13', 'x': '189.27', 'y': '595.50', 'h': '371.67', 'w': '10.10'}, {'page': '13', 'x': '168.02', 'y': '608.28', 'h': '22.36', 'w': '10.10'}], [{'page': '13', 'x': '193.76', 'y': '608.28', 'h': '367.28', 'w': '10.10'}, {'page': '13', 'x': '168.03', 'y': '621.12', 'h': '392.93', 'w': '10.10'}, {'page': '13', 'x': '168.03', 'y': '633.95', 'h': '78.65', 'w': '10.10'}], [{'page': '13', 'x': '250.60', 'y': '633.95', 'h': '310.37', 'w': '10.10'}, {'page': '13', 'x': '168.03', 'y': '646.73', 'h': '392.99', 'w': '10.10'}, {'page': '13', 'x': '168.03', 'y': '659.56', 'h': '280.89', 'w': '10.10'}], [{'page': '13', 'x': '452.48', 'y': '659.56', 'h': '108.46', 'w': '10.10'}, {'page': '13', 'x': '168.02', 'y': '672.42', 'h': '392.93', 'w': '10.10'}, {'page': '13', 'x': '168.01', 'y': '685.19', 'h': '153.16', 'w': '10.10'}], [{'page': '13', 'x': '324.12', 'y': '685.19', 'h': '236.83', 'w': '10.11'}, {'page': '13', 'x': '168.02', 'y': '698.04', 'h': '393.02', 'w': '10.10'}]]\", 'pages': \"('13', '13')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='In this study, we identified 10 DE proteins between RA serotypes and healthy controls.Next, we undertook successfully validation of one of the DE proteins; PZP, in an independent sample cohort indicating our findings for this protein are applicable to another population.We then performed canonical pathway analysis for the DE proteins across each serotype in comparison to healthy controls to identify the key pathways and biological processes that are perturbed across these serotypes.We used ProteoMiner TM protein enrichment columns to deplete the proteins with high abundance and enrich the proteins with low abundance [23].ProteoMiner TM protein enrichment of low abundance proteins has several advantages over the immunoaffinity-based protein depletion approaches including a relatively less-complicated procedure, high material yield and reproducibility [24,25].', metadata={'text': 'In this study, we identified 10 DE proteins between RA serotypes and healthy controls.Next, we undertook successfully validation of one of the DE proteins; PZP, in an independent sample cohort indicating our findings for this protein are applicable to another population.We then performed canonical pathway analysis for the DE proteins across each serotype in comparison to healthy controls to identify the key pathways and biological processes that are perturbed across these serotypes.We used ProteoMiner TM protein enrichment columns to deplete the proteins with high abundance and enrich the proteins with low abundance [23].ProteoMiner TM protein enrichment of low abundance proteins has several advantages over the immunoaffinity-based protein depletion approaches including a relatively less-complicated procedure, high material yield and reproducibility [24,25].', 'para': '4', 'bboxes': \"[[{'page': '13', 'x': '187.65', 'y': '594.15', 'h': '373.37', 'w': '9.58'}], [{'page': '13', 'x': '166.39', 'y': '606.70', 'h': '394.53', 'w': '9.58'}, {'page': '13', 'x': '166.10', 'y': '619.26', 'h': '393.37', 'w': '9.58'}, {'page': '13', 'x': '166.10', 'y': '631.81', 'h': '50.15', 'w': '9.58'}], [{'page': '13', 'x': '219.31', 'y': '631.81', 'h': '339.97', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '644.36', 'h': '392.88', 'w': '9.58'}, {'page': '13', 'x': '166.10', 'y': '656.92', 'h': '220.39', 'w': '9.58'}], [{'page': '13', 'x': '389.49', 'y': '656.92', 'h': '93.06', 'w': '9.58'}, {'page': '13', 'x': '482.56', 'y': '654.92', 'h': '11.80', 'w': '7.28'}, {'page': '13', 'x': '497.10', 'y': '656.92', 'h': '63.82', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '669.47', 'h': '392.89', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '682.02', 'h': '90.84', 'w': '9.58'}], [{'page': '13', 'x': '260.75', 'y': '682.02', 'h': '56.61', 'w': '9.58'}, {'page': '13', 'x': '317.37', 'y': '680.03', 'h': '11.80', 'w': '7.28'}, {'page': '13', 'x': '332.30', 'y': '682.02', 'h': '226.97', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '694.57', 'h': '392.88', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '707.13', 'h': '382.95', 'w': '9.58'}]]\", 'pages': \"('13', '13')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='PZP is a high-molecular-weight immunosuppressive glycoprotein that is elevated during pregnancy.The role of this protein as an autoimmunity mediator was established by a recent LC-MS/MS-based study in inflammatory bowel disease patients [26].In this study, we also found increased expression of PZP in all the RA serotypes as compared to the controls using LC-MS/MS.The results were further validated by ELISA in a different cohort of RA patients and subjects.The high sensitivity and specificity of this protein for RA patients signify strong candidacy of PZP as a disease biomarker.', metadata={'text': 'PZP is a high-molecular-weight immunosuppressive glycoprotein that is elevated during pregnancy.The role of this protein as an autoimmunity mediator was established by a recent LC-MS/MS-based study in inflammatory bowel disease patients [26].In this study, we also found increased expression of PZP in all the RA serotypes as compared to the controls using LC-MS/MS.The results were further validated by ELISA in a different cohort of RA patients and subjects.The high sensitivity and specificity of this protein for RA patients signify strong candidacy of PZP as a disease biomarker.', 'para': '4', 'bboxes': \"[[{'page': '13', 'x': '187.65', 'y': '719.68', 'h': '371.62', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '732.23', 'h': '81.38', 'w': '9.58'}], [{'page': '13', 'x': '250.90', 'y': '732.23', 'h': '308.38', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '744.79', 'h': '361.65', 'w': '9.58'}], [{'page': '13', 'x': '531.13', 'y': '744.79', 'h': '28.14', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '757.34', 'h': '392.88', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '769.89', 'h': '136.95', 'w': '9.58'}], [{'page': '13', 'x': '305.83', 'y': '769.89', 'h': '253.45', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '98.05', 'h': '154.73', 'w': '9.58'}], [{'page': '14', 'x': '324.24', 'y': '98.05', 'h': '235.24', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '110.60', 'h': '299.06', 'w': '9.58'}]]\", 'pages': \"('13', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='In this study, the serum expression of SELENOP was decreased in all RA serotypes in comparison to controls.SELENOP is a biomarker of selenium status that has been identified as a major preventable trigger for autoimmune diseases including RA [27].In comparison to controls, the serum selenium concentrations [28] and SELENOP concentrations [29,30] have been reported to be decreased in RA patients.The selenium status has been linked to the upregulation of a whole set of inflammation-related genes via nuclear factor kappalight-chain enhancer of activated B cells (NF-κB) mediated activation of several intracellular selenoproteins [28].The role of selenium and SELENOP, combined with previous findings suggest strong candidacy of this protein as a biomarker of autoimmunity.', metadata={'text': 'In this study, the serum expression of SELENOP was decreased in all RA serotypes in comparison to controls.SELENOP is a biomarker of selenium status that has been identified as a major preventable trigger for autoimmune diseases including RA [27].In comparison to controls, the serum selenium concentrations [28] and SELENOP concentrations [29,30] have been reported to be decreased in RA patients.The selenium status has been linked to the upregulation of a whole set of inflammation-related genes via nuclear factor kappalight-chain enhancer of activated B cells (NF-κB) mediated activation of several intracellular selenoproteins [28].The role of selenium and SELENOP, combined with previous findings suggest strong candidacy of this protein as a biomarker of autoimmunity.', 'para': '4', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '123.15', 'h': '371.63', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '135.71', 'h': '100.51', 'w': '9.58'}], [{'page': '14', 'x': '269.87', 'y': '135.71', 'h': '289.40', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '148.26', 'h': '326.53', 'w': '9.58'}], [{'page': '14', 'x': '496.01', 'y': '148.26', 'h': '63.26', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '160.81', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '173.37', 'h': '227.71', 'w': '9.58'}], [{'page': '14', 'x': '397.21', 'y': '173.37', 'h': '162.06', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '185.92', 'h': '394.53', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '198.47', 'h': '393.08', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '211.02', 'h': '84.88', 'w': '9.58'}], [{'page': '14', 'x': '254.41', 'y': '211.02', 'h': '304.87', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '223.58', 'h': '322.31', 'w': '9.58'}]]\", 'pages': \"('14', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='NAMLAA degrades bacterial cell wall component peptidoglycan [31] that has strong pro-inflammatory properties and can induce arthritis in rat models [32,33].The degradation of these pro-inflammatory components should suggestively confer an anti-inflammatory and protective role to NAMLAA against arthritis.However, Saha et al. [34] demonstrated that NAMLAA is indeed essential for the development of arthritis, a relatively unexpected finding.The study findings of Saha et al. [34] have not been supported by animal model studies for other inflammatory diseases [35].Decreased levels of this protein in human RA subjects as compared to healthy controls were observed in this study.The autoantigenic potential of NAMLAA and the presence of antibodies has been reported in a recent study [18] that can explain the lower serum levels of circulating NAMLAA.The imbalance of this homeostasis is probably responsible for the development of RA that needs to be further explored.', metadata={'text': 'NAMLAA degrades bacterial cell wall component peptidoglycan [31] that has strong pro-inflammatory properties and can induce arthritis in rat models [32,33].The degradation of these pro-inflammatory components should suggestively confer an anti-inflammatory and protective role to NAMLAA against arthritis.However, Saha et al. [34] demonstrated that NAMLAA is indeed essential for the development of arthritis, a relatively unexpected finding.The study findings of Saha et al. [34] have not been supported by animal model studies for other inflammatory diseases [35].Decreased levels of this protein in human RA subjects as compared to healthy controls were observed in this study.The autoantigenic potential of NAMLAA and the presence of antibodies has been reported in a recent study [18] that can explain the lower serum levels of circulating NAMLAA.The imbalance of this homeostasis is probably responsible for the development of RA that needs to be further explored.', 'para': '6', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '236.13', 'h': '371.62', 'w': '9.58'}, {'page': '14', 'x': '166.10', 'y': '248.68', 'h': '319.08', 'w': '9.58'}], [{'page': '14', 'x': '488.13', 'y': '248.68', 'h': '71.15', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '261.24', 'h': '393.27', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '273.79', 'h': '217.44', 'w': '9.58'}], [{'page': '14', 'x': '386.91', 'y': '273.79', 'h': '172.36', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '286.34', 'h': '392.89', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '298.90', 'h': '35.23', 'w': '9.58'}], [{'page': '14', 'x': '204.71', 'y': '298.90', 'h': '354.57', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '311.45', 'h': '200.75', 'w': '9.58'}], [{'page': '14', 'x': '371.28', 'y': '311.45', 'h': '187.99', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '324.00', 'h': '329.87', 'w': '9.58'}], [{'page': '14', 'x': '500.37', 'y': '324.00', 'h': '60.55', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '336.55', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '349.11', 'h': '327.04', 'w': '9.58'}], [{'page': '14', 'x': '495.92', 'y': '349.11', 'h': '63.36', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '361.66', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '374.21', 'h': '74.85', 'w': '9.58'}]]\", 'pages': \"('14', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='C4BP β-chain, a complement inhibitor [36], and CPN, a zinc metalloprotease [37], were also observed to be DE in this study.However, a lack of consensus regarding the role of these proteins in autoimmunity and RA hereby suggest further exploration.', metadata={'text': 'C4BP β-chain, a complement inhibitor [36], and CPN, a zinc metalloprotease [37], were also observed to be DE in this study.However, a lack of consensus regarding the role of these proteins in autoimmunity and RA hereby suggest further exploration.', 'para': '1', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '386.66', 'h': '372.87', 'w': '9.69'}, {'page': '14', 'x': '165.98', 'y': '399.32', 'h': '181.73', 'w': '9.58'}], [{'page': '14', 'x': '350.80', 'y': '399.32', 'h': '208.48', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '411.87', 'h': '343.98', 'w': '9.58'}]]\", 'pages': \"('14', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='We found three apolipoproteins to be DE between RA patients and healthy controls including ApoM, ApoC1 and ApoCIII.These apolipoproteins are implicated in protection against atherosclerosis owing to their role in HDL metabolism as well as anti-inflammatory properties [38].The polymorphisms in the ApoM gene have been associated with the risk of dyslipidaemia in RA patients [39,40].However, no study reports the serum levels of this chaperone in RA patients.ApoC1 has been identified as a predictor of drug response to RA [41,42].The risk of developing cardiovascular disease is elevated among RA patients than the general population [43,44].The observed decrease in the serum levels of these apolipoproteins in RA patients could suggestively explain the increased risk of developing cardiovascular disease among RA patients and highlight the link between these two illnesses.', metadata={'text': 'We found three apolipoproteins to be DE between RA patients and healthy controls including ApoM, ApoC1 and ApoCIII.These apolipoproteins are implicated in protection against atherosclerosis owing to their role in HDL metabolism as well as anti-inflammatory properties [38].The polymorphisms in the ApoM gene have been associated with the risk of dyslipidaemia in RA patients [39,40].However, no study reports the serum levels of this chaperone in RA patients.ApoC1 has been identified as a predictor of drug response to RA [41,42].The risk of developing cardiovascular disease is elevated among RA patients than the general population [43,44].The observed decrease in the serum levels of these apolipoproteins in RA patients could suggestively explain the increased risk of developing cardiovascular disease among RA patients and highlight the link between these two illnesses.', 'para': '6', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '424.42', 'h': '371.62', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '436.98', 'h': '169.44', 'w': '9.58'}], [{'page': '14', 'x': '338.31', 'y': '436.98', 'h': '220.97', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '449.53', 'h': '393.27', 'w': '9.58'}, {'page': '14', 'x': '166.10', 'y': '462.08', 'h': '68.53', 'w': '9.58'}], [{'page': '14', 'x': '240.26', 'y': '462.08', 'h': '319.02', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '474.64', 'h': '199.50', 'w': '9.58'}], [{'page': '14', 'x': '370.58', 'y': '474.64', 'h': '190.35', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '487.19', 'h': '164.71', 'w': '9.58'}], [{'page': '14', 'x': '336.00', 'y': '487.19', 'h': '223.28', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '499.74', 'h': '102.28', 'w': '9.58'}], [{'page': '14', 'x': '271.78', 'y': '499.74', 'h': '287.50', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '512.30', 'h': '207.79', 'w': '9.58'}], [{'page': '14', 'x': '377.30', 'y': '512.30', 'h': '181.97', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '524.85', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '537.40', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '549.95', 'h': '58.69', 'w': '9.58'}]]\", 'pages': \"('14', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The pathway analysis of the DE proteins showed that some pathways were differentially inhibited or activated in various serotypes suggesting that these serotypes are indeed regulated by different pathogenic mechanisms.However, some similarities were also observed including inhibition of LXR/RXR pathway and NO and ROS production in macrophages.LXR/RXR pathway was inhibited among all the RA serotypes.This pathway has been reported to inhibit atherosclerosis [45] and inflammation [46], suggesting an important and relatively unexplored link between this pathway and RA.The role of ROS in autoimmunity is complex and has been generally viewed as detrimental in the pathogenesis of autoimmune disease [47].A recent study revealed the regulatory role of these oxidative stress markers to prevent the pathogenesis of chronic inflammatory diseases [48].The inhibition of NO and ROS pathway in macrophage across all the serotypes warrants further exploration about the precise role of this pathway in the pathogenesis of RA.', metadata={'text': 'The pathway analysis of the DE proteins showed that some pathways were differentially inhibited or activated in various serotypes suggesting that these serotypes are indeed regulated by different pathogenic mechanisms.However, some similarities were also observed including inhibition of LXR/RXR pathway and NO and ROS production in macrophages.LXR/RXR pathway was inhibited among all the RA serotypes.This pathway has been reported to inhibit atherosclerosis [45] and inflammation [46], suggesting an important and relatively unexplored link between this pathway and RA.The role of ROS in autoimmunity is complex and has been generally viewed as detrimental in the pathogenesis of autoimmune disease [47].A recent study revealed the regulatory role of these oxidative stress markers to prevent the pathogenesis of chronic inflammatory diseases [48].The inhibition of NO and ROS pathway in macrophage across all the serotypes warrants further exploration about the precise role of this pathway in the pathogenesis of RA.', 'para': '6', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '562.51', 'h': '373.28', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '575.06', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '587.61', 'h': '243.63', 'w': '9.58'}], [{'page': '14', 'x': '413.11', 'y': '587.61', 'h': '146.16', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '600.17', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '612.72', 'h': '74.63', 'w': '9.58'}], [{'page': '14', 'x': '246.68', 'y': '612.72', 'h': '287.53', 'w': '9.58'}], [{'page': '14', 'x': '539.86', 'y': '612.72', 'h': '19.41', 'w': '9.58'}, {'page': '14', 'x': '166.10', 'y': '625.27', 'h': '393.18', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '637.83', 'h': '323.22', 'w': '9.58'}], [{'page': '14', 'x': '491.83', 'y': '637.83', 'h': '67.45', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '650.38', 'h': '394.53', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '662.93', 'h': '160.94', 'w': '9.58'}], [{'page': '14', 'x': '330.42', 'y': '662.93', 'h': '228.85', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '675.48', 'h': '394.63', 'w': '9.58'}], [{'page': '14', 'x': '166.09', 'y': '688.04', 'h': '393.19', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '700.59', 'h': '370.26', 'w': '9.58'}]]\", 'pages': \"('14', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='RA is a complex disorder with molecular and clinical heterogeneity.We used RF and ACPA to classify our patient population and studied the DE proteins in comparison to all healthy controls.However, due to the COVID-19 pandemic, only a limited number of samples could be collected for validation of the identified proteins.The lockdown situation also limited the access to the laboratory facilities and the samples were not tested for their individual RF and ACPA status.The validation of the mass spectrometry result for PZP in an independent cohort of patients suggest that identified proteins can be tested on larger cohorts of patients from different populations in the future to validate the study findings and identify disease biomarkers for RA.', metadata={'text': 'RA is a complex disorder with molecular and clinical heterogeneity.We used RF and ACPA to classify our patient population and studied the DE proteins in comparison to all healthy controls.However, due to the COVID-19 pandemic, only a limited number of samples could be collected for validation of the identified proteins.The lockdown situation also limited the access to the laboratory facilities and the samples were not tested for their individual RF and ACPA status.The validation of the mass spectrometry result for PZP in an independent cohort of patients suggest that identified proteins can be tested on larger cohorts of patients from different populations in the future to validate the study findings and identify disease biomarkers for RA.', 'para': '4', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '713.14', 'h': '297.24', 'w': '9.58'}], [{'page': '14', 'x': '487.97', 'y': '713.14', 'h': '71.30', 'w': '9.58'}, {'page': '14', 'x': '166.01', 'y': '725.70', 'h': '393.27', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '738.25', 'h': '87.33', 'w': '9.58'}], [{'page': '14', 'x': '256.82', 'y': '738.25', 'h': '302.45', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '750.80', 'h': '287.62', 'w': '9.58'}], [{'page': '14', 'x': '457.08', 'y': '750.80', 'h': '102.19', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '763.35', 'h': '393.08', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '98.05', 'h': '140.07', 'w': '9.58'}], [{'page': '15', 'x': '309.55', 'y': '98.05', 'h': '249.73', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '110.60', 'h': '393.08', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '123.15', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '135.71', 'h': '175.46', 'w': '9.58'}]]\", 'pages': \"('14', '15')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='RA is a complex disease that is influenced by an intricate interactome of various environmental, genetic and microbial factors that influence the immune homeostasis.Owing to the complex genetic architecture accompanied by a plethora of microbial and environmental triggers that an organism is exposed to this has made the identification of diagnostic and prognostic markers challenging.Our study has explored the serum proteomics of this complex autoimmune disorder in a relatively understudied Pakistani population to identify disease biomarkers that are DE among various serotypes of RA patients and healthy controls.We identified that PZP, SELENOP, C4BP beta chain, ApoM, NAMLAA, CPN catalytic chain, OIT3, CPN subunit 2, ApoC1 and ApoCIII were DE between the RA patients and healthy controls.These serum proteins have strong potential to serve as diagnostic and prognostic biomarkers of RA and can also be evaluated to fill the gaps in the current knowledge of pathogenesis of RA.These findings can be validated in larger cohorts from different populations to identify diagnostic and prognostic biomarkers of RA.', metadata={'text': 'RA is a complex disease that is influenced by an intricate interactome of various environmental, genetic and microbial factors that influence the immune homeostasis.Owing to the complex genetic architecture accompanied by a plethora of microbial and environmental triggers that an organism is exposed to this has made the identification of diagnostic and prognostic markers challenging.Our study has explored the serum proteomics of this complex autoimmune disorder in a relatively understudied Pakistani population to identify disease biomarkers that are DE among various serotypes of RA patients and healthy controls.We identified that PZP, SELENOP, C4BP beta chain, ApoM, NAMLAA, CPN catalytic chain, OIT3, CPN subunit 2, ApoC1 and ApoCIII were DE between the RA patients and healthy controls.These serum proteins have strong potential to serve as diagnostic and prognostic biomarkers of RA and can also be evaluated to fill the gaps in the current knowledge of pathogenesis of RA.These findings can be validated in larger cohorts from different populations to identify diagnostic and prognostic biomarkers of RA.', 'para': '5', 'bboxes': \"[[{'page': '15', 'x': '187.65', 'y': '173.66', 'h': '371.62', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '186.22', 'h': '394.62', 'w': '9.58'}], [{'page': '15', 'x': '166.39', 'y': '198.77', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '211.32', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '223.88', 'h': '229.10', 'w': '9.58'}], [{'page': '15', 'x': '401.31', 'y': '223.88', 'h': '157.97', 'w': '9.58'}, {'page': '15', 'x': '166.10', 'y': '236.43', 'h': '393.18', 'w': '9.58'}, {'page': '15', 'x': '166.10', 'y': '248.98', 'h': '393.57', 'w': '9.58'}, {'page': '15', 'x': '166.10', 'y': '261.54', 'h': '130.46', 'w': '9.58'}], [{'page': '15', 'x': '299.65', 'y': '261.54', 'h': '260.87', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '274.09', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '286.64', 'h': '201.22', 'w': '9.58'}], [{'page': '15', 'x': '370.71', 'y': '286.64', 'h': '188.57', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '299.19', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '311.75', 'h': '238.67', 'w': '9.58'}], [{'page': '15', 'x': '407.54', 'y': '311.75', 'h': '151.74', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '324.30', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '336.85', 'h': '28.14', 'w': '9.58'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Conclusions', 'section_number': '5.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/life12030464/s1;Table S1: Accession, number of unique peptides and description of identified proteins in all samples, Table S2: Pathway analysis results using Ingenuity Pathway Analysis, Table S3: The PZP concentration for the validation cohort, Figure S1: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-6: Double positive RA patients for RF factor and anti-CCP, Lane 7-11: Single positive RA patients for RF factor.The integrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ, Figure S2: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-6: Single positive RA patients for anti-CCP, Lane 7-9: Double negative RA patients for RF factor and anti-CCP.The in-tegrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ, Figure S3: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-8: Healthy control sam-ples.The integrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ.', metadata={'text': 'The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/life12030464/s1;Table S1: Accession, number of unique peptides and description of identified proteins in all samples, Table S2: Pathway analysis results using Ingenuity Pathway Analysis, Table S3: The PZP concentration for the validation cohort, Figure S1: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-6: Double positive RA patients for RF factor and anti-CCP, Lane 7-11: Single positive RA patients for RF factor.The integrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ, Figure S2: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-6: Single positive RA patients for anti-CCP, Lane 7-9: Double negative RA patients for RF factor and anti-CCP.The in-tegrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ, Figure S3: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-8: Healthy control sam-ples.The integrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ.', 'para': '7', 'bboxes': \"[[{'page': '15', 'x': '278.51', 'y': '361.51', 'h': '281.88', 'w': '8.63'}, {'page': '15', 'x': '165.31', 'y': '373.54', 'h': '205.36', 'w': '8.63'}], [{'page': '15', 'x': '372.93', 'y': '373.54', 'h': '186.34', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '385.57', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '397.60', 'h': '394.00', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '409.63', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '421.66', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '433.69', 'h': '282.04', 'w': '8.63'}], [{'page': '15', 'x': '451.20', 'y': '433.69', 'h': '108.07', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '445.72', 'h': '148.76', 'w': '8.63'}], [{'page': '15', 'x': '317.96', 'y': '445.72', 'h': '242.43', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '457.75', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '469.78', 'h': '394.00', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '481.81', 'h': '267.09', 'w': '8.63'}], [{'page': '15', 'x': '435.72', 'y': '481.81', 'h': '123.56', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '493.84', 'h': '143.55', 'w': '8.63'}], [{'page': '15', 'x': '313.13', 'y': '493.84', 'h': '247.27', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '505.87', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '517.90', 'h': '372.43', 'w': '8.63'}], [{'page': '15', 'x': '543.97', 'y': '517.90', 'h': '15.31', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '529.93', 'h': '245.00', 'w': '8.63'}], [{'page': '15', 'x': '414.17', 'y': '529.93', 'h': '145.11', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '541.96', 'h': '54.19', 'w': '8.63'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Supplementary Materials:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The MS raw data for this study are available at the ProteomeXchange Consortium doi PXD020235, 10.6019/PXD020235.', metadata={'text': 'The MS raw data for this study are available at the ProteomeXchange Consortium doi PXD020235, 10.6019/PXD020235.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '285.36', 'y': '128.13', 'h': '273.91', 'w': '8.63'}, {'page': '16', 'x': '166.39', 'y': '139.85', 'h': '196.80', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Data Availability Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The MS raw data for this study are available at the ProteomeXchange Consortium doi PXD020235, 10.6019/PXD020235.', metadata={'text': 'The MS raw data for this study are available at the ProteomeXchange Consortium doi PXD020235, 10.6019/PXD020235.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '285.36', 'y': '128.13', 'h': '273.91', 'w': '8.63'}, {'page': '16', 'x': '166.39', 'y': '139.85', 'h': '196.80', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Data Availability Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Author Contributions: Conceptualization, S.J., P.J., M.J.P. and J.M.M.; methodology, S.J., M.J.P. and J.R.A.; software, J.R.A. and M.J.P.; validation, S.J. and P.J.; formal analysis, A.B., M.M.A. and M.J.P.; investigation, S.J.; resources, P.J., A.B. and M.J.P.; data curation, S.J. and J.M.M.; writing-original draft preparation, S.J. and M.M.A.; writing-review and editing, M.J.P.; visualization, J.R.A.; supervision, P.J., M.J.P., J.M.M. and A.B.; project administration, P.J.; funding acquisition, P.J., A.B. and M.J.P.All authors have read and agreed to the published version of the manuscript.', metadata={'text': 'Author Contributions: Conceptualization, S.J., P.J., M.J.P. and J.M.M.; methodology, S.J., M.J.P. and J.R.A.; software, J.R.A. and M.J.P.; validation, S.J. and P.J.; formal analysis, A.B., M.M.A. and M.J.P.; investigation, S.J.; resources, P.J., A.B. and M.J.P.; data curation, S.J. and J.M.M.; writing-original draft preparation, S.J. and M.M.A.; writing-review and editing, M.J.P.; visualization, J.R.A.; supervision, P.J., M.J.P., J.M.M. and A.B.; project administration, P.J.; funding acquisition, P.J., A.B. and M.J.P.All authors have read and agreed to the published version of the manuscript.', 'para': '1', 'bboxes': \"[[{'page': '15', 'x': '166.04', 'y': '559.66', 'h': '393.23', 'w': '8.63'}, {'page': '15', 'x': '166.24', 'y': '571.37', 'h': '394.15', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '583.09', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.13', 'y': '594.81', 'h': '394.27', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '606.52', 'h': '378.39', 'w': '8.63'}], [{'page': '15', 'x': '547.04', 'y': '606.52', 'h': '12.24', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '618.24', 'h': '291.00', 'w': '8.63'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Funding: Sidrah Jahangir, Peter John, Attya Bhatti and Muhammad Muaaz Aslam were funded by Higher Education Commission (HEC), Pakistan, (grant number 5965).Mandy Peffers was funded through a Wellcome Trust Clinical Intermediate Fellowship (grant number 107471/Z/15/Z).This work was also supported by the MRC and Versus Arthritis as part of the Medical Research Council Versus Arthritis Centre for Integrated Research into Musculoskeletal Ageing (CIMA) (MR/R502182/1).James Anderson was funded by the Horserace betting Levy Board.', metadata={'text': 'Funding: Sidrah Jahangir, Peter John, Attya Bhatti and Muhammad Muaaz Aslam were funded by Higher Education Commission (HEC), Pakistan, (grant number 5965).Mandy Peffers was funded through a Wellcome Trust Clinical Intermediate Fellowship (grant number 107471/Z/15/Z).This work was also supported by the MRC and Versus Arthritis as part of the Medical Research Council Versus Arthritis Centre for Integrated Research into Musculoskeletal Ageing (CIMA) (MR/R502182/1).James Anderson was funded by the Horserace betting Levy Board.', 'para': '3', 'bboxes': \"[[{'page': '15', 'x': '166.39', 'y': '635.93', 'h': '393.23', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '647.65', 'h': '281.19', 'w': '8.63'}], [{'page': '15', 'x': '450.36', 'y': '647.65', 'h': '108.91', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '659.36', 'h': '373.15', 'w': '8.63'}], [{'page': '15', 'x': '541.81', 'y': '659.36', 'h': '17.47', 'w': '8.63'}, {'page': '15', 'x': '166.02', 'y': '671.08', 'h': '394.75', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '682.80', 'h': '394.45', 'w': '8.63'}], [{'page': '15', 'x': '166.24', 'y': '694.51', 'h': '264.08', 'w': '8.63'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), 44,000 before the commencement of study.', metadata={'text': 'The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), 44,000 before the commencement of study.', 'para': '0', 'bboxes': \"[[{'page': '15', 'x': '324.87', 'y': '712.21', 'h': '234.41', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '723.92', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '735.64', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '747.35', 'h': '99.06', 'w': '8.63'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.', metadata={'text': 'Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '166.39', 'y': '98.72', 'h': '392.88', 'w': '8.63'}, {'page': '16', 'x': '166.39', 'y': '110.44', 'h': '38.52', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The authors declare no conflict of interest.', metadata={'text': 'The authors declare no conflict of interest.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '252.09', 'y': '157.54', 'h': '165.99', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), 44,000 before the commencement of study.', metadata={'text': 'The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), 44,000 before the commencement of study.', 'para': '0', 'bboxes': \"[[{'page': '15', 'x': '324.87', 'y': '712.21', 'h': '234.41', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '723.92', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '735.64', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '747.35', 'h': '99.06', 'w': '8.63'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.', metadata={'text': 'Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '166.39', 'y': '98.72', 'h': '392.88', 'w': '8.63'}, {'page': '16', 'x': '166.39', 'y': '110.44', 'h': '38.52', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
       " Document(page_content='The authors declare no conflict of interest.', metadata={'text': 'The authors declare no conflict of interest.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '252.09', 'y': '157.54', 'h': '165.99', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Conflicts of Interest:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'})]"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Type: stuff. \n",
      "The authors detected protein abundances by using a technique called quantitative proteomics, which involves the use of mass spectrometry to measure the amount of protein in a sample. The authors then compared the protein abundances in the samples to determine which proteins were most abundant and which ones were present at lower levels.\n",
      "Type: map_reduce. \n",
      "The authors detected protein abundances by using a technique called quantitative proteomics, which involves the use of mass spectrometry to measure the amount of protein in a sample. The authors then compared the protein abundances in the samples to determine which proteins were most abundant and which ones were present at lower levels.\n",
      "Type: refine. \n",
      "The authors detected protein abundances by using a technique called quantitative proteomics, which involves the use of mass spectrometry to measure the amount of protein in a sample. The authors then compared the protein abundances in the samples to determine which proteins were most abundant and which ones were present at lower levels.\n",
      "Type: map_rerank. \n",
      "The authors detected protein abundances by using a technique called quantitative proteomics, which involves the use of mass spectrometry to measure the amount of protein in a sample. The authors then compared the protein abundances in the samples to determine which proteins were most abundant and which ones were present at lower levels.\n"
     ]
    }
   ],
   "source": [
    "from langchain import HuggingFaceHub\n",
    "from langchain.chains.question_answering import load_qa_chain\n",
    "\n",
    "HUGGINGFACE_TOKEN = \"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\"\n",
    "\n",
    "llm = HuggingFaceHub(\n",
    "    repo_id=\"tiiuae/falcon-7b-instruct\",\n",
    "    model_kwargs={\"temperature\": 0.1, \"max_new_tokens\": 80},\n",
    "    huggingfacehub_api_token=HUGGINGFACE_TOKEN\n",
    ")\n",
    "question = \"How did the authors detect protein abundances?\"\n",
    "\n",
    "chain_types = [\"map_reduce\", \"refine\", \"map_rerank\"]\n",
    "\n",
    "chain = load_qa_chain(llm, chain_type=\"stuff\")\n",
    "print(f\"\"\"Type: stuff. {chain({\"input_documents\": docs[1:3], \"question\": question}, return_only_outputs=True)[\"output_text\"]}\"\"\")\n",
    "\n",
    "for t in chain_types:\n",
    "    chain = load_qa_chain(llm, chain_type=\"stuff\")\n",
    "    # chain.llm_chain.prompt.template = \"\"\"question: {question}. context: {context}. answer: dummy answer.\"\"\"\n",
    "    print(f\"\"\"Type: {t}. {chain({\"input_documents\": docs[1:2], \"question\": question}, return_only_outputs=True)[\"output_text\"]}\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Error raised by inference API: Model yhyhy3/med-orca-instruct-33b time out",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m/home/tommaso/llm4scilit/notebooks/test.ipynb Cell 15\u001b[0m line \u001b[0;36m1\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=12'>13</a>\u001b[0m chain_types \u001b[39m=\u001b[39m [\u001b[39m\"\u001b[39m\u001b[39mmap_reduce\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mrefine\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mmap_rerank\u001b[39m\u001b[39m\"\u001b[39m]\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=14'>15</a>\u001b[0m chain \u001b[39m=\u001b[39m load_qa_chain(llm, chain_type\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mstuff\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m---> <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=15'>16</a>\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39mf\u001b[39m\u001b[39m\"\"\"\u001b[39m\u001b[39mType: stuff. \u001b[39m\u001b[39m{\u001b[39;00mchain({\u001b[39m\"\u001b[39;49m\u001b[39minput_documents\u001b[39;49m\u001b[39m\"\u001b[39;49m:\u001b[39m \u001b[39;49mdocs[\u001b[39m1\u001b[39;49m:\u001b[39m3\u001b[39;49m],\u001b[39m \u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mquestion\u001b[39;49m\u001b[39m\"\u001b[39;49m:\u001b[39m \u001b[39;49mquestion},\u001b[39m \u001b[39;49mreturn_only_outputs\u001b[39m=\u001b[39;49m\u001b[39mTrue\u001b[39;49;00m)[\u001b[39m\"\u001b[39m\u001b[39moutput_text\u001b[39m\u001b[39m\"\u001b[39m]\u001b[39m}\u001b[39;00m\u001b[39m\"\"\"\u001b[39m)\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=17'>18</a>\u001b[0m \u001b[39mfor\u001b[39;00m t \u001b[39min\u001b[39;00m chain_types:\n\u001b[1;32m     <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=18'>19</a>\u001b[0m     chain \u001b[39m=\u001b[39m load_qa_chain(llm, chain_type\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mstuff\u001b[39m\u001b[39m\"\u001b[39m)\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/base.py:243\u001b[0m, in \u001b[0;36mChain.__call__\u001b[0;34m(self, inputs, return_only_outputs, callbacks, tags, metadata, include_run_info)\u001b[0m\n\u001b[1;32m    241\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mKeyboardInterrupt\u001b[39;00m, \u001b[39mException\u001b[39;00m) \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m    242\u001b[0m     run_manager\u001b[39m.\u001b[39mon_chain_error(e)\n\u001b[0;32m--> 243\u001b[0m     \u001b[39mraise\u001b[39;00m e\n\u001b[1;32m    244\u001b[0m run_manager\u001b[39m.\u001b[39mon_chain_end(outputs)\n\u001b[1;32m    245\u001b[0m final_outputs: Dict[\u001b[39mstr\u001b[39m, Any] \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mprep_outputs(\n\u001b[1;32m    246\u001b[0m     inputs, outputs, return_only_outputs\n\u001b[1;32m    247\u001b[0m )\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/base.py:237\u001b[0m, in \u001b[0;36mChain.__call__\u001b[0;34m(self, inputs, return_only_outputs, callbacks, tags, metadata, include_run_info)\u001b[0m\n\u001b[1;32m    231\u001b[0m run_manager \u001b[39m=\u001b[39m callback_manager\u001b[39m.\u001b[39mon_chain_start(\n\u001b[1;32m    232\u001b[0m     dumpd(\u001b[39mself\u001b[39m),\n\u001b[1;32m    233\u001b[0m     inputs,\n\u001b[1;32m    234\u001b[0m )\n\u001b[1;32m    235\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[1;32m    236\u001b[0m     outputs \u001b[39m=\u001b[39m (\n\u001b[0;32m--> 237\u001b[0m         \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_call(inputs, run_manager\u001b[39m=\u001b[39;49mrun_manager)\n\u001b[1;32m    238\u001b[0m         \u001b[39mif\u001b[39;00m new_arg_supported\n\u001b[1;32m    239\u001b[0m         \u001b[39melse\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_call(inputs)\n\u001b[1;32m    240\u001b[0m     )\n\u001b[1;32m    241\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mKeyboardInterrupt\u001b[39;00m, \u001b[39mException\u001b[39;00m) \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m    242\u001b[0m     run_manager\u001b[39m.\u001b[39mon_chain_error(e)\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py:106\u001b[0m, in \u001b[0;36mBaseCombineDocumentsChain._call\u001b[0;34m(self, inputs, run_manager)\u001b[0m\n\u001b[1;32m    104\u001b[0m \u001b[39m# Other keys are assumed to be needed for LLM prediction\u001b[39;00m\n\u001b[1;32m    105\u001b[0m other_keys \u001b[39m=\u001b[39m {k: v \u001b[39mfor\u001b[39;00m k, v \u001b[39min\u001b[39;00m inputs\u001b[39m.\u001b[39mitems() \u001b[39mif\u001b[39;00m k \u001b[39m!=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39minput_key}\n\u001b[0;32m--> 106\u001b[0m output, extra_return_dict \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mcombine_docs(\n\u001b[1;32m    107\u001b[0m     docs, callbacks\u001b[39m=\u001b[39;49m_run_manager\u001b[39m.\u001b[39;49mget_child(), \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mother_keys\n\u001b[1;32m    108\u001b[0m )\n\u001b[1;32m    109\u001b[0m extra_return_dict[\u001b[39mself\u001b[39m\u001b[39m.\u001b[39moutput_key] \u001b[39m=\u001b[39m output\n\u001b[1;32m    110\u001b[0m \u001b[39mreturn\u001b[39;00m extra_return_dict\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py:165\u001b[0m, in \u001b[0;36mStuffDocumentsChain.combine_docs\u001b[0;34m(self, docs, callbacks, **kwargs)\u001b[0m\n\u001b[1;32m    163\u001b[0m inputs \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_get_inputs(docs, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs)\n\u001b[1;32m    164\u001b[0m \u001b[39m# Call predict on the LLM.\u001b[39;00m\n\u001b[0;32m--> 165\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mllm_chain\u001b[39m.\u001b[39;49mpredict(callbacks\u001b[39m=\u001b[39;49mcallbacks, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49minputs), {}\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/llm.py:252\u001b[0m, in \u001b[0;36mLLMChain.predict\u001b[0;34m(self, callbacks, **kwargs)\u001b[0m\n\u001b[1;32m    237\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mpredict\u001b[39m(\u001b[39mself\u001b[39m, callbacks: Callbacks \u001b[39m=\u001b[39m \u001b[39mNone\u001b[39;00m, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs: Any) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m \u001b[39mstr\u001b[39m:\n\u001b[1;32m    238\u001b[0m \u001b[39m    \u001b[39m\u001b[39m\"\"\"Format prompt with kwargs and pass to LLM.\u001b[39;00m\n\u001b[1;32m    239\u001b[0m \n\u001b[1;32m    240\u001b[0m \u001b[39m    Args:\u001b[39;00m\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    250\u001b[0m \u001b[39m            completion = llm.predict(adjective=\"funny\")\u001b[39;00m\n\u001b[1;32m    251\u001b[0m \u001b[39m    \"\"\"\u001b[39;00m\n\u001b[0;32m--> 252\u001b[0m     \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m(kwargs, callbacks\u001b[39m=\u001b[39;49mcallbacks)[\u001b[39mself\u001b[39m\u001b[39m.\u001b[39moutput_key]\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/base.py:243\u001b[0m, in \u001b[0;36mChain.__call__\u001b[0;34m(self, inputs, return_only_outputs, callbacks, tags, metadata, include_run_info)\u001b[0m\n\u001b[1;32m    241\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mKeyboardInterrupt\u001b[39;00m, \u001b[39mException\u001b[39;00m) \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m    242\u001b[0m     run_manager\u001b[39m.\u001b[39mon_chain_error(e)\n\u001b[0;32m--> 243\u001b[0m     \u001b[39mraise\u001b[39;00m e\n\u001b[1;32m    244\u001b[0m run_manager\u001b[39m.\u001b[39mon_chain_end(outputs)\n\u001b[1;32m    245\u001b[0m final_outputs: Dict[\u001b[39mstr\u001b[39m, Any] \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mprep_outputs(\n\u001b[1;32m    246\u001b[0m     inputs, outputs, return_only_outputs\n\u001b[1;32m    247\u001b[0m )\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/base.py:237\u001b[0m, in \u001b[0;36mChain.__call__\u001b[0;34m(self, inputs, return_only_outputs, callbacks, tags, metadata, include_run_info)\u001b[0m\n\u001b[1;32m    231\u001b[0m run_manager \u001b[39m=\u001b[39m callback_manager\u001b[39m.\u001b[39mon_chain_start(\n\u001b[1;32m    232\u001b[0m     dumpd(\u001b[39mself\u001b[39m),\n\u001b[1;32m    233\u001b[0m     inputs,\n\u001b[1;32m    234\u001b[0m )\n\u001b[1;32m    235\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[1;32m    236\u001b[0m     outputs \u001b[39m=\u001b[39m (\n\u001b[0;32m--> 237\u001b[0m         \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_call(inputs, run_manager\u001b[39m=\u001b[39;49mrun_manager)\n\u001b[1;32m    238\u001b[0m         \u001b[39mif\u001b[39;00m new_arg_supported\n\u001b[1;32m    239\u001b[0m         \u001b[39melse\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_call(inputs)\n\u001b[1;32m    240\u001b[0m     )\n\u001b[1;32m    241\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mKeyboardInterrupt\u001b[39;00m, \u001b[39mException\u001b[39;00m) \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m    242\u001b[0m     run_manager\u001b[39m.\u001b[39mon_chain_error(e)\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/llm.py:92\u001b[0m, in \u001b[0;36mLLMChain._call\u001b[0;34m(self, inputs, run_manager)\u001b[0m\n\u001b[1;32m     87\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_call\u001b[39m(\n\u001b[1;32m     88\u001b[0m     \u001b[39mself\u001b[39m,\n\u001b[1;32m     89\u001b[0m     inputs: Dict[\u001b[39mstr\u001b[39m, Any],\n\u001b[1;32m     90\u001b[0m     run_manager: Optional[CallbackManagerForChainRun] \u001b[39m=\u001b[39m \u001b[39mNone\u001b[39;00m,\n\u001b[1;32m     91\u001b[0m ) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m Dict[\u001b[39mstr\u001b[39m, \u001b[39mstr\u001b[39m]:\n\u001b[0;32m---> 92\u001b[0m     response \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgenerate([inputs], run_manager\u001b[39m=\u001b[39;49mrun_manager)\n\u001b[1;32m     93\u001b[0m     \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mcreate_outputs(response)[\u001b[39m0\u001b[39m]\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/llm.py:102\u001b[0m, in \u001b[0;36mLLMChain.generate\u001b[0;34m(self, input_list, run_manager)\u001b[0m\n\u001b[1;32m    100\u001b[0m \u001b[39m\u001b[39m\u001b[39m\"\"\"Generate LLM result from inputs.\"\"\"\u001b[39;00m\n\u001b[1;32m    101\u001b[0m prompts, stop \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mprep_prompts(input_list, run_manager\u001b[39m=\u001b[39mrun_manager)\n\u001b[0;32m--> 102\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mllm\u001b[39m.\u001b[39;49mgenerate_prompt(\n\u001b[1;32m    103\u001b[0m     prompts,\n\u001b[1;32m    104\u001b[0m     stop,\n\u001b[1;32m    105\u001b[0m     callbacks\u001b[39m=\u001b[39;49mrun_manager\u001b[39m.\u001b[39;49mget_child() \u001b[39mif\u001b[39;49;00m run_manager \u001b[39melse\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m,\n\u001b[1;32m    106\u001b[0m     \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mllm_kwargs,\n\u001b[1;32m    107\u001b[0m )\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/base.py:188\u001b[0m, in \u001b[0;36mBaseLLM.generate_prompt\u001b[0;34m(self, prompts, stop, callbacks, **kwargs)\u001b[0m\n\u001b[1;32m    180\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mgenerate_prompt\u001b[39m(\n\u001b[1;32m    181\u001b[0m     \u001b[39mself\u001b[39m,\n\u001b[1;32m    182\u001b[0m     prompts: List[PromptValue],\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    185\u001b[0m     \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs: Any,\n\u001b[1;32m    186\u001b[0m ) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m LLMResult:\n\u001b[1;32m    187\u001b[0m     prompt_strings \u001b[39m=\u001b[39m [p\u001b[39m.\u001b[39mto_string() \u001b[39mfor\u001b[39;00m p \u001b[39min\u001b[39;00m prompts]\n\u001b[0;32m--> 188\u001b[0m     \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgenerate(prompt_strings, stop\u001b[39m=\u001b[39;49mstop, callbacks\u001b[39m=\u001b[39;49mcallbacks, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/base.py:281\u001b[0m, in \u001b[0;36mBaseLLM.generate\u001b[0;34m(self, prompts, stop, callbacks, tags, metadata, **kwargs)\u001b[0m\n\u001b[1;32m    275\u001b[0m         \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m    276\u001b[0m             \u001b[39m\"\u001b[39m\u001b[39mAsked to cache, but no cache found at `langchain.cache`.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m    277\u001b[0m         )\n\u001b[1;32m    278\u001b[0m     run_managers \u001b[39m=\u001b[39m callback_manager\u001b[39m.\u001b[39mon_llm_start(\n\u001b[1;32m    279\u001b[0m         dumpd(\u001b[39mself\u001b[39m), prompts, invocation_params\u001b[39m=\u001b[39mparams, options\u001b[39m=\u001b[39moptions\n\u001b[1;32m    280\u001b[0m     )\n\u001b[0;32m--> 281\u001b[0m     output \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_generate_helper(\n\u001b[1;32m    282\u001b[0m         prompts, stop, run_managers, \u001b[39mbool\u001b[39;49m(new_arg_supported), \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs\n\u001b[1;32m    283\u001b[0m     )\n\u001b[1;32m    284\u001b[0m     \u001b[39mreturn\u001b[39;00m output\n\u001b[1;32m    285\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mlen\u001b[39m(missing_prompts) \u001b[39m>\u001b[39m \u001b[39m0\u001b[39m:\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/base.py:225\u001b[0m, in \u001b[0;36mBaseLLM._generate_helper\u001b[0;34m(self, prompts, stop, run_managers, new_arg_supported, **kwargs)\u001b[0m\n\u001b[1;32m    223\u001b[0m     \u001b[39mfor\u001b[39;00m run_manager \u001b[39min\u001b[39;00m run_managers:\n\u001b[1;32m    224\u001b[0m         run_manager\u001b[39m.\u001b[39mon_llm_error(e)\n\u001b[0;32m--> 225\u001b[0m     \u001b[39mraise\u001b[39;00m e\n\u001b[1;32m    226\u001b[0m flattened_outputs \u001b[39m=\u001b[39m output\u001b[39m.\u001b[39mflatten()\n\u001b[1;32m    227\u001b[0m \u001b[39mfor\u001b[39;00m manager, flattened_output \u001b[39min\u001b[39;00m \u001b[39mzip\u001b[39m(run_managers, flattened_outputs):\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/base.py:212\u001b[0m, in \u001b[0;36mBaseLLM._generate_helper\u001b[0;34m(self, prompts, stop, run_managers, new_arg_supported, **kwargs)\u001b[0m\n\u001b[1;32m    202\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_generate_helper\u001b[39m(\n\u001b[1;32m    203\u001b[0m     \u001b[39mself\u001b[39m,\n\u001b[1;32m    204\u001b[0m     prompts: List[\u001b[39mstr\u001b[39m],\n\u001b[0;32m   (...)\u001b[0m\n\u001b[1;32m    208\u001b[0m     \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs: Any,\n\u001b[1;32m    209\u001b[0m ) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m LLMResult:\n\u001b[1;32m    210\u001b[0m     \u001b[39mtry\u001b[39;00m:\n\u001b[1;32m    211\u001b[0m         output \u001b[39m=\u001b[39m (\n\u001b[0;32m--> 212\u001b[0m             \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_generate(\n\u001b[1;32m    213\u001b[0m                 prompts,\n\u001b[1;32m    214\u001b[0m                 stop\u001b[39m=\u001b[39;49mstop,\n\u001b[1;32m    215\u001b[0m                 \u001b[39m# TODO: support multiple run managers\u001b[39;49;00m\n\u001b[1;32m    216\u001b[0m                 run_manager\u001b[39m=\u001b[39;49mrun_managers[\u001b[39m0\u001b[39;49m] \u001b[39mif\u001b[39;49;00m run_managers \u001b[39melse\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m,\n\u001b[1;32m    217\u001b[0m                 \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs,\n\u001b[1;32m    218\u001b[0m             )\n\u001b[1;32m    219\u001b[0m             \u001b[39mif\u001b[39;00m new_arg_supported\n\u001b[1;32m    220\u001b[0m             \u001b[39melse\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_generate(prompts, stop\u001b[39m=\u001b[39mstop)\n\u001b[1;32m    221\u001b[0m         )\n\u001b[1;32m    222\u001b[0m     \u001b[39mexcept\u001b[39;00m (\u001b[39mKeyboardInterrupt\u001b[39;00m, \u001b[39mException\u001b[39;00m) \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m    223\u001b[0m         \u001b[39mfor\u001b[39;00m run_manager \u001b[39min\u001b[39;00m run_managers:\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/base.py:604\u001b[0m, in \u001b[0;36mLLM._generate\u001b[0;34m(self, prompts, stop, run_manager, **kwargs)\u001b[0m\n\u001b[1;32m    601\u001b[0m new_arg_supported \u001b[39m=\u001b[39m inspect\u001b[39m.\u001b[39msignature(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_call)\u001b[39m.\u001b[39mparameters\u001b[39m.\u001b[39mget(\u001b[39m\"\u001b[39m\u001b[39mrun_manager\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m    602\u001b[0m \u001b[39mfor\u001b[39;00m prompt \u001b[39min\u001b[39;00m prompts:\n\u001b[1;32m    603\u001b[0m     text \u001b[39m=\u001b[39m (\n\u001b[0;32m--> 604\u001b[0m         \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_call(prompt, stop\u001b[39m=\u001b[39;49mstop, run_manager\u001b[39m=\u001b[39;49mrun_manager, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m    605\u001b[0m         \u001b[39mif\u001b[39;00m new_arg_supported\n\u001b[1;32m    606\u001b[0m         \u001b[39melse\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_call(prompt, stop\u001b[39m=\u001b[39mstop, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs)\n\u001b[1;32m    607\u001b[0m     )\n\u001b[1;32m    608\u001b[0m     generations\u001b[39m.\u001b[39mappend([Generation(text\u001b[39m=\u001b[39mtext)])\n\u001b[1;32m    609\u001b[0m \u001b[39mreturn\u001b[39;00m LLMResult(generations\u001b[39m=\u001b[39mgenerations)\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/huggingface_hub.py:113\u001b[0m, in \u001b[0;36mHuggingFaceHub._call\u001b[0;34m(self, prompt, stop, run_manager, **kwargs)\u001b[0m\n\u001b[1;32m    111\u001b[0m response \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mclient(inputs\u001b[39m=\u001b[39mprompt, params\u001b[39m=\u001b[39mparams)\n\u001b[1;32m    112\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39m\"\u001b[39m\u001b[39merror\u001b[39m\u001b[39m\"\u001b[39m \u001b[39min\u001b[39;00m response:\n\u001b[0;32m--> 113\u001b[0m     \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mError raised by inference API: \u001b[39m\u001b[39m{\u001b[39;00mresponse[\u001b[39m'\u001b[39m\u001b[39merror\u001b[39m\u001b[39m'\u001b[39m]\u001b[39m}\u001b[39;00m\u001b[39m\"\u001b[39m)\n\u001b[1;32m    114\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mclient\u001b[39m.\u001b[39mtask \u001b[39m==\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mtext-generation\u001b[39m\u001b[39m\"\u001b[39m:\n\u001b[1;32m    115\u001b[0m     \u001b[39m# Text generation return includes the starter text.\u001b[39;00m\n\u001b[1;32m    116\u001b[0m     text \u001b[39m=\u001b[39m response[\u001b[39m0\u001b[39m][\u001b[39m\"\u001b[39m\u001b[39mgenerated_text\u001b[39m\u001b[39m\"\u001b[39m][\u001b[39mlen\u001b[39m(prompt) :]\n",
      "\u001b[0;31mValueError\u001b[0m: Error raised by inference API: Model yhyhy3/med-orca-instruct-33b time out"
     ]
    },
    {
     "ename": "",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31mThe Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
     ]
    }
   ],
   "source": [
    "from langchain import HuggingFaceHub\n",
    "from langchain.chains.question_answering import load_qa_chain\n",
    "\n",
    "HUGGINGFACE_TOKEN = \"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\"\n",
    "\n",
    "llm = HuggingFaceHub(\n",
    "    repo_id=\"yhyhy3/med-orca-instruct-33b\",\n",
    "    model_kwargs={\"temperature\": 0.1, \"max_new_tokens\": 80},\n",
    "    huggingfacehub_api_token=HUGGINGFACE_TOKEN\n",
    ")\n",
    "question = \"How did the authors detect protein abundances?\"\n",
    "\n",
    "chain_types = [\"map_reduce\", \"refine\", \"map_rerank\"]\n",
    "\n",
    "chain = load_qa_chain(llm, chain_type=\"stuff\")\n",
    "print(f\"\"\"Type: stuff. {chain({\"input_documents\": docs[1:3], \"question\": question}, return_only_outputs=True)[\"output_text\"]}\"\"\")\n",
    "\n",
    "for t in chain_types:\n",
    "    chain = load_qa_chain(llm, chain_type=\"stuff\")\n",
    "    # chain.llm_chain.prompt.template = \"\"\"question: {question}. context: {context}. answer: dummy answer.\"\"\"\n",
    "    print(f\"\"\"Type: {t}. {chain({\"input_documents\": docs[1:2], \"question\": question}, return_only_outputs=True)[\"output_text\"]}\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain import HuggingFaceHub\n",
    "from langchain.chains.question_answering import load_qa_chain\n",
    "\n",
    "HUGGINGFACE_TOKEN = \"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\"\n",
    "\n",
    "llm = HuggingFaceHub(\n",
    "    # repo_id=\"tiiuae/falcon-7b-instruct\",\n",
    "    repo_id=\"yhyhy3/open_llama_7b_v2_med_instruct\",\n",
    "    model_kwargs={\"temperature\": 0.1, \"max_new_tokens\": 80},\n",
    "    huggingfacehub_api_token=HUGGINGFACE_TOKEN\n",
    ")\n",
    "question = \"How did the authors detect protein abundances?\"\n",
    "\n",
    "chain_types = [\"map_reduce\", \"refine\", \"map_rerank\"]\n",
    "\n",
    "chain = load_qa_chain(llm, chain_type=\"stuff\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\\n\\n{context}\\n\\nQuestion: {question}\\nHelpful Answer:\""
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain.llm_chain.prompt.template"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "`run` supported with either positional arguments or keyword arguments, but none were provided.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m/home/tommaso/llm4scilit/notebooks/test.ipynb Cell 16\u001b[0m line \u001b[0;36m1\n\u001b[0;32m----> <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X22sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a>\u001b[0m chain\u001b[39m.\u001b[39;49mrun()\n",
      "File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/base.py:450\u001b[0m, in \u001b[0;36mChain.run\u001b[0;34m(self, callbacks, tags, metadata, *args, **kwargs)\u001b[0m\n\u001b[1;32m    445\u001b[0m     \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39m(kwargs, callbacks\u001b[39m=\u001b[39mcallbacks, tags\u001b[39m=\u001b[39mtags, metadata\u001b[39m=\u001b[39mmetadata)[\n\u001b[1;32m    446\u001b[0m         _output_key\n\u001b[1;32m    447\u001b[0m     ]\n\u001b[1;32m    449\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m kwargs \u001b[39mand\u001b[39;00m \u001b[39mnot\u001b[39;00m args:\n\u001b[0;32m--> 450\u001b[0m     \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m    451\u001b[0m         \u001b[39m\"\u001b[39m\u001b[39m`run` supported with either positional arguments or keyword arguments,\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m    452\u001b[0m         \u001b[39m\"\u001b[39m\u001b[39m but none were provided.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m    453\u001b[0m     )\n\u001b[1;32m    454\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m    455\u001b[0m     \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m    456\u001b[0m         \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m`run` supported with either positional arguments or keyword arguments\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m    457\u001b[0m         \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m but not both. Got args: \u001b[39m\u001b[39m{\u001b[39;00margs\u001b[39m}\u001b[39;00m\u001b[39m and kwargs: \u001b[39m\u001b[39m{\u001b[39;00mkwargs\u001b[39m}\u001b[39;00m\u001b[39m.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m    458\u001b[0m     )\n",
      "\u001b[0;31mValueError\u001b[0m: `run` supported with either positional arguments or keyword arguments, but none were provided."
     ]
    }
   ],
   "source": [
    "chain.run()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'{context}\\n{question} '"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain import PromptTemplate\n",
    "\n",
    "template = \"\"\"{context}\\n{question} \"\"\"\n",
    "\n",
    "prompt_template = PromptTemplate(\n",
    "    template=template,\n",
    "    input_variables=[\"context\", \"question\"],\n",
    ")\n",
    "\n",
    "load_qa_chain(llm, chain_type=\"stuff\", prompt=prompt_template).llm_chain.prompt.template"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}