Spaces:
Runtime error
Runtime error
File size: 173,707 Bytes
7f7b773 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 |
{
"cells": [
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"DATA_PATH = Path(\"/data/tommaso/data\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', metadata={'source': PosixPath('/data/tommaso/data/papers_processed/1.txt'), 'filename': '1.txt', 'file_directory': '/data/tommaso/data/papers_processed', 'filetype': 'text/plain', 'category': 'UncategorizedText'}),\n",
" Document(page_content='Abstract: Rheumatoid arthritis is an autoimmune disorder of complex disease etiology. Currently available serological diagnostic markers lack in terms of sensitivity and specificity and thus addi- tional biomarkers are warranted for early disease diagnosis and management. We aimed to screen and compare serum proteome profiles of rheumatoid arthritis serotypes with healthy controls in the Pakistani population for identification of potential disease biomarkers. Serum samples from rheumatoid arthritis patients and healthy controls were enriched for low abundance proteins using ProteoMinerTM columns. Rheumatoid arthritis patients were assigned to one of the four serotypes based on anti-citrullinated peptide antibodies and rheumatoid factor. Serum protein profiles were ana- lyzed via liquid chromatography-tandem mass spectrometry. The changes in the protein abundances were determined using label-free quantification software ProgenesisQITM followed by pathway analysis. Findings were validated in an independent cohort of patients and healthy controls using an enzyme-linked immunosorbent assay. A total of 213 proteins were identified.', metadata={'source': PosixPath('/data/tommaso/data/papers_processed/1.txt'), 'filename': '1.txt', 'file_directory': '/data/tommaso/data/papers_processed', 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
" Document(page_content='Comparative analysis of all groups (false discovery rate < 0.05, >2-fold change, and identified with ≥2 unique peptides) identified ten proteins that were differentially expressed between rheumatoid arthritis serotypes and healthy controls including pregnancy zone protein, selenoprotein P, C4b-binding protein beta chain, apolipoprotein M, N-acetylmuramoyl-L-alanine amidase, catalytic chain, oncoprotein-induced transcript 3 protein, Carboxypeptidase N subunit 2, Apolipoprotein C-I and Apolipoprotein C-III. Pathway analysis predicted inhibition of liver X receptor/retinoid X receptor activation pathway and production of nitric oxide and reactive oxygen species pathway in macrophages in all serotypes. A catalogue of potential serum biomarkers for rheumatoid arthritis were identified. These biomark- ers can be further evaluated in larger cohorts from different populations for their diagnostic and prognostic potential.', metadata={'source': PosixPath('/data/tommaso/data/papers_processed/1.txt'), 'filename': '1.txt', 'file_directory': '/data/tommaso/data/papers_processed', 'filetype': 'text/plain', 'category': 'NarrativeText'}),\n",
" Document(page_content='Keywords: rheumatoid arthritis; serum; proteomics; biomarkers; LC-MS', metadata={'source': PosixPath('/data/tommaso/data/papers_processed/1.txt'), 'filename': '1.txt', 'file_directory': '/data/tommaso/data/papers_processed', 'filetype': 'text/plain', 'category': 'Title'})]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.document_loaders import UnstructuredFileLoader\n",
"from unstructured.cleaners.core import clean_extra_whitespace, group_broken_paragraphs\n",
"\n",
"loader = UnstructuredFileLoader(\n",
" DATA_PATH / \"papers_processed\" / \"1.txt\",\n",
" strategy=\"hi_res\",\n",
" mode=\"elements\",\n",
" post_processors=[\n",
" clean_extra_whitespace,\n",
" group_broken_paragraphs,\n",
" ])\n",
"docs = loader.load()\n",
"docs[:4]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain.document_loaders.parsers import GrobidParser\n",
"from langchain.document_loaders.generic import GenericLoader\n",
"\n",
"loader = GenericLoader.from_filesystem(\n",
" DATA_PATH / \"papers\",\n",
" glob=\"1.pdf\",\n",
" suffixes=[\".pdf\"],\n",
" parser=GrobidParser(segment_sentences=False),\n",
")\n",
"docs = loader.load()\n",
"docs"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import spacy\n",
"spacy.require_gpu(gpu_id=1)\n",
"\n",
"import spacy_transformers # needed by SpacyTextSplitter when using the en_core_web_trf pipeline\n",
"from langchain.text_splitter import SpacyTextSplitter\n",
"from itertools import chain\n",
"\n",
"splitter = SpacyTextSplitter(chunk_size=1000, pipeline=\"en_core_web_trf\")\n",
"chunks = splitter.split_documents(docs)\n",
"chunks[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## BioBERT"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Rheumatoid factor (RF) and anti-citrullinated peptide antibodies (ACPA) are considered as the main serological markers for RA that have been included in the 2010 American College of Rheumatology (ACR)/European League against Rheumatism (EULAR) classification criteria for RA [7][8][9].Based on 2010 ACR/EULAR classification criteria for RA, clinically diagnosed RA patients can be categorized into four serotypes: (i) positive for both RF and ACPA, (ii) positive for RF and negative for ACPA, (iii) negative for RF and positive for ACPA and (iv) negative for both RF and ACPA.However, the levels of RF are also perturbed in connective tissue diseases [10] and some chronic infectious diseases such as hepatitis B and hepatitis C virus infections [11].RF is thus not a specific diagnostic marker for RA.ACPA is comparatively a more specific biomarker and two-thirds of the individuals ultimately diagnosed with RA were tested positive for ACPAs 6-10 years before diagnosis [12,13].A total of 1-3% of the healthy population may also test positive for ACPAs suggesting the decreased specificity of this biomarker [14][15][16][17].Therefore, it is important to discover the biomarkers for the diagnosis of RA with both increased sensitivity and specificity.'"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[1].page_content"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Use a pipeline as a high-level helper\n",
"from transformers import pipeline\n",
"\n",
"pipe = pipeline(\"question-answering\", model=\"dmis-lab/biobert-large-cased-v1.1-squad\", device=1, handle_impossible_answer=True, max_seq_len=512)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"BertForQuestionAnswering(\n",
" (bert): BertModel(\n",
" (embeddings): BertEmbeddings(\n",
" (word_embeddings): Embedding(58996, 1024, padding_idx=0)\n",
" (position_embeddings): Embedding(512, 1024)\n",
" (token_type_embeddings): Embedding(2, 1024)\n",
" (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)\n",
" (dropout): Dropout(p=0.1, inplace=False)\n",
" )\n",
" (encoder): BertEncoder(\n",
" (layer): ModuleList(\n",
" (0-23): 24 x BertLayer(\n",
" (attention): BertAttention(\n",
" (self): BertSelfAttention(\n",
" (query): Linear(in_features=1024, out_features=1024, bias=True)\n",
" (key): Linear(in_features=1024, out_features=1024, bias=True)\n",
" (value): Linear(in_features=1024, out_features=1024, bias=True)\n",
" (dropout): Dropout(p=0.1, inplace=False)\n",
" )\n",
" (output): BertSelfOutput(\n",
" (dense): Linear(in_features=1024, out_features=1024, bias=True)\n",
" (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)\n",
" (dropout): Dropout(p=0.1, inplace=False)\n",
" )\n",
" )\n",
" (intermediate): BertIntermediate(\n",
" (dense): Linear(in_features=1024, out_features=4096, bias=True)\n",
" (intermediate_act_fn): GELUActivation()\n",
" )\n",
" (output): BertOutput(\n",
" (dense): Linear(in_features=4096, out_features=1024, bias=True)\n",
" (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)\n",
" (dropout): Dropout(p=0.1, inplace=False)\n",
" )\n",
" )\n",
" )\n",
" )\n",
" )\n",
" (qa_outputs): Linear(in_features=1024, out_features=2, bias=True)\n",
")"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.model"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Question: How did the authors detect protein abundances?\n",
"Answer 1 (score: 0.121): 'Mass spectrometry (MS)-based serum proteomics'\n",
"Answer 2 (score: 0.114): 'ProgenesisQITM followed by pathway analysis'\n",
"\n",
"\n",
"Question: How can RA patients be categorized?\n",
"Answer 1 (score: 0.377): 'four serotypes'\n",
"Answer 2 (score: 0.320): 'into four serotypes'\n",
"\n"
]
}
],
"source": [
"questions = [\n",
" \"How did the authors detect protein abundances?\",\n",
" \"How can RA patients be categorized?\"\n",
"]\n",
"context = \"\\n\".join([x.page_content for x in docs])\n",
"\n",
"for q in questions:\n",
" a = pipe(question=q, context=context, top_k=2)\n",
" print(f'''\n",
"Question: {q}\n",
"Answer 1 (score: {a[0][\"score\"]:.3f}): '{a[0][\"answer\"]}'\n",
"Answer 2 (score: {a[1][\"score\"]:.3f}): '{a[1][\"answer\"]}'\n",
"''')\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'score': 0.12108789384365082,\n",
" 'start': 4854,\n",
" 'end': 4899,\n",
" 'answer': 'Mass spectrometry (MS)-based serum proteomics'}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"context = \"\\n\".join([x.page_content for x in docs])\n",
"pipe(question=\"How did the authors detect protein abundances?\", context=context)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## BioGPT"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"from langchain import HuggingFaceHub, HuggingFacePipeline\n",
"\n",
"HUGGINGFACE_TOKEN = \"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\"\n",
"\n",
"# llm = HuggingFacePipeline.from_model_id(\n",
"# model_id=\"stanford-crfm/BioMedLM\",\n",
"# task=\"text-generation\",\n",
"# device=1,\n",
"# model_kwargs={\"temperature\": 0},\n",
"# )\n",
"\n",
"from langchain import PromptTemplate, LLMChain\n",
"\n",
"template = \"\"\"You are a useful and reliableQuestion: {question}\n",
"Context: {context}\"\"\"\n",
"prompt = PromptTemplate(template=template, input_variables=[\"question\", \"context\"])\n",
"llm = HuggingFaceHub(\n",
" repo_id=\"microsoft/BioGPT-Large-PubMedQA\",\n",
" model_kwargs={\"temperature\": 0.1, \"max_length\":200},\n",
" huggingfacehub_api_token=HUGGINGFACE_TOKEN\n",
")\n",
"llm_chain = LLMChain(prompt=prompt, llm=llm)\n",
"question = \"How did the authors detect protein abundances?\"\n",
"context = \"\\n\".join([x.page_content for x in chunks])\n",
"\n",
"# print(llm_chain.run(question=question, context=context))"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Rheumatoid arthritis (RA) is an autoimmune disorder of complex disease etiology.RA leads to the inflammation of joints and surrounding synovial membrane [1].The global prevalence rate of RA is 0.24% and RA has been ranked as the 42nd highest contributor to global disability [2].Diagnosing RA is a highly individualized process and is based on a combination of both clinical manifestations and serological assays.Early disease diagnosis is the key to prevent joint damage and permanent physical disability in RA [3].RA is considered to be a continuum that begins with a disease-susceptibility stage characterized by a combination of genetic risk factors.This stage proceeds through a pre-clinical stage before the development of early RA characterized by articular inflammation.Environmental and microbial triggers continuously operate across this continuum.Immune-mediated etiology associated with stromal tissue dysregulation contributes to the chronic inflammation and ultimate articular destruction that is identified as established RA [4,5].A number of proteins and pathways have been linked to the disease pathogenesis of RA.However, there are still some gaps in current knowledge.Research aimed at the better clarification of these mechanisms can enable the development of more specific disease-modifying therapies [6].', metadata={'text': 'Rheumatoid arthritis (RA) is an autoimmune disorder of complex disease etiology.RA leads to the inflammation of joints and surrounding synovial membrane [1].The global prevalence rate of RA is 0.24% and RA has been ranked as the 42nd highest contributor to global disability [2].Diagnosing RA is a highly individualized process and is based on a combination of both clinical manifestations and serological assays.Early disease diagnosis is the key to prevent joint damage and permanent physical disability in RA [3].RA is considered to be a continuum that begins with a disease-susceptibility stage characterized by a combination of genetic risk factors.This stage proceeds through a pre-clinical stage before the development of early RA characterized by articular inflammation.Environmental and microbial triggers continuously operate across this continuum.Immune-mediated etiology associated with stromal tissue dysregulation contributes to the chronic inflammation and ultimate articular destruction that is identified as established RA [4,5].A number of proteins and pathways have been linked to the disease pathogenesis of RA.However, there are still some gaps in current knowledge.Research aimed at the better clarification of these mechanisms can enable the development of more specific disease-modifying therapies [6].', 'para': '11', 'bboxes': \"[[{'page': '1', 'x': '187.65', 'y': '696.70', 'h': '354.85', 'w': '9.58'}], [{'page': '1', 'x': '545.55', 'y': '696.70', 'h': '14.12', 'w': '9.58'}, {'page': '1', 'x': '166.39', 'y': '709.26', 'h': '341.80', 'w': '9.58'}], [{'page': '1', 'x': '511.79', 'y': '709.26', 'h': '47.49', 'w': '9.58'}, {'page': '1', 'x': '166.10', 'y': '721.81', 'h': '393.18', 'w': '9.58'}, {'page': '1', 'x': '166.39', 'y': '734.36', 'h': '88.77', 'w': '9.58'}], [{'page': '1', 'x': '258.26', 'y': '734.36', 'h': '301.02', 'w': '9.58'}, {'page': '1', 'x': '166.39', 'y': '746.91', 'h': '288.55', 'w': '9.58'}], [{'page': '1', 'x': '458.05', 'y': '746.91', 'h': '101.22', 'w': '9.58'}, {'page': '1', 'x': '166.39', 'y': '759.47', 'h': '346.80', 'w': '9.58'}], [{'page': '2', 'x': '187.65', 'y': '98.05', 'h': '371.62', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '110.60', 'h': '248.38', 'w': '9.58'}], [{'page': '2', 'x': '420.94', 'y': '110.60', 'h': '138.33', 'w': '9.58'}, {'page': '2', 'x': '166.10', 'y': '123.15', 'h': '394.83', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '135.71', 'h': '20.27', 'w': '9.58'}], [{'page': '2', 'x': '190.03', 'y': '135.71', 'h': '370.99', 'w': '9.58'}], [{'page': '2', 'x': '166.39', 'y': '148.26', 'h': '392.89', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '160.81', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '173.37', 'h': '38.95', 'w': '9.58'}], [{'page': '2', 'x': '208.46', 'y': '173.37', 'h': '352.47', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '185.92', 'h': '47.87', 'w': '9.58'}], [{'page': '2', 'x': '216.91', 'y': '185.92', 'h': '256.92', 'w': '9.58'}], [{'page': '2', 'x': '477.36', 'y': '185.92', 'h': '81.91', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '198.47', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '211.02', 'h': '141.30', 'w': '9.58'}]]\", 'pages': \"('1', '2')\", 'section_title': 'Introduction', 'section_number': '1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Rheumatoid factor (RF) and anti-citrullinated peptide antibodies (ACPA) are considered as the main serological markers for RA that have been included in the 2010 American College of Rheumatology (ACR)/European League against Rheumatism (EULAR) classification criteria for RA [7][8][9].Based on 2010 ACR/EULAR classification criteria for RA, clinically diagnosed RA patients can be categorized into four serotypes: (i) positive for both RF and ACPA, (ii) positive for RF and negative for ACPA, (iii) negative for RF and positive for ACPA and (iv) negative for both RF and ACPA.However, the levels of RF are also perturbed in connective tissue diseases [10] and some chronic infectious diseases such as hepatitis B and hepatitis C virus infections [11].RF is thus not a specific diagnostic marker for RA.ACPA is comparatively a more specific biomarker and two-thirds of the individuals ultimately diagnosed with RA were tested positive for ACPAs 6-10 years before diagnosis [12,13].A total of 1-3% of the healthy population may also test positive for ACPAs suggesting the decreased specificity of this biomarker [14][15][16][17].Therefore, it is important to discover the biomarkers for the diagnosis of RA with both increased sensitivity and specificity.', metadata={'text': 'Rheumatoid factor (RF) and anti-citrullinated peptide antibodies (ACPA) are considered as the main serological markers for RA that have been included in the 2010 American College of Rheumatology (ACR)/European League against Rheumatism (EULAR) classification criteria for RA [7][8][9].Based on 2010 ACR/EULAR classification criteria for RA, clinically diagnosed RA patients can be categorized into four serotypes: (i) positive for both RF and ACPA, (ii) positive for RF and negative for ACPA, (iii) negative for RF and positive for ACPA and (iv) negative for both RF and ACPA.However, the levels of RF are also perturbed in connective tissue diseases [10] and some chronic infectious diseases such as hepatitis B and hepatitis C virus infections [11].RF is thus not a specific diagnostic marker for RA.ACPA is comparatively a more specific biomarker and two-thirds of the individuals ultimately diagnosed with RA were tested positive for ACPAs 6-10 years before diagnosis [12,13].A total of 1-3% of the healthy population may also test positive for ACPAs suggesting the decreased specificity of this biomarker [14][15][16][17].Therefore, it is important to discover the biomarkers for the diagnosis of RA with both increased sensitivity and specificity.', 'para': '6', 'bboxes': \"[[{'page': '2', 'x': '187.65', 'y': '223.58', 'h': '373.27', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '236.13', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '248.68', 'h': '394.53', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '261.24', 'h': '133.81', 'w': '9.58'}], [{'page': '2', 'x': '303.29', 'y': '261.24', 'h': '257.23', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '273.79', 'h': '393.08', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '286.34', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.10', 'y': '298.90', 'h': '272.66', 'w': '9.58'}], [{'page': '2', 'x': '441.85', 'y': '298.90', 'h': '117.43', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '311.45', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '324.00', 'h': '240.16', 'w': '9.58'}], [{'page': '2', 'x': '409.64', 'y': '324.00', 'h': '149.63', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '336.55', 'h': '67.99', 'w': '9.58'}], [{'page': '2', 'x': '236.99', 'y': '336.55', 'h': '322.28', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '349.11', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '361.66', 'h': '107.38', 'w': '9.58'}], [{'page': '2', 'x': '276.86', 'y': '361.66', 'h': '282.42', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '374.21', 'h': '325.69', 'w': '9.58'}], [{'page': '2', 'x': '495.20', 'y': '374.21', 'h': '64.08', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '386.77', 'h': '393.27', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '399.32', 'h': '65.18', 'w': '9.58'}]]\", 'pages': \"('2', '2')\", 'section_title': 'Introduction', 'section_number': '1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Mass spectrometry (MS)-based serum proteomics has emerged as a powerful technology in biological research targeted at the RA biomarker discovery [18,19].Several proteins and peptides have been identified that are unique to serum proteome of RA patients [18,20].A recent study compared the serum proteome profiles of seronegative patients with healthy controls [21].However, to our knowledge, no study has compared the serum proteome profiles of all the RA serotypes based on ACPAs and RF.Furthermore, the proteomic profiles of Pakistani RA patients have not been investigated in any previous study.This study aims to screen the RA serotypes, based on ACPAs and RF, and compare them with healthy controls in the Pakistani population for the identification of biomarkers that are differentially expressed (DE) between RA patients and healthy controls.', metadata={'text': 'Mass spectrometry (MS)-based serum proteomics has emerged as a powerful technology in biological research targeted at the RA biomarker discovery [18,19].Several proteins and peptides have been identified that are unique to serum proteome of RA patients [18,20].A recent study compared the serum proteome profiles of seronegative patients with healthy controls [21].However, to our knowledge, no study has compared the serum proteome profiles of all the RA serotypes based on ACPAs and RF.Furthermore, the proteomic profiles of Pakistani RA patients have not been investigated in any previous study.This study aims to screen the RA serotypes, based on ACPAs and RF, and compare them with healthy controls in the Pakistani population for the identification of biomarkers that are differentially expressed (DE) between RA patients and healthy controls.', 'para': '5', 'bboxes': \"[[{'page': '2', 'x': '187.65', 'y': '411.87', 'h': '373.27', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '424.42', 'h': '319.69', 'w': '9.58'}], [{'page': '2', 'x': '489.19', 'y': '424.42', 'h': '70.09', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '436.98', 'h': '394.62', 'w': '9.58'}], [{'page': '2', 'x': '166.01', 'y': '449.53', 'h': '393.66', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '462.08', 'h': '57.92', 'w': '9.58'}], [{'page': '2', 'x': '228.10', 'y': '462.08', 'h': '331.17', 'w': '9.58'}, {'page': '2', 'x': '166.10', 'y': '474.64', 'h': '262.67', 'w': '9.58'}], [{'page': '2', 'x': '432.38', 'y': '474.64', 'h': '126.90', 'w': '9.58'}, {'page': '2', 'x': '166.10', 'y': '487.19', 'h': '370.43', 'w': '9.58'}], [{'page': '2', 'x': '539.87', 'y': '487.19', 'h': '19.41', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '499.74', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '512.30', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '524.85', 'h': '315.47', 'w': '9.58'}]]\", 'pages': \"('2', '2')\", 'section_title': 'Introduction', 'section_number': '1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The study was approved by the institutional review board (IRB) of the National University of Sciences and Technology (NUST), Islamabad, Pakistan, and written informed consent was obtained from all the study subjects.Human blood sera were collected from Pakistani RA patients that were diagnosed according to 2010 ACR/EULAR criteria [7] as well as healthy controls.The venous blood was collected from each patient in a 5 mL BD Vacutainer ® tubes (BD vacutainer TM, Frankin Lakes, NJ, USA) containing spray-coated silica and a polymer gel for serum separation.Butterfly needle was used depending on the condition of the patient.The samples were allowed to clot, and the serum was carefully alliquoted and stored at -80 • C. ACPA-status was evaluated using the commercial ACPA AESKULISA ® enzyme-linked immunosorbent assay (ELISA) assay kit (AESKU.Diagnostics, Wendelsheim, Germany).RF-status was determined using a latex agglutination slide test kit for RF (Werfen, Barcelona, Spain).A total of 18 patients (mean age ± SD = 40.1 ± 12.13) selected for the study were divided into 4 cohorts.The first cohort included RA patients that were double-positive for both RF and ACPA (n = 5), the second and third cohort included RA patients that were either positive for RF or ACPA (n = 5 each) and the fourth cohort included RA patients that were negative for both of these serological markers (n = 3).', metadata={'text': 'The study was approved by the institutional review board (IRB) of the National University of Sciences and Technology (NUST), Islamabad, Pakistan, and written informed consent was obtained from all the study subjects.Human blood sera were collected from Pakistani RA patients that were diagnosed according to 2010 ACR/EULAR criteria [7] as well as healthy controls.The venous blood was collected from each patient in a 5 mL BD Vacutainer ® tubes (BD vacutainer TM, Frankin Lakes, NJ, USA) containing spray-coated silica and a polymer gel for serum separation.Butterfly needle was used depending on the condition of the patient.The samples were allowed to clot, and the serum was carefully alliquoted and stored at -80 • C. ACPA-status was evaluated using the commercial ACPA AESKULISA ® enzyme-linked immunosorbent assay (ELISA) assay kit (AESKU.Diagnostics, Wendelsheim, Germany).RF-status was determined using a latex agglutination slide test kit for RF (Werfen, Barcelona, Spain).A total of 18 patients (mean age ± SD = 40.1 ± 12.13) selected for the study were divided into 4 cohorts.The first cohort included RA patients that were double-positive for both RF and ACPA (n = 5), the second and third cohort included RA patients that were either positive for RF or ACPA (n = 5 each) and the fourth cohort included RA patients that were negative for both of these serological markers (n = 3).', 'para': '7', 'bboxes': \"[[{'page': '2', 'x': '187.65', 'y': '576.26', 'h': '371.62', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '588.81', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '601.36', 'h': '217.09', 'w': '9.58'}], [{'page': '2', 'x': '386.61', 'y': '601.36', 'h': '172.66', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '613.91', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '165.98', 'y': '626.47', 'h': '107.48', 'w': '9.58'}], [{'page': '2', 'x': '276.54', 'y': '626.47', 'h': '282.74', 'w': '9.58'}, {'page': '2', 'x': '166.04', 'y': '639.02', 'h': '47.95', 'w': '9.58'}, {'page': '2', 'x': '213.98', 'y': '637.03', 'h': '5.66', 'w': '7.28'}, {'page': '2', 'x': '222.64', 'y': '639.02', 'h': '336.63', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '651.57', 'h': '198.34', 'w': '9.58'}], [{'page': '2', 'x': '367.82', 'y': '651.57', 'h': '191.46', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '664.13', 'h': '107.76', 'w': '9.58'}], [{'page': '2', 'x': '277.53', 'y': '664.13', 'h': '282.13', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '676.36', 'h': '125.23', 'w': '9.90'}, {'page': '2', 'x': '294.21', 'y': '674.45', 'h': '3.94', 'w': '6.92'}, {'page': '2', 'x': '298.74', 'y': '676.68', 'h': '260.92', 'w': '9.58'}, {'page': '2', 'x': '166.01', 'y': '689.23', 'h': '55.35', 'w': '9.58'}, {'page': '2', 'x': '221.35', 'y': '687.24', 'h': '5.66', 'w': '7.28'}, {'page': '2', 'x': '229.57', 'y': '689.23', 'h': '330.96', 'w': '9.58'}, {'page': '2', 'x': '165.90', 'y': '701.79', 'h': '112.72', 'w': '9.58'}], [{'page': '2', 'x': '281.70', 'y': '701.79', 'h': '277.57', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '714.34', 'h': '159.98', 'w': '9.58'}], [{'page': '2', 'x': '329.49', 'y': '714.02', 'h': '230.78', 'w': '9.90'}, {'page': '2', 'x': '166.39', 'y': '726.89', 'h': '223.73', 'w': '9.58'}], [{'page': '2', 'x': '393.21', 'y': '726.89', 'h': '166.06', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '739.44', 'h': '392.88', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '752.00', 'h': '392.89', 'w': '9.58'}, {'page': '2', 'x': '166.39', 'y': '764.55', 'h': '394.63', 'w': '9.58'}]]\", 'pages': \"('2', '2')\", 'section_title': 'Study Subjects and Serum Collection', 'section_number': '2.1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Life 2022, 12, 464 3 of 17 A total of 5 healthy controls (n = 5) (mean age ± SD = 43.4± 9.11) were also included in the study.Each cohort contained age-matched samples with a female-to-male ratio of 4:1.Blood samples from both RA cases and healthy controls were collected in vacutainers without anticoagulants.Serum was then separated from blood at 4000× g for 5 min, aliquoted into polyethylene tubes (Eppendorf AG, Hamburg, Germany) and stored at -80 • C until use.', metadata={'text': 'Life 2022, 12, 464 3 of 17 A total of 5 healthy controls (n = 5) (mean age ± SD = 43.4± 9.11) were also included in the study.Each cohort contained age-matched samples with a female-to-male ratio of 4:1.Blood samples from both RA cases and healthy controls were collected in vacutainers without anticoagulants.Serum was then separated from blood at 4000× g for 5 min, aliquoted into polyethylene tubes (Eppendorf AG, Hamburg, Germany) and stored at -80 • C until use.', 'para': '4', 'bboxes': \"[[{'page': '3', 'x': '35.49', 'y': '57.46', 'h': '57.79', 'w': '7.77'}, {'page': '3', 'x': '536.53', 'y': '57.56', 'h': '22.95', 'w': '7.67'}, {'page': '3', 'x': '166.01', 'y': '97.73', 'h': '249.40', 'w': '9.90'}], [{'page': '3', 'x': '417.90', 'y': '97.73', 'h': '141.38', 'w': '9.90'}, {'page': '3', 'x': '166.39', 'y': '110.60', 'h': '25.94', 'w': '9.58'}], [{'page': '3', 'x': '195.28', 'y': '110.60', 'h': '335.62', 'w': '9.58'}], [{'page': '3', 'x': '533.84', 'y': '110.60', 'h': '25.43', 'w': '9.58'}, {'page': '3', 'x': '166.39', 'y': '123.15', 'h': '392.88', 'w': '9.58'}, {'page': '3', 'x': '166.39', 'y': '135.71', 'h': '66.29', 'w': '9.58'}], [{'page': '3', 'x': '235.79', 'y': '135.58', 'h': '323.49', 'w': '9.71'}, {'page': '3', 'x': '166.10', 'y': '147.94', 'h': '333.23', 'w': '9.90'}, {'page': '3', 'x': '501.91', 'y': '146.03', 'h': '3.94', 'w': '6.92'}, {'page': '3', 'x': '506.44', 'y': '148.26', 'h': '50.39', 'w': '9.58'}]]\", 'pages': \"('3', '3')\", 'section_title': 'Study Subjects and Serum Collection', 'section_number': '2.1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='For validation, serum samples were collected and processed from RA patients (n = 60) (mean age ± SD = 41.495 ± 12.8275) and healthy controls (n = 20) (mean age ± SD = 45.4 ± 11.31) from the same population.The demographics and clinical characteristics of the experimental and validation cohort are shown in Table 1.', metadata={'text': 'For validation, serum samples were collected and processed from RA patients (n = 60) (mean age ± SD = 41.495 ± 12.8275) and healthy controls (n = 20) (mean age ± SD = 45.4 ± 11.31) from the same population.The demographics and clinical characteristics of the experimental and validation cohort are shown in Table 1.', 'para': '1', 'bboxes': \"[[{'page': '3', 'x': '187.65', 'y': '160.81', 'h': '372.02', 'w': '9.58'}, {'page': '3', 'x': '166.10', 'y': '173.05', 'h': '394.17', 'w': '9.90'}, {'page': '3', 'x': '166.07', 'y': '185.60', 'h': '256.73', 'w': '9.90'}], [{'page': '3', 'x': '425.92', 'y': '185.92', 'h': '133.36', 'w': '9.58'}, {'page': '3', 'x': '166.39', 'y': '198.47', 'h': '343.00', 'w': '9.58'}]]\", 'pages': \"('3', '3')\", 'section_title': 'Study Subjects and Serum Collection', 'section_number': '2.1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Serum samples were thawed on ice followed by centrifugation at 14,000× g for 10 min at 4 • C. Protein concentrations for serum samples from each donor were then determined through Pierce ® 660 nm protein assay kit for protein concentration (Thermo Scientific, Waltham, MA, USA).The sample volumes containing 10 mg total protein were calculated and mixed with double-distilled water (ddH 2 O) to make the total volume up to 500 µL.', metadata={'text': 'Serum samples were thawed on ice followed by centrifugation at 14,000× g for 10 min at 4 • C. Protein concentrations for serum samples from each donor were then determined through Pierce ® 660 nm protein assay kit for protein concentration (Thermo Scientific, Waltham, MA, USA).The sample volumes containing 10 mg total protein were calculated and mixed with double-distilled water (ddH 2 O) to make the total volume up to 500 µL.', 'para': '1', 'bboxes': \"[[{'page': '3', 'x': '187.65', 'y': '635.30', 'h': '371.63', 'w': '9.71'}, {'page': '3', 'x': '166.39', 'y': '647.98', 'h': '15.71', 'w': '9.58'}, {'page': '3', 'x': '184.70', 'y': '645.75', 'h': '3.94', 'w': '6.92'}, {'page': '3', 'x': '189.24', 'y': '647.98', 'h': '370.04', 'w': '9.58'}, {'page': '3', 'x': '166.39', 'y': '660.54', 'h': '66.80', 'w': '9.58'}, {'page': '3', 'x': '233.20', 'y': '658.55', 'h': '5.66', 'w': '7.28'}, {'page': '3', 'x': '242.68', 'y': '660.54', 'h': '317.84', 'w': '9.58'}, {'page': '3', 'x': '165.90', 'y': '673.09', 'h': '93.36', 'w': '9.58'}], [{'page': '3', 'x': '261.76', 'y': '673.09', 'h': '297.51', 'w': '9.58'}, {'page': '3', 'x': '166.39', 'y': '685.53', 'h': '384.50', 'w': '10.84'}]]\", 'pages': \"('3', '3')\", 'section_title': 'Protein Assay', 'section_number': '2.2.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Serum samples were analyzed using one-dimensional (1D) sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE) for assessment of the gross quantitative as well as qualitative differences in the serum protein profiles of the study subjects.Briefly, 16 µg of serum samples were mixed with an equal volume of NativePAGE™ sample buffer (Thermo Scientific, Waltham, MA, USA) and loaded on NativePAGE™ 1.0 mm, 4-16%, bis-tris, mini protein gels (Thermo Scientific, Waltham, MA, USA).Novex Sharp Pre-Stained Protein Standard for molecular weight estimation (Thermo Scientific, Waltham, MA, USA) was also loaded in a separate well.The samples and the standard were run in NuPAGE™ MES SDS running buffer (Thermo Scientific, Waltham, MA, USA) at 120 V for 60 min and then at 150 V for 30 min.The gels were washed for 5 min in ddH 2 O.The washing was repeated thrice.Prior to visualization, the protein gels were stained for 16 hours in Coomassie Brilliant Blue R-250 dye (Bio-Rad, Hemel Hempstead, UK) and rinsed in ddH 2 O for 30 min.The whole figure can be found at Supplementary Materials (Figures S1-S3).', metadata={'text': 'Serum samples were analyzed using one-dimensional (1D) sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE) for assessment of the gross quantitative as well as qualitative differences in the serum protein profiles of the study subjects.Briefly, 16 µg of serum samples were mixed with an equal volume of NativePAGE™ sample buffer (Thermo Scientific, Waltham, MA, USA) and loaded on NativePAGE™ 1.0 mm, 4-16%, bis-tris, mini protein gels (Thermo Scientific, Waltham, MA, USA).Novex Sharp Pre-Stained Protein Standard for molecular weight estimation (Thermo Scientific, Waltham, MA, USA) was also loaded in a separate well.The samples and the standard were run in NuPAGE™ MES SDS running buffer (Thermo Scientific, Waltham, MA, USA) at 120 V for 60 min and then at 150 V for 30 min.The gels were washed for 5 min in ddH 2 O.The washing was repeated thrice.Prior to visualization, the protein gels were stained for 16 hours in Coomassie Brilliant Blue R-250 dye (Bio-Rad, Hemel Hempstead, UK) and rinsed in ddH 2 O for 30 min.The whole figure can be found at Supplementary Materials (Figures S1-S3).', 'para': '7', 'bboxes': \"[[{'page': '3', 'x': '187.65', 'y': '723.60', 'h': '371.62', 'w': '9.58'}, {'page': '3', 'x': '166.10', 'y': '736.15', 'h': '393.18', 'w': '9.58'}, {'page': '3', 'x': '165.98', 'y': '748.71', 'h': '360.04', 'w': '9.58'}], [{'page': '3', 'x': '529.21', 'y': '748.71', 'h': '31.31', 'w': '9.58'}, {'page': '3', 'x': '165.90', 'y': '761.15', 'h': '393.58', 'w': '9.69'}, {'page': '3', 'x': '166.07', 'y': '773.81', 'h': '394.45', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '98.05', 'h': '282.71', 'w': '9.58'}], [{'page': '4', 'x': '451.18', 'y': '98.05', 'h': '108.09', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '110.60', 'h': '393.88', 'w': '9.58'}, {'page': '4', 'x': '165.98', 'y': '123.15', 'h': '152.63', 'w': '9.58'}], [{'page': '4', 'x': '321.71', 'y': '123.15', 'h': '239.52', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '135.71', 'h': '392.88', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '148.26', 'h': '132.03', 'w': '9.58'}], [{'page': '4', 'x': '302.77', 'y': '148.26', 'h': '195.38', 'w': '10.73'}], [{'page': '4', 'x': '501.06', 'y': '148.26', 'h': '58.22', 'w': '9.58'}, {'page': '4', 'x': '165.98', 'y': '160.81', 'h': '90.55', 'w': '9.58'}], [{'page': '4', 'x': '260.29', 'y': '160.81', 'h': '298.99', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '173.37', 'h': '392.88', 'w': '10.73'}, {'page': '4', 'x': '166.39', 'y': '185.92', 'h': '47.62', 'w': '9.58'}], [{'page': '4', 'x': '217.10', 'y': '185.92', 'h': '331.85', 'w': '9.58'}]]\", 'pages': \"('3', '4')\", 'section_title': 'SDS-PAGE and Silver Staining', 'section_number': '2.3.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='For qualitative assessment of the elution efficiency of ProteoMiner™ columns (Bio-Rad, Hemel Hempstead, UK), one serum sample processed through the column was also evaluated using 1D SDS-PAGE.For this purpose, the serum sample, the flow-through after each wash, and the eluted samples were run using the aforementioned protocol.Additionally, trypsin digested samples were also analyzed using 1D SDS-PAGE to confirm complete protein digestion before liquid chromatography-tandem mass spectrometry (LC-MS).', metadata={'text': 'For qualitative assessment of the elution efficiency of ProteoMiner™ columns (Bio-Rad, Hemel Hempstead, UK), one serum sample processed through the column was also evaluated using 1D SDS-PAGE.For this purpose, the serum sample, the flow-through after each wash, and the eluted samples were run using the aforementioned protocol.Additionally, trypsin digested samples were also analyzed using 1D SDS-PAGE to confirm complete protein digestion before liquid chromatography-tandem mass spectrometry (LC-MS).', 'para': '2', 'bboxes': \"[[{'page': '4', 'x': '187.65', 'y': '198.47', 'h': '373.27', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '211.02', 'h': '392.88', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '223.58', 'h': '136.31', 'w': '9.58'}], [{'page': '4', 'x': '305.20', 'y': '223.58', 'h': '254.27', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '236.13', 'h': '348.40', 'w': '9.58'}], [{'page': '4', 'x': '517.88', 'y': '236.13', 'h': '43.05', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '248.68', 'h': '392.88', 'w': '9.58'}, {'page': '4', 'x': '166.10', 'y': '261.24', 'h': '377.12', 'w': '9.58'}]]\", 'pages': \"('4', '4')\", 'section_title': 'SDS-PAGE and Silver Staining', 'section_number': '2.3.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='ProteoMiner™ Small Capacity bead columns for protein enrichment were loaded with 10 mg of protein from each sample separately.The bead columns were then rotated at the room temperature for 2 h followed by centrifugation at 1000× g for 60 s.Washing of the beads was performed thrice in phosphate-buffered saline (Sigma-Aldrich, Gillingham, UK) followed by rotation for 5 min and subsequent centrifugation for 60 s at 1000× g.This eluted the maximum amount of unbound protein.', metadata={'text': 'ProteoMiner™ Small Capacity bead columns for protein enrichment were loaded with 10 mg of protein from each sample separately.The bead columns were then rotated at the room temperature for 2 h followed by centrifugation at 1000× g for 60 s.Washing of the beads was performed thrice in phosphate-buffered saline (Sigma-Aldrich, Gillingham, UK) followed by rotation for 5 min and subsequent centrifugation for 60 s at 1000× g.This eluted the maximum amount of unbound protein.', 'para': '3', 'bboxes': \"[[{'page': '4', 'x': '187.65', 'y': '301.41', 'h': '371.62', 'w': '9.58'}, {'page': '4', 'x': '165.90', 'y': '313.96', 'h': '202.25', 'w': '9.58'}], [{'page': '4', 'x': '371.24', 'y': '313.96', 'h': '188.03', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '326.38', 'h': '322.63', 'w': '9.71'}], [{'page': '4', 'x': '492.17', 'y': '326.51', 'h': '67.11', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '339.06', 'h': '393.87', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '351.49', 'h': '368.81', 'w': '9.71'}], [{'page': '4', 'x': '539.86', 'y': '351.62', 'h': '19.41', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '364.17', 'h': '220.03', 'w': '9.58'}]]\", 'pages': \"('4', '4')\", 'section_title': 'ProteoMiner TM Column Processing', 'section_number': '2.4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='A pre-mixed solution of 0.05% (w/v) RapiGest (Waters, Elstree, Hertfordshire, UK) and 160 µL of 25 mM ammonium bicarbonate (NH 4 HCO 3 ) (Fluka Chemicals Ltd., Gillingham, UK) was used for resuspension of the Proteominer TM beads.The resuspended beads were then heated for 10 min at 80 • C; DL-Dithiothreitol (Sigma-Aldrich, Gillingham, UK) to 3 mM final concentration was added, incubated for 10 min at 60 • C and iodoacetamide (Sigma-Aldrich, Gillingham, UK) was added to a final concentration of 9 mM, incubated in the dark for 30 min at room temperature.Protease enzyme trypsin (Sigma-Aldrich, Gillingham, UK) was used for enzymatic protein digestion.A total of 2 µg of trypsin was added to each sample and rotated at 37 • C for 16 h.The samples containing the beads were supplemented again with 2 µg trypsin and rotation for 2 h at 37 • C. The digested serum samples were then centrifuged at 1000× g for 1 min at room temperature.Supernatant was removed followed by the inhibition of the trypsin activity by acidification with 0.5% (v/v) trifluoroacetic acid (TFA, Sigma-Aldrich, Gillingham, UK) and rotation at 37 • C for 30 min.The samples were then centrifuged at 13,000× g for 15 min at 4 • C.', metadata={'text': 'A pre-mixed solution of 0.05% (w/v) RapiGest (Waters, Elstree, Hertfordshire, UK) and 160 µL of 25 mM ammonium bicarbonate (NH 4 HCO 3 ) (Fluka Chemicals Ltd., Gillingham, UK) was used for resuspension of the Proteominer TM beads.The resuspended beads were then heated for 10 min at 80 • C; DL-Dithiothreitol (Sigma-Aldrich, Gillingham, UK) to 3 mM final concentration was added, incubated for 10 min at 60 • C and iodoacetamide (Sigma-Aldrich, Gillingham, UK) was added to a final concentration of 9 mM, incubated in the dark for 30 min at room temperature.Protease enzyme trypsin (Sigma-Aldrich, Gillingham, UK) was used for enzymatic protein digestion.A total of 2 µg of trypsin was added to each sample and rotated at 37 • C for 16 h.The samples containing the beads were supplemented again with 2 µg trypsin and rotation for 2 h at 37 • C. The digested serum samples were then centrifuged at 1000× g for 1 min at room temperature.Supernatant was removed followed by the inhibition of the trypsin activity by acidification with 0.5% (v/v) trifluoroacetic acid (TFA, Sigma-Aldrich, Gillingham, UK) and rotation at 37 • C for 30 min.The samples were then centrifuged at 13,000× g for 15 min at 4 • C.', 'para': '6', 'bboxes': \"[[{'page': '4', 'x': '187.65', 'y': '402.13', 'h': '371.62', 'w': '9.58'}, {'page': '4', 'x': '165.90', 'y': '414.57', 'h': '394.62', 'w': '10.84'}, {'page': '4', 'x': '166.39', 'y': '427.23', 'h': '220.48', 'w': '9.58'}, {'page': '4', 'x': '386.88', 'y': '425.24', 'h': '11.80', 'w': '7.28'}, {'page': '4', 'x': '401.67', 'y': '427.23', 'h': '27.81', 'w': '9.58'}], [{'page': '4', 'x': '432.57', 'y': '427.23', 'h': '126.71', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '439.79', 'h': '128.16', 'w': '9.58'}, {'page': '4', 'x': '297.71', 'y': '437.56', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '302.25', 'y': '439.79', 'h': '257.02', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '452.34', 'h': '289.28', 'w': '9.58'}, {'page': '4', 'x': '458.58', 'y': '450.11', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '463.12', 'y': '452.34', 'h': '96.15', 'w': '9.58'}, {'page': '4', 'x': '166.07', 'y': '464.89', 'h': '393.21', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '477.45', 'h': '201.41', 'w': '9.58'}], [{'page': '4', 'x': '373.30', 'y': '477.45', 'h': '187.22', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '490.00', 'h': '261.11', 'w': '9.58'}], [{'page': '4', 'x': '430.61', 'y': '489.89', 'h': '128.66', 'w': '9.69'}, {'page': '4', 'x': '166.39', 'y': '502.55', 'h': '168.86', 'w': '9.58'}, {'page': '4', 'x': '337.78', 'y': '500.32', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '342.32', 'y': '502.55', 'h': '44.55', 'w': '9.58'}], [{'page': '4', 'x': '389.93', 'y': '502.55', 'h': '169.35', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '515.00', 'h': '284.75', 'w': '9.69'}, {'page': '4', 'x': '453.78', 'y': '512.87', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '458.32', 'y': '515.11', 'h': '100.96', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '527.53', 'h': '317.21', 'w': '9.71'}], [{'page': '4', 'x': '486.71', 'y': '527.66', 'h': '72.56', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '540.21', 'h': '393.88', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '552.76', 'h': '330.90', 'w': '9.58'}, {'page': '4', 'x': '499.89', 'y': '550.53', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '504.43', 'y': '552.76', 'h': '56.60', 'w': '9.58'}], [{'page': '4', 'x': '166.09', 'y': '565.19', 'h': '276.76', 'w': '9.71'}, {'page': '4', 'x': '445.43', 'y': '563.09', 'h': '3.94', 'w': '6.92'}, {'page': '4', 'x': '449.97', 'y': '565.32', 'h': '9.55', 'w': '9.58'}]]\", 'pages': \"('4', '4')\", 'section_title': 'Protein Digestion', 'section_number': '2.5.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Each serum digest sample was analyzed using LC-MS/MS on an UltiMate 3000 Nano LC System (Dionex/Thermo Scientific, Waltham, MA, USA).The system was attached to a Q Exactive TM Quadrupole-Orbitrap instrument (Thermo Scientific, Waltham, MA, USA).Prior to loading onto the instrument, the samples were carefully randomized using Microsoft Excel.All the samples were run in one single batch.For this purpose, 150 ng of the tryptic digest from each trypsin-digested serum sample was subjected to LC-MS/MS via a 90 min gradient.For loading on trapping column (100 Å, 75 µm × 2 cm, Acclaim PepMap 100 C18, 3 µm packing material) loading buffer was used that contained 2% (v/v) acetonitrile and 0.1% (v/v) TFA in water.The sample digests mixed with loaded buffer were run at a flow rate of 12 µL min -1 for 7 min.Then, a trapping column was coupled with an analytical column (100 Å, 75 µm × 50 cm, EASY-Spray PepMap RSLC C18, 2 µm packing material) followed by elution of the peptides through a linear gradient.The linear gradient consisted of 96.2%A composed of 0.1% (v/v) formic acid: 3.8% B consisting of 0.1% (v/v) formic acid in water/acetonitrile [80/20] (v/v) to 50% A: 50% B at a flow rate of 300 nl min -1 over 90 min and washed for 5 min at 1% A: 99% B. The column was then re-equilibrated to the starting conditions and maintained at 40 • C before direct introduction of the affluent into the integrated nano-electrospray ionization source that was operating in the positive ion mode.The MS instrument was operated in the data-dependent acquisition (DDA) mode with the survey scans between the mass to charge ratio (m/z) range of 350 to 2000 that were acquired at a mass resolution of about 60,000 and the fullwidth at halfmaximum (FWHM) at m/z of about 200.The automatic gain control was set to 3e6 with a maximum injection time of 100 ms.For MS/MS, 12 of the most intense precursor ions with an isolation window of 2 m/z units and charge states ranging from 2+ to 5+ were selected.For this, the automatic gain control was set to a value of 1e5 with the maximum injection time of 100 ms.The peptide fragmentation was obtained by the higher-energy collisional dissociation utilizing a normalized collision energy of 30%.Dynamic exclusion of the m/z values was used to avoid the repeated fragmentation of the same peptide with an exclusion time of 20 s.All MS raw files for this experiment have been deposited to the ProteomeXchange Consortium through the PRIDE partner proteomics repository.The dataset identifier for this submission is PXD020235 and 10.6019/PXD020235 [22].', metadata={'text': 'Each serum digest sample was analyzed using LC-MS/MS on an UltiMate 3000 Nano LC System (Dionex/Thermo Scientific, Waltham, MA, USA).The system was attached to a Q Exactive TM Quadrupole-Orbitrap instrument (Thermo Scientific, Waltham, MA, USA).Prior to loading onto the instrument, the samples were carefully randomized using Microsoft Excel.All the samples were run in one single batch.For this purpose, 150 ng of the tryptic digest from each trypsin-digested serum sample was subjected to LC-MS/MS via a 90 min gradient.For loading on trapping column (100 Å, 75 µm × 2 cm, Acclaim PepMap 100 C18, 3 µm packing material) loading buffer was used that contained 2% (v/v) acetonitrile and 0.1% (v/v) TFA in water.The sample digests mixed with loaded buffer were run at a flow rate of 12 µL min -1 for 7 min.Then, a trapping column was coupled with an analytical column (100 Å, 75 µm × 50 cm, EASY-Spray PepMap RSLC C18, 2 µm packing material) followed by elution of the peptides through a linear gradient.The linear gradient consisted of 96.2%A composed of 0.1% (v/v) formic acid: 3.8% B consisting of 0.1% (v/v) formic acid in water/acetonitrile [80/20] (v/v) to 50% A: 50% B at a flow rate of 300 nl min -1 over 90 min and washed for 5 min at 1% A: 99% B. The column was then re-equilibrated to the starting conditions and maintained at 40 • C before direct introduction of the affluent into the integrated nano-electrospray ionization source that was operating in the positive ion mode.The MS instrument was operated in the data-dependent acquisition (DDA) mode with the survey scans between the mass to charge ratio (m/z) range of 350 to 2000 that were acquired at a mass resolution of about 60,000 and the fullwidth at halfmaximum (FWHM) at m/z of about 200.The automatic gain control was set to 3e6 with a maximum injection time of 100 ms.For MS/MS, 12 of the most intense precursor ions with an isolation window of 2 m/z units and charge states ranging from 2+ to 5+ were selected.For this, the automatic gain control was set to a value of 1e5 with the maximum injection time of 100 ms.The peptide fragmentation was obtained by the higher-energy collisional dissociation utilizing a normalized collision energy of 30%.Dynamic exclusion of the m/z values was used to avoid the repeated fragmentation of the same peptide with an exclusion time of 20 s.All MS raw files for this experiment have been deposited to the ProteomeXchange Consortium through the PRIDE partner proteomics repository.The dataset identifier for this submission is PXD020235 and 10.6019/PXD020235 [22].', 'para': '17', 'bboxes': \"[[{'page': '4', 'x': '187.65', 'y': '603.27', 'h': '371.62', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '615.83', 'h': '275.76', 'w': '9.58'}], [{'page': '4', 'x': '445.29', 'y': '615.83', 'h': '113.98', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '628.38', 'h': '69.65', 'w': '9.58'}, {'page': '4', 'x': '236.04', 'y': '626.39', 'h': '11.80', 'w': '7.28'}, {'page': '4', 'x': '251.61', 'y': '628.38', 'h': '308.91', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '640.93', 'h': '26.46', 'w': '9.58'}], [{'page': '4', 'x': '195.34', 'y': '640.93', 'h': '363.93', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '653.49', 'h': '70.68', 'w': '9.58'}], [{'page': '4', 'x': '240.17', 'y': '653.49', 'h': '198.27', 'w': '9.58'}], [{'page': '4', 'x': '441.54', 'y': '653.49', 'h': '117.74', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '666.04', 'h': '392.88', 'w': '9.58'}, {'page': '4', 'x': '166.12', 'y': '678.59', 'h': '99.03', 'w': '9.58'}], [{'page': '4', 'x': '269.47', 'y': '678.28', 'h': '289.81', 'w': '9.90'}, {'page': '4', 'x': '166.39', 'y': '691.04', 'h': '393.88', 'w': '9.69'}, {'page': '4', 'x': '166.39', 'y': '703.70', 'h': '184.20', 'w': '9.58'}], [{'page': '4', 'x': '354.67', 'y': '703.70', 'h': '204.81', 'w': '9.58'}, {'page': '4', 'x': '165.98', 'y': '716.14', 'h': '161.87', 'w': '9.69'}, {'page': '4', 'x': '327.94', 'y': '714.02', 'h': '10.01', 'w': '6.92'}, {'page': '4', 'x': '341.08', 'y': '716.25', 'h': '43.65', 'w': '9.58'}], [{'page': '4', 'x': '388.22', 'y': '716.25', 'h': '171.05', 'w': '9.58'}, {'page': '4', 'x': '165.98', 'y': '728.49', 'h': '393.30', 'w': '9.90'}, {'page': '4', 'x': '166.10', 'y': '741.36', 'h': '346.29', 'w': '9.58'}], [{'page': '4', 'x': '515.48', 'y': '741.36', 'h': '43.99', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '753.91', 'h': '123.17', 'w': '9.58'}], [{'page': '4', 'x': '292.23', 'y': '753.91', 'h': '267.05', 'w': '9.58'}, {'page': '4', 'x': '166.39', 'y': '766.46', 'h': '392.88', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '98.05', 'h': '57.86', 'w': '9.58'}, {'page': '5', 'x': '224.34', 'y': '95.82', 'h': '10.01', 'w': '6.92'}, {'page': '5', 'x': '237.34', 'y': '98.05', 'h': '321.93', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '110.60', 'h': '266.57', 'w': '9.58'}, {'page': '5', 'x': '435.34', 'y': '108.37', 'h': '3.94', 'w': '6.92'}, {'page': '5', 'x': '439.88', 'y': '110.60', 'h': '119.40', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '123.15', 'h': '392.88', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '135.71', 'h': '96.68', 'w': '9.58'}], [{'page': '5', 'x': '266.18', 'y': '135.71', 'h': '293.10', 'w': '9.58'}, {'page': '5', 'x': '166.07', 'y': '148.26', 'h': '393.21', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '160.81', 'h': '394.53', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '173.24', 'h': '181.48', 'w': '9.71'}], [{'page': '5', 'x': '351.05', 'y': '173.37', 'h': '208.23', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '185.92', 'h': '165.42', 'w': '9.58'}], [{'page': '5', 'x': '335.20', 'y': '185.92', 'h': '224.07', 'w': '9.58'}, {'page': '5', 'x': '165.98', 'y': '198.34', 'h': '393.30', 'w': '9.71'}, {'page': '5', 'x': '166.39', 'y': '211.02', 'h': '37.95', 'w': '9.58'}], [{'page': '5', 'x': '207.44', 'y': '211.02', 'h': '351.83', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '223.58', 'h': '109.24', 'w': '9.58'}], [{'page': '5', 'x': '279.32', 'y': '223.58', 'h': '280.35', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '236.13', 'h': '305.80', 'w': '9.58'}], [{'page': '5', 'x': '475.28', 'y': '236.13', 'h': '83.99', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '248.55', 'h': '392.88', 'w': '9.71'}, {'page': '5', 'x': '166.39', 'y': '261.24', 'h': '115.17', 'w': '9.58'}], [{'page': '5', 'x': '286.60', 'y': '261.24', 'h': '272.67', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '273.79', 'h': '373.17', 'w': '9.58'}], [{'page': '5', 'x': '542.65', 'y': '273.79', 'h': '16.63', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '286.34', 'h': '355.30', 'w': '9.58'}]]\", 'pages': \"('4', '5')\", 'section_title': 'LC-MS/MS', 'section_number': '2.6.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='For label-free quantification, all the raw files were processed using Progenesis™ QI 2.0 software (Nonlinear Dynamics, Waters).Progenesis™ QI software undertakes the spectral alignment, consistent peak picking across all runs, normalization of the total protein abundance as well as peptide/protein quantification.For each feature, the top five spectra were exported, and the peptide and protein identifications were carried out via in-house Mascot server (Version 2.6.2).Reviewed Homo sapiens database was used to perform the identifications.Search parameters included: fragment mass tolerance value of 0.01 Da; peptide mass tolerance value of 10.0 ppm; enzyme, trypsin; one allowed missed cleavage; carbamidomethylation (cysteine) as the fixed modifications and oxidation (methionine) as the variable modification; The criteria used for protein identification included a false discovery rate (FDR) of 1% and ≥2 unique peptides.', metadata={'text': 'For label-free quantification, all the raw files were processed using Progenesis™ QI 2.0 software (Nonlinear Dynamics, Waters).Progenesis™ QI software undertakes the spectral alignment, consistent peak picking across all runs, normalization of the total protein abundance as well as peptide/protein quantification.For each feature, the top five spectra were exported, and the peptide and protein identifications were carried out via in-house Mascot server (Version 2.6.2).Reviewed Homo sapiens database was used to perform the identifications.Search parameters included: fragment mass tolerance value of 0.01 Da; peptide mass tolerance value of 10.0 ppm; enzyme, trypsin; one allowed missed cleavage; carbamidomethylation (cysteine) as the fixed modifications and oxidation (methionine) as the variable modification; The criteria used for protein identification included a false discovery rate (FDR) of 1% and ≥2 unique peptides.', 'para': '4', 'bboxes': \"[[{'page': '5', 'x': '187.65', 'y': '324.30', 'h': '371.62', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '336.85', 'h': '174.99', 'w': '9.58'}], [{'page': '5', 'x': '344.48', 'y': '336.85', 'h': '214.80', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '349.41', 'h': '392.88', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '361.96', 'h': '231.45', 'w': '9.58'}], [{'page': '5', 'x': '400.98', 'y': '361.96', 'h': '158.30', 'w': '9.58'}, {'page': '5', 'x': '165.98', 'y': '374.51', 'h': '393.30', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '387.06', 'h': '131.22', 'w': '9.58'}], [{'page': '5', 'x': '300.78', 'y': '386.93', 'h': '258.50', 'w': '9.71'}, {'page': '5', 'x': '166.39', 'y': '399.62', 'h': '66.54', 'w': '9.58'}], [{'page': '5', 'x': '237.66', 'y': '399.62', 'h': '322.86', 'w': '9.58'}, {'page': '5', 'x': '166.10', 'y': '412.17', 'h': '394.42', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '424.72', 'h': '393.87', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '437.28', 'h': '392.88', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '449.51', 'h': '230.16', 'w': '9.90'}]]\", 'pages': \"('5', '5')\", 'section_title': 'Label-Free Quantification', 'section_number': '2.7.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Canonical pathways, networks and disregulated regulators of the proteins that were identified with an FDR adjusted p-value of <0.05 and ≥2 unique peptides were performed using Ingenuity Pathway Analysis (IPA) (Qiagen, Hilden, Germany).For this, the gene names for the identified proteins were uploaded and analyzed for humans.All identified proteins were used as a background.The uncharacterized proteins were excluded from analysis.', metadata={'text': 'Canonical pathways, networks and disregulated regulators of the proteins that were identified with an FDR adjusted p-value of <0.05 and ≥2 unique peptides were performed using Ingenuity Pathway Analysis (IPA) (Qiagen, Hilden, Germany).For this, the gene names for the identified proteins were uploaded and analyzed for humans.All identified proteins were used as a background.The uncharacterized proteins were excluded from analysis.', 'para': '3', 'bboxes': \"[[{'page': '5', 'x': '187.65', 'y': '487.79', 'h': '371.62', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '500.02', 'h': '392.89', 'w': '9.90'}, {'page': '5', 'x': '166.39', 'y': '512.89', 'h': '310.88', 'w': '9.58'}], [{'page': '5', 'x': '481.23', 'y': '512.89', 'h': '78.04', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '525.45', 'h': '342.92', 'w': '9.58'}], [{'page': '5', 'x': '514.37', 'y': '525.45', 'h': '46.56', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '538.00', 'h': '186.85', 'w': '9.58'}], [{'page': '5', 'x': '357.87', 'y': '538.00', 'h': '201.40', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '550.55', 'h': '61.84', 'w': '9.58'}]]\", 'pages': \"('5', '5')\", 'section_title': 'Pathway Analysis', 'section_number': '2.8.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content=\"A human PZP ELISA kit (CSB-EL019131HU, CUSABIO, Houston, TX, USA) was used for the quantification of PZP protein in human samples from an independent cohort of RA patients and controls according to the manufacturer's directions.All the samples were analyzed in duplicates and protein concentration was determined as an average of the duplicates.\", metadata={'text': \"A human PZP ELISA kit (CSB-EL019131HU, CUSABIO, Houston, TX, USA) was used for the quantification of PZP protein in human samples from an independent cohort of RA patients and controls according to the manufacturer's directions.All the samples were analyzed in duplicates and protein concentration was determined as an average of the duplicates.\", 'para': '1', 'bboxes': \"[[{'page': '5', 'x': '187.65', 'y': '588.51', 'h': '371.62', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '601.06', 'h': '392.88', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '613.62', 'h': '315.12', 'w': '9.58'}], [{'page': '5', 'x': '487.71', 'y': '613.62', 'h': '71.56', 'w': '9.58'}, {'page': '5', 'x': '165.98', 'y': '626.17', 'h': '393.30', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '638.72', 'h': '64.33', 'w': '9.58'}]]\", 'pages': \"('5', '5')\", 'section_title': 'Validation of MS Using ELISA', 'section_number': '2.9.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Heat map plots were created and visualized using MetaboAnalyst 4.0.Principal component analysis (PCA) was also performed using MetaboAnalyst 4.0.Log transformation and Pareto scaling were applied for data analysis of the normalized data.For this study, the DE proteins were defined as those with a FDR adjusted p-value of <0.05, identified ≥2 unique peptides and a >2 fold expression change using ANOVA.For comparison of PZP concentration between RA patients and healthy controls, a t-test was used.A boxplot depicting the ELISA results was designed using R 4.1.1.', metadata={'text': 'Heat map plots were created and visualized using MetaboAnalyst 4.0.Principal component analysis (PCA) was also performed using MetaboAnalyst 4.0.Log transformation and Pareto scaling were applied for data analysis of the normalized data.For this study, the DE proteins were defined as those with a FDR adjusted p-value of <0.05, identified ≥2 unique peptides and a >2 fold expression change using ANOVA.For comparison of PZP concentration between RA patients and healthy controls, a t-test was used.A boxplot depicting the ELISA results was designed using R 4.1.1.', 'para': '5', 'bboxes': \"[[{'page': '5', 'x': '187.65', 'y': '676.68', 'h': '324.45', 'w': '9.58'}], [{'page': '5', 'x': '518.63', 'y': '676.68', 'h': '40.64', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '689.23', 'h': '331.18', 'w': '9.58'}], [{'page': '5', 'x': '501.86', 'y': '689.23', 'h': '59.07', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '701.79', 'h': '372.04', 'w': '9.58'}], [{'page': '5', 'x': '544.26', 'y': '701.79', 'h': '15.21', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '714.21', 'h': '394.12', 'w': '9.71'}, {'page': '5', 'x': '166.39', 'y': '726.58', 'h': '339.08', 'w': '9.90'}], [{'page': '5', 'x': '507.91', 'y': '726.89', 'h': '53.02', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '739.31', 'h': '382.39', 'w': '9.71'}], [{'page': '5', 'x': '551.89', 'y': '739.44', 'h': '7.78', 'w': '9.58'}, {'page': '5', 'x': '166.39', 'y': '752.00', 'h': '280.43', 'w': '9.58'}]]\", 'pages': \"('5', '5')\", 'section_title': 'Statistics', 'section_number': '2.10.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Life 2022, 12, 464 6 of 17', metadata={'text': 'Life 2022, 12, 464 6 of 17', 'para': '0', 'bboxes': \"[[{'page': '6', 'x': '35.49', 'y': '57.46', 'h': '57.79', 'w': '7.77'}, {'page': '6', 'x': '536.53', 'y': '57.56', 'h': '22.95', 'w': '7.67'}]]\", 'pages': \"('6', '6')\", 'section_title': 'Statistics', 'section_number': '2.10.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='1-D SDS PAGE did not demonstrate any significant differences among groups (Figure 1).A large band of serum albumin appeared at 67 kDa in all the samples; the most abundant protein in human serum.1-D SDS-PAGE of the serum samples processed through Pro-teoMiner™ columns showed that with each wash, the albumin and other high abundance proteins gradually decreased, and all the on-bead proteins were enriched gradually as depicted by the presence of all protein bands and their respective intensities in the SDS-PAGE of eluted samples (Figure 2).', metadata={'text': '1-D SDS PAGE did not demonstrate any significant differences among groups (Figure 1).A large band of serum albumin appeared at 67 kDa in all the samples; the most abundant protein in human serum.1-D SDS-PAGE of the serum samples processed through Pro-teoMiner™ columns showed that with each wash, the albumin and other high abundance proteins gradually decreased, and all the on-bead proteins were enriched gradually as depicted by the presence of all protein bands and their respective intensities in the SDS-PAGE of eluted samples (Figure 2).', 'para': '2', 'bboxes': \"[[{'page': '6', 'x': '187.55', 'y': '127.04', 'h': '373.46', 'w': '9.58'}], [{'page': '6', 'x': '166.01', 'y': '139.59', 'h': '393.27', 'w': '9.58'}, {'page': '6', 'x': '166.10', 'y': '152.14', 'h': '113.19', 'w': '9.58'}], [{'page': '6', 'x': '283.92', 'y': '152.14', 'h': '277.01', 'w': '9.58'}, {'page': '6', 'x': '166.39', 'y': '164.70', 'h': '392.88', 'w': '9.58'}, {'page': '6', 'x': '166.10', 'y': '177.25', 'h': '394.83', 'w': '9.58'}, {'page': '6', 'x': '166.10', 'y': '189.80', 'h': '393.18', 'w': '9.58'}, {'page': '6', 'x': '166.39', 'y': '202.36', 'h': '125.01', 'w': '9.58'}]]\", 'pages': \"('6', '6')\", 'section_title': '1-D SDS-PAGE Qualitative Analysis', 'section_number': '3.1.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='A total of 213 proteins were identified following ProgenesisQI™ using Mascot (Table S1).One RF-negative and ACPA-positive sample returned a very low alignment score of 8.6% and was, therefore, excluded from the analysis.For the remaining samples, more than 1 unique peptide was mapped to 165 proteins out of 213 proteins.Out of 213 proteins, 124 proteins showed >a 2-fold change.A total of 37 out of these 213 proteins had q-value < 0.05.', metadata={'text': 'A total of 213 proteins were identified following ProgenesisQI™ using Mascot (Table S1).One RF-negative and ACPA-positive sample returned a very low alignment score of 8.6% and was, therefore, excluded from the analysis.For the remaining samples, more than 1 unique peptide was mapped to 165 proteins out of 213 proteins.Out of 213 proteins, 124 proteins showed >a 2-fold change.A total of 37 out of these 213 proteins had q-value < 0.05.', 'para': '4', 'bboxes': \"[[{'page': '7', 'x': '187.65', 'y': '453.98', 'h': '371.62', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '466.53', 'h': '16.21', 'w': '9.58'}], [{'page': '7', 'x': '185.69', 'y': '466.53', 'h': '373.58', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '479.08', 'h': '238.50', 'w': '9.58'}], [{'page': '7', 'x': '409.44', 'y': '479.08', 'h': '149.84', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '491.63', 'h': '305.54', 'w': '9.58'}], [{'page': '7', 'x': '475.02', 'y': '491.63', 'h': '85.50', 'w': '9.58'}, {'page': '7', 'x': '165.90', 'y': '504.19', 'h': '181.57', 'w': '9.58'}], [{'page': '7', 'x': '356.21', 'y': '504.19', 'h': '203.06', 'w': '9.58'}, {'page': '7', 'x': '166.12', 'y': '516.74', 'h': '64.13', 'w': '9.58'}]]\", 'pages': \"('7', '7')\", 'section_title': 'Identification of Proteins in Serum', 'section_number': '3.2.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The comparative analysis of all groups (a FDR adjusted p-value of <0.05, identified ≥2 unique peptides and a >2 fold expression change) identified 25 proteins that were DE (Table 2), of which 10 proteins were DE between healthy control subjects and 1 of the serotypes including PZP, selenoprotein P (SELENOP), C4b-binding protein (C4BP) beta chain, apolipoprotein M (ApoM), N-acetylmuramoyl-L-alanine amidase (NAMLAA), carboxypeptidase N (CPN) catalytic chain, oncoprotein Induced Transcript 3 (OIT3), CPN subunit 2, apolipoprotein C-I (ApoC1) and apolipoprotein C-III (ApoCIII).', metadata={'text': 'The comparative analysis of all groups (a FDR adjusted p-value of <0.05, identified ≥2 unique peptides and a >2 fold expression change) identified 25 proteins that were DE (Table 2), of which 10 proteins were DE between healthy control subjects and 1 of the serotypes including PZP, selenoprotein P (SELENOP), C4b-binding protein (C4BP) beta chain, apolipoprotein M (ApoM), N-acetylmuramoyl-L-alanine amidase (NAMLAA), carboxypeptidase N (CPN) catalytic chain, oncoprotein Induced Transcript 3 (OIT3), CPN subunit 2, apolipoprotein C-I (ApoC1) and apolipoprotein C-III (ApoCIII).', 'para': '0', 'bboxes': \"[[{'page': '7', 'x': '187.65', 'y': '554.57', 'h': '373.28', 'w': '9.71'}, {'page': '7', 'x': '166.39', 'y': '566.93', 'h': '392.88', 'w': '9.90'}, {'page': '7', 'x': '166.39', 'y': '579.80', 'h': '392.88', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '592.36', 'h': '393.87', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '604.91', 'h': '394.12', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '617.46', 'h': '392.88', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '630.02', 'h': '326.51', 'w': '9.58'}]]\", 'pages': \"('7', '7')\", 'section_title': 'Differentially Expressed Proteins', 'section_number': '3.3.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The PCA analysis (Figure 3A,B) showed that only 22.1% of the proteins (PC1) were divided between RA patients and healthy controls.The distribution only decreased to 21.3%, when only the patient groups were included in PCA (Figure 3C).The heat map of the proteins showed that the group averages of various proteins were different between patients and healthy subjects (Figure 4A).The heat map of the patient serotypes and controls however showed that although distinguishable patterns of expression existed between normalized abundances of individual proteins between patient serotypes as well as healthy subjects, only Q96PD5 (NAMLAA) showed similar trends across all the RA serogrpups as compared to healthy controls (Figure 4B).', metadata={'text': 'The PCA analysis (Figure 3A,B) showed that only 22.1% of the proteins (PC1) were divided between RA patients and healthy controls.The distribution only decreased to 21.3%, when only the patient groups were included in PCA (Figure 3C).The heat map of the proteins showed that the group averages of various proteins were different between patients and healthy subjects (Figure 4A).The heat map of the patient serotypes and controls however showed that although distinguishable patterns of expression existed between normalized abundances of individual proteins between patient serotypes as well as healthy subjects, only Q96PD5 (NAMLAA) showed similar trends across all the RA serogrpups as compared to healthy controls (Figure 4B).', 'para': '3', 'bboxes': \"[[{'page': '7', 'x': '187.65', 'y': '642.57', 'h': '371.62', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '655.12', 'h': '231.62', 'w': '9.58'}], [{'page': '7', 'x': '402.93', 'y': '655.12', 'h': '156.34', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '667.67', 'h': '318.32', 'w': '9.58'}], [{'page': '7', 'x': '487.20', 'y': '667.67', 'h': '72.08', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '680.23', 'h': '392.88', 'w': '9.58'}, {'page': '7', 'x': '166.10', 'y': '692.78', 'h': '192.29', 'w': '9.58'}], [{'page': '7', 'x': '362.14', 'y': '692.78', 'h': '197.14', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '705.33', 'h': '392.88', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '717.89', 'h': '392.88', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '730.44', 'h': '393.27', 'w': '9.58'}, {'page': '7', 'x': '166.39', 'y': '742.99', 'h': '246.41', 'w': '9.58'}]]\", 'pages': \"('7', '7')\", 'section_title': 'Differentially Expressed Proteins', 'section_number': '3.3.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Canonical pathway analysis was undertaken on the DE proteins between each serotype of RA and healthy controls.The comparison of double-positive RA samples with healthy controls predicted activation of dendritic cell maturation (p = 0.009); and inhibition of liver X receptor/retinoid X receptor (LXR/RXR) pathway (p = 7.9 × 10 -28 ), acute phase response signalling (p = 3.16 × 10 -27 ) and production of NO and ROS species in themacrophages (p = 1.41 × 10 -08 ) (Figure 5A).The comparison of RF-positive RA patients with healthy controls revealed an activation of the coagulation system (p = 3.98 × 10 -11 ), the intrinsic prothrombin activation pathway (p = 8.70 × 10 -09 ) and the GP6 signaling pathway (p = 0.0009); and inhibition of the LXR/RXR pathway (p = 5.01 × 10 -21 ), production of NO and ROS in macrophages (p = 2.57 × 10 -08 ) and maturity onset diabetes of young (MODY) signaling (p = 2.29 × 10 -06 ) (Figure 5B).The comparison of ACPA-positive RA patients with healthy controls revealed activation of the coagulation system (p = 3.54 × 10 -08 ), the intrinsic prothrombin activation pathway (p = 4.89 × 10 -06 ), the extrinsic prothrombin activation pathway (p = 5.01 × 10 -10 ) and acute phase response signalling (p = 5.01 × 10 -11 ); and inhibition of the LXR/RXR pathway (p = 1.99 × 10 -14 ) and production of NO and ROS in macrophages (p = 0.001) (Figure 5C).Pathway analysis of double-negative RA patients with healthy controls revealed the activation of the coagulation system (p = 7.94 × 10 -19 ), the intrinsic prothrombin activation pathway (p = 5.01 × 10 -12 ) and the extrinsic prothrombin activation pathway (p = 1.25 × 10 -13 ); and inhibition of the LXR/RXR pathway (p = 1.58 × 10 -25 ); acute phase response signalling (p = 1 × 10 -23 ) and production of NO and ROS in macrophages (p = 2.18 × 10 -10 ) (Figure 5D).', metadata={'text': 'Canonical pathway analysis was undertaken on the DE proteins between each serotype of RA and healthy controls.The comparison of double-positive RA samples with healthy controls predicted activation of dendritic cell maturation (p = 0.009); and inhibition of liver X receptor/retinoid X receptor (LXR/RXR) pathway (p = 7.9 × 10 -28 ), acute phase response signalling (p = 3.16 × 10 -27 ) and production of NO and ROS species in themacrophages (p = 1.41 × 10 -08 ) (Figure 5A).The comparison of RF-positive RA patients with healthy controls revealed an activation of the coagulation system (p = 3.98 × 10 -11 ), the intrinsic prothrombin activation pathway (p = 8.70 × 10 -09 ) and the GP6 signaling pathway (p = 0.0009); and inhibition of the LXR/RXR pathway (p = 5.01 × 10 -21 ), production of NO and ROS in macrophages (p = 2.57 × 10 -08 ) and maturity onset diabetes of young (MODY) signaling (p = 2.29 × 10 -06 ) (Figure 5B).The comparison of ACPA-positive RA patients with healthy controls revealed activation of the coagulation system (p = 3.54 × 10 -08 ), the intrinsic prothrombin activation pathway (p = 4.89 × 10 -06 ), the extrinsic prothrombin activation pathway (p = 5.01 × 10 -10 ) and acute phase response signalling (p = 5.01 × 10 -11 ); and inhibition of the LXR/RXR pathway (p = 1.99 × 10 -14 ) and production of NO and ROS in macrophages (p = 0.001) (Figure 5C).Pathway analysis of double-negative RA patients with healthy controls revealed the activation of the coagulation system (p = 7.94 × 10 -19 ), the intrinsic prothrombin activation pathway (p = 5.01 × 10 -12 ) and the extrinsic prothrombin activation pathway (p = 1.25 × 10 -13 ); and inhibition of the LXR/RXR pathway (p = 1.58 × 10 -25 ); acute phase response signalling (p = 1 × 10 -23 ) and production of NO and ROS in macrophages (p = 2.18 × 10 -10 ) (Figure 5D).', 'para': '4', 'bboxes': \"[[{'page': '10', 'x': '187.65', 'y': '187.01', 'h': '371.62', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '199.57', 'h': '121.34', 'w': '9.58'}], [{'page': '10', 'x': '290.85', 'y': '199.57', 'h': '268.81', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '212.12', 'h': '393.08', 'w': '9.58'}, {'page': '10', 'x': '166.07', 'y': '224.36', 'h': '279.60', 'w': '9.90'}, {'page': '10', 'x': '445.76', 'y': '222.44', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '460.56', 'y': '224.67', 'h': '98.72', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '236.91', 'h': '107.64', 'w': '9.90'}, {'page': '10', 'x': '274.13', 'y': '235.00', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '288.92', 'y': '237.23', 'h': '270.35', 'w': '9.58'}, {'page': '10', 'x': '166.07', 'y': '249.46', 'h': '61.63', 'w': '9.90'}, {'page': '10', 'x': '227.80', 'y': '247.55', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '242.59', 'y': '249.78', 'h': '60.14', 'w': '9.58'}], [{'page': '10', 'x': '305.42', 'y': '249.78', 'h': '254.24', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '262.02', 'h': '324.85', 'w': '9.90'}, {'page': '10', 'x': '491.34', 'y': '260.10', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '506.13', 'y': '262.33', 'h': '54.79', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '274.57', 'h': '228.54', 'w': '9.90'}, {'page': '10', 'x': '395.03', 'y': '272.65', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '409.83', 'y': '274.88', 'h': '149.84', 'w': '9.58'}, {'page': '10', 'x': '166.07', 'y': '287.12', 'h': '292.45', 'w': '9.90'}, {'page': '10', 'x': '458.61', 'y': '285.21', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '473.41', 'y': '287.44', 'h': '85.87', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '299.67', 'h': '171.14', 'w': '9.90'}, {'page': '10', 'x': '337.63', 'y': '297.76', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '352.42', 'y': '299.99', 'h': '207.85', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '312.23', 'h': '106.03', 'w': '9.90'}, {'page': '10', 'x': '272.52', 'y': '310.31', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '287.32', 'y': '312.54', 'h': '58.57', 'w': '9.58'}], [{'page': '10', 'x': '348.64', 'y': '312.54', 'h': '210.64', 'w': '9.58'}, {'page': '10', 'x': '165.98', 'y': '324.78', 'h': '356.34', 'w': '9.90'}, {'page': '10', 'x': '522.41', 'y': '322.87', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '537.20', 'y': '325.10', 'h': '22.07', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '337.33', 'h': '240.24', 'w': '9.90'}, {'page': '10', 'x': '406.72', 'y': '335.42', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '421.52', 'y': '337.65', 'h': '139.41', 'w': '9.58'}, {'page': '10', 'x': '166.12', 'y': '349.89', 'h': '131.91', 'w': '9.90'}, {'page': '10', 'x': '298.12', 'y': '347.97', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '312.92', 'y': '349.89', 'h': '226.91', 'w': '9.90'}, {'page': '10', 'x': '539.92', 'y': '347.97', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '554.71', 'y': '350.20', 'h': '5.81', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '362.44', 'h': '235.96', 'w': '9.90'}, {'page': '10', 'x': '402.45', 'y': '360.53', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '417.24', 'y': '362.76', 'h': '142.04', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '375.31', 'h': '172.72', 'w': '9.58'}], [{'page': '10', 'x': '341.60', 'y': '375.31', 'h': '217.67', 'w': '9.58'}, {'page': '10', 'x': '165.98', 'y': '387.55', 'h': '373.85', 'w': '9.90'}, {'page': '10', 'x': '539.92', 'y': '385.63', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '554.71', 'y': '387.86', 'h': '5.81', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '400.10', 'h': '272.29', 'w': '9.90'}, {'page': '10', 'x': '438.77', 'y': '398.18', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '453.57', 'y': '400.41', 'h': '107.36', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '412.65', 'h': '190.95', 'w': '9.90'}, {'page': '10', 'x': '357.44', 'y': '410.74', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '372.24', 'y': '412.97', 'h': '187.43', 'w': '9.58'}, {'page': '10', 'x': '166.07', 'y': '425.20', 'h': '60.01', 'w': '9.90'}, {'page': '10', 'x': '226.17', 'y': '423.29', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '240.97', 'y': '425.20', 'h': '198.64', 'w': '9.90'}, {'page': '10', 'x': '439.70', 'y': '423.29', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '454.50', 'y': '425.52', 'h': '104.78', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '437.76', 'h': '173.96', 'w': '9.90'}, {'page': '10', 'x': '340.45', 'y': '435.84', 'h': '14.30', 'w': '6.92'}, {'page': '10', 'x': '355.24', 'y': '438.07', 'h': '58.63', 'w': '9.58'}]]\", 'pages': \"('10', '10')\", 'section_title': 'Pathway Analysis', 'section_number': '3.4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The comparison of the four serotypes of RA with healthy controls revealed an inhibition of inflammatory response, leukocyte migration, binding of professional phagocytic cells, migration of cells, adhesion of phagocytes, cell movement of phagocytes and cell movement of leukocytes in all serotypes except double-negative serotype.Accumulation of leukocytes was, however, inhibited in all serotypes.Concentration of cholesterol was inhibited in all serotypes except ACPA-positive patients that did not show activation or inhibition of this protein (Figure 6).The detailed results of pathway analysis are provided in Table S2.', metadata={'text': 'The comparison of the four serotypes of RA with healthy controls revealed an inhibition of inflammatory response, leukocyte migration, binding of professional phagocytic cells, migration of cells, adhesion of phagocytes, cell movement of phagocytes and cell movement of leukocytes in all serotypes except double-negative serotype.Accumulation of leukocytes was, however, inhibited in all serotypes.Concentration of cholesterol was inhibited in all serotypes except ACPA-positive patients that did not show activation or inhibition of this protein (Figure 6).The detailed results of pathway analysis are provided in Table S2.', 'para': '3', 'bboxes': \"[[{'page': '10', 'x': '187.65', 'y': '450.63', 'h': '373.27', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '463.18', 'h': '392.88', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '475.73', 'h': '392.88', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '488.29', 'h': '326.66', 'w': '9.58'}], [{'page': '10', 'x': '496.17', 'y': '488.29', 'h': '63.10', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '500.84', 'h': '242.32', 'w': '9.58'}], [{'page': '10', 'x': '412.04', 'y': '500.84', 'h': '147.24', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '513.39', 'h': '393.08', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '525.94', 'h': '154.46', 'w': '9.58'}], [{'page': '10', 'x': '323.94', 'y': '525.94', 'h': '235.34', 'w': '9.58'}, {'page': '10', 'x': '166.39', 'y': '538.50', 'h': '49.73', 'w': '9.58'}]]\", 'pages': \"('10', '10')\", 'section_title': 'Pathway Analysis', 'section_number': '3.4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='cells, migration of cells, adhesion of phagocytes, cell movement of phagocytes and cell movement of leukocytes in all serotypes except double-negative serotype.Accumulation of leukocytes was, however, inhibited in all serotypes.Concentration of cholesterol was inhibited in all serotypes except ACPA-positive patients that did not show activation or inhibition of this protein (Figure 6).The detailed results of pathway analysis are provided in Table S2.', metadata={'text': 'cells, migration of cells, adhesion of phagocytes, cell movement of phagocytes and cell movement of leukocytes in all serotypes except double-negative serotype.Accumulation of leukocytes was, however, inhibited in all serotypes.Concentration of cholesterol was inhibited in all serotypes except ACPA-positive patients that did not show activation or inhibition of this protein (Figure 6).The detailed results of pathway analysis are provided in Table S2.', 'para': '3', 'bboxes': \"[[{'page': '11', 'x': '161.33', 'y': '2.64', 'h': '392.96', 'w': '10.17'}, {'page': '11', 'x': '161.33', 'y': '15.42', 'h': '327.31', 'w': '10.17'}], [{'page': '11', 'x': '491.28', 'y': '15.42', 'h': '63.01', 'w': '10.17'}, {'page': '11', 'x': '161.33', 'y': '28.26', 'h': '243.11', 'w': '10.17'}], [{'page': '11', 'x': '407.57', 'y': '28.26', 'h': '146.64', 'w': '10.17'}, {'page': '11', 'x': '161.33', 'y': '41.10', 'h': '392.99', 'w': '10.17'}, {'page': '11', 'x': '161.33', 'y': '53.88', 'h': '154.79', 'w': '10.17'}], [{'page': '11', 'x': '318.41', 'y': '53.88', 'h': '236.00', 'w': '10.17'}, {'page': '11', 'x': '161.33', 'y': '66.71', 'h': '50.88', 'w': '10.17'}]]\", 'pages': \"('11', '11')\", 'section_title': 'Pathway Analysis', 'section_number': '3.4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='We validated the mass spectrometry results using ELISA for PZP.As Figure 7 shows, the expression of PZP was significantly higher among patients (7.54 ± 6.35 µg/mL) as compared to controls (1.03 ± 0.54 µg/mL (p-value 7.41 × 10 -11 ).The PZP concentration for each sample is represented in Table S3.The sensitivity of PZP for detecting RA is 96.7% and specificity is 95%.', metadata={'text': 'We validated the mass spectrometry results using ELISA for PZP.As Figure 7 shows, the expression of PZP was significantly higher among patients (7.54 ± 6.35 µg/mL) as compared to controls (1.03 ± 0.54 µg/mL (p-value 7.41 × 10 -11 ).The PZP concentration for each sample is represented in Table S3.The sensitivity of PZP for detecting RA is 96.7% and specificity is 95%.', 'para': '3', 'bboxes': \"[[{'page': '13', 'x': '187.65', 'y': '113.59', 'h': '286.90', 'w': '9.58'}], [{'page': '13', 'x': '477.04', 'y': '113.59', 'h': '83.48', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '125.83', 'h': '392.88', 'w': '9.90'}, {'page': '13', 'x': '166.39', 'y': '138.38', 'h': '267.36', 'w': '9.90'}, {'page': '13', 'x': '433.85', 'y': '136.46', 'h': '13.80', 'w': '6.92'}, {'page': '13', 'x': '448.14', 'y': '138.70', 'h': '5.91', 'w': '9.58'}], [{'page': '13', 'x': '457.13', 'y': '138.70', 'h': '102.14', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '151.25', 'h': '183.14', 'w': '9.58'}], [{'page': '13', 'x': '352.62', 'y': '151.25', 'h': '207.49', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '163.80', 'h': '96.73', 'w': '9.58'}]]\", 'pages': \"('13', '13')\", 'section_title': 'Validation of Mass Spectrometry Using ELISA', 'section_number': '3.5.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Life 2022, 12, x FOR PEER REVIEW 14 of 19 of the proteins up-or downregulation with the activation of the respective function.The negative Z score on contrary represents inhibition of the function.The orange-colored squares represent upregulation during the disease state and the blue squares represent downregulation with the color intensity being directly correlated with the prediction strength.', metadata={'text': 'Life 2022, 12, x FOR PEER REVIEW 14 of 19 of the proteins up-or downregulation with the activation of the respective function.The negative Z score on contrary represents inhibition of the function.The orange-colored squares represent upregulation during the disease state and the blue squares represent downregulation with the color intensity being directly correlated with the prediction strength.', 'para': '2', 'bboxes': \"[[{'page': '13', 'x': '37.64', 'y': '1.90', 'h': '123.72', 'w': '8.10'}, {'page': '13', 'x': '529.95', 'y': '1.90', 'h': '30.99', 'w': '8.04'}, {'page': '13', 'x': '168.02', 'y': '39.19', 'h': '331.53', 'w': '9.07'}], [{'page': '13', 'x': '501.74', 'y': '39.19', 'h': '59.29', 'w': '9.07'}, {'page': '13', 'x': '168.01', 'y': '50.71', 'h': '221.30', 'w': '9.07'}], [{'page': '13', 'x': '392.29', 'y': '50.71', 'h': '168.68', 'w': '9.07'}, {'page': '13', 'x': '168.01', 'y': '62.30', 'h': '392.96', 'w': '9.07'}, {'page': '13', 'x': '168.01', 'y': '73.88', 'h': '250.47', 'w': '9.07'}]]\", 'pages': \"('13', '13')\", 'section_title': 'Validation of Mass Spectrometry Using ELISA', 'section_number': '3.5.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='We validated the mass spectrometry results using ELISA for PZP.As Figure 7 shows, the expression of PZP was significantly higher among patients (7.54 ± 6.35 µg/mL) as compared to controls (1.03 ± 0.54 µg/mL (p-value 7.41 × 10 -11 ).The PZP concentration for each sample is represented in Table S3.The sensitivity of PZP for detecting RA is 96.7% and specificity is 95%.', metadata={'text': 'We validated the mass spectrometry results using ELISA for PZP.As Figure 7 shows, the expression of PZP was significantly higher among patients (7.54 ± 6.35 µg/mL) as compared to controls (1.03 ± 0.54 µg/mL (p-value 7.41 × 10 -11 ).The PZP concentration for each sample is represented in Table S3.The sensitivity of PZP for detecting RA is 96.7% and specificity is 95%.', 'para': '3', 'bboxes': \"[[{'page': '13', 'x': '189.26', 'y': '113.46', 'h': '286.80', 'w': '10.10'}], [{'page': '13', 'x': '478.18', 'y': '113.46', 'h': '82.72', 'w': '10.10'}, {'page': '13', 'x': '168.01', 'y': '126.30', 'h': '393.04', 'w': '10.10'}, {'page': '13', 'x': '168.00', 'y': '139.14', 'h': '251.76', 'w': '10.11'}], [{'page': '13', 'x': '422.30', 'y': '139.14', 'h': '138.68', 'w': '10.10'}, {'page': '13', 'x': '168.02', 'y': '151.92', 'h': '153.23', 'w': '10.10'}], [{'page': '13', 'x': '324.44', 'y': '151.92', 'h': '236.56', 'w': '10.10'}, {'page': '13', 'x': '168.02', 'y': '164.75', 'h': '77.85', 'w': '10.10'}]]\", 'pages': \"('13', '13')\", 'section_title': 'Validation of Mass Spectrometry Using ELISA', 'section_number': '3.5.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='In this study, we identified 10 DE proteins between RA serotypes and healthy controls.Next, we undertook successfully validation of one of the DE proteins; PZP, in an independent sample cohort indicating our findings for this protein are applicable to another population.We then performed canonical pathway analysis for the DE proteins across each serotype in comparison to healthy controls to identify the key pathways and biological processes that are perturbed across these serotypes.We used ProteoMiner TM protein enrichment columns to deplete the proteins with high abundance and enrich the proteins with low abundance [23].ProteoMiner TM protein enrichment of low abundance proteins has several advantages over the immunoaffinity-based protein depletion', metadata={'text': 'In this study, we identified 10 DE proteins between RA serotypes and healthy controls.Next, we undertook successfully validation of one of the DE proteins; PZP, in an independent sample cohort indicating our findings for this protein are applicable to another population.We then performed canonical pathway analysis for the DE proteins across each serotype in comparison to healthy controls to identify the key pathways and biological processes that are perturbed across these serotypes.We used ProteoMiner TM protein enrichment columns to deplete the proteins with high abundance and enrich the proteins with low abundance [23].ProteoMiner TM protein enrichment of low abundance proteins has several advantages over the immunoaffinity-based protein depletion', 'para': '4', 'bboxes': \"[[{'page': '13', 'x': '189.27', 'y': '595.50', 'h': '371.67', 'w': '10.10'}, {'page': '13', 'x': '168.02', 'y': '608.28', 'h': '22.36', 'w': '10.10'}], [{'page': '13', 'x': '193.76', 'y': '608.28', 'h': '367.28', 'w': '10.10'}, {'page': '13', 'x': '168.03', 'y': '621.12', 'h': '392.93', 'w': '10.10'}, {'page': '13', 'x': '168.03', 'y': '633.95', 'h': '78.65', 'w': '10.10'}], [{'page': '13', 'x': '250.60', 'y': '633.95', 'h': '310.37', 'w': '10.10'}, {'page': '13', 'x': '168.03', 'y': '646.73', 'h': '392.99', 'w': '10.10'}, {'page': '13', 'x': '168.03', 'y': '659.56', 'h': '280.89', 'w': '10.10'}], [{'page': '13', 'x': '452.48', 'y': '659.56', 'h': '108.46', 'w': '10.10'}, {'page': '13', 'x': '168.02', 'y': '672.42', 'h': '392.93', 'w': '10.10'}, {'page': '13', 'x': '168.01', 'y': '685.19', 'h': '153.16', 'w': '10.10'}], [{'page': '13', 'x': '324.12', 'y': '685.19', 'h': '236.83', 'w': '10.11'}, {'page': '13', 'x': '168.02', 'y': '698.04', 'h': '393.02', 'w': '10.10'}]]\", 'pages': \"('13', '13')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='In this study, we identified 10 DE proteins between RA serotypes and healthy controls.Next, we undertook successfully validation of one of the DE proteins; PZP, in an independent sample cohort indicating our findings for this protein are applicable to another population.We then performed canonical pathway analysis for the DE proteins across each serotype in comparison to healthy controls to identify the key pathways and biological processes that are perturbed across these serotypes.We used ProteoMiner TM protein enrichment columns to deplete the proteins with high abundance and enrich the proteins with low abundance [23].ProteoMiner TM protein enrichment of low abundance proteins has several advantages over the immunoaffinity-based protein depletion approaches including a relatively less-complicated procedure, high material yield and reproducibility [24,25].', metadata={'text': 'In this study, we identified 10 DE proteins between RA serotypes and healthy controls.Next, we undertook successfully validation of one of the DE proteins; PZP, in an independent sample cohort indicating our findings for this protein are applicable to another population.We then performed canonical pathway analysis for the DE proteins across each serotype in comparison to healthy controls to identify the key pathways and biological processes that are perturbed across these serotypes.We used ProteoMiner TM protein enrichment columns to deplete the proteins with high abundance and enrich the proteins with low abundance [23].ProteoMiner TM protein enrichment of low abundance proteins has several advantages over the immunoaffinity-based protein depletion approaches including a relatively less-complicated procedure, high material yield and reproducibility [24,25].', 'para': '4', 'bboxes': \"[[{'page': '13', 'x': '187.65', 'y': '594.15', 'h': '373.37', 'w': '9.58'}], [{'page': '13', 'x': '166.39', 'y': '606.70', 'h': '394.53', 'w': '9.58'}, {'page': '13', 'x': '166.10', 'y': '619.26', 'h': '393.37', 'w': '9.58'}, {'page': '13', 'x': '166.10', 'y': '631.81', 'h': '50.15', 'w': '9.58'}], [{'page': '13', 'x': '219.31', 'y': '631.81', 'h': '339.97', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '644.36', 'h': '392.88', 'w': '9.58'}, {'page': '13', 'x': '166.10', 'y': '656.92', 'h': '220.39', 'w': '9.58'}], [{'page': '13', 'x': '389.49', 'y': '656.92', 'h': '93.06', 'w': '9.58'}, {'page': '13', 'x': '482.56', 'y': '654.92', 'h': '11.80', 'w': '7.28'}, {'page': '13', 'x': '497.10', 'y': '656.92', 'h': '63.82', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '669.47', 'h': '392.89', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '682.02', 'h': '90.84', 'w': '9.58'}], [{'page': '13', 'x': '260.75', 'y': '682.02', 'h': '56.61', 'w': '9.58'}, {'page': '13', 'x': '317.37', 'y': '680.03', 'h': '11.80', 'w': '7.28'}, {'page': '13', 'x': '332.30', 'y': '682.02', 'h': '226.97', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '694.57', 'h': '392.88', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '707.13', 'h': '382.95', 'w': '9.58'}]]\", 'pages': \"('13', '13')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='PZP is a high-molecular-weight immunosuppressive glycoprotein that is elevated during pregnancy.The role of this protein as an autoimmunity mediator was established by a recent LC-MS/MS-based study in inflammatory bowel disease patients [26].In this study, we also found increased expression of PZP in all the RA serotypes as compared to the controls using LC-MS/MS.The results were further validated by ELISA in a different cohort of RA patients and subjects.The high sensitivity and specificity of this protein for RA patients signify strong candidacy of PZP as a disease biomarker.', metadata={'text': 'PZP is a high-molecular-weight immunosuppressive glycoprotein that is elevated during pregnancy.The role of this protein as an autoimmunity mediator was established by a recent LC-MS/MS-based study in inflammatory bowel disease patients [26].In this study, we also found increased expression of PZP in all the RA serotypes as compared to the controls using LC-MS/MS.The results were further validated by ELISA in a different cohort of RA patients and subjects.The high sensitivity and specificity of this protein for RA patients signify strong candidacy of PZP as a disease biomarker.', 'para': '4', 'bboxes': \"[[{'page': '13', 'x': '187.65', 'y': '719.68', 'h': '371.62', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '732.23', 'h': '81.38', 'w': '9.58'}], [{'page': '13', 'x': '250.90', 'y': '732.23', 'h': '308.38', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '744.79', 'h': '361.65', 'w': '9.58'}], [{'page': '13', 'x': '531.13', 'y': '744.79', 'h': '28.14', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '757.34', 'h': '392.88', 'w': '9.58'}, {'page': '13', 'x': '166.39', 'y': '769.89', 'h': '136.95', 'w': '9.58'}], [{'page': '13', 'x': '305.83', 'y': '769.89', 'h': '253.45', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '98.05', 'h': '154.73', 'w': '9.58'}], [{'page': '14', 'x': '324.24', 'y': '98.05', 'h': '235.24', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '110.60', 'h': '299.06', 'w': '9.58'}]]\", 'pages': \"('13', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='In this study, the serum expression of SELENOP was decreased in all RA serotypes in comparison to controls.SELENOP is a biomarker of selenium status that has been identified as a major preventable trigger for autoimmune diseases including RA [27].In comparison to controls, the serum selenium concentrations [28] and SELENOP concentrations [29,30] have been reported to be decreased in RA patients.The selenium status has been linked to the upregulation of a whole set of inflammation-related genes via nuclear factor kappalight-chain enhancer of activated B cells (NF-κB) mediated activation of several intracellular selenoproteins [28].The role of selenium and SELENOP, combined with previous findings suggest strong candidacy of this protein as a biomarker of autoimmunity.', metadata={'text': 'In this study, the serum expression of SELENOP was decreased in all RA serotypes in comparison to controls.SELENOP is a biomarker of selenium status that has been identified as a major preventable trigger for autoimmune diseases including RA [27].In comparison to controls, the serum selenium concentrations [28] and SELENOP concentrations [29,30] have been reported to be decreased in RA patients.The selenium status has been linked to the upregulation of a whole set of inflammation-related genes via nuclear factor kappalight-chain enhancer of activated B cells (NF-κB) mediated activation of several intracellular selenoproteins [28].The role of selenium and SELENOP, combined with previous findings suggest strong candidacy of this protein as a biomarker of autoimmunity.', 'para': '4', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '123.15', 'h': '371.63', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '135.71', 'h': '100.51', 'w': '9.58'}], [{'page': '14', 'x': '269.87', 'y': '135.71', 'h': '289.40', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '148.26', 'h': '326.53', 'w': '9.58'}], [{'page': '14', 'x': '496.01', 'y': '148.26', 'h': '63.26', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '160.81', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '173.37', 'h': '227.71', 'w': '9.58'}], [{'page': '14', 'x': '397.21', 'y': '173.37', 'h': '162.06', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '185.92', 'h': '394.53', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '198.47', 'h': '393.08', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '211.02', 'h': '84.88', 'w': '9.58'}], [{'page': '14', 'x': '254.41', 'y': '211.02', 'h': '304.87', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '223.58', 'h': '322.31', 'w': '9.58'}]]\", 'pages': \"('14', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='NAMLAA degrades bacterial cell wall component peptidoglycan [31] that has strong pro-inflammatory properties and can induce arthritis in rat models [32,33].The degradation of these pro-inflammatory components should suggestively confer an anti-inflammatory and protective role to NAMLAA against arthritis.However, Saha et al. [34] demonstrated that NAMLAA is indeed essential for the development of arthritis, a relatively unexpected finding.The study findings of Saha et al. [34] have not been supported by animal model studies for other inflammatory diseases [35].Decreased levels of this protein in human RA subjects as compared to healthy controls were observed in this study.The autoantigenic potential of NAMLAA and the presence of antibodies has been reported in a recent study [18] that can explain the lower serum levels of circulating NAMLAA.The imbalance of this homeostasis is probably responsible for the development of RA that needs to be further explored.', metadata={'text': 'NAMLAA degrades bacterial cell wall component peptidoglycan [31] that has strong pro-inflammatory properties and can induce arthritis in rat models [32,33].The degradation of these pro-inflammatory components should suggestively confer an anti-inflammatory and protective role to NAMLAA against arthritis.However, Saha et al. [34] demonstrated that NAMLAA is indeed essential for the development of arthritis, a relatively unexpected finding.The study findings of Saha et al. [34] have not been supported by animal model studies for other inflammatory diseases [35].Decreased levels of this protein in human RA subjects as compared to healthy controls were observed in this study.The autoantigenic potential of NAMLAA and the presence of antibodies has been reported in a recent study [18] that can explain the lower serum levels of circulating NAMLAA.The imbalance of this homeostasis is probably responsible for the development of RA that needs to be further explored.', 'para': '6', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '236.13', 'h': '371.62', 'w': '9.58'}, {'page': '14', 'x': '166.10', 'y': '248.68', 'h': '319.08', 'w': '9.58'}], [{'page': '14', 'x': '488.13', 'y': '248.68', 'h': '71.15', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '261.24', 'h': '393.27', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '273.79', 'h': '217.44', 'w': '9.58'}], [{'page': '14', 'x': '386.91', 'y': '273.79', 'h': '172.36', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '286.34', 'h': '392.89', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '298.90', 'h': '35.23', 'w': '9.58'}], [{'page': '14', 'x': '204.71', 'y': '298.90', 'h': '354.57', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '311.45', 'h': '200.75', 'w': '9.58'}], [{'page': '14', 'x': '371.28', 'y': '311.45', 'h': '187.99', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '324.00', 'h': '329.87', 'w': '9.58'}], [{'page': '14', 'x': '500.37', 'y': '324.00', 'h': '60.55', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '336.55', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '349.11', 'h': '327.04', 'w': '9.58'}], [{'page': '14', 'x': '495.92', 'y': '349.11', 'h': '63.36', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '361.66', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '374.21', 'h': '74.85', 'w': '9.58'}]]\", 'pages': \"('14', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='C4BP β-chain, a complement inhibitor [36], and CPN, a zinc metalloprotease [37], were also observed to be DE in this study.However, a lack of consensus regarding the role of these proteins in autoimmunity and RA hereby suggest further exploration.', metadata={'text': 'C4BP β-chain, a complement inhibitor [36], and CPN, a zinc metalloprotease [37], were also observed to be DE in this study.However, a lack of consensus regarding the role of these proteins in autoimmunity and RA hereby suggest further exploration.', 'para': '1', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '386.66', 'h': '372.87', 'w': '9.69'}, {'page': '14', 'x': '165.98', 'y': '399.32', 'h': '181.73', 'w': '9.58'}], [{'page': '14', 'x': '350.80', 'y': '399.32', 'h': '208.48', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '411.87', 'h': '343.98', 'w': '9.58'}]]\", 'pages': \"('14', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='We found three apolipoproteins to be DE between RA patients and healthy controls including ApoM, ApoC1 and ApoCIII.These apolipoproteins are implicated in protection against atherosclerosis owing to their role in HDL metabolism as well as anti-inflammatory properties [38].The polymorphisms in the ApoM gene have been associated with the risk of dyslipidaemia in RA patients [39,40].However, no study reports the serum levels of this chaperone in RA patients.ApoC1 has been identified as a predictor of drug response to RA [41,42].The risk of developing cardiovascular disease is elevated among RA patients than the general population [43,44].The observed decrease in the serum levels of these apolipoproteins in RA patients could suggestively explain the increased risk of developing cardiovascular disease among RA patients and highlight the link between these two illnesses.', metadata={'text': 'We found three apolipoproteins to be DE between RA patients and healthy controls including ApoM, ApoC1 and ApoCIII.These apolipoproteins are implicated in protection against atherosclerosis owing to their role in HDL metabolism as well as anti-inflammatory properties [38].The polymorphisms in the ApoM gene have been associated with the risk of dyslipidaemia in RA patients [39,40].However, no study reports the serum levels of this chaperone in RA patients.ApoC1 has been identified as a predictor of drug response to RA [41,42].The risk of developing cardiovascular disease is elevated among RA patients than the general population [43,44].The observed decrease in the serum levels of these apolipoproteins in RA patients could suggestively explain the increased risk of developing cardiovascular disease among RA patients and highlight the link between these two illnesses.', 'para': '6', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '424.42', 'h': '371.62', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '436.98', 'h': '169.44', 'w': '9.58'}], [{'page': '14', 'x': '338.31', 'y': '436.98', 'h': '220.97', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '449.53', 'h': '393.27', 'w': '9.58'}, {'page': '14', 'x': '166.10', 'y': '462.08', 'h': '68.53', 'w': '9.58'}], [{'page': '14', 'x': '240.26', 'y': '462.08', 'h': '319.02', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '474.64', 'h': '199.50', 'w': '9.58'}], [{'page': '14', 'x': '370.58', 'y': '474.64', 'h': '190.35', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '487.19', 'h': '164.71', 'w': '9.58'}], [{'page': '14', 'x': '336.00', 'y': '487.19', 'h': '223.28', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '499.74', 'h': '102.28', 'w': '9.58'}], [{'page': '14', 'x': '271.78', 'y': '499.74', 'h': '287.50', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '512.30', 'h': '207.79', 'w': '9.58'}], [{'page': '14', 'x': '377.30', 'y': '512.30', 'h': '181.97', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '524.85', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '537.40', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '549.95', 'h': '58.69', 'w': '9.58'}]]\", 'pages': \"('14', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The pathway analysis of the DE proteins showed that some pathways were differentially inhibited or activated in various serotypes suggesting that these serotypes are indeed regulated by different pathogenic mechanisms.However, some similarities were also observed including inhibition of LXR/RXR pathway and NO and ROS production in macrophages.LXR/RXR pathway was inhibited among all the RA serotypes.This pathway has been reported to inhibit atherosclerosis [45] and inflammation [46], suggesting an important and relatively unexplored link between this pathway and RA.The role of ROS in autoimmunity is complex and has been generally viewed as detrimental in the pathogenesis of autoimmune disease [47].A recent study revealed the regulatory role of these oxidative stress markers to prevent the pathogenesis of chronic inflammatory diseases [48].The inhibition of NO and ROS pathway in macrophage across all the serotypes warrants further exploration about the precise role of this pathway in the pathogenesis of RA.', metadata={'text': 'The pathway analysis of the DE proteins showed that some pathways were differentially inhibited or activated in various serotypes suggesting that these serotypes are indeed regulated by different pathogenic mechanisms.However, some similarities were also observed including inhibition of LXR/RXR pathway and NO and ROS production in macrophages.LXR/RXR pathway was inhibited among all the RA serotypes.This pathway has been reported to inhibit atherosclerosis [45] and inflammation [46], suggesting an important and relatively unexplored link between this pathway and RA.The role of ROS in autoimmunity is complex and has been generally viewed as detrimental in the pathogenesis of autoimmune disease [47].A recent study revealed the regulatory role of these oxidative stress markers to prevent the pathogenesis of chronic inflammatory diseases [48].The inhibition of NO and ROS pathway in macrophage across all the serotypes warrants further exploration about the precise role of this pathway in the pathogenesis of RA.', 'para': '6', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '562.51', 'h': '373.28', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '575.06', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '587.61', 'h': '243.63', 'w': '9.58'}], [{'page': '14', 'x': '413.11', 'y': '587.61', 'h': '146.16', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '600.17', 'h': '392.88', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '612.72', 'h': '74.63', 'w': '9.58'}], [{'page': '14', 'x': '246.68', 'y': '612.72', 'h': '287.53', 'w': '9.58'}], [{'page': '14', 'x': '539.86', 'y': '612.72', 'h': '19.41', 'w': '9.58'}, {'page': '14', 'x': '166.10', 'y': '625.27', 'h': '393.18', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '637.83', 'h': '323.22', 'w': '9.58'}], [{'page': '14', 'x': '491.83', 'y': '637.83', 'h': '67.45', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '650.38', 'h': '394.53', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '662.93', 'h': '160.94', 'w': '9.58'}], [{'page': '14', 'x': '330.42', 'y': '662.93', 'h': '228.85', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '675.48', 'h': '394.63', 'w': '9.58'}], [{'page': '14', 'x': '166.09', 'y': '688.04', 'h': '393.19', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '700.59', 'h': '370.26', 'w': '9.58'}]]\", 'pages': \"('14', '14')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='RA is a complex disorder with molecular and clinical heterogeneity.We used RF and ACPA to classify our patient population and studied the DE proteins in comparison to all healthy controls.However, due to the COVID-19 pandemic, only a limited number of samples could be collected for validation of the identified proteins.The lockdown situation also limited the access to the laboratory facilities and the samples were not tested for their individual RF and ACPA status.The validation of the mass spectrometry result for PZP in an independent cohort of patients suggest that identified proteins can be tested on larger cohorts of patients from different populations in the future to validate the study findings and identify disease biomarkers for RA.', metadata={'text': 'RA is a complex disorder with molecular and clinical heterogeneity.We used RF and ACPA to classify our patient population and studied the DE proteins in comparison to all healthy controls.However, due to the COVID-19 pandemic, only a limited number of samples could be collected for validation of the identified proteins.The lockdown situation also limited the access to the laboratory facilities and the samples were not tested for their individual RF and ACPA status.The validation of the mass spectrometry result for PZP in an independent cohort of patients suggest that identified proteins can be tested on larger cohorts of patients from different populations in the future to validate the study findings and identify disease biomarkers for RA.', 'para': '4', 'bboxes': \"[[{'page': '14', 'x': '187.65', 'y': '713.14', 'h': '297.24', 'w': '9.58'}], [{'page': '14', 'x': '487.97', 'y': '713.14', 'h': '71.30', 'w': '9.58'}, {'page': '14', 'x': '166.01', 'y': '725.70', 'h': '393.27', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '738.25', 'h': '87.33', 'w': '9.58'}], [{'page': '14', 'x': '256.82', 'y': '738.25', 'h': '302.45', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '750.80', 'h': '287.62', 'w': '9.58'}], [{'page': '14', 'x': '457.08', 'y': '750.80', 'h': '102.19', 'w': '9.58'}, {'page': '14', 'x': '166.39', 'y': '763.35', 'h': '393.08', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '98.05', 'h': '140.07', 'w': '9.58'}], [{'page': '15', 'x': '309.55', 'y': '98.05', 'h': '249.73', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '110.60', 'h': '393.08', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '123.15', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '135.71', 'h': '175.46', 'w': '9.58'}]]\", 'pages': \"('14', '15')\", 'section_title': 'Discussion', 'section_number': '4.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='RA is a complex disease that is influenced by an intricate interactome of various environmental, genetic and microbial factors that influence the immune homeostasis.Owing to the complex genetic architecture accompanied by a plethora of microbial and environmental triggers that an organism is exposed to this has made the identification of diagnostic and prognostic markers challenging.Our study has explored the serum proteomics of this complex autoimmune disorder in a relatively understudied Pakistani population to identify disease biomarkers that are DE among various serotypes of RA patients and healthy controls.We identified that PZP, SELENOP, C4BP beta chain, ApoM, NAMLAA, CPN catalytic chain, OIT3, CPN subunit 2, ApoC1 and ApoCIII were DE between the RA patients and healthy controls.These serum proteins have strong potential to serve as diagnostic and prognostic biomarkers of RA and can also be evaluated to fill the gaps in the current knowledge of pathogenesis of RA.These findings can be validated in larger cohorts from different populations to identify diagnostic and prognostic biomarkers of RA.', metadata={'text': 'RA is a complex disease that is influenced by an intricate interactome of various environmental, genetic and microbial factors that influence the immune homeostasis.Owing to the complex genetic architecture accompanied by a plethora of microbial and environmental triggers that an organism is exposed to this has made the identification of diagnostic and prognostic markers challenging.Our study has explored the serum proteomics of this complex autoimmune disorder in a relatively understudied Pakistani population to identify disease biomarkers that are DE among various serotypes of RA patients and healthy controls.We identified that PZP, SELENOP, C4BP beta chain, ApoM, NAMLAA, CPN catalytic chain, OIT3, CPN subunit 2, ApoC1 and ApoCIII were DE between the RA patients and healthy controls.These serum proteins have strong potential to serve as diagnostic and prognostic biomarkers of RA and can also be evaluated to fill the gaps in the current knowledge of pathogenesis of RA.These findings can be validated in larger cohorts from different populations to identify diagnostic and prognostic biomarkers of RA.', 'para': '5', 'bboxes': \"[[{'page': '15', 'x': '187.65', 'y': '173.66', 'h': '371.62', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '186.22', 'h': '394.62', 'w': '9.58'}], [{'page': '15', 'x': '166.39', 'y': '198.77', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '211.32', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '223.88', 'h': '229.10', 'w': '9.58'}], [{'page': '15', 'x': '401.31', 'y': '223.88', 'h': '157.97', 'w': '9.58'}, {'page': '15', 'x': '166.10', 'y': '236.43', 'h': '393.18', 'w': '9.58'}, {'page': '15', 'x': '166.10', 'y': '248.98', 'h': '393.57', 'w': '9.58'}, {'page': '15', 'x': '166.10', 'y': '261.54', 'h': '130.46', 'w': '9.58'}], [{'page': '15', 'x': '299.65', 'y': '261.54', 'h': '260.87', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '274.09', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '286.64', 'h': '201.22', 'w': '9.58'}], [{'page': '15', 'x': '370.71', 'y': '286.64', 'h': '188.57', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '299.19', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '311.75', 'h': '238.67', 'w': '9.58'}], [{'page': '15', 'x': '407.54', 'y': '311.75', 'h': '151.74', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '324.30', 'h': '392.88', 'w': '9.58'}, {'page': '15', 'x': '166.39', 'y': '336.85', 'h': '28.14', 'w': '9.58'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Conclusions', 'section_number': '5.', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/life12030464/s1;Table S1: Accession, number of unique peptides and description of identified proteins in all samples, Table S2: Pathway analysis results using Ingenuity Pathway Analysis, Table S3: The PZP concentration for the validation cohort, Figure S1: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-6: Double positive RA patients for RF factor and anti-CCP, Lane 7-11: Single positive RA patients for RF factor.The integrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ, Figure S2: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-6: Single positive RA patients for anti-CCP, Lane 7-9: Double negative RA patients for RF factor and anti-CCP.The in-tegrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ, Figure S3: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-8: Healthy control sam-ples.The integrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ.', metadata={'text': 'The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/life12030464/s1;Table S1: Accession, number of unique peptides and description of identified proteins in all samples, Table S2: Pathway analysis results using Ingenuity Pathway Analysis, Table S3: The PZP concentration for the validation cohort, Figure S1: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-6: Double positive RA patients for RF factor and anti-CCP, Lane 7-11: Single positive RA patients for RF factor.The integrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ, Figure S2: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-6: Single positive RA patients for anti-CCP, Lane 7-9: Double negative RA patients for RF factor and anti-CCP.The in-tegrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ, Figure S3: Serum samples from study subjects loaded on SDS-PAGE Gel to check the presence of proteins and get a rough idea of protein integrity: Lane 1: Ladder, Lane 2-8: Healthy control sam-ples.The integrated density ratio is shown at the bottom for each band.Integrated density ratio is calculated using ImageJ.', 'para': '7', 'bboxes': \"[[{'page': '15', 'x': '278.51', 'y': '361.51', 'h': '281.88', 'w': '8.63'}, {'page': '15', 'x': '165.31', 'y': '373.54', 'h': '205.36', 'w': '8.63'}], [{'page': '15', 'x': '372.93', 'y': '373.54', 'h': '186.34', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '385.57', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '397.60', 'h': '394.00', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '409.63', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '421.66', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '433.69', 'h': '282.04', 'w': '8.63'}], [{'page': '15', 'x': '451.20', 'y': '433.69', 'h': '108.07', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '445.72', 'h': '148.76', 'w': '8.63'}], [{'page': '15', 'x': '317.96', 'y': '445.72', 'h': '242.43', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '457.75', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '469.78', 'h': '394.00', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '481.81', 'h': '267.09', 'w': '8.63'}], [{'page': '15', 'x': '435.72', 'y': '481.81', 'h': '123.56', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '493.84', 'h': '143.55', 'w': '8.63'}], [{'page': '15', 'x': '313.13', 'y': '493.84', 'h': '247.27', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '505.87', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '517.90', 'h': '372.43', 'w': '8.63'}], [{'page': '15', 'x': '543.97', 'y': '517.90', 'h': '15.31', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '529.93', 'h': '245.00', 'w': '8.63'}], [{'page': '15', 'x': '414.17', 'y': '529.93', 'h': '145.11', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '541.96', 'h': '54.19', 'w': '8.63'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Supplementary Materials:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The MS raw data for this study are available at the ProteomeXchange Consortium doi PXD020235, 10.6019/PXD020235.', metadata={'text': 'The MS raw data for this study are available at the ProteomeXchange Consortium doi PXD020235, 10.6019/PXD020235.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '285.36', 'y': '128.13', 'h': '273.91', 'w': '8.63'}, {'page': '16', 'x': '166.39', 'y': '139.85', 'h': '196.80', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Data Availability Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The MS raw data for this study are available at the ProteomeXchange Consortium doi PXD020235, 10.6019/PXD020235.', metadata={'text': 'The MS raw data for this study are available at the ProteomeXchange Consortium doi PXD020235, 10.6019/PXD020235.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '285.36', 'y': '128.13', 'h': '273.91', 'w': '8.63'}, {'page': '16', 'x': '166.39', 'y': '139.85', 'h': '196.80', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Data Availability Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Author Contributions: Conceptualization, S.J., P.J., M.J.P. and J.M.M.; methodology, S.J., M.J.P. and J.R.A.; software, J.R.A. and M.J.P.; validation, S.J. and P.J.; formal analysis, A.B., M.M.A. and M.J.P.; investigation, S.J.; resources, P.J., A.B. and M.J.P.; data curation, S.J. and J.M.M.; writing-original draft preparation, S.J. and M.M.A.; writing-review and editing, M.J.P.; visualization, J.R.A.; supervision, P.J., M.J.P., J.M.M. and A.B.; project administration, P.J.; funding acquisition, P.J., A.B. and M.J.P.All authors have read and agreed to the published version of the manuscript.', metadata={'text': 'Author Contributions: Conceptualization, S.J., P.J., M.J.P. and J.M.M.; methodology, S.J., M.J.P. and J.R.A.; software, J.R.A. and M.J.P.; validation, S.J. and P.J.; formal analysis, A.B., M.M.A. and M.J.P.; investigation, S.J.; resources, P.J., A.B. and M.J.P.; data curation, S.J. and J.M.M.; writing-original draft preparation, S.J. and M.M.A.; writing-review and editing, M.J.P.; visualization, J.R.A.; supervision, P.J., M.J.P., J.M.M. and A.B.; project administration, P.J.; funding acquisition, P.J., A.B. and M.J.P.All authors have read and agreed to the published version of the manuscript.', 'para': '1', 'bboxes': \"[[{'page': '15', 'x': '166.04', 'y': '559.66', 'h': '393.23', 'w': '8.63'}, {'page': '15', 'x': '166.24', 'y': '571.37', 'h': '394.15', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '583.09', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.13', 'y': '594.81', 'h': '394.27', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '606.52', 'h': '378.39', 'w': '8.63'}], [{'page': '15', 'x': '547.04', 'y': '606.52', 'h': '12.24', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '618.24', 'h': '291.00', 'w': '8.63'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Funding: Sidrah Jahangir, Peter John, Attya Bhatti and Muhammad Muaaz Aslam were funded by Higher Education Commission (HEC), Pakistan, (grant number 5965).Mandy Peffers was funded through a Wellcome Trust Clinical Intermediate Fellowship (grant number 107471/Z/15/Z).This work was also supported by the MRC and Versus Arthritis as part of the Medical Research Council Versus Arthritis Centre for Integrated Research into Musculoskeletal Ageing (CIMA) (MR/R502182/1).James Anderson was funded by the Horserace betting Levy Board.', metadata={'text': 'Funding: Sidrah Jahangir, Peter John, Attya Bhatti and Muhammad Muaaz Aslam were funded by Higher Education Commission (HEC), Pakistan, (grant number 5965).Mandy Peffers was funded through a Wellcome Trust Clinical Intermediate Fellowship (grant number 107471/Z/15/Z).This work was also supported by the MRC and Versus Arthritis as part of the Medical Research Council Versus Arthritis Centre for Integrated Research into Musculoskeletal Ageing (CIMA) (MR/R502182/1).James Anderson was funded by the Horserace betting Levy Board.', 'para': '3', 'bboxes': \"[[{'page': '15', 'x': '166.39', 'y': '635.93', 'h': '393.23', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '647.65', 'h': '281.19', 'w': '8.63'}], [{'page': '15', 'x': '450.36', 'y': '647.65', 'h': '108.91', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '659.36', 'h': '373.15', 'w': '8.63'}], [{'page': '15', 'x': '541.81', 'y': '659.36', 'h': '17.47', 'w': '8.63'}, {'page': '15', 'x': '166.02', 'y': '671.08', 'h': '394.75', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '682.80', 'h': '394.45', 'w': '8.63'}], [{'page': '15', 'x': '166.24', 'y': '694.51', 'h': '264.08', 'w': '8.63'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), 44,000 before the commencement of study.', metadata={'text': 'The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), 44,000 before the commencement of study.', 'para': '0', 'bboxes': \"[[{'page': '15', 'x': '324.87', 'y': '712.21', 'h': '234.41', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '723.92', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '735.64', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '747.35', 'h': '99.06', 'w': '8.63'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.', metadata={'text': 'Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '166.39', 'y': '98.72', 'h': '392.88', 'w': '8.63'}, {'page': '16', 'x': '166.39', 'y': '110.44', 'h': '38.52', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The authors declare no conflict of interest.', metadata={'text': 'The authors declare no conflict of interest.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '252.09', 'y': '157.54', 'h': '165.99', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), 44,000 before the commencement of study.', metadata={'text': 'The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), 44,000 before the commencement of study.', 'para': '0', 'bboxes': \"[[{'page': '15', 'x': '324.87', 'y': '712.21', 'h': '234.41', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '723.92', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '735.64', 'h': '392.88', 'w': '8.63'}, {'page': '15', 'x': '166.39', 'y': '747.35', 'h': '99.06', 'w': '8.63'}]]\", 'pages': \"('15', '15')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.', metadata={'text': 'Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '166.39', 'y': '98.72', 'h': '392.88', 'w': '8.63'}, {'page': '16', 'x': '166.39', 'y': '110.44', 'h': '38.52', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Institutional Review Board Statement:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'}),\n",
" Document(page_content='The authors declare no conflict of interest.', metadata={'text': 'The authors declare no conflict of interest.', 'para': '0', 'bboxes': \"[[{'page': '16', 'x': '252.09', 'y': '157.54', 'h': '165.99', 'w': '8.63'}]]\", 'pages': \"('16', '16')\", 'section_title': 'Conflicts of Interest:', 'section_number': 'None', 'paper_title': 'LC-MS/MS-Based Serum Protein Profiling for Identification of Candidate Biomarkers in Pakistani Rheumatoid Arthritis Patients', 'file_path': '/data/tommaso/data/papers/1.pdf'})]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Type: stuff. \n",
"The authors detected protein abundances by using a technique called quantitative proteomics, which involves the use of mass spectrometry to measure the amount of protein in a sample. The authors then compared the protein abundances in the samples to determine which proteins were most abundant and which ones were present at lower levels.\n",
"Type: map_reduce. \n",
"The authors detected protein abundances by using a technique called quantitative proteomics, which involves the use of mass spectrometry to measure the amount of protein in a sample. The authors then compared the protein abundances in the samples to determine which proteins were most abundant and which ones were present at lower levels.\n",
"Type: refine. \n",
"The authors detected protein abundances by using a technique called quantitative proteomics, which involves the use of mass spectrometry to measure the amount of protein in a sample. The authors then compared the protein abundances in the samples to determine which proteins were most abundant and which ones were present at lower levels.\n",
"Type: map_rerank. \n",
"The authors detected protein abundances by using a technique called quantitative proteomics, which involves the use of mass spectrometry to measure the amount of protein in a sample. The authors then compared the protein abundances in the samples to determine which proteins were most abundant and which ones were present at lower levels.\n"
]
}
],
"source": [
"from langchain import HuggingFaceHub\n",
"from langchain.chains.question_answering import load_qa_chain\n",
"\n",
"HUGGINGFACE_TOKEN = \"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\"\n",
"\n",
"llm = HuggingFaceHub(\n",
" repo_id=\"tiiuae/falcon-7b-instruct\",\n",
" model_kwargs={\"temperature\": 0.1, \"max_new_tokens\": 80},\n",
" huggingfacehub_api_token=HUGGINGFACE_TOKEN\n",
")\n",
"question = \"How did the authors detect protein abundances?\"\n",
"\n",
"chain_types = [\"map_reduce\", \"refine\", \"map_rerank\"]\n",
"\n",
"chain = load_qa_chain(llm, chain_type=\"stuff\")\n",
"print(f\"\"\"Type: stuff. {chain({\"input_documents\": docs[1:3], \"question\": question}, return_only_outputs=True)[\"output_text\"]}\"\"\")\n",
"\n",
"for t in chain_types:\n",
" chain = load_qa_chain(llm, chain_type=\"stuff\")\n",
" # chain.llm_chain.prompt.template = \"\"\"question: {question}. context: {context}. answer: dummy answer.\"\"\"\n",
" print(f\"\"\"Type: {t}. {chain({\"input_documents\": docs[1:2], \"question\": question}, return_only_outputs=True)[\"output_text\"]}\"\"\")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Error raised by inference API: Model yhyhy3/med-orca-instruct-33b time out",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m/home/tommaso/llm4scilit/notebooks/test.ipynb Cell 15\u001b[0m line \u001b[0;36m1\n\u001b[1;32m <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=12'>13</a>\u001b[0m chain_types \u001b[39m=\u001b[39m [\u001b[39m\"\u001b[39m\u001b[39mmap_reduce\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mrefine\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m\"\u001b[39m\u001b[39mmap_rerank\u001b[39m\u001b[39m\"\u001b[39m]\n\u001b[1;32m <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=14'>15</a>\u001b[0m chain \u001b[39m=\u001b[39m load_qa_chain(llm, chain_type\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mstuff\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m---> <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=15'>16</a>\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39mf\u001b[39m\u001b[39m\"\"\"\u001b[39m\u001b[39mType: stuff. \u001b[39m\u001b[39m{\u001b[39;00mchain({\u001b[39m\"\u001b[39;49m\u001b[39minput_documents\u001b[39;49m\u001b[39m\"\u001b[39;49m:\u001b[39m \u001b[39;49mdocs[\u001b[39m1\u001b[39;49m:\u001b[39m3\u001b[39;49m],\u001b[39m \u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mquestion\u001b[39;49m\u001b[39m\"\u001b[39;49m:\u001b[39m \u001b[39;49mquestion},\u001b[39m \u001b[39;49mreturn_only_outputs\u001b[39m=\u001b[39;49m\u001b[39mTrue\u001b[39;49;00m)[\u001b[39m\"\u001b[39m\u001b[39moutput_text\u001b[39m\u001b[39m\"\u001b[39m]\u001b[39m}\u001b[39;00m\u001b[39m\"\"\"\u001b[39m)\n\u001b[1;32m <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=17'>18</a>\u001b[0m \u001b[39mfor\u001b[39;00m t \u001b[39min\u001b[39;00m chain_types:\n\u001b[1;32m <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=18'>19</a>\u001b[0m chain \u001b[39m=\u001b[39m load_qa_chain(llm, chain_type\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mstuff\u001b[39m\u001b[39m\"\u001b[39m)\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/base.py:243\u001b[0m, in \u001b[0;36mChain.__call__\u001b[0;34m(self, inputs, return_only_outputs, callbacks, tags, metadata, include_run_info)\u001b[0m\n\u001b[1;32m 241\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mKeyboardInterrupt\u001b[39;00m, \u001b[39mException\u001b[39;00m) \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m 242\u001b[0m run_manager\u001b[39m.\u001b[39mon_chain_error(e)\n\u001b[0;32m--> 243\u001b[0m \u001b[39mraise\u001b[39;00m e\n\u001b[1;32m 244\u001b[0m run_manager\u001b[39m.\u001b[39mon_chain_end(outputs)\n\u001b[1;32m 245\u001b[0m final_outputs: Dict[\u001b[39mstr\u001b[39m, Any] \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mprep_outputs(\n\u001b[1;32m 246\u001b[0m inputs, outputs, return_only_outputs\n\u001b[1;32m 247\u001b[0m )\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/base.py:237\u001b[0m, in \u001b[0;36mChain.__call__\u001b[0;34m(self, inputs, return_only_outputs, callbacks, tags, metadata, include_run_info)\u001b[0m\n\u001b[1;32m 231\u001b[0m run_manager \u001b[39m=\u001b[39m callback_manager\u001b[39m.\u001b[39mon_chain_start(\n\u001b[1;32m 232\u001b[0m dumpd(\u001b[39mself\u001b[39m),\n\u001b[1;32m 233\u001b[0m inputs,\n\u001b[1;32m 234\u001b[0m )\n\u001b[1;32m 235\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[1;32m 236\u001b[0m outputs \u001b[39m=\u001b[39m (\n\u001b[0;32m--> 237\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_call(inputs, run_manager\u001b[39m=\u001b[39;49mrun_manager)\n\u001b[1;32m 238\u001b[0m \u001b[39mif\u001b[39;00m new_arg_supported\n\u001b[1;32m 239\u001b[0m \u001b[39melse\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_call(inputs)\n\u001b[1;32m 240\u001b[0m )\n\u001b[1;32m 241\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mKeyboardInterrupt\u001b[39;00m, \u001b[39mException\u001b[39;00m) \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m 242\u001b[0m run_manager\u001b[39m.\u001b[39mon_chain_error(e)\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py:106\u001b[0m, in \u001b[0;36mBaseCombineDocumentsChain._call\u001b[0;34m(self, inputs, run_manager)\u001b[0m\n\u001b[1;32m 104\u001b[0m \u001b[39m# Other keys are assumed to be needed for LLM prediction\u001b[39;00m\n\u001b[1;32m 105\u001b[0m other_keys \u001b[39m=\u001b[39m {k: v \u001b[39mfor\u001b[39;00m k, v \u001b[39min\u001b[39;00m inputs\u001b[39m.\u001b[39mitems() \u001b[39mif\u001b[39;00m k \u001b[39m!=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39minput_key}\n\u001b[0;32m--> 106\u001b[0m output, extra_return_dict \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mcombine_docs(\n\u001b[1;32m 107\u001b[0m docs, callbacks\u001b[39m=\u001b[39;49m_run_manager\u001b[39m.\u001b[39;49mget_child(), \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mother_keys\n\u001b[1;32m 108\u001b[0m )\n\u001b[1;32m 109\u001b[0m extra_return_dict[\u001b[39mself\u001b[39m\u001b[39m.\u001b[39moutput_key] \u001b[39m=\u001b[39m output\n\u001b[1;32m 110\u001b[0m \u001b[39mreturn\u001b[39;00m extra_return_dict\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py:165\u001b[0m, in \u001b[0;36mStuffDocumentsChain.combine_docs\u001b[0;34m(self, docs, callbacks, **kwargs)\u001b[0m\n\u001b[1;32m 163\u001b[0m inputs \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_get_inputs(docs, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs)\n\u001b[1;32m 164\u001b[0m \u001b[39m# Call predict on the LLM.\u001b[39;00m\n\u001b[0;32m--> 165\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mllm_chain\u001b[39m.\u001b[39;49mpredict(callbacks\u001b[39m=\u001b[39;49mcallbacks, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49minputs), {}\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/llm.py:252\u001b[0m, in \u001b[0;36mLLMChain.predict\u001b[0;34m(self, callbacks, **kwargs)\u001b[0m\n\u001b[1;32m 237\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mpredict\u001b[39m(\u001b[39mself\u001b[39m, callbacks: Callbacks \u001b[39m=\u001b[39m \u001b[39mNone\u001b[39;00m, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs: Any) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m \u001b[39mstr\u001b[39m:\n\u001b[1;32m 238\u001b[0m \u001b[39m \u001b[39m\u001b[39m\"\"\"Format prompt with kwargs and pass to LLM.\u001b[39;00m\n\u001b[1;32m 239\u001b[0m \n\u001b[1;32m 240\u001b[0m \u001b[39m Args:\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 250\u001b[0m \u001b[39m completion = llm.predict(adjective=\"funny\")\u001b[39;00m\n\u001b[1;32m 251\u001b[0m \u001b[39m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 252\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m(kwargs, callbacks\u001b[39m=\u001b[39;49mcallbacks)[\u001b[39mself\u001b[39m\u001b[39m.\u001b[39moutput_key]\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/base.py:243\u001b[0m, in \u001b[0;36mChain.__call__\u001b[0;34m(self, inputs, return_only_outputs, callbacks, tags, metadata, include_run_info)\u001b[0m\n\u001b[1;32m 241\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mKeyboardInterrupt\u001b[39;00m, \u001b[39mException\u001b[39;00m) \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m 242\u001b[0m run_manager\u001b[39m.\u001b[39mon_chain_error(e)\n\u001b[0;32m--> 243\u001b[0m \u001b[39mraise\u001b[39;00m e\n\u001b[1;32m 244\u001b[0m run_manager\u001b[39m.\u001b[39mon_chain_end(outputs)\n\u001b[1;32m 245\u001b[0m final_outputs: Dict[\u001b[39mstr\u001b[39m, Any] \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mprep_outputs(\n\u001b[1;32m 246\u001b[0m inputs, outputs, return_only_outputs\n\u001b[1;32m 247\u001b[0m )\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/base.py:237\u001b[0m, in \u001b[0;36mChain.__call__\u001b[0;34m(self, inputs, return_only_outputs, callbacks, tags, metadata, include_run_info)\u001b[0m\n\u001b[1;32m 231\u001b[0m run_manager \u001b[39m=\u001b[39m callback_manager\u001b[39m.\u001b[39mon_chain_start(\n\u001b[1;32m 232\u001b[0m dumpd(\u001b[39mself\u001b[39m),\n\u001b[1;32m 233\u001b[0m inputs,\n\u001b[1;32m 234\u001b[0m )\n\u001b[1;32m 235\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[1;32m 236\u001b[0m outputs \u001b[39m=\u001b[39m (\n\u001b[0;32m--> 237\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_call(inputs, run_manager\u001b[39m=\u001b[39;49mrun_manager)\n\u001b[1;32m 238\u001b[0m \u001b[39mif\u001b[39;00m new_arg_supported\n\u001b[1;32m 239\u001b[0m \u001b[39melse\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_call(inputs)\n\u001b[1;32m 240\u001b[0m )\n\u001b[1;32m 241\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mKeyboardInterrupt\u001b[39;00m, \u001b[39mException\u001b[39;00m) \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m 242\u001b[0m run_manager\u001b[39m.\u001b[39mon_chain_error(e)\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/llm.py:92\u001b[0m, in \u001b[0;36mLLMChain._call\u001b[0;34m(self, inputs, run_manager)\u001b[0m\n\u001b[1;32m 87\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_call\u001b[39m(\n\u001b[1;32m 88\u001b[0m \u001b[39mself\u001b[39m,\n\u001b[1;32m 89\u001b[0m inputs: Dict[\u001b[39mstr\u001b[39m, Any],\n\u001b[1;32m 90\u001b[0m run_manager: Optional[CallbackManagerForChainRun] \u001b[39m=\u001b[39m \u001b[39mNone\u001b[39;00m,\n\u001b[1;32m 91\u001b[0m ) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m Dict[\u001b[39mstr\u001b[39m, \u001b[39mstr\u001b[39m]:\n\u001b[0;32m---> 92\u001b[0m response \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgenerate([inputs], run_manager\u001b[39m=\u001b[39;49mrun_manager)\n\u001b[1;32m 93\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mcreate_outputs(response)[\u001b[39m0\u001b[39m]\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/llm.py:102\u001b[0m, in \u001b[0;36mLLMChain.generate\u001b[0;34m(self, input_list, run_manager)\u001b[0m\n\u001b[1;32m 100\u001b[0m \u001b[39m\u001b[39m\u001b[39m\"\"\"Generate LLM result from inputs.\"\"\"\u001b[39;00m\n\u001b[1;32m 101\u001b[0m prompts, stop \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mprep_prompts(input_list, run_manager\u001b[39m=\u001b[39mrun_manager)\n\u001b[0;32m--> 102\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mllm\u001b[39m.\u001b[39;49mgenerate_prompt(\n\u001b[1;32m 103\u001b[0m prompts,\n\u001b[1;32m 104\u001b[0m stop,\n\u001b[1;32m 105\u001b[0m callbacks\u001b[39m=\u001b[39;49mrun_manager\u001b[39m.\u001b[39;49mget_child() \u001b[39mif\u001b[39;49;00m run_manager \u001b[39melse\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m,\n\u001b[1;32m 106\u001b[0m \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mllm_kwargs,\n\u001b[1;32m 107\u001b[0m )\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/base.py:188\u001b[0m, in \u001b[0;36mBaseLLM.generate_prompt\u001b[0;34m(self, prompts, stop, callbacks, **kwargs)\u001b[0m\n\u001b[1;32m 180\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mgenerate_prompt\u001b[39m(\n\u001b[1;32m 181\u001b[0m \u001b[39mself\u001b[39m,\n\u001b[1;32m 182\u001b[0m prompts: List[PromptValue],\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 185\u001b[0m \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs: Any,\n\u001b[1;32m 186\u001b[0m ) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m LLMResult:\n\u001b[1;32m 187\u001b[0m prompt_strings \u001b[39m=\u001b[39m [p\u001b[39m.\u001b[39mto_string() \u001b[39mfor\u001b[39;00m p \u001b[39min\u001b[39;00m prompts]\n\u001b[0;32m--> 188\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgenerate(prompt_strings, stop\u001b[39m=\u001b[39;49mstop, callbacks\u001b[39m=\u001b[39;49mcallbacks, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/base.py:281\u001b[0m, in \u001b[0;36mBaseLLM.generate\u001b[0;34m(self, prompts, stop, callbacks, tags, metadata, **kwargs)\u001b[0m\n\u001b[1;32m 275\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m 276\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mAsked to cache, but no cache found at `langchain.cache`.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 277\u001b[0m )\n\u001b[1;32m 278\u001b[0m run_managers \u001b[39m=\u001b[39m callback_manager\u001b[39m.\u001b[39mon_llm_start(\n\u001b[1;32m 279\u001b[0m dumpd(\u001b[39mself\u001b[39m), prompts, invocation_params\u001b[39m=\u001b[39mparams, options\u001b[39m=\u001b[39moptions\n\u001b[1;32m 280\u001b[0m )\n\u001b[0;32m--> 281\u001b[0m output \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_generate_helper(\n\u001b[1;32m 282\u001b[0m prompts, stop, run_managers, \u001b[39mbool\u001b[39;49m(new_arg_supported), \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs\n\u001b[1;32m 283\u001b[0m )\n\u001b[1;32m 284\u001b[0m \u001b[39mreturn\u001b[39;00m output\n\u001b[1;32m 285\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mlen\u001b[39m(missing_prompts) \u001b[39m>\u001b[39m \u001b[39m0\u001b[39m:\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/base.py:225\u001b[0m, in \u001b[0;36mBaseLLM._generate_helper\u001b[0;34m(self, prompts, stop, run_managers, new_arg_supported, **kwargs)\u001b[0m\n\u001b[1;32m 223\u001b[0m \u001b[39mfor\u001b[39;00m run_manager \u001b[39min\u001b[39;00m run_managers:\n\u001b[1;32m 224\u001b[0m run_manager\u001b[39m.\u001b[39mon_llm_error(e)\n\u001b[0;32m--> 225\u001b[0m \u001b[39mraise\u001b[39;00m e\n\u001b[1;32m 226\u001b[0m flattened_outputs \u001b[39m=\u001b[39m output\u001b[39m.\u001b[39mflatten()\n\u001b[1;32m 227\u001b[0m \u001b[39mfor\u001b[39;00m manager, flattened_output \u001b[39min\u001b[39;00m \u001b[39mzip\u001b[39m(run_managers, flattened_outputs):\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/base.py:212\u001b[0m, in \u001b[0;36mBaseLLM._generate_helper\u001b[0;34m(self, prompts, stop, run_managers, new_arg_supported, **kwargs)\u001b[0m\n\u001b[1;32m 202\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_generate_helper\u001b[39m(\n\u001b[1;32m 203\u001b[0m \u001b[39mself\u001b[39m,\n\u001b[1;32m 204\u001b[0m prompts: List[\u001b[39mstr\u001b[39m],\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 208\u001b[0m \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs: Any,\n\u001b[1;32m 209\u001b[0m ) \u001b[39m-\u001b[39m\u001b[39m>\u001b[39m LLMResult:\n\u001b[1;32m 210\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[1;32m 211\u001b[0m output \u001b[39m=\u001b[39m (\n\u001b[0;32m--> 212\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_generate(\n\u001b[1;32m 213\u001b[0m prompts,\n\u001b[1;32m 214\u001b[0m stop\u001b[39m=\u001b[39;49mstop,\n\u001b[1;32m 215\u001b[0m \u001b[39m# TODO: support multiple run managers\u001b[39;49;00m\n\u001b[1;32m 216\u001b[0m run_manager\u001b[39m=\u001b[39;49mrun_managers[\u001b[39m0\u001b[39;49m] \u001b[39mif\u001b[39;49;00m run_managers \u001b[39melse\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m,\n\u001b[1;32m 217\u001b[0m \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs,\n\u001b[1;32m 218\u001b[0m )\n\u001b[1;32m 219\u001b[0m \u001b[39mif\u001b[39;00m new_arg_supported\n\u001b[1;32m 220\u001b[0m \u001b[39melse\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_generate(prompts, stop\u001b[39m=\u001b[39mstop)\n\u001b[1;32m 221\u001b[0m )\n\u001b[1;32m 222\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mKeyboardInterrupt\u001b[39;00m, \u001b[39mException\u001b[39;00m) \u001b[39mas\u001b[39;00m e:\n\u001b[1;32m 223\u001b[0m \u001b[39mfor\u001b[39;00m run_manager \u001b[39min\u001b[39;00m run_managers:\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/base.py:604\u001b[0m, in \u001b[0;36mLLM._generate\u001b[0;34m(self, prompts, stop, run_manager, **kwargs)\u001b[0m\n\u001b[1;32m 601\u001b[0m new_arg_supported \u001b[39m=\u001b[39m inspect\u001b[39m.\u001b[39msignature(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_call)\u001b[39m.\u001b[39mparameters\u001b[39m.\u001b[39mget(\u001b[39m\"\u001b[39m\u001b[39mrun_manager\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m 602\u001b[0m \u001b[39mfor\u001b[39;00m prompt \u001b[39min\u001b[39;00m prompts:\n\u001b[1;32m 603\u001b[0m text \u001b[39m=\u001b[39m (\n\u001b[0;32m--> 604\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_call(prompt, stop\u001b[39m=\u001b[39;49mstop, run_manager\u001b[39m=\u001b[39;49mrun_manager, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m 605\u001b[0m \u001b[39mif\u001b[39;00m new_arg_supported\n\u001b[1;32m 606\u001b[0m \u001b[39melse\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_call(prompt, stop\u001b[39m=\u001b[39mstop, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs)\n\u001b[1;32m 607\u001b[0m )\n\u001b[1;32m 608\u001b[0m generations\u001b[39m.\u001b[39mappend([Generation(text\u001b[39m=\u001b[39mtext)])\n\u001b[1;32m 609\u001b[0m \u001b[39mreturn\u001b[39;00m LLMResult(generations\u001b[39m=\u001b[39mgenerations)\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/llms/huggingface_hub.py:113\u001b[0m, in \u001b[0;36mHuggingFaceHub._call\u001b[0;34m(self, prompt, stop, run_manager, **kwargs)\u001b[0m\n\u001b[1;32m 111\u001b[0m response \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mclient(inputs\u001b[39m=\u001b[39mprompt, params\u001b[39m=\u001b[39mparams)\n\u001b[1;32m 112\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39m\"\u001b[39m\u001b[39merror\u001b[39m\u001b[39m\"\u001b[39m \u001b[39min\u001b[39;00m response:\n\u001b[0;32m--> 113\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mError raised by inference API: \u001b[39m\u001b[39m{\u001b[39;00mresponse[\u001b[39m'\u001b[39m\u001b[39merror\u001b[39m\u001b[39m'\u001b[39m]\u001b[39m}\u001b[39;00m\u001b[39m\"\u001b[39m)\n\u001b[1;32m 114\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mclient\u001b[39m.\u001b[39mtask \u001b[39m==\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mtext-generation\u001b[39m\u001b[39m\"\u001b[39m:\n\u001b[1;32m 115\u001b[0m \u001b[39m# Text generation return includes the starter text.\u001b[39;00m\n\u001b[1;32m 116\u001b[0m text \u001b[39m=\u001b[39m response[\u001b[39m0\u001b[39m][\u001b[39m\"\u001b[39m\u001b[39mgenerated_text\u001b[39m\u001b[39m\"\u001b[39m][\u001b[39mlen\u001b[39m(prompt) :]\n",
"\u001b[0;31mValueError\u001b[0m: Error raised by inference API: Model yhyhy3/med-orca-instruct-33b time out"
]
},
{
"ename": "",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[1;31mThe Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click <a href='https://aka.ms/vscodeJupyterKernelCrash'>here</a> for more info. View Jupyter <a href='command:jupyter.viewOutput'>log</a> for further details."
]
}
],
"source": [
"from langchain import HuggingFaceHub\n",
"from langchain.chains.question_answering import load_qa_chain\n",
"\n",
"HUGGINGFACE_TOKEN = \"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\"\n",
"\n",
"llm = HuggingFaceHub(\n",
" repo_id=\"yhyhy3/med-orca-instruct-33b\",\n",
" model_kwargs={\"temperature\": 0.1, \"max_new_tokens\": 80},\n",
" huggingfacehub_api_token=HUGGINGFACE_TOKEN\n",
")\n",
"question = \"How did the authors detect protein abundances?\"\n",
"\n",
"chain_types = [\"map_reduce\", \"refine\", \"map_rerank\"]\n",
"\n",
"chain = load_qa_chain(llm, chain_type=\"stuff\")\n",
"print(f\"\"\"Type: stuff. {chain({\"input_documents\": docs[1:3], \"question\": question}, return_only_outputs=True)[\"output_text\"]}\"\"\")\n",
"\n",
"for t in chain_types:\n",
" chain = load_qa_chain(llm, chain_type=\"stuff\")\n",
" # chain.llm_chain.prompt.template = \"\"\"question: {question}. context: {context}. answer: dummy answer.\"\"\"\n",
" print(f\"\"\"Type: {t}. {chain({\"input_documents\": docs[1:2], \"question\": question}, return_only_outputs=True)[\"output_text\"]}\"\"\")"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from langchain import HuggingFaceHub\n",
"from langchain.chains.question_answering import load_qa_chain\n",
"\n",
"HUGGINGFACE_TOKEN = \"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\"\n",
"\n",
"llm = HuggingFaceHub(\n",
" # repo_id=\"tiiuae/falcon-7b-instruct\",\n",
" repo_id=\"yhyhy3/open_llama_7b_v2_med_instruct\",\n",
" model_kwargs={\"temperature\": 0.1, \"max_new_tokens\": 80},\n",
" huggingfacehub_api_token=HUGGINGFACE_TOKEN\n",
")\n",
"question = \"How did the authors detect protein abundances?\"\n",
"\n",
"chain_types = [\"map_reduce\", \"refine\", \"map_rerank\"]\n",
"\n",
"chain = load_qa_chain(llm, chain_type=\"stuff\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\\n\\n{context}\\n\\nQuestion: {question}\\nHelpful Answer:\""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.llm_chain.prompt.template"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "`run` supported with either positional arguments or keyword arguments, but none were provided.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m/home/tommaso/llm4scilit/notebooks/test.ipynb Cell 16\u001b[0m line \u001b[0;36m1\n\u001b[0;32m----> <a href='vscode-notebook-cell://ssh-remote%2Bstudents.datascience.ch/home/tommaso/llm4scilit/notebooks/test.ipynb#X22sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a>\u001b[0m chain\u001b[39m.\u001b[39;49mrun()\n",
"File \u001b[0;32m/data/tommaso/mambaforge/envs/llm4scilit/lib/python3.10/site-packages/langchain/chains/base.py:450\u001b[0m, in \u001b[0;36mChain.run\u001b[0;34m(self, callbacks, tags, metadata, *args, **kwargs)\u001b[0m\n\u001b[1;32m 445\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39m(kwargs, callbacks\u001b[39m=\u001b[39mcallbacks, tags\u001b[39m=\u001b[39mtags, metadata\u001b[39m=\u001b[39mmetadata)[\n\u001b[1;32m 446\u001b[0m _output_key\n\u001b[1;32m 447\u001b[0m ]\n\u001b[1;32m 449\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m kwargs \u001b[39mand\u001b[39;00m \u001b[39mnot\u001b[39;00m args:\n\u001b[0;32m--> 450\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m 451\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m`run` supported with either positional arguments or keyword arguments,\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 452\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m but none were provided.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 453\u001b[0m )\n\u001b[1;32m 454\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 455\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m 456\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m`run` supported with either positional arguments or keyword arguments\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 457\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m but not both. Got args: \u001b[39m\u001b[39m{\u001b[39;00margs\u001b[39m}\u001b[39;00m\u001b[39m and kwargs: \u001b[39m\u001b[39m{\u001b[39;00mkwargs\u001b[39m}\u001b[39;00m\u001b[39m.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 458\u001b[0m )\n",
"\u001b[0;31mValueError\u001b[0m: `run` supported with either positional arguments or keyword arguments, but none were provided."
]
}
],
"source": [
"chain.run()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'{context}\\n{question} '"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain import PromptTemplate\n",
"\n",
"template = \"\"\"{context}\\n{question} \"\"\"\n",
"\n",
"prompt_template = PromptTemplate(\n",
" template=template,\n",
" input_variables=[\"context\", \"question\"],\n",
")\n",
"\n",
"load_qa_chain(llm, chain_type=\"stuff\", prompt=prompt_template).llm_chain.prompt.template"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
|