Spaces:

terrierteam
/

doc2query

Runtime error

App Files Files Community

Sean MacAvaney commited on Oct 30, 2022

Commit

68b08cf

•

1 Parent(s): 3ed7c41

update

Browse files

Files changed (4) hide show

README.md +1 -68
app.py +12 -52
requirements.txt +1 -0
wrapup.md +42 -0

README.md CHANGED Viewed

@@ -10,72 +10,7 @@ pinned: false
 models:
 - macavaney/doc2query-t5-base-msmarco
 ---
-<style>
-.transformer {
-  display: inline-block;
-  background: #8facdb;
-  position: relative;
-  height: 60px;
-  line-height: 60px;
-  padding: 0 24px;
-  margin: 0 18px;
-  color: #333;
-  cursor: help;
-}
-.transformer::before {
-  content: "";
-  position: absolute;
-  bottom: 0;
-  top: 0;
-  left: -15px;
-  border-top: 30px solid #8facdb;
-  border-bottom: 30px solid #8facdb;
-  border-left: 15px solid transparent;
-}
-.transformer::after {
-  content: "";
-  position: absolute;
-  bottom: 0;
-  top: 0;
-  right: -15px;
-  border-top: 30px solid transparent;
-  border-bottom: 30px solid transparent;
-  border-left: 15px solid #8facdb;
-}
-.transformer.boring {
-  background: #ddd;
-}
-.transformer.boring::before {
-  border-top-color: #ddd;
-  border-bottom-color: #ddd;
-}
-.transformer.boring::after {
-  border-left-color: #ddd;
-}
-.df {
-  width: 24px;
-  line-height: 24px;
-  text-align: center;
-  border: 3px double #888;
-  background-color: #eee;
-  color: #333;
-  border-radius: 4px;
-  display: inline-block;
-  box-sizing: content-box;
-  cursor: help;
-  margin: 0 -25px;
-  opacity: 0.5;
-  z-index: 1;
-  position: relative;
-}
-.df:hover {
-  opacity: 1;
-}
-.pipeline {
-  text-align: center;
-}
-</style>
 This is a demonstration of [PyTerrier's Doc2Query package](https://github.com/terrierteam/pyterrier_doc2query). Doc2Query generates
 queries for a document, which can then be appended to a document's text before indexing to boost important terms and add missing terms.
@@ -87,5 +22,3 @@ Doc2Query functions as a `D→D` (document-to-document) transformer and can be u
   <div class="transformer" title="Doc2Query Transformer">Doc2Query</div>
   <div class="df" title="Document Frame">D</div>
 </div>
-Try it below!

 models:
 - macavaney/doc2query-t5-base-msmarco
 ---
+# 🐕 PyTerrier: Doc2Query
 This is a demonstration of [PyTerrier's Doc2Query package](https://github.com/terrierteam/pyterrier_doc2query). Doc2Query generates
 queries for a document, which can then be appended to a document's text before indexing to boost important terms and add missing terms.
   <div class="transformer" title="Doc2Query Transformer">Doc2Query</div>
   <div class="df" title="Document Frame">D</div>
 </div>

app.py CHANGED Viewed

@@ -1,7 +1,7 @@
-import base64
 import pandas as pd
 import gradio as gr
 from pyterrier_doc2query import Doc2Query
 MODEL = 'macavaney/doc2query-t5-base-msmarco'
@@ -13,41 +13,18 @@ COLAB_INSTALL = '''
 !pip install -q git+https://github.com/terrierteam/pyterrier_doc2query
 '''.strip()
-def df2code(df):
-  rows = []
-  for row in df.itertuples(index=False):
-    rows.append(f'  {dict(row._asdict())},')
-  rows = '\n'.join(rows)
-  return f'''pd.DataFrame([
-{rows}
-])'''
-def code2colab(code):
-  enc_code = base64.b64encode((COLAB_INSTALL + '\n\n' + code).encode()).decode()
-  url = f'https://colaburl.macavaney.us/?py64={enc_code}&name={COLAB_NAME}'
-  return f'<a href="{url}" rel="nofollow" target="_blank" style="float: right;"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="margin: 0;" /></a>'
-def code2md(code):
-  return f'''
-{code2colab(code)}
-**Code:**
-```python
-{code}
-```
-'''
 def predict(input, model, append, num_samples):
   assert model == MODEL
   doc2query.append = append
   doc2query.num_samples = num_samples
   code = f'''import pandas as pd
 from pyterrier_doc2query import Doc2Query
 doc2query = Doc2Query({repr(model)}, append={append}, num_samples={num_samples})
 doc2query({df2code(input)})
 '''
-  return (doc2query(input), code2md(code))
 example_inp = pd.DataFrame([
   {'docno': '0', 'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.'},
@@ -55,19 +32,13 @@ example_inp = pd.DataFrame([
   {'docno': '985', 'text': 'Continue on Hollins Ferry Road to Patapsco Avenue. Make a right onto Patapsco Avenue for approximately 2.5 miles. The courthouse is at the corner of Patapsco Avenue and 7th Street. The commissioner\'s office is on the first (ground) floor.'}
 ])
-example_out = predict(example_inp, MODEL, doc2query.append, doc2query.num_samples)
-gr.Interface(
     predict,
-    inputs=[gr.Dataframe(
-      headers=["docno", "text"],
-      datatype=["str", "str"],
-      col_count=(2, "fixed"),
-      row_count=1,
-      wrap=True,
-      label='Pipeline Input',
-      value=example_inp,
-    ), gr.Dropdown(
       choices=[MODEL],
       value=MODEL,
       label='Model',
@@ -82,17 +53,6 @@ gr.Interface(
       step=1.,
       label='# Queries'
     )],
-    outputs=[gr.Dataframe(
-      headers=["docno", "text", "querygen"],
-      datatype=["str", "str", "str"],
-      col_count=3,
-      row_count=1,
-      wrap=True,
-      label='Pipeline Output',
-      value=example_out[0],
-    ), gr.Markdown(value=example_out[1])],
-    title="🐕 PyTerrier: Doc2Query",
-    description=open('README.md', 'rt').read().split('\n---\n')[-1],
-    allow_flagging='never',
-    css="table.font-mono td, table.font-mono th { white-space: pre-line; font-size: 11px; line-height: 16px; } table.font-mono td input { width: 95%; } th .cursor-pointer {display: none;} th .min-h-\[2\.3rem\] {min-height: inherit;}",
 ).launch(share=False)

 import pandas as pd
 import gradio as gr
 from pyterrier_doc2query import Doc2Query
+from pyterrier_gradio import Demo, MarkdownFile, interface, df2code, code2md
 MODEL = 'macavaney/doc2query-t5-base-msmarco'
 !pip install -q git+https://github.com/terrierteam/pyterrier_doc2query
 '''.strip()
 def predict(input, model, append, num_samples):
   assert model == MODEL
   doc2query.append = append
   doc2query.num_samples = num_samples
   code = f'''import pandas as pd
 from pyterrier_doc2query import Doc2Query
 doc2query = Doc2Query({repr(model)}, append={append}, num_samples={num_samples})
 doc2query({df2code(input)})
 '''
+  return (doc2query(input), code2md(code, COLAB_INSTALL, COLAB_NAME))
 example_inp = pd.DataFrame([
   {'docno': '0', 'text': 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.'},
   {'docno': '985', 'text': 'Continue on Hollins Ferry Road to Patapsco Avenue. Make a right onto Patapsco Avenue for approximately 2.5 miles. The courthouse is at the corner of Patapsco Avenue and 7th Street. The commissioner\'s office is on the first (ground) floor.'}
 ])
+interface(
+  MarkdownFile('README.md'),
+  Demo(
     predict,
+    example_inp,
+    [
+    gr.Dropdown(
       choices=[MODEL],
       value=MODEL,
       label='Model',
       step=1.,
       label='# Queries'
     )],
+  ),
+  MarkdownFile('wrapup.md'),
 ).launch(share=False)

requirements.txt CHANGED Viewed

@@ -1,3 +1,4 @@
 git+https://github.com/terrier-org/pyterrier
 git+https://github.com/terrierteam/pyterrier_doc2query@master
 ir_datasets

+git+https://github.com/seanmacavaney/pyterrier_gradio@v0.0.2
 git+https://github.com/terrier-org/pyterrier
 git+https://github.com/terrierteam/pyterrier_doc2query@master
 ir_datasets

wrapup.md ADDED Viewed

	@@ -0,0 +1,42 @@

+### Putting it all together
+You can use Doc2Query in an indexing pipeline to build an index of the expanded documents:
+<div class="pipeline">
+  <div class="df" title="Document Frame">D</div>
+  <div class="transformer" title="Doc2Query Transformer">Doc2Query</div>
+  <div class="df" title="Document Frame">D</div>
+  <div class="transformer boring" title="Indexer">Indexer</div>
+  <div class="artefact" title="Doc2Query Index">IDX</div>
+</div>
+```python
+import pyterrer as pt
+pt.init()
+import pyterrier_doc2query
+doc2query = pyterrier_doc2query.Doc2Query(append=True)
+dataset = pt.get_dataset('irds:msmarco-passage')
+indexer = pt.IterDictIndexer('./msmarco_psg')
+indxer_pipe = doc2query >> indexer
+indxer_pipe.index(dataset.get_corpus_iter())
+```
+Once you built an index, you can retrieve from it using any retrieval function (often BM25):
+<div class="pipeline">
+  <div class="df" title="Query Frame">Q</div>
+  <div class="transformer boring" title="BM25 Transformer">BM25 Retriever <div class="artefact" title="Doc2Query Index">IDX</div></div>
+  <div class="df" title="Result Frame">R</div>
+</div>
+```python
+bm25 = pt.BatchRetrieve('./msmarco_psg', wmodel="BM25")
+```
+### References & Credits
+ - Rodrigo Nogueira and Jimmy Lin. [From doc2query to docTTTTTquery](https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf).
+ - Craig Macdonald, Nicola Tonellotto, Sean MacAvaney, Iadh Ounis. [PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval](https://dl.acm.org/doi/abs/10.1145/3459637.3482013). CIKM 2021.