Spaces:

myscale
/

ChatData

Running

App Files Files Community

Fangrui Liu commited on Jun 28, 2023

Commit

a796108

•

1 Parent(s): 980721a

init

Browse files

Files changed (6) hide show

.gitignore +167 -0
README.md +108 -13
app.py +163 -0
callbacks/arxiv_callbacks.py +50 -0
prompts/arxiv_prompt.py +12 -0
requirements.txt +10 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,167 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+# dataset files
+data/
+.streamlit/
+*.ipynb
+.DS_Store

README.md CHANGED Viewed

@@ -1,13 +1,108 @@
----
-title: ChatData
-emoji: 📈
-colorFrom: pink
-colorTo: purple
-sdk: streamlit
-sdk_version: 1.21.0
-app_file: app.py
-pinned: false
-license: mit
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# ChatData 🔍 📖
+***We are constantly improving LangChain's self-query retriever. Some of the features are not merged.***
+[![](https://dcbadge.vercel.app/api/server/D2qpkqc4Jq?compact=true&style=flat)](https://discord.gg/D2qpkqc4Jq)
+[![Twitter](https://img.shields.io/twitter/url/https/twitter.com/myscaledb.svg?style=social&label=Follow%20%40MyScaleDB)](https://twitter.com/myscaledb)
+![ChatData](assets/logo.png)
+Yet another chat-with-documents app, but supporting query over millions of files with [MyScale](https://myscale.com) and [LangChain](https://github.com/hwchase17/langchain/).
+## News 🔥
+- 🔧 Our contribution to LangChain that helps self-query retrievers [**filter with more types and functions**](https://python.langchain.com/docs/modules/data_connection/retrievers/how_to/self_query/myscale_self_query)
+- 🌟 **We just opened a FREE pod hosting data for ArXiv paper.** Anyone can try their own SQL with vector search!!! Feel the power when SQL meets vector search! See how to access the pod [here](#data-service).
+- 📚 We collected **1.67 million papers on arxiv**! We are collecting more and we need your advice!
+- More coming...
+## Quickstart
+1. Create an virtual environment
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+```
+2. Install dependencies
+> This app is currently using [MyScale's fork of LangChain](https://github.com/myscale/langchain/tree/master). It contains [improved prompts](https://github.com/hwchase17/langchain/pull/6737#discussion_r1243527112) for comparators `LIKE` and `CONTAIN` in [MyScale self-query retriever](https://github.com/hwchase17/langchain/pull/6143).
+```bash
+python3 -m pip install -r requirements.txt
+```
+3. Run the app!
+```python
+# fill you OpenAI key in .streamlit/secrets.toml
+cp .streamlit/secrets.example.toml .streamlit/secrets.toml
+# start the app
+python3 -m streamlit run app.py
+```
+## Quick Navigator 🧭
+- [How can I run this app?](README.md#how-to-run)
+- [How this app is built?](docs/self-query.md)
+- [What is the overview pipeline?](docs/self-query.md#query-pipeline-design)
+- [How did LangChain and MyScale convert natural language to structured filters?](docs/self-query.md#selfqueryretriever-defines-interaction-between-vectorstore-and-your-app)
+- [How to make chain execution more responsive in LangChain?](docs/self-query.md#not-responsive-add-callbacks)
+- Where can I get those arxiv data?
+  - [From parquet files on S3](docs/self-query.md#insert-data)
+  - <a name="data-service"></a>Or directly use MyScale database as service... for **FREE** ✨
+    ```python
+    import clickhouse_connect
+    client = clickhouse_connect.get_client(
+        host='msc-1decbcc9.us-east-1.aws.staging.myscale.cloud',
+        port=443,
+        username='chatdata',
+        password='myscale_rocks'
+    )
+    ```
+    Or put these settings in `.streamlit/secrets.toml`
+    ```toml
+    MYSCALE_HOST = "msc-1decbcc9.us-east-1.aws.staging.myscale.cloud"
+    MYSCALE_PORT = 443
+    MYSCALE_USER = "chatdata"
+    MYSCALE_PASSWORD = "myscale_rocks"
+    ```
+## Introduction
+ChatData brings millions of papers into your knowledge base. We imported 1.67 million papers with metadata info (continuously updating), which contains:
+1. `metadata.authors`: paper's authors in *list of strings*
+2. `metadata.abstract`: paper's abstracts used as ranking criterion (with InstructXL)
+3. `metadata.titles`: papers's titles
+4. `metadata.categories`: paper's categories in *list of strings* like ["cs.CV"]
+5. `metadata.pubdate`: paper's date of publication in *ISO 8601 formated strings*
+6. `metadata.primary_category`: paper's primary category in *strings* defined by ArXiv
+7. `metadata.comment`: some additional comment to the paper
+And for overall table schema, please refer to [table creation section in docs/self-query.md](docs/self-query.md#table-creation).
+## How to run 🏃
+```bash
+python3 -m pip install requirements.txt
+python3 -m streamlit run app.py
+```
+## How to build? 🧱
+See [docs/self-query.md](docs/self-query.md)
+## Special Thanks 👏 (Ordered Alphabetically)
+- [ArXiv API](https://info.arxiv.org/help/api/index.html) for its open access interoperability to pre-printed papers.
+- [InstructorXL](https://huggingface.co/hkunlp/instructor-xl) for its promptable embeddings that improves retrieve performance.
+- [LangChain🦜️🔗](https://github.com/hwchase17/langchain/) for its easy-to-use and composable API designs and prompts.
+- [The Alexandria Index](https://alex.macrocosm.so/download) for providing arXiv data index to the public.

app.py ADDED Viewed

	@@ -0,0 +1,163 @@

+import re
+import pandas as pd
+from os import environ
+import streamlit as st
+from langchain.vectorstores import MyScale, MyScaleSettings
+from langchain.embeddings import HuggingFaceInstructEmbeddings
+from langchain.retrievers.self_query.base import SelfQueryRetriever
+from langchain.chains.query_constructor.base import AttributeInfo
+from langchain.chains import RetrievalQAWithSourcesChain
+from langchain import OpenAI
+from langchain.chat_models import ChatOpenAI
+from prompts.arxiv_prompt import combine_prompt_template
+from callbacks.arxiv_callbacks import ChatDataSearchCallBackHandler, ChatDataAskCallBackHandler
+from langchain.prompts.prompt import PromptTemplate
+environ['TOKENIZERS_PARALLELISM'] = 'true'
+st.set_page_config(page_title="ChatData")
+st.header("ChatData")
+columns = ['title', 'id', 'categories', 'abstract', 'authors', 'pubdate']
+def display(dataframe, columns):
+    if len(docs) > 0:
+        st.dataframe(dataframe[columns])
+    else:
+        st.write("Sorry 😵 we didn't find any articles related to your query.\nPlease use verbs that may match the datatype.", unsafe_allow_html=True)
+@st.experimental_singleton(show_spinner=False)
+def build_retriever():
+    with st.spinner("Loading Model..."):
+        embeddings = HuggingFaceInstructEmbeddings(
+            model_name='hkunlp/instructor-xl',
+            embed_instruction="Represent the question for retrieving supporting scientific papers: ")
+    with st.spinner("Connecting DB..."):
+        myscale_connection = {
+            "host": st.secrets['MYSCALE_HOST'],
+            "port": st.secrets['MYSCALE_PORT'],
+            "username": st.secrets['MYSCALE_USER'],
+            "password": st.secrets['MYSCALE_PASSWORD'],
+        }
+        config = MyScaleSettings(**myscale_connection, table='ChatArXiv',
+                                 column_map={
+                                     "id": "id",
+                                     "text": "abstract",
+                                     "vector": "vector",
+                                     "metadata": "metadata"
+                                 })
+        doc_search = MyScale(embeddings, config)
+    with st.spinner("Building Self Query Retriever..."):
+        metadata_field_info = [
+            AttributeInfo(
+                name="pubdate",
+                description="The year the paper is published",
+                type="timestamp",
+            ),
+            AttributeInfo(
+                name="authors",
+                description="List of author names",
+                type="list[string]",
+            ),
+            AttributeInfo(
+                name="title",
+                description="Title of the paper",
+                type="string",
+            ),
+            AttributeInfo(
+                name="categories",
+                description="arxiv categories to this paper",
+                type="list[string]"
+            ),
+            AttributeInfo(
+                name="length(categories)",
+                description="length of arxiv categories to this paper",
+                type="int"
+            ),
+        ]
+        retriever = SelfQueryRetriever.from_llm(
+            OpenAI(openai_api_key=st.secrets['OPENAI_API_KEY'], temperature=0),
+            doc_search, "Scientific papers indexes with abstracts. All in English.", metadata_field_info,
+            use_original_query=False)
+        with st.spinner('Building RetrievalQAWith SourcesChain...'):
+            document_with_metadata_prompt = PromptTemplate(
+                input_variables=["page_content", "id", "title", "authors"],
+                template="Content:\n\tTitle: {title}\n\tAbstract: {page_content}\n\tAuthors: {authors}\nSOURCE: {id}")
+            COMBINE_PROMPT = PromptTemplate(
+                template=combine_prompt_template, input_variables=["summaries", "question"])
+            chain = RetrievalQAWithSourcesChain.from_llm(
+                llm=ChatOpenAI(
+                    openai_api_key=st.secrets['OPENAI_API_KEY'], temperature=0.6),
+                document_prompt=document_with_metadata_prompt,
+                combine_prompt=COMBINE_PROMPT,
+                retriever=retriever,
+                return_source_documents=True,)
+    return [{'name': m.name, 'desc': m.description, 'type': m.type} for m in metadata_field_info], retriever, chain
+if 'retriever' not in st.session_state:
+    st.session_state['metadata_columns'], \
+        st.session_state['retriever'], \
+        st.session_state['chain'] = \
+        build_retriever()
+st.info("We provides you metadata columns below for query. Please choose a natural expression to describe filters on those columns.\n\n" +
+        "For example: \n\n- What is a Bayesian network? Please use articles published later than Feb 2018 and with more than 2 categories and whose title like `computer` and must have `cs.CV` in its category.\n" +
+        "- What is neural network? Please use articles published by Geoffrey Hinton after 2018.\n" +
+        "- Introduce some applications of GANs published around 2019.")
+st.info("You can retrieve papers with button `Query` or ask questions based on retrieved papers with button `Ask`.", icon='💡')
+st.dataframe(st.session_state.metadata_columns)
+st.text_input("Ask a question:", key='query')
+cols = st.columns([1, 1, 7])
+cols[0].button("Query", key='search')
+cols[1].button("Ask", key='ask')
+plc_hldr = st.empty()
+if st.session_state.search:
+    plc_hldr = st.empty()
+    with plc_hldr.expander('Query Log', expanded=True):
+        call_back = None
+        callback = ChatDataSearchCallBackHandler()
+        try:
+            docs = st.session_state.retriever.get_relevant_documents(
+                st.session_state.query, callbacks=[callback])
+            callback.progress_bar.progress(value=1.0, text="Done!")
+            docs = pd.DataFrame(
+                [{**d.metadata, 'abstract': d.page_content} for d in docs])
+            display(docs, columns)
+        except Exception as e:
+            st.write('Oops 😵 Something bad happened...')
+            # raise e
+if st.session_state.ask:
+    plc_hldr = st.empty()
+    ctx = st.container()
+    with plc_hldr.expander('Chat Log', expanded=True):
+        call_back = None
+        callback = ChatDataAskCallBackHandler()
+        try:
+            ret = st.session_state.chain(
+                st.session_state.query, callbacks=[callback])
+            callback.progress_bar.progress(value=1.0, text="Done!")
+            st.markdown(
+                f"### Answer from LLM\n{ret['answer']}\n### References")
+            docs = ret['source_documents']
+            ref = re.findall(
+                '(http://arxiv.org/abs/\d{4}.\d+v\d)', ret['sources'])
+            docs = pd.DataFrame([{**d.metadata, 'abstract': d.page_content}
+                                for d in docs if d.metadata['id'] in ref])
+            display(docs, columns)
+        except Exception as e:
+            st.write('Oops 😵 Something bad happened...')
+            # raise e

callbacks/arxiv_callbacks.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import streamlit as st
+from langchain.callbacks.streamlit.streamlit_callback_handler import StreamlitCallbackHandler
+class ChatDataSearchCallBackHandler(StreamlitCallbackHandler):
+    def __init__(self) -> None:
+        self.progress_bar = st.progress(value=0.0, text="Working...")
+        self.tokens_stream = ""
+    def on_llm_start(self, serialized, prompts, **kwargs) -> None:
+        pass
+    def on_text(self, text: str, **kwargs) -> None:
+        self.progress_bar.progress(value=0.2, text="Asking LLM...")
+    def on_chain_end(self, outputs, **kwargs) -> None:
+        self.progress_bar.progress(value=0.6, text='Searching in DB...')
+        st.markdown('### Generated Filter')
+        st.write(outputs['text'], unsafe_allow_html=True)
+    def on_chain_start(self, serialized, inputs, **kwargs) -> None:
+        pass
+class ChatDataAskCallBackHandler(StreamlitCallbackHandler):
+    def __init__(self) -> None:
+        self.progress_bar = st.progress(value=0.0, text='Searching DB...')
+        self.status_bar = st.empty()
+        self.prog_value = 0.0
+        self.prog_map = {
+            'langchain.chains.qa_with_sources.retrieval.RetrievalQAWithSourcesChain': 0.2,
+            'langchain.chains.combine_documents.map_reduce.MapReduceDocumentsChain': 0.4,
+            'langchain.chains.combine_documents.stuff.StuffDocumentsChain': 0.8
+        }
+    def on_llm_start(self, serialized, prompts, **kwargs) -> None:
+        pass
+    def on_text(self, text: str, **kwargs) -> None:
+        pass
+    def on_chain_start(self, serialized, inputs, **kwargs) -> None:
+        cid = '.'.join(serialized['id'])
+        if cid != 'langchain.chains.llm.LLMChain':
+            self.progress_bar.progress(value=self.prog_map[cid], text=f'Running Chain `{cid}`...')
+            self.prog_value = self.prog_map[cid]
+        else:
+            self.prog_value += 0.1
+            self.progress_bar.progress(value=self.prog_value, text=f'Running Chain `{cid}`...')
+    def on_chain_end(self, outputs, **kwargs) -> None:
+        pass

prompts/arxiv_prompt.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from langchain.chains.qa_with_sources.map_reduce_prompt import combine_prompt_template
+combine_prompt_template_ = (
+            "You are a helpful paper assistant. Your task is to provide information and answer any questions "
+            + "related to PDFs given below. You should only use the abstract of the selected papers as your source of information "
+            + "and try to provide concise and accurate answers to any questions asked by the user. If you are unable to find "
+            + "relevant information in the given sections, you will need to let the user know that the source does not contain "
+            + "relevant information but still try to provide an answer based on your general knowledge. The following is the related information "
+            + "about the paper that will help you answer users' questions, you MUST answer it using question's language:\n\n"
+        )
+combine_prompt_template = combine_prompt_template_ + combine_prompt_template

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+langchain @ git+https://github.com/myscale/langchain.git@master
+InstructorEmbedding
+pandas
+sentence_transformers
+streamlit==1.20
+altair==4.2.2
+clickhouse-connect
+openai
+lark
+tiktoken