|
import streamlit as st
|
|
|
|
|
|
st.set_page_config(
|
|
layout="wide",
|
|
initial_sidebar_state="auto"
|
|
)
|
|
|
|
|
|
st.markdown("""
|
|
<style>
|
|
.main-title {
|
|
font-size: 36px;
|
|
color: #4A90E2;
|
|
font-weight: bold;
|
|
text-align: center;
|
|
}
|
|
.sub-title {
|
|
font-size: 24px;
|
|
color: #4A90E2;
|
|
margin-top: 20px;
|
|
}
|
|
.section {
|
|
background-color: #f9f9f9;
|
|
padding: 15px;
|
|
border-radius: 10px;
|
|
margin-top: 20px;
|
|
}
|
|
.section h2 {
|
|
font-size: 22px;
|
|
color: #4A90E2;
|
|
}
|
|
.section p, .section ul {
|
|
color: #666666;
|
|
}
|
|
.link {
|
|
color: #4A90E2;
|
|
text-decoration: none;
|
|
}
|
|
.benchmark-table {
|
|
width: 100%;
|
|
border-collapse: collapse;
|
|
margin-top: 20px;
|
|
}
|
|
.benchmark-table th, .benchmark-table td {
|
|
border: 1px solid #ddd;
|
|
padding: 8px;
|
|
text-align: left;
|
|
}
|
|
.benchmark-table th {
|
|
background-color: #4A90E2;
|
|
color: white;
|
|
}
|
|
.benchmark-table td {
|
|
background-color: #f2f2f2;
|
|
}
|
|
</style>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="main-title">Introduction to XLM-RoBERTa Annotators in Spark NLP</div>', unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>XLM-RoBERTa (Cross-lingual Robustly Optimized BERT Approach) is an advanced multilingual model that extends the capabilities of RoBERTa to over 100 languages. Pre-trained on a massive, diverse corpus, XLM-RoBERTa is designed to handle various NLP tasks in a multilingual context, making it ideal for applications that require cross-lingual understanding. Below, we provide an overview of the XLM-RoBERTa annotators for these tasks:</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
st.markdown("""<div class="sub-title">Sequence Classification with XLM-RoBERTa</div>""", unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>Sequence classification is a common task in Natural Language Processing (NLP) where the goal is to assign a label to a sequence of text, such as sentiment analysis, spam detection, or paraphrase identification.</p>
|
|
<p><strong>XLM-RoBERTa</strong> excels at sequence classification across multiple languages, making it a powerful tool for global applications. Below is an example of how to implement sequence classification using XLM-RoBERTa in Spark NLP.</p>
|
|
<p>Using XLM-RoBERTa for Sequence Classification enables:</p>
|
|
<ul>
|
|
<li><strong>Multilingual Text Classification:</strong> Classify sequences of text in multiple languages with a single model.</li>
|
|
<li><strong>Broad Application:</strong> Apply to tasks such as sentiment analysis, spam detection, and paraphrase identification across languages.</li>
|
|
<li><strong>Transfer Learning:</strong> Utilize pretrained XLM-RoBERTa models to leverage knowledge from extensive cross-lingual datasets.</li>
|
|
</ul>
|
|
<p>Advantages of using XLM-RoBERTa for Sequence Classification in Spark NLP include:</p>
|
|
<ul>
|
|
<li><strong>Scalability:</strong> Spark NLP is built on Apache Spark, ensuring it scales efficiently for large datasets.</li>
|
|
<li><strong>Pretrained Excellence:</strong> Leverage state-of-the-art pretrained models to achieve high accuracy in text classification tasks.</li>
|
|
<li><strong>Multilingual Flexibility:</strong> XLM-RoBERTa’s multilingual capabilities make it suitable for global applications, reducing the need for language-specific models.</li>
|
|
<li><strong>Seamless Integration:</strong> Easily incorporate XLM-RoBERTa into your existing Spark pipelines for streamlined NLP workflows.</li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
st.markdown("""<div class="sub-title">How to Use XLM-RoBERTa for Sequence Classification in Spark NLP</div>""", unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>To leverage XLM-RoBERTa for sequence classification, Spark NLP provides an intuitive pipeline setup. The following example shows how to use XLM-RoBERTa for sequence classification tasks such as sentiment analysis, paraphrase detection, or categorizing text sequences into predefined classes. XLM-RoBERTa’s multilingual training enables it to perform sequence classification across various languages, making it a powerful tool for global NLP tasks.</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.code('''
|
|
from sparknlp.base import *
|
|
from sparknlp.annotator import *
|
|
from pyspark.ml import Pipeline
|
|
|
|
documentAssembler = DocumentAssembler() \\
|
|
.setInputCol("text") \\
|
|
.setOutputCol("document")
|
|
|
|
tokenizer = Tokenizer() \\
|
|
.setInputCols("document") \\
|
|
.setOutputCol("token")
|
|
|
|
seq_classifier = XlmRoBertaForSequenceClassification.pretrained("xlmroberta_classifier_base_mrpc","en") \\
|
|
.setInputCols(["document", "token"]) \\
|
|
.setOutputCol("class")
|
|
|
|
pipeline = Pipeline(stages=[documentAssembler, tokenizer, seq_classifier])
|
|
|
|
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
|
|
|
|
result = pipeline.fit(data).transform(data)
|
|
result.select("class.result").show(truncate=False)
|
|
''', language='python')
|
|
|
|
st.text("""
|
|
+-------+
|
|
|result |
|
|
+-------+
|
|
|[True] |
|
|
+-------+
|
|
""")
|
|
|
|
|
|
st.markdown('<div class="sub-title">Choosing the Right Model</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<p>The XLM-RoBERTa model used here is pretrained and fine-tuned for sequence classification tasks such as paraphrase detection. It is available in Spark NLP, providing high accuracy and multilingual support.</p>
|
|
<p>For more information about the model, visit the <a class="link" href="https://huggingface.co/xlm-roberta-base" target="_blank">XLM-RoBERTa Model Hub</a>.</p>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
|
|
st.markdown("""
|
|
<div class="section">
|
|
<ul>
|
|
<li><a class="link" href="https://arxiv.org/abs/1911.02116" target="_blank">XLM-R: Cross-lingual Pre-training</a></li>
|
|
<li><a class="link" href="https://huggingface.co/xlm-roberta-base" target="_blank">XLM-RoBERTa Model Overview</a></li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<ul>
|
|
<li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
|
|
<li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
|
|
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
|
|
<li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
|
|
<li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|
|
st.markdown('<div class="sub-title">Quick Links</div>', unsafe_allow_html=True)
|
|
|
|
st.markdown("""
|
|
<div class="section">
|
|
<ul>
|
|
<li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started</a></li>
|
|
<li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>
|
|
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>
|
|
<li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>
|
|
</ul>
|
|
</div>
|
|
""", unsafe_allow_html=True)
|
|
|