File size: 8,206 Bytes
cee413d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
import streamlit as st
# Page configuration
st.set_page_config(
layout="wide",
initial_sidebar_state="auto"
)
# Custom CSS for better styling
st.markdown("""
<style>
.main-title {
font-size: 36px;
color: #4A90E2;
font-weight: bold;
text-align: center;
}
.sub-title {
font-size: 24px;
color: #4A90E2;
margin-top: 20px;
}
.section {
background-color: #f9f9f9;
padding: 15px;
border-radius: 10px;
margin-top: 20px;
}
.section h2 {
font-size: 22px;
color: #4A90E2;
}
.section p, .section ul {
color: #666666;
}
.link {
color: #4A90E2;
text-decoration: none;
}
.benchmark-table {
width: 100%;
border-collapse: collapse;
margin-top: 20px;
}
.benchmark-table th, .benchmark-table td {
border: 1px solid #ddd;
padding: 8px;
text-align: left;
}
.benchmark-table th {
background-color: #4A90E2;
color: white;
}
.benchmark-table td {
background-color: #f2f2f2;
}
</style>
""", unsafe_allow_html=True)
# Title
st.markdown('<div class="main-title">Introduction to XLM-RoBERTa Annotators in Spark NLP</div>', unsafe_allow_html=True)
# Subtitle
st.markdown("""
<div class="section">
<p>XLM-RoBERTa (Cross-lingual Robustly Optimized BERT Approach) is an advanced multilingual model that extends the capabilities of RoBERTa to over 100 languages. Pre-trained on a massive, diverse corpus, XLM-RoBERTa is designed to handle various NLP tasks in a multilingual context, making it ideal for applications that require cross-lingual understanding. Below, we provide an overview of the XLM-RoBERTa annotators for these tasks:</p>
</div>
""", unsafe_allow_html=True)
st.markdown("""<div class="sub-title">Sequence Classification with XLM-RoBERTa</div>""", unsafe_allow_html=True)
st.markdown("""
<div class="section">
<p>Sequence classification is a common task in Natural Language Processing (NLP) where the goal is to assign a label to a sequence of text, such as sentiment analysis, spam detection, or paraphrase identification.</p>
<p><strong>XLM-RoBERTa</strong> excels at sequence classification across multiple languages, making it a powerful tool for global applications. Below is an example of how to implement sequence classification using XLM-RoBERTa in Spark NLP.</p>
<p>Using XLM-RoBERTa for Sequence Classification enables:</p>
<ul>
<li><strong>Multilingual Text Classification:</strong> Classify sequences of text in multiple languages with a single model.</li>
<li><strong>Broad Application:</strong> Apply to tasks such as sentiment analysis, spam detection, and paraphrase identification across languages.</li>
<li><strong>Transfer Learning:</strong> Utilize pretrained XLM-RoBERTa models to leverage knowledge from extensive cross-lingual datasets.</li>
</ul>
<p>Advantages of using XLM-RoBERTa for Sequence Classification in Spark NLP include:</p>
<ul>
<li><strong>Scalability:</strong> Spark NLP is built on Apache Spark, ensuring it scales efficiently for large datasets.</li>
<li><strong>Pretrained Excellence:</strong> Leverage state-of-the-art pretrained models to achieve high accuracy in text classification tasks.</li>
<li><strong>Multilingual Flexibility:</strong> XLM-RoBERTa’s multilingual capabilities make it suitable for global applications, reducing the need for language-specific models.</li>
<li><strong>Seamless Integration:</strong> Easily incorporate XLM-RoBERTa into your existing Spark pipelines for streamlined NLP workflows.</li>
</ul>
</div>
""", unsafe_allow_html=True)
st.markdown("""<div class="sub-title">How to Use XLM-RoBERTa for Sequence Classification in Spark NLP</div>""", unsafe_allow_html=True)
st.markdown("""
<div class="section">
<p>To leverage XLM-RoBERTa for sequence classification, Spark NLP provides an intuitive pipeline setup. The following example shows how to use XLM-RoBERTa for sequence classification tasks such as sentiment analysis, paraphrase detection, or categorizing text sequences into predefined classes. XLM-RoBERTa’s multilingual training enables it to perform sequence classification across various languages, making it a powerful tool for global NLP tasks.</p>
</div>
""", unsafe_allow_html=True)
# Code Example
st.code('''
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \\
.setInputCol("text") \\
.setOutputCol("document")
tokenizer = Tokenizer() \\
.setInputCols("document") \\
.setOutputCol("token")
seq_classifier = XlmRoBertaForSequenceClassification.pretrained("xlmroberta_classifier_base_mrpc","en") \\
.setInputCols(["document", "token"]) \\
.setOutputCol("class")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, seq_classifier])
data = spark.createDataFrame([["PUT YOUR STRING HERE"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("class.result").show(truncate=False)
''', language='python')
st.text("""
+-------+
|result |
+-------+
|[True] |
+-------+
""")
# Model Info Section
st.markdown('<div class="sub-title">Choosing the Right Model</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<p>The XLM-RoBERTa model used here is pretrained and fine-tuned for sequence classification tasks such as paraphrase detection. It is available in Spark NLP, providing high accuracy and multilingual support.</p>
<p>For more information about the model, visit the <a class="link" href="https://huggingface.co/xlm-roberta-base" target="_blank">XLM-RoBERTa Model Hub</a>.</p>
</div>
""", unsafe_allow_html=True)
# References Section
st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<ul>
<li><a class="link" href="https://arxiv.org/abs/1911.02116" target="_blank">XLM-R: Cross-lingual Pre-training</a></li>
<li><a class="link" href="https://huggingface.co/xlm-roberta-base" target="_blank">XLM-RoBERTa Model Overview</a></li>
</ul>
</div>
""", unsafe_allow_html=True)
# Footer
st.markdown("""
<div class="section">
<ul>
<li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
<li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
<li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
<li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
</ul>
</div>
""", unsafe_allow_html=True)
st.markdown('<div class="sub-title">Quick Links</div>', unsafe_allow_html=True)
st.markdown("""
<div class="section">
<ul>
<li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started</a></li>
<li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>
<li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>
</ul>
</div>
""", unsafe_allow_html=True)
|