Aminrhmni commited on
Commit
2f6bb0f
1 Parent(s): ad38ac0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -13,11 +13,11 @@ The paragraph describes the development of a language model named "Hafez," which
13
 
14
  <b>Model Type:</b> Hafez is based on the BERT architecture, which is a popular model for natural language processing (NLP).
15
 
16
- Cultural Reference: The model is named after Hafez, a renowned Persian poet known for his deeply emotional and philosophical verses. This choice of name suggests a connection to Persian literature and an intention to handle language in a way that may resonate with the cultural significance of the poet. (NLP).
17
 
18
- Training Data: The model has been trained on a substantial dataset comprising over 12 billion tokens. The text used to train the Hafez language model is comprised of two parts: 90% consists of educational materials, including research papers, dissertations, and theses, while the remaining 10% includes general texts. This careful selection of content aims to provide the model with a strong foundation in academic language and discourse.
19
 
20
- Text Cleaning and Preprocessing: The training data underwent a cleaning and preprocessing phase, which is essential for ensuring that the data is of high quality and suitable for training a machine learning model. The cleaning and preparation were conducted using "Viravirast text tools," which are likely specialized tools designed for text processing in this context.
21
 
22
 
23
  ### How to use
 
13
 
14
  <b>Model Type:</b> Hafez is based on the BERT architecture, which is a popular model for natural language processing (NLP).
15
 
16
+ <b>Cultural Reference:</b> The model is named after Hafez, a renowned Persian poet known for his deeply emotional and philosophical verses. This choice of name suggests a connection to Persian literature and an intention to handle language in a way that may resonate with the cultural significance of the poet. (NLP).
17
 
18
+ <b>Training Data:</b> The model has been trained on a substantial dataset comprising over 12 billion tokens. The text used to train the Hafez language model is comprised of two parts: 90% consists of educational materials, including research papers, dissertations, and theses, while the remaining 10% includes general texts. This careful selection of content aims to provide the model with a strong foundation in academic language and discourse.
19
 
20
+ <b>Text Cleaning and Preprocessing:</b> The training data underwent a cleaning and preprocessing phase, which is essential for ensuring that the data is of high quality and suitable for training a machine learning model. The cleaning and preparation were conducted using "Viravirast text tools," which are likely specialized tools designed for text processing in this context.
21
 
22
 
23
  ### How to use