tyrealqian commited on
Commit
f8624a3
1 Parent(s): 63b51a0

Add BERTopic model

Browse files
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # bertopic_WGnews_Oct31
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("tyrealqian/bertopic_WGnews_Oct31")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 28
34
+ * Number of training documents: 6196
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | beijing - winter - olympics - winter olympics - olympic | 18 | -1_beijing_winter_olympics_winter olympics |
42
+ | 0 | gold - medal - olympics - beijing - womens | 2054 | 0_gold_medal_olympics_beijing |
43
+ | 1 | covid - olympics - beijing - cases - winter | 633 | 1_covid_olympics_beijing_cases |
44
+ | 2 | gold - gu - womens - chinas - mens | 524 | 2_gold_gu_womens_chinas |
45
+ | 3 | president - xi - xi jinping - jinping - president xi | 388 | 3_president_xi_xi jinping_jinping |
46
+ | 4 | boycott - diplomatic - diplomatic boycott - boycott beijing - rights | 372 | 4_boycott_diplomatic_diplomatic boycott_boycott beijing |
47
+ | 5 | dwen - mascot - bing - bing dwen - dwen dwen | 328 | 5_dwen_mascot_bing_bing dwen |
48
+ | 6 | ceremony - opening - opening ceremony - beijing - ceremony beijing | 305 | 6_ceremony_opening_opening ceremony_beijing |
49
+ | 7 | kamila - valieva - kamila valieva - russian - figure | 249 | 7_kamila_valieva_kamila valieva_russian |
50
+ | 8 | torch - flame - relay - torch relay - olympic | 208 | 8_torch_flame_relay_torch relay |
51
+ | 9 | venue - ice - venues - zhangjiakou - beijing | 194 | 9_venue_ice_venues_zhangjiakou |
52
+ | 10 | sports - winter sports - winter - globalink - snow | 159 | 10_sports_winter sports_winter_globalink |
53
+ | 11 | food - robot - robots - served - serving | 122 | 11_food_robot_robots_served |
54
+ | 12 | green - carbon - games - beijing - winter | 120 | 12_green_carbon_games_beijing |
55
+ | 13 | coverage - heres - day - olympics - gold | 90 | 13_coverage_heres_day_olympics |
56
+ | 14 | bach - thomas bach - thomas - president thomas - ioc | 59 | 14_bach_thomas bach_thomas_president thomas |
57
+ | 15 | snow - snowfall - heavy - weather - heavy snowfall | 48 | 15_snow_snowfall_heavy_weather |
58
+ | 16 | bank - commemorative - digital - yuan - set | 43 | 16_bank_commemorative_digital_yuan |
59
+ | 17 | paralympic - paralympic games - games - paralympic winter - winter paralympic | 37 | 17_paralympic_paralympic games_games_paralympic winter |
60
+ | 18 | phones - personal - burner - app - smartphonelike | 34 | 18_phones_personal_burner_app |
61
+ | 19 | nbc - nbcuniversal - ads - ratings - nbcs | 31 | 19_nbc_nbcuniversal_ads_ratings |
62
+ | 20 | watch beijing - watch - athletes watch - know - names | 27 | 20_watch beijing_watch_athletes watch_know |
63
+ | 21 | ukraine - invasion - russian - invasion ukraine - ukraine beijing | 27 | 21_ukraine_invasion_russian_invasion ukraine |
64
+ | 22 | city - summer winter - summer - host summer - city host | 27 | 22_city_summer winter_summer_host summer |
65
+ | 23 | leduc - nonbinary - timothy leduc - timothy - openly | 26 | 23_leduc_nonbinary_timothy leduc_timothy |
66
+ | 24 | ralph lauren - lauren - ralph - uniforms - team | 26 | 24_ralph lauren_lauren_ralph_uniforms |
67
+ | 25 | peng - shuai - peng shuai - tennis - chinese tennis | 25 | 25_peng_shuai_peng shuai_tennis |
68
+ | 26 | women - female athletes - record - athletes - female | 22 | 26_women_female athletes_record_athletes |
69
+
70
+ </details>
71
+
72
+ ## Training hyperparameters
73
+
74
+ * calculate_probabilities: True
75
+ * language: None
76
+ * low_memory: False
77
+ * min_topic_size: 10
78
+ * n_gram_range: (1, 1)
79
+ * nr_topics: None
80
+ * seed_topic_list: None
81
+ * top_n_words: 10
82
+ * verbose: True
83
+ * zeroshot_min_similarity: 0.7
84
+ * zeroshot_topic_list: None
85
+
86
+ ## Framework versions
87
+
88
+ * Numpy: 1.26.4
89
+ * HDBSCAN: 0.8.39
90
+ * UMAP: 0.5.7
91
+ * Pandas: 2.2.2
92
+ * Scikit-Learn: 1.5.2
93
+ * Sentence-transformers: 3.2.1
94
+ * Transformers: 4.44.2
95
+ * Numba: 0.60.0
96
+ * Plotly: 5.24.1
97
+ * Python: 3.10.12
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": true,
3
+ "language": null,
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": true,
14
+ "zeroshot_min_similarity": 0.7,
15
+ "zeroshot_topic_list": null
16
+ }
ctfidf.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:319e31ae05f7469702ae747d320da3d211470104342723bca7a293ac2fc9894f
3
+ size 1658880
ctfidf_config.json ADDED
The diff for this file is too large to render. See raw diff
 
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:14fe682db798799fbf8db236803779e90f69539f67209985a4512339e8047bc5
3
+ size 114776
topics.json ADDED
The diff for this file is too large to render. See raw diff