--- # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 # Doc / guide: https://huggingface.co/docs/hub/model-cards {} --- # Model Card for Model ID #Encoder from HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding. The model can effectively encode a tweet into topic-level embeddings. It can be used to estimate **topic-level similarity** between tweets. ## Model Details #Encoder leverage hashtags to learn inter-post topic relevance (for retrieval) via contrastive learning over 179M tweets. It was pre-trained on pairwise posts, and contrastive learning guided them to learn topic relevance via learning to identify posts with the same hashtag. We randomly noise the hashtags to avoid trivial representation. Please refers to https://github.com/albertan017/HICL for more details. ![Alt text](encoder-train.png) ### Model Description - **Developed by:** Hanzhuo Tan, Department of Computing, the Hong Kong Polytechnic University - **Model type:** Roberta - **Language(s) (NLP):** English - **License:** n.a - **Finetuned from model [optional]:** Bertweet ### Model Sources [optional] - **Repository:** https://github.com/albertan017/HICL - **Paper [optional]:** HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding ## Uses ``` from transformers import AutoModel, AutoTokenizer hashencoder = AutoModel.from_pretrained("albertan017/hashencoder") tokenizer = AutoTokenizer.from_pretrained("albertan017/hashencoder") tweet = "here's a sample tweet for encoding" input_ids = torch.tensor([tokenizer.encode(tweet)]) with torch.no_grad(): features = hashencoder(input_ids) # Models outputs are now tuples ``` ## Bias, Risks, and Limitations We do not inforce semantic similarity. ## Training Details ### Training Data #Encoder is pre-trained on 15 GB of plain text from 179 million tweets and 4 billion tokens. Following the practice to pre-train BERTweet, the raw data was collected from the archived Twitter stream, containing 4TB of sampled tweets from January 2013 to June 2021. For data pre-processing, we ran the following steps. First, we employed fastText to extract English tweets and only kept tweets with hashtags. Then, low-frequency hashtags appearing in less than 100 tweets were further filtered out to alleviate sparsity. After that, we obtained a large-scale dataset containing 179M tweets, each has at least one hashtag, and hence corresponds to 180K hashtags in total. ### Training Procedure To leverage hashtag-gathered context in pre-training, we exploit contrastive learning and train #Encoder to identify pairwise posts sharing the same hashtag for gaining topic relevance. ## Citation [optional] **BibTeX:** [More Information Needed] **APA:** [More Information Needed]