Weronika Stryj

privategeek24

weronika-stryj-a863721b2

AI & ML interests

Text classification, Sentiment Analysis, Finance Risk analysis, Text generation, Question answearing, Optimization LLM, Cyber risk analysis, Speech to text

Recent Activity

Reacted to singhsidhukuldeep's post with 👀 about 2 months ago

While Google's Transformer might have introduced "Attention is all you need," Microsoft and Tsinghua University are here with the DIFF Transformer, stating, "Sparse-Attention is all you need." The DIFF Transformer outperforms traditional Transformers in scaling properties, requiring only about 65% of the model size or training tokens to achieve comparable performance. The secret sauce? A differential attention mechanism that amplifies focus on relevant context while canceling out noise, leading to sparser and more effective attention patterns. How? - It uses two separate softmax attention maps and subtracts them. - It employs a learnable scalar λ for balancing the attention maps. - It implements GroupNorm for each attention head independently. - It is compatible with FlashAttention for efficient computation. What do you get? - Superior long-context modeling (up to 64K tokens). - Enhanced key information retrieval. - Reduced hallucination in question-answering and summarization tasks. - More robust in-context learning, less affected by prompt order. - Mitigation of activation outliers, opening doors for efficient quantization. Extensive experiments show DIFF Transformer's advantages across various tasks and model sizes, from 830M to 13.1B parameters. This innovative architecture could be a game-changer for the next generation of LLMs. What are your thoughts on DIFF Transformer's potential impact?

Reacted to davidberenstein1957's post with 👍 about 2 months ago

Don't use an LLM when you can use a much cheaper model. The problem is that no one tells you how to actually do it. Just picking a pre-trained model (e.g., BERT) and throwing it at your problem won't work! If you want a small model to perform well on your problem, you need to fine-tune it. And to fine-tune it, you need data. The good news is that you don't need a lot of data but instead high-quality data for your specific problem. In the latest livestream, I showed you guys how to get started with Argilla on the Hub! Hope to see you at the next one. https://www.youtube.com/watch?v=BEe7shiG3rY

updated a collection about 2 months ago

Speech transcription

View all activity

Organizations

None yet

privategeek24's activity

Reacted to singhsidhukuldeep's post with 👀 about 2 months ago

Post

2160

While Google's Transformer might have introduced "Attention is all you need," Microsoft and Tsinghua University are here with the DIFF Transformer, stating, "Sparse-Attention is all you need."

The DIFF Transformer outperforms traditional Transformers in scaling properties, requiring only about 65% of the model size or training tokens to achieve comparable performance.

The secret sauce? A differential attention mechanism that amplifies focus on relevant context while canceling out noise, leading to sparser and more effective attention patterns.

How?
- It uses two separate softmax attention maps and subtracts them.
- It employs a learnable scalar λ for balancing the attention maps.
- It implements GroupNorm for each attention head independently.
- It is compatible with FlashAttention for efficient computation.

What do you get?
- Superior long-context modeling (up to 64K tokens).
- Enhanced key information retrieval.
- Reduced hallucination in question-answering and summarization tasks.
- More robust in-context learning, less affected by prompt order.
- Mitigation of activation outliers, opening doors for efficient quantization.

Extensive experiments show DIFF Transformer's advantages across various tasks and model sizes, from 830M to 13.1B parameters.

This innovative architecture could be a game-changer for the next generation of LLMs. What are your thoughts on DIFF Transformer's potential impact?

1 reply

Reacted to davidberenstein1957's post with 👍 about 2 months ago

Post

2498

Don't use an LLM when you can use a much cheaper model.

The problem is that no one tells you how to actually do it.

Just picking a pre-trained model (e.g., BERT) and throwing it at your problem won't work!

If you want a small model to perform well on your problem, you need to fine-tune it.

And to fine-tune it, you need data.

The good news is that you don't need a lot of data but instead high-quality data for your specific problem.

In the latest livestream, I showed you guys how to get started with Argilla on the Hub! Hope to see you at the next one.

https://www.youtube.com/watch?v=BEe7shiG3rY

updated a collection about 2 months ago

Speech transcription

Collection

1 item • Updated Oct 10

Reacted to m-ric's post with 👍 about 2 months ago

Post

3038

📜 𝐎𝐥𝐝-𝐬𝐜𝐡𝐨𝐨𝐥 𝐑𝐍𝐍𝐬 𝐜𝐚𝐧 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 𝐫𝐢𝐯𝐚𝐥 𝐟𝐚𝐧𝐜𝐲 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬!

Researchers from Mila and Borealis AI just have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today's transformers.

They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating "minLSTM" and "minGRU". The key changes:
❶ Removed dependencies on previous hidden states in the gates
❷ Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients
❸ Ensured outputs are time-independent in scale (not sure I understood that well either, don't worry)

⚡️ As a result, you can use a “parallel scan” algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200x faster than their traditional counterparts for long sequences

🔥 The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.

And for Language Modeling, they need 2.5x fewer training steps than Transformers to reach the same performance! 🚀

🤔 Why does this matter?

By showing there are simpler models with similar performance to transformers, this challenges the narrative that we need advanced architectures for better performance!

💬 François Chollet wrote in a tweet about this paper:

“The fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)”

“Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape.”

It’s the Bitter lesson by Rich Sutton striking again: don’t need fancy thinking architectures, just scale up your model and data!

Read the paper 👉 Were RNNs All We Needed? (2410.01201)

2 replies

liked 3 models about 2 months ago

updated a collection about 2 months ago

Speech transcription

Collection

1 item • Updated Oct 10

liked a model about 2 months ago

alphacep/vosk-model-small-ru

Automatic Speech Recognition • Updated Aug 8, 2023 • 8

liked a model 2 months ago

hantian/yolo-doclaynet

Updated Oct 7 • 24

Reacted to tomaarsen's post with ❤️ 2 months ago

Post

2026

🎉SetFit v1.1.0 is out! Training efficient classifiers on CPU or GPU now uses the Sentence Transformers Trainer, and we resolved a lot of issues caused by updates of third-party libraries (like Transformers). Details:

Training a SetFit classifier model consists of 2 phases:
1. Finetuning a Sentence Transformer embedding model
2. Training a Classifier to map embeddings -> classes

🔌The first phase now uses the SentenceTransformerTrainer that was introduced in the Sentence Transformers v3 update. This brings some immediate upsides like MultiGPU support, without any (intended) breaking changes.

➡️ Beyond that, we softly deprecated the "evaluation_strategy" argument in favor of "eval_strategy" (following a Transformers deprecation), and deprecated Python 3.7. In return, we add official support for Python 3.11 and 3.12.

✨ There's some more minor changes too, like max_steps and eval_max_steps now being a hard limit instead of an approximate one, training/validation losses now logging nicely in Notebooks, and the "device" parameter no longer being ignored in some situations.

Check out the full release notes here: https://github.com/huggingface/setfit/releases/tag/v1.1.0
Or read the documentation: https://huggingface.co/docs/setfit
Or check out the public SetFit models for inspiration: https://huggingface.co/models?library=setfit&sort=created

P.s. the model in the code snippet trained in 1 minute and it can classify ~6000 sentences per second on my GPU.

liked a model 3 months ago

google/gemma-2-2b-it

Text Generation • Updated Aug 27 • 900k • 706

Reacted to zolicsaki's post with 🚀 3 months ago

Post

1288

Fast inference is no longer a nice-to-have demo; it will be the driving force behind future frontier models. Time to switch over to custom AI hardware and short Nvidia.

Try out SambaNova's lightning fast API for free at https://sambanova.ai/fast-api?api_ref=444868

updated a collection 3 months ago

Image to Text

Collection

2 items • Updated Sep 9

liked a model 3 months ago

jinhybr/OCR-Donut-CORD

Image-to-Text • Updated Nov 5, 2022 • 1.38k • 191

Reacted to kingabzpro's post with 👀 3 months ago

Post

1838

How can I make my RAG application generate real-time responses? Up until now, I have been using Groq for fast LLM generation and the Gradio Live function. I am looking for a better solution that can help me build a real-time application without any delay. @abidlabs

kingabzpro/Real-Time-RAG

2 replies

liked a model 3 months ago

distilbert/distilbert-base-uncased

Fill-Mask • Updated May 6 • 15.1M • • 566

upvoted 2 articles 3 months ago

Article

Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

Jul 23

• 218

Article

TGI Multi-LoRA: Deploy Once, Serve 30 Models

Jul 18

• 51