""" This specific file was bodged together by ham-handed hedgehogs. If something looks wrong, it's because it is. If you're not a hedgehog, you shouldn't reuse this code. Use this instead: https://docs.streamlit.io/library/get-started """ import streamlit as st from st_helpers import make_header, content_text, content_title, cite, make_footer, make_tabs from charts import draw_current_progress st.set_page_config(page_title="Training Transformers Together", layout="centered") st.markdown("## Full demo content will be posted here on December 7th!") make_header() content_text(f""" There was a time when you could comfortably train state-of-the-art vision and language models at home on your workstation. The first convolutional neural net to beat ImageNet ({cite("AlexNet", "https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf")}) was trained for 5-6 days on two gamer-grade GPUs. In contrast, today's TOP-1 ImageNet model ({cite("CoAtNet", "https://arxiv.org/abs/2106.04803")}) takes 20,000 TPU-v3 days. And things are even worse in the NLP world: training {cite("GPT‑3", "https://arxiv.org/abs/2005.14165")} on a top-tier server with 8x A100 would take decades.""") content_text(f""" So, can individual researchers and small labs still train state-of-the-art? Yes we can! All it takes is for a bunch of us to come together. In fact, we're doing it right now and you're invited to join! """, vspace_before=12) draw_current_progress() content_text(f""" For this demo we train a model similar to {cite("OpenAI DALL-E", "https://openai.com/blog/dall-e/")}, that is, a transformer "language model" that generates images from text description. It is trained on {cite("LAION-400M", "https://laion.ai/laion-400-open-dataset/")}, the world's largest openly available image-text-pair dataset with 400 million samples. Our model is based on the {cite("dalle‑pytorch", "https://github.com/lucidrains/DALLE-pytorch")} implementation by {cite("Phil Wang", "https://github.com/lucidrains")} with a few tweaks to make it communication-efficient. """, vspace_after=8) with st.expander("How to train efficiently over the internet?"): content_text(f""" Modern distributed training algorithms are designed for HPC networks with 10-100 gigabit per second bandwidth. In turn, a typical Internet connection runs at 10-100 megabits per second: that’s three orders of magnitude slower. To make distributed training efficient, you need to win back these three orders of magnitude. This may seem daunting at first, but in reality, DL researchers have already made all the necessary pieces for solving this puzzle: """) content_text(f"""
Speed‑up
How to achieve
4-16x Large-batch training: {cite("You et al. (2019)", "https://arxiv.org/abs/1904.00962")} proposed a way for training neural networks efficiently with larger batches, and hence, fewer communication rounds.
4-64x Gradient Compression: from simple {cite("8-bit quantization", "https://arxiv.org/abs/1511.04561")} to advanced techniques such as {cite("Deep Gradient Compression", "https://arxiv.org/abs/1712.01887")}, {cite("PowerSGD", "https://arxiv.org/abs/1905.13727")}, {cite("1-bit Adam", "https://arxiv.org/abs/2102.02888")}, and many others. As a rule of thumb, you can safely reduce communication by 16-64x. More extreme compression is often possible, but it may affect stability or final quality.
4-24x Parameter sharing: reusing parameters between model layers results in a model with fewer parameters, and hence, fewer gradients to communicate. {cite("Lan et al. (2019)", "https://arxiv.org/abs/1909.11942")} and {cite("Xue et al. (2021)", "https://arxiv.org/pdf/2107.11817.pdf")} propose efficient parameter sharing techniques for NLP and vision.
1.5-2x Overlapping computation with communication: running network communication in background while computing the next portion of gradients. This is a {cite("long-standing trick from HPC", "https://ur.booksc.eu/book/1624068/2d0506")} that was recently adapted for DL training. {cite("Ren et al. (2021)", "https://arxiv.org/abs/2101.06840")} show that updating parameters in background while computing the next batch of gradients does not reduce convergence.
""") content_text(""" These techniques are already more than enough to cover 1000x slower communication (totalling to 655. and choose which techniques to use. In this demo, we use parameter sharing to reduce the number of parameters by roughly 12x. If you don’t want parameter sharing, you can instead use more advanced gradient compression or larger batches. """) content_title("How do I join?") content_text(f""" That's easy. First, make sure you're logged in at Hugging Face. If you don't have an account, create one {cite("here", "https://huggingface.co/join")}.
Please note that we currently limit the number of colab participants to TODO to make sure we do not interfere with other users. If there are too many active peers, take a look at alternative starter kits here TODO """) content_title("How does it work?") content_text(" TODO General Story That Weaves Together Three Tabs Below . Lorem ipsum dolor sit amet, " "consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim" " ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. " "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. " "Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.") make_tabs() make_footer()