File size: 36,892 Bytes

1247a15
 
02162cc
 
 
 
78c0f10
 
 
 
 
81ae1df
9ca90e4
 
4af7ce0
07ccef2
 
 
 
f1d73bf
02162cc
 
 
 
 
 
 
78c0f10
 
 
02162cc
1247a15
 
feb23b6
139296d
feb23b6
4af7ce0
feb23b6
1247a15
feb23b6
 
 
 
 
 
 
1247a15
feb23b6
1247a15
feb23b6
d159d67
feb23b6
d159d67
feb23b6
 
 
 
 
 
 
 
 
402ed7d
feb23b6
44d9a0b
feb23b6
44d9a0b
feb23b6
 
 
44d9a0b
feb23b6
 
 
44d9a0b
feb23b6
 
 
44d9a0b
feb23b6
 
 
44d9a0b
feb23b6
 
 
44d9a0b
feb23b6
44d9a0b
feb23b6
44d9a0b
57da752
feb23b6
 
44d9a0b
feb23b6
75d2037
feb23b6
e204b87
cc56b1b
feb23b6
 
 
 
 
 
 
adb2018

---
license: apache-2.0
datasets:
- HuggingFaceFW/fineweb
- PleIAs/YouTube-Commons
- allenai/WildChat-1M
- Salesforce/xlam-function-calling-60k
- ShareGPT4Video/ShareGPT4Video
- OpenGVLab/ShareGPT-4o
- TempoFunk/webvid-10M
- MBZUAI/VideoInstruct-100K
- Isaak-Carter/j.o.s.i.e.v4.0.1o
- NousResearch/dolma-v1_7-c4
- NousResearch/dolma-v1_7-cc_en_head
- nyu-visionx/Cambrian-10M
- LargeWorldModel/ultrachat_qa_mix_1M
- LargeWorldModel/ultrachat_qa_mix_512K
- LargeWorldModel/ultrachat_qa_mix_256K
- LargeWorldModel/ultrachat_qa_mix_128K
- nkp37/OpenVid-1M
language:
- de
- en
library_name: mlx
tags:
- moe
- multimodal
- vision
- audio
- endtoend
- j.o.s.i.e.
---

# J.O.S.I.E. (Just a Smart and Intelligent Entity)

Welcome to the J.O.S.I.E. project repository! J.O.S.I.E. is a cutting-edge, super intelligent AI assistant designed to revolutionize the way we interact with smart home systems and general AI capabilities. This document provides an overview of J.O.S.I.E.'s features, capabilities, and development roadmap.

## Table of Contents

1. [Introduction](#introduction)
2. [Features](#features)
3. [Training Stages](#training-stages)
4. [Current Progress](#current-progress)
5. [Usage](#usage)
6. [Contributing](#contributing)
7. [License](#license)

## Introduction

J.O.S.I.E. stands for "Just a Smart and Intelligent Entity." It is not just a conversational AI assistant but a fully multimodal AI designed to understand and process images, videos, thermal images, depth, and audio in real-time. J.O.S.I.E. is built to autonomously manage smart homes and provide general-purpose assistance, with advanced capabilities accessible only to the main user.

## Features

- **Real-Time Processing:** J.O.S.I.E. operates in real-time, ensuring quick and efficient responses.
- **Tool Calling:** Capable of calling various tools to perform tasks (only for the main user).
- **Short/Long-Term Memory:** Remembers past interactions and uses this data to provide a more personalized experience.
- **Secure Information Access:** Accesses top-secret information upon receiving a special password from the main user.
- **Contextual Greetings:** Greets users based on contextual data such as time of day, birthdays, and more.
- **Voice Interaction:** Will support real-time voice responses with a response time under 0.3 ms.
- **Advanced Multimodal Capabilities:** Initially uses Meta's image binding model, transitioning to a self-implemented encoder.
- **Uncensored Interaction:** Full, uncensored interaction capabilities are reserved for the main user.
- **Autonomous Smart Home Management:** Manages smart home devices and systems autonomously.

## Training Stages

J.O.S.I.E.'s development is structured into several meticulously planned stages, each focusing on different aspects of its capabilities:

### Stage 1: **Genesis**
- **Objective:** Fine-tune the Large Language Model (LLM) with a custom dataset and prompt format. The LLM used is Qwen2 7B and 0.5B.
- **Outcome:** A robust foundation for text-based interactions.

### Stage 2: **Fusion**
- **Objective:** Train encoders separately using transfer learning to align input embeddings with text embeddings.
- **Outcome:** Harmonized multimodal input processing.

### Stage 3: **Synergy**
- **Objective:** Fine-tune the LLM for multimodal reasoning using a custom dataset.
- **Outcome:** Enhanced reasoning capabilities across text and other modalities.

### Stage 4: **Vocalize**
- **Objective:** Fine-tune the decoder for audio output, giving J.O.S.I.E. a voice.
- **Outcome:** Synchronized text and audio responses.

### Stage 5: **Convergence**
- **Objective:** Perform full model fine-tuning for seamless integration of all components.
- **Outcome:** A fully multimodal, real-time interactive AI assistant.

## Current Progress

J.O.S.I.E. is currently in its beta stage, specifically in Stage 1. The model is being actively developed, and the current version is focused on fine-tuning the LLM with custom datasets.

### Latest Beta Version 4 of Stage 1:
- **Model:** [Isaak-Carter/josiev4o-7b-stage1-v0.1](https://huggingface.co/Isaak-Carter/J.O.S.I.E.v4o-7b-stage1-v0.1-gguf)
- **Quants:** [Isaak-Carter/J.O.S.I.E.v4o-7b-stage1-v0.1-gguf](https://huggingface.co/Isaak-Carter/J.O.S.I.E.v4o-7b-stage1-v0.1-gguf)

For a sneak peek at the current progress, visit the [GitHub Repo](https://github.com/Goekdeniz-Guelmez/J.O.S.I.E.-v4o.git).

## Source Code

To se the latest updates on J.O.S.I.E.v4o you can see my <a href="https://github.com/Goekdeniz-Guelmez/J.O.S.I.E.-v4o.git">Github Repo</a>
   
## Contributing

I welcome contributions from the you! To contribute to J.O.S.I.E., please fork the repository and create a pull request with your changes. Ensure that your code adheres to my coding standards and includes appropriate tests and comments.

## License

J.O.S.I.E. is licensed under the Apache2 License. See the [LICENSE](LICENSE) file for more details.





# Big Updates!

I have finaly trained the Vision and Audio encoder part, big thangs to FaceBook Research for the ImageBind model, wich is what I have build it on top of.

What I did was, I copied the weights from the original ImageBind model into a second 'downcycled' ImageBindVisionAudioHuge model.
After that I have continued to trained the model on a custom Vision and Audio dataset using the contrastive learning Algorythm introduced by Google with Pali Gemma with the text embeddings from the origional ImageBind model.

After mergind the encoder with the test reasoner (Qwen2-0.5B-Instruct), I got succesfull inference on both video, image and audio.
I will slowly start writing the training scrypt, creating the new dataset, and optimizing the model and inference code a litle bit more, and lastly train the model.

Here are the actual model layers:

```txt
ImageBindModelAudioVision(
  (modality_preprocessors): ModuleDict(
    (vision): RGBDTPreprocessor(
      (cls_token): tensor((1, 1, 1280), requires_grad=True)
      
      (rgbt_stem): PatchEmbedGeneric(
        (proj): Sequential(
          (0): PadIm2Video()
          (1): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
        )
      )
      (pos_embedding_helper): SpatioTemporalPosEmbeddingHelper(
        (pos_embed): tensor((1, 257, 1280), requires_grad=True)
        
      )
    )
    (audio): AudioPreprocessor(
      (cls_token): tensor((1, 1, 768), requires_grad=True)
      
      (rgbt_stem): PatchEmbedGeneric(
        (proj): Conv2d(1, 768, kernel_size=(16, 16), stride=(10, 10), bias=False)
        (norm_layer): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
      (pos_embedding_helper): SpatioTemporalPosEmbeddingHelper(
        (pos_embed): tensor((1, 229, 768), requires_grad=True)
        
      )
    )
  )
  (modality_trunks): ModuleDict(
    (vision): SimpleTransformer(
      (pre_transformer_layer): Sequential(
        (0): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        (1): EinOpsRearrange()
      )
      (blocks): Sequential(
        (0): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (1): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (2): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (3): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (4): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (5): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (6): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (7): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (8): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (9): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (10): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (11): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (12): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (13): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (14): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (15): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (16): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (17): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (18): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (19): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (20): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (21): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (22): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (23): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (24): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (25): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (26): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (27): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (28): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (29): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (30): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
        (31): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=1280, out_features=5120, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=5120, out_features=1280, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
        )
      )
      (post_transformer_layer): EinOpsRearrange()
    )
    (audio): SimpleTransformer(
      (pre_transformer_layer): Sequential(
        (0): Identity()
        (1): EinOpsRearrange()
      )
      (blocks): Sequential(
        (0): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): Identity()
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
        (1): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.009)
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
        (2): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.018)
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
        (3): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.027)
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
        (4): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.036)
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
        (5): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.045)
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
        (6): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.055)
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
        (7): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.064)
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
        (8): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.073)
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
        (9): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.082)
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
        (10): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.091)
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
        (11): BlockWithMasking(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.100)
          (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU(approximate='none')
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.0, inplace=False)
          )
          (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        )
      )
      (post_transformer_layer): EinOpsRearrange()
    )
  )
  (modality_heads): ModuleDict(
    (vision): Sequential(
      (0): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
      (1): SelectElement()
      (2): Linear(in_features=1280, out_features=1024, bias=False)
    )
    (audio): Sequential(
      (0): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (1): SelectElement()
      (2): Linear(in_features=768, out_features=1024, bias=False)
    )
  )
  (modality_postprocessors): ModuleDict(
    (vision): Normalize()
    (audio): Sequential(
      (0): Normalize()
      (1): LearnableLogitScaling(logit_scale_init=20.0,learnable=False, max_logit_scale=100)
    )
  )
)
```