Spaces:

zetavg
/

LLaMA-LoRA-Tuner-UI-Demo

Runtime error

App Files Files Community

zetavg commited on Apr 9, 2023

Commit

6c08b63

•

2 Parent(s): 4c02e18 bbdf699

Merge branch 'dev-2'

Browse files

Files changed (13) hide show

LLaMA_LoRA.ipynb +5 -5
README.md +10 -10
app.py +5 -0
llama_lora/globals.py +7 -1
llama_lora/lib/finetune.py +21 -2
llama_lora/models.py +15 -4
llama_lora/ui/finetune_ui.py +35 -9
llama_lora/ui/inference_ui.py +169 -29
llama_lora/ui/main_page.py +42 -3
llama_lora/utils/data.py +16 -0
llama_lora/utils/lru_cache.py +27 -0
requirements.lock.txt +2 -2
templates/user_and_ai.json +7 -0

LLaMA_LoRA.ipynb CHANGED Viewed

@@ -27,13 +27,13 @@
         "colab_type": "text"
       },
       "source": [
-        "<a href=\"https://colab.research.google.com/github/zetavg/LLaMA-LoRA/blob/main/LLaMA_LoRA.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
       ]
     },
     {
       "cell_type": "markdown",
       "source": [
-        "# 🦙🎛️ LLaMA-LoRA\n",
         "\n",
         "TL;DR: **Runtime > Run All** (`⌘/Ctrl+F9`). Takes about 5 minutes to start. You will be promped to authorize Google Drive access."
       ],
@@ -72,9 +72,9 @@
         "# @title Git/Project { display-mode: \"form\", run: \"auto\" }\n",
         "# @markdown Project settings.\n",
         "\n",
-        "# @markdown The URL of the LLaMA-LoRA project<br>&nbsp;&nbsp;(default: `https://github.com/zetavg/llama-lora.git`):\n",
-        "llama_lora_project_url = \"https://github.com/zetavg/llama-lora.git\" # @param {type:\"string\"}\n",
-        "# @markdown The branch to use for LLaMA-LoRA project:\n",
         "llama_lora_project_branch = \"main\" # @param {type:\"string\"}\n",
         "\n",
         "# # @markdown Forces the local directory to be updated by the remote branch:\n",

         "colab_type": "text"
       },
       "source": [
+        "<a href=\"https://colab.research.google.com/github/zetavg/LLaMA-LoRA-Tuner/blob/main/LLaMA_LoRA.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
       ]
     },
     {
       "cell_type": "markdown",
       "source": [
+        "# 🦙🎛️ LLaMA-LoRA Tuner\n",
         "\n",
         "TL;DR: **Runtime > Run All** (`⌘/Ctrl+F9`). Takes about 5 minutes to start. You will be promped to authorize Google Drive access."
       ],
         "# @title Git/Project { display-mode: \"form\", run: \"auto\" }\n",
         "# @markdown Project settings.\n",
         "\n",
+        "# @markdown The URL of the LLaMA-LoRA-Tuner project<br>&nbsp;&nbsp;(default: `https://github.com/zetavg/LLaMA-LoRA-Tuner.git`):\n",
+        "llama_lora_project_url = \"https://github.com/zetavg/LLaMA-LoRA-Tuner.git\" # @param {type:\"string\"}\n",
+        "# @markdown The branch to use for LLaMA-LoRA-Tuner project:\n",
         "llama_lora_project_branch = \"main\" # @param {type:\"string\"}\n",
         "\n",
         "# # @markdown Forces the local directory to be updated by the remote branch:\n",

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
-# 🦙🎛️ LLaMA-LoRA
-<a href="https://colab.research.google.com/github/zetavg/LLaMA-LoRA/blob/main/LLaMA_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
 Making evaluating and fine-tuning LLaMA models with low-rank adaptation (LoRA) easy.
@@ -27,7 +27,7 @@ There are various ways to run this app:
 ### Run On Google Colab
-Open [this Colab Notebook](https://colab.research.google.com/github/zetavg/LLaMA-LoRA/blob/main/LLaMA_LoRA.ipynb) and select **Runtime > Run All** (`⌘/Ctrl+F9`).
 You will be prompted to authorize Google Drive access, as Google Drive will be used to store your data. See the "Config"/"Google Drive" section for settings and more info.
@@ -38,7 +38,7 @@ After approximately 5 minutes of running, you will see the public URL in the out
 After following the [installation guide of SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), create a `.yaml` to define a task for running the app:
 ```yaml
-# llama-lora-multitool.yaml
 resources:
   accelerators: A10:1  # 1x NVIDIA A10 GPU, about US$ 0.6 / hr on Lambda Cloud.
@@ -49,13 +49,13 @@ file_mounts:
   # (to store train datasets trained models)
   # See https://skypilot.readthedocs.io/en/latest/reference/storage.html for details.
   /data:
-    name: llama-lora-multitool-data  # Make sure this name is unique or you own this bucket. If it does not exists, SkyPilot will try to create a bucket with this name.
     store: s3  # Could be either of [s3, gcs]
     mode: MOUNT
-# Clone the LLaMA-LoRA repo and install its dependencies.
 setup: |
-  git clone https://github.com/zetavg/LLaMA-LoRA.git llama_lora
   cd llama_lora && pip install -r requirements.lock.txt
   cd ..
   echo 'Dependencies installed.'
@@ -69,7 +69,7 @@ run: |
 Then launch a cluster to run the task:
 ```
-sky launch -c llama-lora-multitool llama-lora-multitool.yaml
 ```
 `-c ...` is an optional flag to specify a cluster name. If not specified, SkyPilot will automatically generate one.
@@ -86,8 +86,8 @@ When you are done, run `sky stop <cluster_name>` to stop the cluster. To termina
   <summary>Prepare environment with conda</summary>
   ```bash
-  conda create -y python=3.8 -n llama-lora-multitool
-  conda activate llama-lora-multitool
   ```
 </details>

+# 🦙🎛️ LLaMA-LoRA Tuner
+<a href="https://colab.research.google.com/github/zetavg/LLaMA-LoRA-Tuner/blob/main/LLaMA_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
 Making evaluating and fine-tuning LLaMA models with low-rank adaptation (LoRA) easy.
 ### Run On Google Colab
+Open [this Colab Notebook](https://colab.research.google.com/github/zetavg/LLaMA-LoRA-Tuner/blob/main/LLaMA_LoRA.ipynb) and select **Runtime > Run All** (`⌘/Ctrl+F9`).
 You will be prompted to authorize Google Drive access, as Google Drive will be used to store your data. See the "Config"/"Google Drive" section for settings and more info.
 After following the [installation guide of SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), create a `.yaml` to define a task for running the app:
 ```yaml
+# llama-lora-tuner.yaml
 resources:
   accelerators: A10:1  # 1x NVIDIA A10 GPU, about US$ 0.6 / hr on Lambda Cloud.
   # (to store train datasets trained models)
   # See https://skypilot.readthedocs.io/en/latest/reference/storage.html for details.
   /data:
+    name: llama-lora-tuner-data  # Make sure this name is unique or you own this bucket. If it does not exists, SkyPilot will try to create a bucket with this name.
     store: s3  # Could be either of [s3, gcs]
     mode: MOUNT
+# Clone the LLaMA-LoRA Tuner repo and install its dependencies.
 setup: |
+  git clone https://github.com/zetavg/LLaMA-LoRA-Tuner.git llama_lora
   cd llama_lora && pip install -r requirements.lock.txt
   cd ..
   echo 'Dependencies installed.'
 Then launch a cluster to run the task:
 ```
+sky launch -c llama-lora-tuner llama-lora-tuner.yaml
 ```
 `-c ...` is an optional flag to specify a cluster name. If not specified, SkyPilot will automatically generate one.
   <summary>Prepare environment with conda</summary>
   ```bash
+  conda create -y python=3.8 -n llama-lora-tuner
+  conda activate llama-lora-tuner
   ```
 </details>

app.py CHANGED Viewed

@@ -7,6 +7,7 @@ import gradio as gr
 from llama_lora.globals import Global
 from llama_lora.ui.main_page import main_page, get_page_title, main_page_custom_css
 from llama_lora.utils.data import init_data_dir
 def main(
@@ -16,6 +17,7 @@ def main(
     # Allows to listen on all interfaces by providing '0.0.0.0'.
     server_name: str = "127.0.0.1",
     share: bool = False,
     ui_show_sys_info: bool = True,
     ui_dev_mode: bool = False,
 ):
@@ -39,6 +41,9 @@ def main(
     os.makedirs(data_dir, exist_ok=True)
     init_data_dir()
     with gr.Blocks(title=get_page_title(), css=main_page_custom_css()) as demo:
         main_page()

 from llama_lora.globals import Global
 from llama_lora.ui.main_page import main_page, get_page_title, main_page_custom_css
 from llama_lora.utils.data import init_data_dir
+from llama_lora.models import load_base_model
 def main(
     # Allows to listen on all interfaces by providing '0.0.0.0'.
     server_name: str = "127.0.0.1",
     share: bool = False,
+    skip_loading_base_model: bool = False,
     ui_show_sys_info: bool = True,
     ui_dev_mode: bool = False,
 ):
     os.makedirs(data_dir, exist_ok=True)
     init_data_dir()
+    if not skip_loading_base_model:
+        load_base_model()
     with gr.Blocks(title=get_page_title(), css=main_page_custom_css()) as demo:
         main_page()

llama_lora/globals.py CHANGED Viewed

@@ -6,6 +6,7 @@ from typing import Any, Dict, List, Optional, Tuple, Union
 from numba import cuda
 import nvidia_smi
 from .lib.finetune import train
@@ -25,8 +26,13 @@ class Global:
     # Training Control
     should_stop_training = False
     # Model related
     model_has_been_used = False
     # GPU Info
     gpu_cc = None  # GPU compute capability
@@ -35,7 +41,7 @@ class Global:
     gpu_total_memory = None
     # UI related
-    ui_title: str = "LLaMA-LoRA"
     ui_emoji: str = "🦙🎛️"
     ui_subtitle: str = "Toolkit for evaluating and fine-tuning LLaMA models with low-rank adaptation (LoRA)."
     ui_show_sys_info: bool = True

 from numba import cuda
 import nvidia_smi
+from .utils.lru_cache import LRUCache
 from .lib.finetune import train
     # Training Control
     should_stop_training = False
+    # Generation Control
+    should_stop_generating = False
+    generation_force_stopped_at = None
     # Model related
     model_has_been_used = False
+    cached_lora_models = LRUCache(10)
     # GPU Info
     gpu_cc = None  # GPU compute capability
     gpu_total_memory = None
     # UI related
+    ui_title: str = "LLaMA-LoRA Tuner"
     ui_emoji: str = "🦙🎛️"
     ui_subtitle: str = "Toolkit for evaluating and fine-tuning LLaMA models with low-rank adaptation (LoRA)."
     ui_show_sys_info: bool = True

llama_lora/lib/finetune.py CHANGED Viewed

@@ -2,6 +2,8 @@ import os
 import sys
 from typing import Any, List
 import fire
 import torch
 import transformers
@@ -47,6 +49,10 @@ def train(
     # logging
     callbacks: List[Any] = []
 ):
     device_map = "auto"
     world_size = int(os.environ.get("WORLD_SIZE", 1))
     ddp = world_size != 1
@@ -202,6 +208,12 @@ def train(
         ),
         callbacks=callbacks,
     )
     model.config.use_cache = False
     old_state_dict = model.state_dict
@@ -214,9 +226,16 @@ def train(
     if torch.__version__ >= "2" and sys.platform != "win32":
         model = torch.compile(model)
-    result = trainer.train(resume_from_checkpoint=resume_from_checkpoint)
     model.save_pretrained(output_dir)
     print(f"Model saved to {output_dir}.")
-    return result

 import sys
 from typing import Any, List
+import json
 import fire
 import torch
 import transformers
     # logging
     callbacks: List[Any] = []
 ):
+    if os.path.exists(output_dir):
+        if (not os.path.isdir(output_dir)) or os.path.exists(os.path.join(output_dir, 'adapter_config.json')):
+            raise ValueError(f"The output directory already exists and is not empty. ({output_dir})")
     device_map = "auto"
     world_size = int(os.environ.get("WORLD_SIZE", 1))
     ddp = world_size != 1
         ),
         callbacks=callbacks,
     )
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    with open(os.path.join(output_dir, "trainer_args.json"), 'w') as trainer_args_json_file:
+        json.dump(trainer.args.to_dict(), trainer_args_json_file, indent=2)
     model.config.use_cache = False
     old_state_dict = model.state_dict
     if torch.__version__ >= "2" and sys.platform != "win32":
         model = torch.compile(model)
+    train_output = trainer.train(resume_from_checkpoint=resume_from_checkpoint)
     model.save_pretrained(output_dir)
     print(f"Model saved to {output_dir}.")
+    with open(os.path.join(output_dir, "trainer_log_history.jsonl"), 'w') as trainer_log_history_jsonl_file:
+        trainer_log_history = "\n".join([json.dumps(line) for line in trainer.state.log_history])
+        trainer_log_history_jsonl_file.write(trainer_log_history)
+    with open(os.path.join(output_dir, "train_output.json"), 'w') as train_output_json_file:
+        json.dump(train_output, train_output_json_file, indent=2)
+    return train_output

llama_lora/models.py CHANGED Viewed

@@ -31,27 +31,32 @@ def get_base_model():
     return Global.loaded_base_model
-def get_model_with_lora(lora_weights: str = "tloen/alpaca-lora-7b"):
     Global.model_has_been_used = True
     if device == "cuda":
         model = PeftModel.from_pretrained(
             get_base_model(),
-            lora_weights,
             torch_dtype=torch.float16,
             device_map={'': 0},  # ? https://github.com/tloen/alpaca-lora/issues/21
         )
     elif device == "mps":
         model = PeftModel.from_pretrained(
             get_base_model(),
-            lora_weights,
             device_map={"": device},
             torch_dtype=torch.float16,
         )
     else:
         model = PeftModel.from_pretrained(
             get_base_model(),
-            lora_weights,
             device_map={"": device},
         )
@@ -65,6 +70,10 @@ def get_model_with_lora(lora_weights: str = "tloen/alpaca-lora-7b"):
     model.eval()
     if torch.__version__ >= "2" and sys.platform != "win32":
         model = torch.compile(model)
     return model
@@ -121,6 +130,8 @@ def unload_models():
     del Global.loaded_tokenizer
     Global.loaded_tokenizer = None
     clear_cache()
     Global.model_has_been_used = False

     return Global.loaded_base_model
+def get_model_with_lora(lora_weights_name_or_path: str = "tloen/alpaca-lora-7b"):
     Global.model_has_been_used = True
+    if Global.cached_lora_models:
+        model_from_cache = Global.cached_lora_models.get(lora_weights_name_or_path)
+        if model_from_cache:
+            return model_from_cache
     if device == "cuda":
         model = PeftModel.from_pretrained(
             get_base_model(),
+            lora_weights_name_or_path,
             torch_dtype=torch.float16,
             device_map={'': 0},  # ? https://github.com/tloen/alpaca-lora/issues/21
         )
     elif device == "mps":
         model = PeftModel.from_pretrained(
             get_base_model(),
+            lora_weights_name_or_path,
             device_map={"": device},
             torch_dtype=torch.float16,
         )
     else:
         model = PeftModel.from_pretrained(
             get_base_model(),
+            lora_weights_name_or_path,
             device_map={"": device},
         )
     model.eval()
     if torch.__version__ >= "2" and sys.platform != "win32":
         model = torch.compile(model)
+    if Global.cached_lora_models:
+        Global.cached_lora_models.set(lora_weights_name_or_path, model)
     return model
     del Global.loaded_tokenizer
     Global.loaded_tokenizer = None
+    Global.cached_lora_models.clear()
     clear_cache()
     Global.model_has_been_used = False

llama_lora/ui/finetune_ui.py CHANGED Viewed

@@ -269,6 +269,9 @@ def do_train(
     progress=gr.Progress(track_tqdm=should_training_progress_track_tqdm),
 ):
     try:
         clear_cache()
         # If model has been used in inference, we need to unload it first.
         # Otherwise, we'll get a 'Function MmBackward0 returned an invalid
@@ -373,6 +376,9 @@ Train data (first 10):
             time.sleep(2)
             return message
         log_history = []
         class UiTrainerCallback(TrainerCallback):
@@ -419,11 +425,30 @@ Train data (first 10):
         # Do not let other tqdm iterations interfere the progress reporting after training starts.
         # progress.track_tqdm = False  # setting this dynamically is not working, determining if track_tqdm should be enabled based on GPU cores at start instead.
-        results = Global.train_fn(
             base_model,  # base_model
             tokenizer,  # tokenizer
-            os.path.join(Global.data_dir, "lora_models",
-                         model_name),  # output_dir
             train_data,
             # 128,  # batch_size (is not used, use gradient_accumulation_steps instead)
             micro_batch_size,    # micro_batch_size
@@ -445,12 +470,13 @@ Train data (first 10):
         logs_str = "\n".join([json.dumps(log)
                              for log in log_history]) or "None"
-        result_message = f"Training ended:\n{str(results)}\n\nLogs:\n{logs_str}"
         print(result_message)
         return result_message
     except Exception as e:
-        raise gr.Error(e)
 def do_abort_training():
@@ -675,9 +701,9 @@ def finetune_ui():
                             elem_id="finetune_confirm_stop_btn"
                         )
-        training_status = gr.Text(
-            "Training status will be shown here.",
-            label="Training Status/Results",
             elem_id="finetune_training_status")
         train_progress = train_btn.click(
@@ -693,7 +719,7 @@ def finetune_ui():
                 lora_dropout,
                 model_name
             ]),
-            outputs=training_status
         )
         # controlled by JS, shows the confirm_abort_button

     progress=gr.Progress(track_tqdm=should_training_progress_track_tqdm),
 ):
     try:
+        if not should_training_progress_track_tqdm:
+            progress(0, desc="Preparing train data...")
         clear_cache()
         # If model has been used in inference, we need to unload it first.
         # Otherwise, we'll get a 'Function MmBackward0 returned an invalid
             time.sleep(2)
             return message
+        if not should_training_progress_track_tqdm:
+            progress(0, desc="Preparing model for training...")
         log_history = []
         class UiTrainerCallback(TrainerCallback):
         # Do not let other tqdm iterations interfere the progress reporting after training starts.
         # progress.track_tqdm = False  # setting this dynamically is not working, determining if track_tqdm should be enabled based on GPU cores at start instead.
+        output_dir = os.path.join(Global.data_dir, "lora_models", model_name)
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+        with open(os.path.join(output_dir, "info.json"), 'w') as info_json_file:
+            dataset_name = "N/A (from text input)"
+            if load_dataset_from == "Data Dir":
+                dataset_name = dataset_from_data_dir
+            info = {
+                'base_model': Global.base_model,
+                'prompt_template': template,
+                'dataset_name': dataset_name,
+                'dataset_rows': len(train_data),
+            }
+            json.dump(info, info_json_file, indent=2)
+        if not should_training_progress_track_tqdm:
+            progress(0, desc="Train starting...")
+        train_output = Global.train_fn(
             base_model,  # base_model
             tokenizer,  # tokenizer
+            output_dir,  # output_dir
             train_data,
             # 128,  # batch_size (is not used, use gradient_accumulation_steps instead)
             micro_batch_size,    # micro_batch_size
         logs_str = "\n".join([json.dumps(log)
                              for log in log_history]) or "None"
+        result_message = f"Training ended:\n{str(train_output)}\n\nLogs:\n{logs_str}"
         print(result_message)
+        clear_cache()
         return result_message
     except Exception as e:
+        raise gr.Error(f"{e} (To dismiss this error, click the 'Abort' button)")
 def do_abort_training():
                             elem_id="finetune_confirm_stop_btn"
                         )
+        train_output = gr.Text(
+            "Training results will be shown here.",
+            label="Train Output",
             elem_id="finetune_training_status")
         train_progress = train_btn.click(
                 lora_dropout,
                 model_name
             ]),
+            outputs=train_output
         )
         # controlled by JS, shows the confirm_abort_button

llama_lora/ui/inference_ui.py CHANGED Viewed

@@ -11,13 +11,15 @@ from ..models import get_base_model, get_model_with_lora, get_tokenizer, get_dev
 from ..utils.data import (
     get_available_template_names,
     get_available_lora_model_names,
-    get_path_of_available_lora_model)
 from ..utils.prompter import Prompter
 from ..utils.callbacks import Iteratorize, Stream
 device = get_device()
 default_show_raw = True
 def do_inference(
@@ -36,12 +38,23 @@ def do_inference(
     progress=gr.Progress(track_tqdm=True),
 ):
     try:
         variables = [variable_0, variable_1, variable_2, variable_3,
                      variable_4, variable_5, variable_6, variable_7]
         prompter = Prompter(prompt_template)
         prompt = prompter.generate_prompt(variables)
-        if lora_model_name is not None and "/" not in lora_model_name and lora_model_name != "None":
             path_of_available_lora_model = get_path_of_available_lora_model(
                 lora_model_name)
             if path_of_available_lora_model:
@@ -66,16 +79,24 @@ def do_inference(
                         yield out
                 for partial_sentence in word_generator(message):
-                    yield partial_sentence, json.dumps(list(range(len(partial_sentence.split()))), indent=2)
                     time.sleep(0.05)
                 return
             time.sleep(1)
-            yield message, json.dumps(list(range(len(message.split()))), indent=2)
             return
         model = get_base_model()
-        if not lora_model_name == "None" and lora_model_name is not None:
             model = get_model_with_lora(lora_model_name)
         tokenizer = get_tokenizer()
@@ -97,6 +118,19 @@ def do_inference(
             "max_new_tokens": max_new_tokens,
         }
         if stream_output:
             # Stream the reply 1 token at a time.
             # This is based on the trick of using 'stopping_criteria' to create an iterator,
@@ -128,29 +162,61 @@ def do_inference(
                     raw_output = None
                     if show_raw:
                         raw_output = str(output)
-                    yield prompter.get_response(decoded_output), raw_output
             return  # early return for stream_output
         # Without streaming
         with torch.no_grad():
-            generation_output = model.generate(
-                input_ids=input_ids,
-                generation_config=generation_config,
-                return_dict_in_generate=True,
-                output_scores=True,
-                max_new_tokens=max_new_tokens,
-            )
         s = generation_output.sequences[0]
         output = tokenizer.decode(s)
         raw_output = None
         if show_raw:
             raw_output = str(s)
-        yield prompter.get_response(output), raw_output
     except Exception as e:
         raise gr.Error(e)
 def reload_selections(current_lora_model, current_prompt_template):
     available_template_names = get_available_template_names()
     available_template_names_with_none = available_template_names + ["None"]
@@ -172,7 +238,7 @@ def reload_selections(current_lora_model, current_prompt_template):
             gr.Dropdown.update(choices=available_template_names_with_none, value=current_prompt_template))
-def handle_prompt_template_change(prompt_template):
     prompter = Prompter(prompt_template)
     var_names = prompter.get_variable_names()
     human_var_names = [' '.join(word.capitalize()
@@ -182,7 +248,36 @@ def handle_prompt_template_change(prompt_template):
     while len(gr_updates) < 8:
         gr_updates.append(gr.Textbox.update(
             label="Not Used", visible=False))
-    return gr_updates
 def update_prompt_preview(prompt_template,
@@ -200,12 +295,15 @@ def inference_ui():
     with gr.Blocks() as inference_ui_blocks:
         with gr.Row():
-            lora_model = gr.Dropdown(
-                label="LoRA Model",
-                elem_id="inference_lora_model",
-                value="tloen/alpaca-lora-7b",
-                allow_custom_value=True,
-            )
             prompt_template = gr.Dropdown(
                 label="Prompt Template",
                 elem_id="inference_prompt_template",
@@ -318,7 +416,7 @@ def inference_ui():
             with gr.Column(elem_id="inference_output_group_container"):
                 with gr.Column(elem_id="inference_output_group"):
                     inference_output = gr.Textbox(
-                        lines=12, label="Output", elem_id="inference_output")
                     inference_output.style(show_copy_button=True)
                     with gr.Accordion(
                             "Raw Output",
@@ -346,10 +444,20 @@ def inference_ui():
         )
         things_that_might_timeout.append(reload_selections_event)
-        prompt_template_change_event = prompt_template.change(fn=handle_prompt_template_change, inputs=[prompt_template], outputs=[
-            variable_0, variable_1, variable_2, variable_3, variable_4, variable_5, variable_6, variable_7])
         things_that_might_timeout.append(prompt_template_change_event)
         generate_event = generate_btn.click(
             fn=do_inference,
             inputs=[
@@ -369,8 +477,12 @@ def inference_ui():
             outputs=[inference_output, inference_raw_output],
             api_name="inference"
         )
-        stop_btn.click(fn=None, inputs=None, outputs=None,
-                       cancels=[generate_event])
         update_prompt_preview_event = update_prompt_preview_btn.click(fn=update_prompt_preview, inputs=[prompt_template,
                                                                                                         variable_0, variable_1, variable_2, variable_3,
@@ -543,9 +655,15 @@ def inference_ui():
           return function (...args) {
             const context = this;
             clearTimeout(timeout);
-            timeout = setTimeout(() => {
               func.apply(context, args);
-            }, wait);
           };
         }
@@ -580,5 +698,27 @@ def inference_ui():
           });
         }
       }, 100);
     }
     """)

 from ..utils.data import (
     get_available_template_names,
     get_available_lora_model_names,
+    get_path_of_available_lora_model,
+    get_info_of_available_lora_model)
 from ..utils.prompter import Prompter
 from ..utils.callbacks import Iteratorize, Stream
 device = get_device()
 default_show_raw = True
+inference_output_lines = 12
 def do_inference(
     progress=gr.Progress(track_tqdm=True),
 ):
     try:
+        if Global.generation_force_stopped_at is not None:
+            required_elapsed_time_after_forced_stop = 1
+            current_unix_time = time.time()
+            remaining_time = required_elapsed_time_after_forced_stop - \
+                (current_unix_time - Global.generation_force_stopped_at)
+            if remaining_time > 0:
+                time.sleep(remaining_time)
+            Global.generation_force_stopped_at = None
         variables = [variable_0, variable_1, variable_2, variable_3,
                      variable_4, variable_5, variable_6, variable_7]
         prompter = Prompter(prompt_template)
         prompt = prompter.generate_prompt(variables)
+        if not lora_model_name:
+            lora_model_name = "None"
+        if "/" not in lora_model_name and lora_model_name != "None":
             path_of_available_lora_model = get_path_of_available_lora_model(
                 lora_model_name)
             if path_of_available_lora_model:
                         yield out
                 for partial_sentence in word_generator(message):
+                    yield (
+                        gr.Textbox.update(
+                            value=partial_sentence, lines=inference_output_lines),
+                        json.dumps(
+                            list(range(len(partial_sentence.split()))), indent=2)
+                    )
                     time.sleep(0.05)
                 return
             time.sleep(1)
+            yield (
+                gr.Textbox.update(value=message, lines=1), # TODO
+                json.dumps(list(range(len(message.split()))), indent=2)
+            )
             return
         model = get_base_model()
+        if lora_model_name != "None":
             model = get_model_with_lora(lora_model_name)
         tokenizer = get_tokenizer()
             "max_new_tokens": max_new_tokens,
         }
+        def ui_generation_stopping_criteria(input_ids, score, **kwargs):
+            if Global.should_stop_generating:
+                return True
+            return False
+        Global.should_stop_generating = False
+        generate_params.setdefault(
+            "stopping_criteria", transformers.StoppingCriteriaList()
+        )
+        generate_params["stopping_criteria"].append(
+            ui_generation_stopping_criteria
+        )
         if stream_output:
             # Stream the reply 1 token at a time.
             # This is based on the trick of using 'stopping_criteria' to create an iterator,
                     raw_output = None
                     if show_raw:
                         raw_output = str(output)
+                    response = prompter.get_response(decoded_output)
+                    if Global.should_stop_generating:
+                        return
+                    yield (
+                        gr.Textbox.update(
+                            value=response, lines=inference_output_lines),
+                        raw_output)
+                    if Global.should_stop_generating:
+                        # If the user stops the generation, and then clicks the
+                        # generation button again, they may mysteriously landed
+                        # here, in the previous, should-be-stopped generation
+                        # function call, with the new generation function not be
+                        # called at all. To workaround this, we yield a message
+                        # and setting lines=1, and if the front-end JS detects
+                        # that lines has been set to 1 (rows="1" in HTML),
+                        # it will automatically click the generate button again
+                        # (gr.Textbox.update() does not support updating
+                        # elem_classes or elem_id).
+                        # [WORKAROUND-UI01]
+                        yield (
+                            gr.Textbox.update(
+                                value="Please retry", lines=1),
+                            None)
             return  # early return for stream_output
         # Without streaming
         with torch.no_grad():
+            generation_output = model.generate(**generate_params)
         s = generation_output.sequences[0]
         output = tokenizer.decode(s)
         raw_output = None
         if show_raw:
             raw_output = str(s)
+        response = prompter.get_response(output)
+        if Global.should_stop_generating:
+            return
+        yield (
+            gr.Textbox.update(value=response, lines=inference_output_lines),
+            raw_output)
     except Exception as e:
         raise gr.Error(e)
+def handle_stop_generate():
+    Global.generation_force_stopped_at = time.time()
+    Global.should_stop_generating = True
 def reload_selections(current_lora_model, current_prompt_template):
     available_template_names = get_available_template_names()
     available_template_names_with_none = available_template_names + ["None"]
             gr.Dropdown.update(choices=available_template_names_with_none, value=current_prompt_template))
+def handle_prompt_template_change(prompt_template, lora_model):
     prompter = Prompter(prompt_template)
     var_names = prompter.get_variable_names()
     human_var_names = [' '.join(word.capitalize()
     while len(gr_updates) < 8:
         gr_updates.append(gr.Textbox.update(
             label="Not Used", visible=False))
+    model_prompt_template_message_update = gr.Markdown.update(
+        "", visible=False)
+    lora_mode_info = get_info_of_available_lora_model(lora_model)
+    if lora_mode_info and isinstance(lora_mode_info, dict):
+        model_prompt_template = lora_mode_info.get("prompt_template")
+        if model_prompt_template and model_prompt_template != prompt_template:
+            model_prompt_template_message_update = gr.Markdown.update(
+                f"Trained with prompt template `{model_prompt_template}`", visible=True)
+    return [model_prompt_template_message_update] + gr_updates
+def handle_lora_model_change(lora_model, prompt_template):
+    lora_mode_info = get_info_of_available_lora_model(lora_model)
+    if not lora_mode_info:
+        return gr.Markdown.update("", visible=False), prompt_template
+    if not isinstance(lora_mode_info, dict):
+        return gr.Markdown.update("", visible=False), prompt_template
+    model_prompt_template = lora_mode_info.get("prompt_template")
+    if not model_prompt_template:
+        return gr.Markdown.update("", visible=False), prompt_template
+    available_template_names = get_available_template_names()
+    if model_prompt_template in available_template_names:
+        return gr.Markdown.update("", visible=False), model_prompt_template
+    return gr.Markdown.update(f"Trained with prompt template `{model_prompt_template}`", visible=True), prompt_template
 def update_prompt_preview(prompt_template,
     with gr.Blocks() as inference_ui_blocks:
         with gr.Row():
+            with gr.Column(elem_id="inference_lora_model_group"):
+                model_prompt_template_message = gr.Markdown(
+                    "", visible=False, elem_id="inference_lora_model_prompt_template_message")
+                lora_model = gr.Dropdown(
+                    label="LoRA Model",
+                    elem_id="inference_lora_model",
+                    value="tloen/alpaca-lora-7b",
+                    allow_custom_value=True,
+                )
             prompt_template = gr.Dropdown(
                 label="Prompt Template",
                 elem_id="inference_prompt_template",
             with gr.Column(elem_id="inference_output_group_container"):
                 with gr.Column(elem_id="inference_output_group"):
                     inference_output = gr.Textbox(
+                        lines=inference_output_lines, label="Output", elem_id="inference_output")
                     inference_output.style(show_copy_button=True)
                     with gr.Accordion(
                             "Raw Output",
         )
         things_that_might_timeout.append(reload_selections_event)
+        prompt_template_change_event = prompt_template.change(
+            fn=handle_prompt_template_change,
+            inputs=[prompt_template, lora_model],
+            outputs=[
+                model_prompt_template_message,
+                variable_0, variable_1, variable_2, variable_3, variable_4, variable_5, variable_6, variable_7])
         things_that_might_timeout.append(prompt_template_change_event)
+        lora_model_change_event = lora_model.change(
+            fn=handle_lora_model_change,
+            inputs=[lora_model, prompt_template],
+            outputs=[model_prompt_template_message, prompt_template])
+        things_that_might_timeout.append(lora_model_change_event)
         generate_event = generate_btn.click(
             fn=do_inference,
             inputs=[
             outputs=[inference_output, inference_raw_output],
             api_name="inference"
         )
+        stop_btn.click(
+            fn=handle_stop_generate,
+            inputs=None,
+            outputs=None,
+            cancels=[generate_event]
+        )
         update_prompt_preview_event = update_prompt_preview_btn.click(fn=update_prompt_preview, inputs=[prompt_template,
                                                                                                         variable_0, variable_1, variable_2, variable_3,
           return function (...args) {
             const context = this;
             clearTimeout(timeout);
+            const fn = () => {
+              if (document.querySelector('#inference_preview_prompt > .wrap:not(.hide)')) {
+                // Preview request is still loading, wait for 10ms and try again.
+                timeout = setTimeout(fn, 10);
+                return;
+              }
               func.apply(context, args);
+            };
+            timeout = setTimeout(fn, wait);
           };
         }
           });
         }
       }, 100);
+      // [WORKAROUND-UI01]
+      setTimeout(function () {
+        const inference_output_textarea = document.querySelector(
+          '#inference_output textarea'
+        );
+        if (!inference_output_textarea) return;
+        const observer = new MutationObserver(function () {
+          if (inference_output_textarea.getAttribute('rows') === '1') {
+            setTimeout(function () {
+              const inference_generate_btn = document.getElementById(
+                'inference_generate_btn'
+              );
+              if (inference_generate_btn) inference_generate_btn.click();
+            }, 10);
+          }
+        });
+        observer.observe(inference_output_textarea, {
+          attributes: true,
+          attributeFilter: ['rows'],
+        });
+      }, 100);
     }
     """)

llama_lora/ui/main_page.py CHANGED Viewed

@@ -30,7 +30,7 @@ def main_page():
                 tokenizer_ui()
             info = []
             if Global.version:
-                info.append(f"LLaMA-LoRA `{Global.version}`")
             info.append(f"Base model: `{Global.base_model}`")
             if Global.ui_show_sys_info:
                 info.append(f"Data dir: `{Global.data_dir}`")
@@ -134,6 +134,41 @@ def main_page_custom_css():
         /* text-transform: uppercase; */
     }
     #inference_prompt_box > *:first-child {
         border-bottom-left-radius: 0;
         border-bottom-right-radius: 0;
@@ -266,12 +301,16 @@ def main_page_custom_css():
     }
     @media screen and (min-width: 640px) {
-        #inference_lora_model, #finetune_template {
             border-top-right-radius: 0;
             border-bottom-right-radius: 0;
             border-right: 0;
             margin-right: -16px;
         }
         #inference_prompt_template {
             border-top-left-radius: 0;
@@ -301,7 +340,7 @@ def main_page_custom_css():
             height: 42px !important;
             min-width: 42px !important;
             width: 42px !important;
-            z-index: 1;
         }
     }

                 tokenizer_ui()
             info = []
             if Global.version:
+                info.append(f"LLaMA-LoRA Tuner `{Global.version}`")
             info.append(f"Base model: `{Global.base_model}`")
             if Global.ui_show_sys_info:
                 info.append(f"Data dir: `{Global.data_dir}`")
         /* text-transform: uppercase; */
     }
+    #inference_lora_model_group {
+        border-radius: var(--block-radius);
+        background: var(--block-background-fill);
+    }
+    #inference_lora_model_group #inference_lora_model {
+        background: transparent;
+    }
+    #inference_lora_model_prompt_template_message:not(.hidden) + #inference_lora_model {
+        padding-bottom: 28px;
+    }
+    #inference_lora_model_group > #inference_lora_model_prompt_template_message {
+        position: absolute;
+        bottom: 8px;
+        left: 20px;
+        z-index: 1;
+        font-size: 12px;
+        opacity: 0.7;
+    }
+    #inference_lora_model_group > #inference_lora_model_prompt_template_message p {
+        font-size: 12px;
+    }
+    #inference_lora_model_prompt_template_message > .wrap {
+        display: none;
+    }
+    #inference_lora_model > .wrap:first-child:not(.hide),
+    #inference_prompt_template > .wrap:first-child:not(.hide) {
+        opacity: 0.5;
+    }
+    #inference_lora_model_group, #inference_lora_model {
+        z-index: 60;
+    }
+    #inference_prompt_template {
+        z-index: 55;
+    }
     #inference_prompt_box > *:first-child {
         border-bottom-left-radius: 0;
         border-bottom-right-radius: 0;
     }
     @media screen and (min-width: 640px) {
+        #inference_lora_model, #inference_lora_model_group,
+        #finetune_template {
             border-top-right-radius: 0;
             border-bottom-right-radius: 0;
             border-right: 0;
             margin-right: -16px;
         }
+        #inference_lora_model_group #inference_lora_model {
+            box-shadow: var(--block-shadow);
+        }
         #inference_prompt_template {
             border-top-left-radius: 0;
             height: 42px !important;
             min-width: 42px !important;
             width: 42px !important;
+            z-index: 61;
         }
     }

llama_lora/utils/data.py CHANGED Viewed

@@ -52,6 +52,22 @@ def get_path_of_available_lora_model(name):
     return None
 def get_dataset_content(name):
     file_name = os.path.join(Global.data_dir, "datasets", name)
     if not os.path.exists(file_name):

     return None
+def get_info_of_available_lora_model(name):
+    try:
+        if "/" in name:
+            return None
+        path_of_available_lora_model = get_path_of_available_lora_model(
+            name)
+        if not path_of_available_lora_model:
+            return None
+        with open(os.path.join(path_of_available_lora_model, "info.json"), "r") as json_file:
+            return json.load(json_file)
+    except Exception as e:
+        return None
 def get_dataset_content(name):
     file_name = os.path.join(Global.data_dir, "datasets", name)
     if not os.path.exists(file_name):

llama_lora/utils/lru_cache.py ADDED Viewed

	@@ -0,0 +1,27 @@

+from collections import OrderedDict
+class LRUCache:
+    def __init__(self, capacity=5):
+        self.cache = OrderedDict()
+        self.capacity = capacity
+    def get(self, key):
+        if key in self.cache:
+            # Move the accessed item to the end of the OrderedDict
+            self.cache.move_to_end(key)
+            return self.cache[key]
+        return None
+    def set(self, key, value):
+        if key in self.cache:
+            # If the key already exists, update its value
+            self.cache[key] = value
+        else:
+            # If the cache has reached its capacity, remove the least recently used item
+            if len(self.cache) >= self.capacity:
+                self.cache.popitem(last=False)
+            self.cache[key] = value
+    def clear(self):
+        self.cache.clear()

requirements.lock.txt CHANGED Viewed

@@ -65,10 +65,10 @@ packaging==23.0
 pandas==2.0.0
 parso==0.8.3
 pathspec==0.11.1
-peft @ git+https://github.com/huggingface/peft.git@deff03f2c251534fffd2511fc2d440e84cc54b1b
 pexpect==4.8.0
 pickleshare==0.7.5
-Pillow==9.5.0
 pkgutil_resolve_name==1.3.10
 platformdirs==3.2.0
 pluggy==1.0.0

 pandas==2.0.0
 parso==0.8.3
 pathspec==0.11.1
+peft @ git+https://github.com/huggingface/peft.git@382b178911edff38c1ff619bbac2ba556bd2276b
 pexpect==4.8.0
 pickleshare==0.7.5
+Pillow==9.3.0
 pkgutil_resolve_name==1.3.10
 platformdirs==3.2.0
 pluggy==1.0.0

templates/user_and_ai.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "description": "Unhelpful AI assistant.",
+  "variables": ["instruction"],
+  "prompt": "### User:\n{instruction}\n\n### AI:\n",
+  "default": "prompt",
+  "response_split": "### AI:"
+}