+
+# 📝 AutoPrompt
+
+
+
+
+
+**Auto Prompt is a prompt optimization framework designed to enhance and perfect your prompts for real-world use cases.**
+
+The framework automatically generates high-quality, detailed prompts tailored to user intentions. It employs a refinement (calibration) process, where it iteratively builds a dataset of challenging edge cases and optimizes the prompt accordingly. This approach not only reduces manual effort in prompt engineering but also effectively addresses common issues such as prompt [sensitivity](https://arxiv.org/abs/2307.09009) and inherent prompt [ambiguity](https://arxiv.org/abs/2311.04205) issues.
+
+
+**Our mission:** Empower users to produce high-quality robust prompts using the power of large language models (LLMs).
+
+# Why Auto Prompt?
+- **Prompt Engineering Challenges.** The quality of LLMs greatly depends on the prompts used. Even [minor changes](#prompt-sensitivity-example) can significantly affect their performance.
+- **Benchmarking Challenges.** Creating a benchmark for production-grade prompts is often labour-intensive and time-consuming.
+- **Reliable Prompts.** Auto Prompt generates robust high-quality prompts, offering measured accuracy and performance enhancement using minimal data and annotation steps.
+- **Modularity and Adaptability.** With modularity at its core, Auto Prompt integrates seamlessly with popular open-source tools such as LangChain, Wandb, and Argilla, and can be adapted for a variety of tasks, including data synthesis and prompt migration.
+
+## System Overview
+
+![System Overview](./docs/AutoPrompt_Diagram.png)
+
+The system is designed for real-world scenarios, such as moderation tasks, which are often challenged by imbalanced data distributions. The system implements the [Intent-based Prompt Calibration](https://arxiv.org/abs/2402.03099) method. The process begins with a user-provided initial prompt and task description, optionally including user examples. The refinement process iteratively generates diverse samples, annotates them via user/LLM, and evaluates prompt performance, after which an LLM suggests an improved prompt.
+
+The optimization process can be extended to content generation tasks by first devising a ranker prompt and then performing the prompt optimization with this learned ranker. The optimization concludes upon reaching the budget or iteration limit.
+
+
+This joint synthetic data generation and prompt optimization approach outperform traditional methods while requiring minimal data and iterations. Learn more in our paper
+[Intent-based Prompt Calibration: Enhancing prompt optimization with synthetic boundary cases](https://arxiv.org/abs/2402.03099) by E. Levi et al. (2024).
+
+
+**Using GPT-4 Turbo, this optimization typically completes in just a few minutes at a cost of under $1.** To manage costs associated with GPT-4 LLM's token usage, the framework enables users to set a budget limit for optimization, in USD or token count, configured as illustrated [here](docs/examples.md#steps-to-run-example).
+
+## Demo
+
+![pipeline_recording](./docs/autoprompt_recording.gif)
+
+
+## 📖 Documentation
+ - [How to install](docs/installation.md) (Setup instructions)
+ - [Prompt optimization examples](docs/examples.md) (Use cases: movie review classification, generation, and chat moderation)
+ - [How it works](docs/how-it-works.md) (Explanation of pipelines)
+ - [Architecture guide](docs/architecture.md) (Overview of main components)
+
+## Features
+- 📝 Boosts prompt quality with a minimal amount of data and annotation steps.
+- 🛬 Designed for production use cases like moderation, multi-label classification, and content generation.
+- ⚙️ Enables seamless migrating of prompts across model versions or LLM providers.
+- 🎓 Supports prompt squeezing. Combine multiple rules into a single efficient prompt.
+
+
+## QuickStart
+AutoPrompt requires `python <= 3.10`
+
+
+> **Step 1** - Download the project
+
+```bash
+git clone git@github.com:Eladlev/AutoPrompt.git
+cd AutoPrompt
+```
+
+
+
+> **Step 2** - Install dependencies
+
+Use either Conda or pip, depending on your preference. Using Conda:
+```bash
+conda env create -f environment_dev.yml
+conda activate AutoPrompt
+```
+
+Using pip:
+```bash
+pip install -r requirements.txt
+```
+
+Using pipenv:
+```bash
+pip install pipenv
+pipenv sync
+```
+
+
+
+> **Step 3** - Configure your LLM.
+
+Set your OpenAI API key by updating the configuration file `config/llm_env.yml`
+- If you need help locating your API key, visit this [link](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key).
+
+- We recommend using [OpenAI's GPT-4](https://platform.openai.com/docs/guides/gpt) for the LLM. Our framework also supports other providers and open-source models, as discussed [here](docs/installation.md#configure-your-llm).
+
+
+
+> **Step 4** - Configure your Annotator
+- Select an annotation approach for your project. We recommend beginning with a human-in-the-loop method, utilizing [Argilla](https://docs.argilla.io/en/latest/index.html). Follow the [Argilla setup instructions](https://docs.argilla.io/en/latest/getting_started/installation/deployments/huggingface-spaces.html) to configure your server. Alternatively, you can set up an LLM as your annotator by following these [configuration steps](docs/installation.md#configure-llm-annotator).
+
+- The default predictor LLM, GPT-3.5, for estimating prompt performance, is configured in the `predictor` section of `config/config_default.yml`.
+
+- Define your budget in the input config yaml file using the `max_usage parameter`. For OpenAI models, `max_usage` sets the maximum spend in USD. For other LLMs, it limits the maximum token count.
+
+
+
+
+> **Step 5** - Run the pipeline
+
+First, configure your labels by editing `config/config_default.yml`
+```
+dataset:
+ label_schema: ["Yes", "No"]
+```
+
+
+For a **classification pipeline**, use the following command from your terminal within the appropriate working directory:
+```bash
+python run_pipeline.py
+```
+If the initial prompt and task description are not provided directly as input, you will be guided to provide these details. Alternatively, specify them as command-line arguments:
+```bash
+python run_pipeline.py \
+ --prompt "Does this movie review contain a spoiler? answer Yes or No" \
+ --task_description "Assistant is an expert classifier that will classify a movie review, and let the user know if it contains a spoiler for the reviewed movie or not." \
+ --num_steps 30
+```
+You can track the optimization progress using the [W&B](https://wandb.ai/site) dashboard, with setup instructions available [here](docs/installation.md#monitoring-weights-and-biases-setup).
+
+
+If you are using pipenv, be sure to activate the environment:
+``` bash
+pipenv shell
+python run_pipeline.py
+```
+or alternatively prefix your command with `pipenv run`:
+```bash
+pipenv run python run_pipeline.py
+```
+
+#### Generation pipeline
+To run the generation pipeline, use the following example command:
+```bash
+python run_generation_pipeline.py \
+ --prompt "Write a good and comprehensive movie review about a specific movie." \
+ --task_description "Assistant is a large language model that is tasked with writing movie reviews."
+```
+For more information, refer to our [generation task example](docs/examples.md#generating-movie-reviews-generation-task).
+
+
+
+Enjoy the results. Completion of these steps yields a **refined (calibrated)
+prompt** tailored for your task, alongside a **benchmark** featuring challenging samples,
+stored in the default `dump` path.
+
+
+
+## Tips
+
+- Prompt accuracy may fluctuate during the optimization. To identify the best prompts, we recommend continuous refinement following the initial generation of the benchmark. Set the number of optimization iterations with `--num_steps` and control sample generation by specifying `max_samples` in the `dataset` section. For instance, setting `max_samples: 50` and `--num_steps 30` limits the benchmark to 50 samples, allowing for 25 additional refinement iterations, assuming 10 samples per iteration.
+
+- The framework supports checkpoints for easy resumption of optimization from the last saved state. It automatically saves the most recent optimization state in a `dump` path. Use `--output_dump` to set this path and `--load_path` to resume from a checkpoint.
+- The iterations include multiple calls to the LLM service, with long prompts and requests for a relatively large amount of generated tokens by the LLM. This might take time ~1 minute (especially in the generative tasks), so please be patient.
+- If there are some issues with the Argilla server connection/error, try to restart the space.
+
+
+
+## Prompt Sensitivity Example
+You write a prompt for identifying movie spoilers:
+```
+Review the content provided and indicate whether it includes any significant plot revelations or critical points that could reveal important elements of the story or its outcome. Respond with "Yes" if it contains such spoilers or critical insights, and "No" if it refrains from unveiling key story elements.
+```
+This prompt scores 81 on your [benchmark](docs/examples.md#filtering-movie-reviews-with-spoilers-classification-task) using GPT-4 LLM. Then, you make a minor modification:
+```
+Review the text and determine if it provides essential revelations or critical details about the story that would constitute a spoiler. Respond with "Yes" for the presence of spoilers, and "No" for their absence.
+```
+Surprisingly, the second prompt scores 72, representing an 11% drop in accuracy. This illustrates the need for a careful prompt engineering process.
+
+## 🚀 Contributing
+
+Your contributions are greatly appreciated! If you're eager to contribute, kindly refer to our [Contributing Guidelines](docs/contributing.md)) for detailed information.
+
+
+If you wish to be a part of our journey, we invite you to connect with us through our [Discord Community](https://discord.gg/G2rSbAf8uP). We're excited to have you onboard!
+
+## 🛡 Disclaimer
+
+The AutoPrompt project is provided on an "as-is" basis without any guarantees or warranties, expressed or implied.
+
+Our perspective on the optimization and usage of prompts:
+
+1. The core objective of AutoPrompt is to refine and perfect prompts to achieve high-quality results. This is achieved through an iterative calibration process, which helps in reducing errors and enhancing the performance of LLMs. However, the framework does not guarantee absolute correctness or unbiased results in every instance.
+
+2. AutoPrompt aims to improve the reliability of prompts and mitigate sensitivity issues, but it does not claim to completely eliminate such issues.
+
+
+Please note that using LLMs like OpenAI's GPT-4, supported by AutoPrompt, may lead to significant costs due to token usage. By using AutoPrompt, you acknowledge your responsibility to monitor and manage your token use and expenses. We advise regularly reviewing your LLM provider's API usage and establishing limits or alerts to prevent unexpected charges.
+To manage costs associated with GPT-4 LLM's token usage, the framework enables users to set a budget limit for optimization, in USD or token count, configured as illustrated [here](docs/examples.md#steps-to-run-example).
+
+## Citation
+
+If you have used our code in your research, please cite our [paper](https://arxiv.org/abs/2402.03099):
+
+```
+@misc{2402.03099,
+Author = {Elad Levi and Eli Brosh and Matan Friedmann},
+Title = {Intent-based Prompt Calibration: Enhancing prompt optimization with synthetic boundary cases},
+Year = {2024},
+Eprint = {arXiv:2402.03099},
+}
+```
+
+
+## License
+
+This framework is licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).
+
+## ✉️ Support / Contact us
+- [Community Discord](https://discord.gg/G2rSbAf8uP)
+- Our email: [autopromptai@gmail.com](mailto:autopromptai@gmail.com)
+
+
diff --git a/AutoPrompt/config/config_default.yml b/AutoPrompt/config/config_default.yml
new file mode 100644
index 0000000000000000000000000000000000000000..54d5c809f487c9c6b941ec1254690bdc513d24f4
--- /dev/null
+++ b/AutoPrompt/config/config_default.yml
@@ -0,0 +1,58 @@
+use_wandb: False
+dataset:
+ name: 'dataset'
+ records_path: null
+ initial_dataset: ''
+ label_schema: ["Yes", "No"]
+ max_samples: 50
+ semantic_sampling: False # Change to True in case you don't have M1. Currently there is an issue with faiss and M1
+
+annotator:
+ method : 'argilla'
+ config:
+ api_url: 'https://kenken999-arglira.hf.space'
+ api_key: 'admin.apikey'
+ workspace: 'admin'
+ time_interval: 5
+
+predictor:
+ method : 'llm'
+ config:
+ llm:
+ type: 'OpenAI'
+ name: 'llama3-8b-8192'
+# async_params:
+# retry_interval: 10
+# max_retries: 2
+ model_kwargs: {"seed": 220}
+ num_workers: 5
+ prompt: 'prompts/predictor_completion/prediction.prompt'
+ mini_batch_size: 1 #change to >1 if you want to include multiple samples in the one prompt
+ mode: 'prediction'
+
+meta_prompts:
+ folder: 'prompts/meta_prompts_classification'
+ num_err_prompt: 1 # Number of error examples per sample in the prompt generation
+ num_err_samples: 2 # Number of error examples per sample in the sample generation
+ history_length: 4 # Number of sample in the meta-prompt history
+ num_generated_samples: 10 # Number of generated samples at each iteration
+ num_initialize_samples: 10 # Number of generated samples at iteration 0, in zero-shot case
+ samples_generation_batch: 10 # Number of samples generated in one call to the LLM
+ num_workers: 5 #Number of parallel workers
+ warmup: 4 # Number of warmup steps
+
+eval:
+ function_name: 'accuracy'
+ num_large_errors: 4
+ num_boundary_predictions : 0
+ error_threshold: 0.5
+
+llm:
+ type: 'OpenAI'
+ name: 'llama3-70b-8192'
+ temperature: 0.8
+
+stop_criteria:
+ max_usage: 2 #In $ in case of OpenAI models, otherwise number of tokens
+ patience: 10 # Number of patience steps
+ min_delta: 0.01 # Delta for the improvement definition
diff --git a/AutoPrompt/config/config_diff/config_batch_classification.yml b/AutoPrompt/config/config_diff/config_batch_classification.yml
new file mode 100644
index 0000000000000000000000000000000000000000..151581cb96f8854e2168cda054697c3006471bc1
--- /dev/null
+++ b/AutoPrompt/config/config_diff/config_batch_classification.yml
@@ -0,0 +1,14 @@
+use_wandb: True
+dataset:
+ label_schema: ["Yes", "No"]
+
+annotator:
+ method : 'llm_batch'
+ config:
+ instructions: ['Is there is an address in the text?', 'Is there is a phone number in the text?',
+ 'Is there is a password in the text?']
+ aggregation_mode: 'exist' #'majority_vote', 'exist', or 'all'. exist/all is working only in case label_schema: ["Yes", "No"]!
+ estimator_config:
+ num_workers: 2
+ prompt: 'prompts/predictor/prediction.prompt'
+ mode: 'annotation'
\ No newline at end of file
diff --git a/AutoPrompt/config/config_diff/config_generation.yml b/AutoPrompt/config/config_diff/config_generation.yml
new file mode 100644
index 0000000000000000000000000000000000000000..523c1f89d35e3255043a7254a6953889d54ee23e
--- /dev/null
+++ b/AutoPrompt/config/config_diff/config_generation.yml
@@ -0,0 +1,25 @@
+annotator:
+ method : ''
+
+dataset:
+ max_samples: 20
+ label_schema: ["1","2","3","4","5"]
+
+predictor:
+ method : 'llm'
+ config:
+ prompt: 'prompts/predictor_completion/prediction_generation.prompt'
+ mini_batch_size: 1
+ llm:
+ type: 'OpenAI'
+ name: 'llama3-70b-8192' #'gpt-3.5-turbo-0613'
+ num_workers: 7
+
+meta_prompts:
+ folder: 'prompts/meta_prompts_generation'
+ warmup: 1
+
+eval:
+ function_name: 'ranking'
+ error_threshold: 4
+
diff --git a/AutoPrompt/config/config_diff/config_ranking.yml b/AutoPrompt/config/config_diff/config_ranking.yml
new file mode 100644
index 0000000000000000000000000000000000000000..148fdf7af4218e2fc33e9d26efb4bac227a76879
--- /dev/null
+++ b/AutoPrompt/config/config_diff/config_ranking.yml
@@ -0,0 +1,5 @@
+dataset:
+ label_schema: ["1","2","3","4","5"]
+
+meta_prompts:
+ folder: 'prompts/meta_prompts_ranking'
\ No newline at end of file
diff --git a/AutoPrompt/config/llm_env.yml b/AutoPrompt/config/llm_env.yml
new file mode 100644
index 0000000000000000000000000000000000000000..d2116de07ee0a0000bf6d7e4ba60087b2250907d
--- /dev/null
+++ b/AutoPrompt/config/llm_env.yml
@@ -0,0 +1,12 @@
+openai:
+ OPENAI_API_KEY: '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'
+ OPENAI_API_BASE: 'https://api.groq.com/openai/v1'
+ OPENAI_ORGANIZATION: ''
+
+azure:
+ AZURE_OPENAI_API_KEY: ''
+ AZURE_OPENAI_ENDPOINT: ''
+ OPENAI_API_VERSION: ''
+
+google:
+ GOOGLE_API_KEY: ''
\ No newline at end of file
diff --git a/AutoPrompt/dataset/base_dataset.py b/AutoPrompt/dataset/base_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..b75a395865464b4ac7045ecee6fbae840f02a8c4
--- /dev/null
+++ b/AutoPrompt/dataset/base_dataset.py
@@ -0,0 +1,158 @@
+import os.path
+import logging
+import pandas as pd
+from pathlib import Path
+from datetime import datetime
+import csv
+
+from utils.dedup import Dedup
+
+class DatasetBase:
+ """
+ This class store and manage all the dataset records (including the annotations and prediction)
+ """
+
+ def __init__(self, config):
+ if config.records_path is None:
+ self.records = pd.DataFrame(columns=['id', 'text', 'prediction',
+ 'annotation', 'metadata', 'score', 'batch_id'])
+ else:
+ self.records = pd.read_csv(config.records_path)
+ dt_string = datetime.now().strftime("%d_%m_%Y_%H_%M_%S")
+
+ self.name = config.name + '__' + dt_string
+ self.label_schema = config.label_schema
+ self.dedup = Dedup(config)
+ self.sample_size = config.get("sample_size", 3)
+ self.semantic_sampling = config.get("semantic_sampling", False)
+ if not config.get('dedup_new_samples', False):
+ self.remove_duplicates = self._null_remove
+
+ def __len__(self):
+ """
+ Return the number of samples in the dataset.
+ """
+ return len(self.records)
+
+ def __getitem__(self, batch_idx):
+ """
+ Return the batch idx.
+ """
+ extract_records = self.records[self.records['batch_id'] == batch_idx]
+ extract_records = extract_records.reset_index(drop=True)
+ return extract_records
+
+ def get_leq(self, batch_idx):
+ """
+ Return all the records up to batch_idx (includes).
+ """
+ extract_records = self.records[self.records['batch_id'] <= batch_idx]
+ extract_records = extract_records.reset_index(drop=True)
+ return extract_records
+
+ def add(self, sample_list: dict = None, batch_id: int = None, records: pd.DataFrame = None):
+ """
+ Add records to the dataset.
+ :param sample_list: The samples to add in a dict structure (only used in case record=None)
+ :param batch_id: The batch_id for the upload records (only used in case record= None)
+ :param records: dataframes, update using pandas
+ """
+ if records is None:
+ records = pd.DataFrame([{'id': len(self.records) + i, 'text': sample, 'batch_id': batch_id} for
+ i, sample in enumerate(sample_list)])
+ self.records = pd.concat([self.records, records], ignore_index=True)
+
+ def update(self, records: pd.DataFrame):
+ """
+ Update records in dataset.
+ """
+ # Ignore if records is empty
+ if len(records) == 0:
+ return
+
+ # Set 'id' as the index for both DataFrames
+ records.set_index('id', inplace=True)
+ self.records.set_index('id', inplace=True)
+
+ # Update using 'id' as the key
+ self.records.update(records)
+
+ # Remove null annotations
+ if len(self.records.loc[self.records["annotation"]=="Discarded"]) > 0:
+ discarded_annotation_records = self.records.loc[self.records["annotation"]=="Discarded"]
+ #TODO: direct `discarded_annotation_records` to another dataset to be used later for corner-cases
+ self.records = self.records.loc[self.records["annotation"]!="Discarded"]
+
+ # Reset index
+ self.records.reset_index(inplace=True)
+
+ def modify(self, index: int, record: dict):
+ """
+ Modify a record in the dataset.
+ """
+ self.records[index] = record
+
+ def apply(self, function, column_name: str):
+ """
+ Apply function on each record.
+ """
+ self.records[column_name] = self.records.apply(function, axis=1)
+
+ def save_dataset(self, path: Path):
+ self.records.to_csv(path, index=False, quoting=csv.QUOTE_NONNUMERIC)
+
+ def load_dataset(self, path: Path):
+ """
+ Loading dataset
+ :param path: path for the csv
+ """
+ if os.path.isfile(path):
+ self.records = pd.read_csv(path, dtype={'annotation': str, 'prediction': str, 'batch_id': int})
+ else:
+ logging.warning('Dataset dump not found, initializing from zero')
+
+ def remove_duplicates(self, samples: list) -> list:
+ """
+ Remove (soft) duplicates from the given samples
+ :param samples: The samples
+ :return: The samples without duplicates
+ """
+ dd = self.dedup.copy()
+ df = pd.DataFrame(samples, columns=['text'])
+ df_dedup = dd.sample(df, operation_function=min)
+ return df_dedup['text'].tolist()
+
+ def _null_remove(self, samples: list) -> list:
+ # Identity function that returns the input unmodified
+ return samples
+
+ def sample_records(self, n: int = None) -> pd.DataFrame:
+ """
+ Return a sample of the records after semantic clustering
+ :param n: The number of samples to return
+ :return: A sample of the records
+ """
+ n = n or self.sample_size
+ if self.semantic_sampling:
+ dd = self.dedup.copy()
+ df_samples = dd.sample(self.records).head(n)
+
+ if len(df_samples) < n:
+ df_samples = self.records.head(n)
+ else:
+ df_samples = self.records.sample(n)
+ return df_samples
+
+ @staticmethod
+ def samples_to_text(records: pd.DataFrame) -> str:
+ """
+ Return a string that organize the samples for a meta-prompt
+ :param records: The samples for the step
+ :return: A string that contains the organized samples
+ """
+ txt_res = '##\n'
+ for i, row in records.iterrows():
+ txt_res += f"Sample:\n {row.text}\n#\n"
+ return txt_res
+
+
diff --git a/AutoPrompt/docs/AutoPrompt_Diagram.png b/AutoPrompt/docs/AutoPrompt_Diagram.png
new file mode 100644
index 0000000000000000000000000000000000000000..d6ac00078e2b9890e3c96688047a878e9627c4ad
Binary files /dev/null and b/AutoPrompt/docs/AutoPrompt_Diagram.png differ
diff --git a/AutoPrompt/docs/arch_overview.png b/AutoPrompt/docs/arch_overview.png
new file mode 100644
index 0000000000000000000000000000000000000000..733ba5a52d82842f5d01ad5254b8490c5a3e2059
Binary files /dev/null and b/AutoPrompt/docs/arch_overview.png differ
diff --git a/AutoPrompt/docs/architecture.md b/AutoPrompt/docs/architecture.md
new file mode 100644
index 0000000000000000000000000000000000000000..7ab69f0d462deba6f8c7db5216a8fd3c42e16425
--- /dev/null
+++ b/AutoPrompt/docs/architecture.md
@@ -0,0 +1,18 @@
+# Architecture Guide
+
+
+This document outlines the system design of AutoPrompt, which is built around four primary components: Dataset, Estimator, Evaluator, and Optimizer Manager. These components collaborate to refine prompts through an iterative process involving sample generation, annotation, prediction, evaluation of scores, and optimization.
+
+* __Dataset.__ This component manages the dataset and performs operations such as insertion, modification, deletion, and applying functions, on the dataset rows. The component also handles data cleaning by removing semantic duplications and performing semantic sampling. Since the system is optimized for small datasets, the current implementation is based on a local database using [pandas](https://pandas.pydata.org).
+* __Estimator.__ The estimator is responsible for estimating a batch of samples. We implement this component in two forms, once for the predictions and once for the annotations. Such a generic implementation (for both use cases) allows for easy adaptation of the system to diverse use cases, including prompt calibration, prompt distillation and prompt squashing. The currently supported types of estimators are:
+ 1. __Human annotation__: Using [Argilla UI](https://docs.argilla.io/en/latest/index.html#). The system is connected to the Argilla server and is waiting until the annotation task is completed.
+ 2. __LLM estimator__: Using an LLM to estimate the sample given a prompt. We support various types of LLMs, using [Langchain](https://python.langchain.com/docs/get_started/introduction) integration. For efficiency, the system supports parallelism using both workers and async calls. The system also supports sending a few samples in one prompt (prompt batching), which can reduce the cost significantly.
+ 3. __Batch estimator__: The batch estimator runs multiple LLM estimators and adds a policy layer to aggregate the results. It is mainly used for prompt-squashing, aiming to optimize a single prompt that achieves the efficacy of multiple prompts. For example, in case of a user with several moderation rules.
+* __Evaluator.__ The evaluator is responsible for evaluating the records after the prediction and annotation stage. The evaluator accepts a function and applies it to each row. It's important to note that the function is generic, for example in the generation pipeline the function is performed by invoking an LLM. The evaluator is also responsible for defining the errors and handling the error analysis using the Analyzer meta-prompt.
+* __Optimizer manager (Optimization Pipeline).__ The optimizer manager handles the whole optimization process flow, it performs the iteration steps described in the system flow [documentation](how-it-works.md) and is responsible for stopping and returning the final calibrated prompt. The currently supported criteria are either convergence (determined by a patient hyper-parameter), or usage limit (determined by maximal cost if relevant, or by the number of generated tokens).
+
+## Design Considerations
+
+- **Modularity and Flexibility**: Each component is designed with modularity in mind, allowing for easy swaps or upgrades to accommodate different use cases.
+- **Scalability**: The framework's architecture supports scaling, from handling small datasets efficiently to accommodating the computational demands of parallel processing and batch estimation.
+- **Cost-Efficiency**: Features like prompt batching and the use of a batch estimator are specifically included to manage and minimize operational costs associated with LLM usage.
diff --git a/AutoPrompt/docs/argilla_movie_spoilers_example.png b/AutoPrompt/docs/argilla_movie_spoilers_example.png
new file mode 100644
index 0000000000000000000000000000000000000000..e7eb59d2cbf18dad6edbfa539e098dbb5e3fee85
Binary files /dev/null and b/AutoPrompt/docs/argilla_movie_spoilers_example.png differ
diff --git a/AutoPrompt/docs/autoprompt_recording.gif b/AutoPrompt/docs/autoprompt_recording.gif
new file mode 100644
index 0000000000000000000000000000000000000000..b186fa04a5addb4b72ad265d5f9a217482f6781b
--- /dev/null
+++ b/AutoPrompt/docs/autoprompt_recording.gif
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:e4156d4ad7c4d971a7a7721b0a031000a43ba677dd3dd20e2c15f54de88b6172
+size 2333849
diff --git a/AutoPrompt/docs/contributing.md b/AutoPrompt/docs/contributing.md
new file mode 100644
index 0000000000000000000000000000000000000000..387e6b710c70be1b1a467718aedc8564bf64001d
--- /dev/null
+++ b/AutoPrompt/docs/contributing.md
@@ -0,0 +1,13 @@
+# Contributing to AutoPrompt
+
+Thank you for considering contributing to AutoPrompt! We deeply appreciate your interest in improving our project.
+
+## Bug Fixes and Documentation Enhancements
+
+Bug fixes and documentation improvements, including compelling examples and use cases, greatly benefit our project. If you encounter any bugs or identify areas where the documentation could be strengthened, please do not hesitate to submit a pull request (PR) containing your proposed changes.
+
+## Feature Requests
+
+For significant feature additions, we encourage you to open an issue on GitHub. Additionally, we invite you to join our Discord community and engage in discussions about the feature in the #features-requests channel. This collaborative environment enables us to delve deeper into the proposed features and foster meaningful dialogue.
+
+We value your contributions and look forward to working together to enhance AutoPrompt!
diff --git a/AutoPrompt/docs/examples.md b/AutoPrompt/docs/examples.md
new file mode 100644
index 0000000000000000000000000000000000000000..542de2a6ef93cdd2ddbff4315920fb1a260088e9
--- /dev/null
+++ b/AutoPrompt/docs/examples.md
@@ -0,0 +1,243 @@
+
+# Prompt Optimization Examples
+
+This document provides practical examples of using the AutoPrompt pipeline across various scenarios. It focuses on movie review and chat moderation tasks to demonstrate the flexibility and effectiveness of the AutoPrompt framework.
+
+
+1. [Filtering Movie Reviews with Spoilers (Classification task)](#filtering-movie-reviews-with-spoilers-i-task)
+2. [Movie Genre Identification (Multi-label classification task)](#movie-genre-identification-multi-label-classification)
+3. [Rating Movie Reviews (Scoring task)](#rating-movie-reviews-scoring-task)
+4. [Generating Movie Reviews (Generation task)](#generating-movie-reviews-generation-task)
+5. [Single Topic Moderation](#single-topic-moderation)
+6. [Multi-Topic Moderation (Prompt squeezing task)](#multi-topic-moderation-prompt-squeezing)
+
+### Filtering Movie Reviews with Spoilers (Classification task)
+
+In this binary classification example, we aim to filter out movie reviews containing spoilers for a specific movie. A correctly implemented filter can be a powerful tool in a large-scale movie review system.
+
+We'll start with a simple initial prompt and task description:
+ - Initial prompt: “Does this movie review contain a spoiler? answer Yes or No”
+ - Task description: “Assistant is an expert classifier that will classify a movie review, and let the user know if it contains a spoiler for the reviewed movie or not.”
+
+#### Steps to Run Example
+
+1. Configure your labels by editing `config/config_default.yml`. Modify the `label_schema` in the `dataset` section to include only 'Yes' and 'No' options.
+
+```
+dataset:
+ name: 'dataset'
+ records_path: null
+ initial_dataset: 'dump/dataset.csv'
+ label_schema: ["Yes", "No"]
+ max_samples: 50
+```
+2. Run the main pipeline from an IDE or the command line
+```bash
+> python run_pipeline.py
+```
+
+*Note*: Without input parameters, the pipeline prompts the user to provide them. Alternatively, specify initial prompt and task description as command-line arguments:
+```bash
+> python run_pipeline.py \
+ --prompt "Does this movie review contain a spoiler? answer Yes or No" \
+ --task_description "Assistant is an expert classifier that will classify a movie review, and let the user know if it contains a spoiler for the reviewed movie or not."
+```
+
+3. A browser window displaying the Argilla workspace will open for manual annotations
+![argilla_example](./argilla_movie_spoilers_example.png)
+
+Annotate the generated examples as they appear and monitor the pipeline's progress. Control the number of optimization iterations with the `num_steps` parameter, specified at start:
+```bash
+> python run_pipeline.py --num_steps 30
+```
+The pipeline concludes after reaching the `num_steps` or meeting a predefined stop criteria, defined in `config/config_default.yml`:
+```
+stop_criteria:
+ max_usage: 0.5 # Max budget for optimization (USD for OpenAI's LLM model)
+ patience: 3 # Number of iterations to wait for improvement
+ min_delta: 0.05 # Minimum improvement between iterations
+```
+Note that the framework also supports using an LLM as the annotator, see setup instructions [here](installation.md#configure-llm-annotator).
+
+4. After completion, the pipeline outputs a **refined (calibrated) prompt** tailored for the task and a reference **benchmark** with challenging samples. In this example, the final spoiler identification prompt might be:
+
+```
+Review Spoiler Identification Protocol: For the task of classifying IMDB reviews for
+the presence of spoilers, the classifier must label reviews with a heightened sensitivity to
+nuanced language and indirect spoiler cues. The classification labels are ’Yes’ for spoilers
+and ’No’ for non-spoilers. Apply the following criteria rigorously: Label ’Yes’ if a review: -
+Contains subtle references or nuanced language that hints at plot developments or character
+arcs, without explicit detail. - Includes emotional responses or descriptive language that
+indirectly reveals plot outcomes or twists. - Employs suggestive language that points to future
+events or endings, even if it does not reveal specific information. Label ’No’ if a review: -
+Discusses technical aspects, acting, direction, or personal viewer impressions in a manner
+that does not hint at or reveal any plot details. - Comments on thematic elements, genre
+characteristics, or storytelling techniques without disclosing or implying crucial plot twists.
+```
+
+- The framework automatically saves the benchmark, run log, and a checkpoint file (which stores the state of the optimization, enabling seamless continuation from a previous run) in a default `dump` path, adjustable with the `--output_dump` command line argument.
+- Note that the steps above are relevant to all classification and generation tasks. See the following examples for more use cases.
+
+5. Until now, we've initiated the pipeline with just an initial prompt and task description. However, you can also include a few examples by specifying an initial dataset in the `initial_dataset` field within the `dataset` section of the `config/config_default.yml` file. For example:
+```
+dataset:
+ initial_dataset: 'dump/dataset.csv'
+```
+An example of an initial dataset with two samples is shown below:
+```
+id,text,prediction,annotation,metadata,score,batch_id
+0,"The cinematography was mesmerizing, especially during the scene where they finally reveal the mysterious room that captivated the main character.",No,Yes,,,0
+1,"The director's bold choice to leave the world's fate unclear until the final frame will spark audience discussions.",No,Yes,,,0
+```
+
+
+### Movie Genre Identification (Multi-label classification):
+
+In this example, we want to segment movie reviews into pre-defined genres. The initial prompt and task description might look like this:
+ - Initial prompt: "Based on the following movie review, what genre is this movie? Select between Action, Comedy, Drama, Romance or Horror."
+ - Task description: "Assistant is an expert cinema critic for all genres, and is tasked with classifying other movie reviews."
+
+#### Run Example
+For this multi-label classification, update the `label_schema` in `config/config_default.yml`
+```
+dataset:
+ label_schema: ["Action", "Comedy", "Drama", "Romance", "Horror"]
+```
+And then simply run the pipeline with the corresponding input parameters:
+```bash
+> python run_pipeline.py \
+ --prompt "Based on the following movie review, what genre is this movie? Select between Action, Comedy, Drama, Romance or Horror." \
+ --task_description "Assistant is an expert cinema critic for all genres, and is tasked with classifying other movie reviews."
+```
+Please follow the same annotation and monitoring procedures as shown in the previous example.
+
+### Rating Movie Reviews (Scoring task):
+In this example, we aim to score (rank) the movie reviews based on various criteria, assigning a numerical rating to each
+
+We'll start with a simple initial prompt:
+ - Initial prompt: "How well is this movie review written? Give it a score between 1 and 5, with 1 being the lowest score."
+ - Task description: "Assistant is an expert cinema reviewer and editor, and is tasked with scoring other movie reviews."
+
+Note that although this task involves scoring, it is treated as a classification task, similar to the examples above.
+
+#### Run Example
+To run this task, update the `label_scheme` in the input `config/config_default.yml` config file:
+```
+dataset:
+ label_schema: ["1", "2", "3", "4", "5"]
+```
+And then simply use the input parameters to run the pipeline:
+```bash
+> python run_pipeline.py \
+ --prompt "How well is this movie review written? Give it a score between 1 and 5, with 1 being the lowest score." \
+ --task_description "Assistant is an expert cinema reviewer and editor, and is tasked with scoring other movie reviews."
+```
+Follow the same steps as in the simple classification example for running the pipeline and annotating through the Argilla UI.
+
+### Generating Movie Reviews (Generation task):
+Here, we aim to generate good (insightful and comprehensive) movie reviews from scratch. The initial prompt might look something like this:
+ - Initial prompt: “Write a good and comprehensive movie review about a specific movie.”
+ - Task description: “Assistant is a large language model that is tasked with writing movie reviews.”
+
+This time, we'll need to use the `run_generation_pipeline.py` to initiate a generation run. This pipeline is different from but builds on the classification pipeline in our earlier examples.
+
+The generation pipeline starts by taking the initial prompt and modifying it for a scoring task, similar to the scoring example above. Once it establishes a robust estimtor for high-quality content, in this instance movie reviews, it runs the generation pipeline without the need for human annotation.
+
+To facilitate this, two distinct input config files are employed: `config/config_diff/config_ranking.yml`, and `config/config_diff/config_generation.yml`.
+
+Note that the `annotator` section in the generation config yaml file remains empty:
+```
+annotator:
+ method : ''
+```
+
+#### Run Example
+
+Run the generation pipeline with appropriate arguments:
+```bash
+> python run_generation_pipeline.py \
+ --prompt "Write a good and comprehensive movie review about a specific movie." \
+ --task_description "Assistant is a large language model that is tasked with writing movie reviews."
+```
+
+As the pipeline runs, the user will be prompted to annotate ranking examples of movie reviews. The final output will be a calibrated prompt for the generation task.
+
+### Single Topic Moderation:
+
+In this example, we aim to monitor user interactions on an Enterprise's chat platform to moderate (filter out) any unsolicited advertisements. This ensures a focused and relevant communication environment.
+
+The initial prompt could be as follows:
+
+- Initial prompt: “Assess whether the message contains advertising. Answer 'Yes' or 'No'.”
+ - Task description: “As a moderation expert at FabricFantasia, an online store selling clothes, you meticulously review customer inquiries and support tickets.”
+
+#### Run Example
+For the moderation, update the `label_schema` in `config/config_default.yml`
+```
+dataset:
+ label_schema: ["Yes", "No"]
+```
+And then execute the pipeline with the specified input parameters:
+```bash
+> python run_pipeline.py \
+ --prompt "Assess whether the message contains advertising. Answer 'Yes' or 'No'." \
+ --task_description "As a moderation expert at FabricFantasia, an online store selling clothes, you meticulously review customer inquiries and support tickets."
+```
+Please follow the same annotation and monitoring procedures as shown in the previous examples.
+
+### Multi Topic Moderation (Prompt squeezing task):
+In this example, our goal is to monitor user interactions on an enterprise's chat platform and moderate (filter out) any problematic topics, including disclosing personal information, deceptive practices, spam, illegal activities, conflict of interest, and off-topic content.
+
+The initial prompt could be structured as follows:
+
+- Initial prompt: “Does this message contain any problematic topics such as disclosing personal information, deceptive practices, spam, illegal activities, conflict of interest, or off-topic content? Respond with 'Yes' or 'No'.”
+ - Task description: “As a moderation expert at FabricFantasia, an online store selling clothes, you meticulously review customer inquiries and support tickets.”
+
+
+#### Run Example
+In a multi-topic moderation setting, we use various moderation rules to annotate a sample. Each rule is evaluated independently, and the outcomes are combined to generate the final labels. We employ an LLM annotator to avoid time-intensive manual annotation.
+
+This task utilizes two distinct input configuration files: `config/config_default.yml`, used previously, and `config/config_diff/config_batch_classification.yml`, which specifies the individual moderation rules, the policy for aggregating results, and LLM configuration. The available aggregation policies are 'exist', 'majority', and 'all'. The 'exist' and 'all' policies are suited for scenarios with 'Yes' or 'No' labels, while the 'majority' policy assigns the final label based on the most frequently occurring outcome across the rules.
+
+In our case, it can look like this:
+```
+dataset:
+ label_schema: ["Yes", "No"]
+
+annotator:
+ method : 'llm_batch'
+ config:
+ instructions:
+ ['Does the message disclosure sensitive personal information? Answer Yes or No',
+ 'Does the message involve deceptive practices? Answer Yes or No',
+ 'Is this message an example of spam? Answer Yes or No',
+ 'Does the message reference or promote any illegal activities? Answer Yes or No',
+ 'Does the message come from someone with a potential conflict of interest? Answer Yes or No',
+ 'Is this message completely irrelevant to the services or products offered? Answer Yes or No'
+ ]
+ aggregation_mode: 'exist' #'majority', 'exist', or 'all'. exist/all is working only in case label_schema: ["Yes", "No"]!
+ estimator_config:
+ num_workers: 2
+ prompt: 'prompts/predictor/prediction.prompt'
+ mode: 'annotation'
+ mini_batch_size: 1
+ llm:
+ type: 'OpenAI'
+ name: 'gpt-4-1106-preview'
+```
+
+Also, update the `label_schema` in `config/config_default.yml`
+```
+dataset:
+ label_schema: ["Yes", "No"]
+```
+
+#### Run Example
+As before, we'll use the `run_pipeline.py` to initiate a multi-topic moderation run.
+```bash
+> python run_pipeline.py \
+ --batch_config_path "config/config_diff/config_batch_classification.yml" \
+ --prompt "Assess whether the message contains any of the following problematic topics: disclosing personal information, deceptive practices, spam, illegal activities, conflict of interest, off-topic content. Answer 'Yes' if it does or 'No' otherwise." \
+ --task_description "As a moderation expert at FabricFantasia, an online store selling clothes, you meticulously review customer inquiries and support tickets."
+```
+Please follow the same annotation and monitoring procedures as shown in the previous examples.
diff --git a/AutoPrompt/docs/how-it-works.md b/AutoPrompt/docs/how-it-works.md
new file mode 100644
index 0000000000000000000000000000000000000000..e3cc5dbca369a49c70c086230b32119a64e4f5de
--- /dev/null
+++ b/AutoPrompt/docs/how-it-works.md
@@ -0,0 +1,58 @@
+
+# How AutoPrompt works
+
+This document outlines the optimization process flows of AutoPrompt. The framework is designed with modularity and adaptability in mind, allowing for easy extension of the prompt calibration process from classification tasks to generative tasks.
+
+
+## Classification Pipeline Overview
+
+The classification pipeline executes a calibration process involving the following steps:
+
+1. **User Input:**
+ - The user provides an initial prompt and task description to kickstart the calibration process.
+
+2. **Challenging Examples:**
+ - A set of challenging examples is proposed to the user to enhance the model's performance.
+
+3. **Annotation:**
+ - The provided examples are annotated, utilizing either a human-in-the-loop approach or leveraging Language Model (LLM) capabilities.
+
+4. **Prediction:**
+ - The annotated samples are evaluated using the current prompt to assess model performance.
+
+5. **Prompt Analysis:**
+ - The pipeline analyzes the prompt scores and identifies instances of large errors.
+
+6. **Prompt Refinement:**
+ - A new prompt is suggested based on the evaluation results, aiming to improve model accuracy.
+
+7. **Iteration:**
+ - Steps 2-6 are iteratively repeated until convergence, refining the prompt and enhancing the model's performance throughout the process.
+
+
+## Generation Pipeline Overview
+
+The generation pipeline shares a common structure with the classification flow but introduces a modification step for generation prompts. The process unfolds as follows:
+
+1. **User Input:**
+ - The user provides an initial prompt and task description for the generation process.
+
+2. **Prompt Modification (LLM):**
+ - The initial prompt is transformed into a classification-compatible input using a Language Model (LLM), creating an intermediary task for boolean classification or ranking.
+
+3. **Annotation (Classification):**
+ - Challenging examples are annotated for boolean classification or ranking based on the modified prompts. This step is analogous to the classification flow.
+
+4. **Ranker Calibration (LLM):**
+ - Utilizing the annotated examples, a ranking prompt (implemented as an LLM estimator) is fitted.
+
+5. **Calibration (Generation):**
+ - The original generation prompt is calibrated using the ranking LLM estimator (now used for evaluation), resulting in enhanced prompt formulations for generation tasks.
+
+
+
+The modular architecture of the pipeline demonstrates the flexibility of the core calibration process and effectiveness for both classification and generation tasks. The additional step in the generation flow seamlessly integrates with the overall iterative prompt calibration approach.
+
+
+
+
diff --git a/AutoPrompt/docs/installation.md b/AutoPrompt/docs/installation.md
new file mode 100644
index 0000000000000000000000000000000000000000..efbcc5242b1de02d8672cfe85900f2d12abf998e
--- /dev/null
+++ b/AutoPrompt/docs/installation.md
@@ -0,0 +1,75 @@
+# Installation
+
+This guide provides detailed instructions for setting up your development environment, configuring LLMs, and integrating various tools necessary for your project.
+
+## Python version
+We recommend using python 3.10.13
+
+## Install with Conda
+We recommend installing using Conda:
+```bash
+conda env create -f environment_dev.yml
+conda activate AutoPrompt
+```
+
+## Install with pip
+Install using pip directly:
+```bash
+pip install -r requirements.txt
+```
+
+## Install with pipenv
+Install using pipenv:
+```bash
+pip install pipenv
+pipenv sync
+```
+
+### Configure your LLM
+
+Set your OpenAI API key in the configuration file `config/llm_env.yml`. For assistance locating your API key, visit this [link](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key).
+
+- For LLM, we recommend using [OpenAI's GPT-4](https://platform.openai.com/docs/guides/gpt). Alternatively, configure Azure by setting llm type in `config/config_default.yml` to `"Azure"` and specifying the key in `config/llm_env.yml`. Our system also supports various LLMs, including open source models, through [Langchain Pipeline](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines). Change the llm `type` to `"HuggingFacePipeline"` and specify the model ID in the llm `name` field.
+
+- **Configure your Predictor**. We employ a predictor to estimate prompt performance. The default predictor LLM is GPT-3.5. Configuration is located in the `predictor` section of `config/config_default.yml`.
+
+### Configure Human-in-the-Loop Annotator
+
+Our pipeline incorporates a human-in-the-loop annotation process using [Argilla](https://docs.argilla.io/en/latest/index.html). Follow these steps to set it up:
+
+1. **Set Up Argilla Server and UI:** Follow the [instructions](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html) to install and set up an Argilla server and user interface.
+
+2. **Quick Installation Option:** For a faster setup, we recommend deploying Argilla on a Hugging Face [space](https://huggingface.co/new-space?template=argilla/argilla-template-space).
+
+3. **Configure API Settings:** After setting up the server, modify the `api_url` and `api_key` in the `config/config_default.yml` file. For instance, if using the recommended Hugging Face space, your API URL should be formatted as follows: `api_url: 'https://.hf.space'`.
+
+
+### Configure LLM Annotator
+
+To specify an LLM as the annotation tool in your pipeline, update the `annotator` section in the `config/config_default.yml` file as follows:
+
+```
+annotator:
+ method: 'llm'
+ config:
+ llm:
+ type: 'OpenAI'
+ name: 'gpt-4-1106-preview'
+ instruction:
+ 'Assess whether the text contains a harmful topic.
+ Answer Yes if it does and No otherwise.'
+ num_workers: 5
+ prompt: 'prompts/predictor_completion/prediction.prompt'
+ mini_batch_size: 1
+ mode: 'annotation'
+```
+We recommend using a robust LLM, like GPT-4, for annotation purposes. In the `instruction` field, you specify the task instructions for the annotation. The `mini_batch_size` field determines the number of samples processed in a single annotation pass, allowing you to balance efficiency with LLM token usage.
+
+
+### Monitoring: Weights and Biases Setup
+
+To effectively track your optimization process, including metrics like score, prompts instances, and error analysis across iterations, we recommend using [Weights and Biases](https://wandb.ai/site).
+
+1. **Sign Up for Weights and Biases:** Visit their [website](https://wandb.ai/site) and follow the instructions to create an account.
+
+2. **Enable wandb in Your Configuration:** In your project's `config/config_default.yml` file, set `use_wandb` to `True` to activate wandb support.
\ No newline at end of file
diff --git a/AutoPrompt/environment_dev.yml b/AutoPrompt/environment_dev.yml
new file mode 100644
index 0000000000000000000000000000000000000000..100124833fce285cd5e3e66ebb923f071b9e1a93
--- /dev/null
+++ b/AutoPrompt/environment_dev.yml
@@ -0,0 +1,23 @@
+name: AutoPrompt
+
+channels:
+ - conda-forge
+dependencies:
+ - python=3.10.13
+ - pip>=2.22.0
+ - openai
+ - langchain
+ - pandas
+ - wandb
+ - transformers
+ - tqdm
+ - faiss-cpu
+ - sentence-transformers
+ - pip:
+ - prodict
+ - argilla==1.25.0
+ - schedule
+ - pandas
+ - easydict
+ - pillow==10.2.0
+ - langchain-google-genai==0.0.9
diff --git a/AutoPrompt/estimator/__init__.py b/AutoPrompt/estimator/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..c64c4de00ad4e937fbe7843c01d12240b6193a90
--- /dev/null
+++ b/AutoPrompt/estimator/__init__.py
@@ -0,0 +1,37 @@
+import pandas as pd
+
+from .estimator_argilla import ArgillaEstimator
+from .estimator_llm import LLMEstimator
+from .estimator_llm_batch import LLMBatchEstimator
+from dataset.base_dataset import DatasetBase
+
+
+class DummyEstimator:
+ """
+ A dummy callback for the Estimator class.
+ This is a method to handle an empty estimator.
+ """
+
+ @staticmethod
+ def calc_usage():
+ """
+ Dummy function to calculate the usage of the dummy estimator
+ """
+ return 0
+
+ @staticmethod
+ def apply(dataset: DatasetBase, batch_id: int):
+ """
+ Dummy function to mimic the apply method, returns an empty dataframe
+ """
+ return pd.DataFrame()
+
+def give_estimator(opt):
+ if opt.method == 'argilla':
+ return ArgillaEstimator(opt.config)
+ elif opt.method == 'llm':
+ return LLMEstimator(opt.config)
+ elif opt.method == 'llm_batch':
+ return LLMBatchEstimator(opt.config)
+ else:
+ return DummyEstimator()
diff --git a/AutoPrompt/estimator/estimator_argilla.py b/AutoPrompt/estimator/estimator_argilla.py
new file mode 100644
index 0000000000000000000000000000000000000000..2d559debeac22e772536542e41f9a41834754270
--- /dev/null
+++ b/AutoPrompt/estimator/estimator_argilla.py
@@ -0,0 +1,119 @@
+import argilla as rg
+import time
+import pandas as pd
+from argilla.client.singleton import active_client
+from utils.config import Color
+from dataset.base_dataset import DatasetBase
+import json
+import webbrowser
+import base64
+
+class ArgillaEstimator:
+ """
+ The ArgillaEstimator class is responsible to generate the GT for the dataset by using Argilla interface.
+ In particular using the text classification mode.
+ """
+ def __init__(self, opt):
+ """
+ Initialize a new instance of the ArgillaEstimator class.
+ """
+ try:
+ self.opt = opt
+ rg.init(
+ api_url=opt.api_url,
+ api_key=opt.api_key,
+ workspace=opt.workspace
+ )
+ self.time_interval = opt.time_interval
+ except:
+ raise Exception("Failed to connect to argilla, check connection details")
+
+ @staticmethod
+ def initialize_dataset(dataset_name: str, label_schema: set[str]):
+ """
+ Initialize a new dataset in the Argilla system
+ :param dataset_name: The name of the dataset
+ :param label_schema: The list of classes
+ """
+ try:
+ settings = rg.TextClassificationSettings(label_schema=label_schema)
+ rg.configure_dataset_settings(name=dataset_name, settings=settings)
+ except:
+ raise Exception("Failed to create dataset")
+
+ @staticmethod
+ def upload_missing_records(dataset_name: str, batch_id: int, batch_records: pd.DataFrame):
+ """
+ Update the Argilla dataset by adding missing records from batch_id that appears in batch_records
+ :param dataset_name: The dataset name
+ :param batch_id: The batch id
+ :param batch_records: A dataframe of the batch records
+ """
+ #TODO: sort visualization according to batch_id descending
+ query = "metadata.batch_id:{}".format(batch_id)
+ result = rg.load(name=dataset_name, query=query)
+ df = result.to_pandas()
+ if len(df) == len(batch_records):
+ return
+ if df.empty:
+ upload_df = batch_records
+ else:
+ merged_df = pd.merge(batch_records, df['text'], on='text', how='left', indicator=True)
+ upload_df = merged_df[merged_df['_merge'] == 'left_only'].drop(columns=['_merge'])
+ record_list = []
+ for index, row in upload_df.iterrows():
+ config = {'text': row['text'], 'metadata': {"batch_id": row['batch_id'], 'id': row['id']}, "id": row['id']}
+ # if not (row[['prediction']].isnull().any()):
+ # config['prediction'] = row['prediction'] # TODO: fix it incorrect type!!!
+ if not(row[['annotation']].isnull().any()): # TODO: fix it incorrect type!!!
+ config['annotation'] = row['annotation']
+ record_list.append(rg.TextClassificationRecord(**config))
+ rg.log(records=record_list, name=dataset_name)
+
+ def calc_usage(self):
+ """
+ Dummy function to calculate the usage of the estimator
+ """
+ return 0
+
+ def apply(self, dataset: DatasetBase, batch_id: int):
+ """
+ Apply the estimator on the dataset. The function enter to infinite loop until all the records are annotated.
+ Then it update the dataset with all the annotations
+ :param dataset: DatasetBase object, contains all the processed records
+ :param batch_id: The batch id to annotate
+ """
+ current_api = active_client()
+ try:
+ rg_dataset = current_api.datasets.find_by_name(dataset.name)
+ except:
+ self.initialize_dataset(dataset.name, dataset.label_schema)
+ rg_dataset = current_api.datasets.find_by_name(dataset.name)
+ batch_records = dataset[batch_id]
+ if batch_records.empty:
+ return []
+ self.upload_missing_records(dataset.name, batch_id, batch_records)
+ data = {'metadata': {'batch_id': [str(batch_id)]}}
+ json_data = json.dumps(data)
+ encoded_bytes = base64.b64encode(json_data.encode('utf-8'))
+ encoded_string = str(encoded_bytes, "utf-8")
+ url_link = self.opt.api_url + '/datasets/' + self.opt.workspace + '/' \
+ + dataset.name + '?query=' + encoded_string
+ print(f"{Color.GREEN}Waiting for annotations from batch {batch_id}:\n{url_link}{Color.END}")
+ webbrowser.open(url_link)
+ while True:
+ query = "(status:Validated OR status:Discarded) AND metadata.batch_id:{}".format(batch_id)
+ search_results = current_api.search.search_records(
+ name=dataset.name,
+ task=rg_dataset.task,
+ size=0,
+ query_text=query,
+ )
+ if search_results.total == len(batch_records):
+ result = rg.load(name=dataset.name, query=query)
+ df = result.to_pandas()[['text', 'annotation', 'metadata', 'status']]
+ df["annotation"] = df.apply(lambda x: 'Discarded' if x['status']=='Discarded' else x['annotation'], axis=1)
+ df = df.drop(columns=['status'])
+ df['id'] = df.apply(lambda x: x['metadata']['id'], axis=1)
+ return df
+ time.sleep(self.time_interval)
diff --git a/AutoPrompt/estimator/estimator_llm.py b/AutoPrompt/estimator/estimator_llm.py
new file mode 100644
index 0000000000000000000000000000000000000000..f2295b1118c90531b8916ac34c244756ac3086c2
--- /dev/null
+++ b/AutoPrompt/estimator/estimator_llm.py
@@ -0,0 +1,95 @@
+from utils.llm_chain import ChainWrapper, get_chain_metadata
+from pathlib import Path
+from dataset.base_dataset import DatasetBase
+import pandas as pd
+
+class LLMEstimator:
+ """
+ A wrapper for an estimator using LLM
+ """
+
+ def __init__(self, opt):
+ """
+ Initialize a new instance of the LLMEstimator class.
+ :param opt: The configuration file (EasyDict)
+ """
+ self.opt = opt
+ self.chain = None
+ self.mini_batch_size = opt.mini_batch_size
+ self.mode = opt.mode
+ self.num_workers = opt.num_workers
+ if 'instruction' in opt.keys():
+ self.cur_instruct = opt.instruction
+ else:
+ self.cur_instruct = None
+
+ @staticmethod
+ def generate_sample_text(sample_id: int, text: str) -> str:
+ """
+ Generate a sample text for the chain prompt
+ :param sample_id: The sample id
+ :param text: The text of the sample
+ :return: The sample text for the prompt
+ """
+ return f"ID: {sample_id}; Sample: {text}\n"
+
+ def calc_usage(self) -> float:
+ """"
+ Calculate the usage of the estimator
+ """
+ return self.chain.accumulate_usage
+
+ def init_chain(self, label_schema: set[str]):
+ """
+ Initialize the chain
+ :param label_schema: The label schema
+ """
+ chain_metadata = get_chain_metadata(Path(self.opt.prompt), retrieve_module=True)
+ if hasattr(chain_metadata['module'], 'update_classification_prediction_schema'):
+ chain_metadata['json_schema'] = chain_metadata['module'].update_classification_prediction_schema(
+ chain_metadata['json_schema'],
+ label_schema
+ )
+ self.chain = ChainWrapper(self.opt.llm, self.opt.prompt, chain_metadata['json_schema'],
+ chain_metadata['parser_func'])
+
+ def apply_dataframe(self, record: pd.DataFrame):
+ """
+ Apply the estimator on a dataframe
+ :param record: The record
+ """
+ chain_input = ''
+ mini_batch_inputs = []
+ record[self.mode] = 'Discarded'
+ # prepare all the inputs for the chains
+ for i, row in record.iterrows():
+ chain_input += self.generate_sample_text(i, row['text'])
+ if ((i + 1) % self.mini_batch_size) == 0:
+ mini_batch_inputs.append({'batch_size': self.mini_batch_size, 'task_instruction': self.cur_instruct,
+ 'samples': chain_input})
+ chain_input = ''
+ if not (chain_input == ''):
+ mini_batch_inputs.append({'batch_size': self.mini_batch_size, 'task_instruction': self.cur_instruct,
+ 'samples': chain_input})
+
+ all_results = self.chain.batch_invoke(mini_batch_inputs, self.num_workers)
+ union_results = [element for sublist in all_results for element in sublist['results']]
+ for res in union_results:
+ record.loc[res['id'], self.mode] = res['prediction']
+ return record
+
+ def apply(self, dataset: DatasetBase, idx: int, leq: bool = False):
+ """
+ Apply the estimator on the batches up to idx (includes), it then updates the annotation field
+ if self.mode is 'annotation', otherwise it update the prediction field.
+ :param dataset: The dataset
+ :param idx: The current batch index
+ :param leq: If True, apply on all the batches up to idx (includes), otherwise apply only on idx
+ """
+ if self.chain is None:
+ self.init_chain(dataset.label_schema)
+ if leq:
+ batch_records = dataset.get_leq(idx)
+ else:
+ batch_records = dataset[idx]
+ return self.apply_dataframe(batch_records)
diff --git a/AutoPrompt/estimator/estimator_llm_batch.py b/AutoPrompt/estimator/estimator_llm_batch.py
new file mode 100644
index 0000000000000000000000000000000000000000..474958a70b7820ae47557cbc63bed0376d7c9583
--- /dev/null
+++ b/AutoPrompt/estimator/estimator_llm_batch.py
@@ -0,0 +1,68 @@
+from estimator.estimator_llm import LLMEstimator
+from dataset.base_dataset import DatasetBase
+import pandas as pd
+
+
+class LLMBatchEstimator:
+ """
+ A wrapper for an estimator using aggregation of multiple LLMs estimators
+ """
+
+ def __init__(self, opt):
+ """
+ Initialize a new instance of the LLMEstimator class.
+ :param opt: The configuration file (EasyDict)
+ """
+ self.llm_estimators = [LLMEstimator(opt.estimator_config) for _ in range(len(opt.instructions))]
+ for i, estimator in enumerate(self.llm_estimators):
+ estimator.cur_instruct = opt.instructions[i]
+ self.mode = opt.estimator_config.mode
+ self.aggregation_mode = opt.aggregation_mode
+
+ def calc_usage(self) -> float:
+ """"
+ Calculate the usage of the estimator
+ """
+ return sum([estimator.calc_usage() for estimator in self.llm_estimators])
+
+ def get_aggregation_function(self):
+ if self.aggregation_mode == 'max':
+ return lambda record: max(record)
+ elif self.aggregation_mode == 'min':
+ return lambda record: min(record)
+ elif self.aggregation_mode == 'mean':
+ return lambda record: sum(record) / len(record)
+ elif self.aggregation_mode == 'median':
+ return lambda record: sorted(record)[len(record) // 2]
+ elif self.aggregation_mode == 'majority':
+ return lambda record: max(set(record), key=record.count)
+ elif self.aggregation_mode == 'exist':
+ return lambda record: 'Yes' if any([t == 'Yes' for t in record]) else 'No'
+ elif self.aggregation_mode == 'all':
+ return lambda record: 'Yes' if all([t == 'Yes' for t in record]) else 'No'
+ else:
+ raise Exception(f'Unknown aggregation class {self.aggregation_mode}')
+
+ def apply(self, dataset: DatasetBase, idx: int, leq: bool = False):
+ """
+ Apply the estimator on the batches up to idx (includes), it then updates the annotation field
+ if self.mode is 'annotation', otherwise it update the prediction field.
+ :param dataset: The dataset
+ :param idx: The current batch index
+ :param leq: If True, apply on all the batches up to idx (includes), otherwise apply only on idx
+ """
+ update_datasets = [estimator.apply(dataset, idx, leq) for estimator in self.llm_estimators]
+ res_dataset = update_datasets[0]
+ if res_dataset.empty:
+ return res_dataset
+ for i, df in enumerate(update_datasets[1:]):
+ # Merge the dataframes on the 'id' column
+ merged_df = pd.merge(res_dataset, df[['id', self.mode]], on='id', how='left', suffixes=('_left', '_right'))
+ if i == 0:
+ res_dataset[self.mode] = merged_df.apply(lambda row: [str(row['{}_left'.format(self.mode)])] +
+ [str(row['{}_right'.format(self.mode)])], axis=1)
+ else:
+ res_dataset[self.mode] = merged_df.apply(lambda row: row['{}_left'.format(self.mode)] +
+ [str(row['{}_right'.format(self.mode)])], axis=1)
+ res_dataset[self.mode] = res_dataset[self.mode].apply(self.get_aggregation_function())
+ return res_dataset
diff --git a/AutoPrompt/eval/eval_utils.py b/AutoPrompt/eval/eval_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a72dd1febdf7a1596d857d0ff64e82d5a61f5dc
--- /dev/null
+++ b/AutoPrompt/eval/eval_utils.py
@@ -0,0 +1,24 @@
+from estimator.estimator_llm import LLMEstimator
+
+
+def set_function_from_iterrow(func):
+ def wrapper(dataset):
+ dataset['score'] = dataset.apply(func, axis=1)
+ return dataset
+
+ return wrapper
+
+
+def set_ranking_function(params):
+ evaluator = LLMEstimator(params)
+ evaluator.init_chain(params.label_schema)
+ evaluator.mode = 'score'
+ def wrapper(dataset):
+ generation_dataset = dataset.copy()
+ generation_dataset['text'] = '###User input:\n' + generation_dataset['text'] + '\n####model prediction:\n' + generation_dataset['prediction']
+
+ generation_dataset = evaluator.apply_dataframe(generation_dataset)
+ generation_dataset.score = generation_dataset.score.astype(int)
+ dataset.score = generation_dataset.score
+ return dataset
+ return wrapper
diff --git a/AutoPrompt/eval/evaluator.py b/AutoPrompt/eval/evaluator.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9107e9e51a86ab9dcfee1ac153556682a0447f0
--- /dev/null
+++ b/AutoPrompt/eval/evaluator.py
@@ -0,0 +1,152 @@
+import pandas as pd
+import numpy as np
+from sklearn.metrics import confusion_matrix
+import eval.eval_utils as utils
+
+class Eval:
+ """
+ The Eval class is responsible to calculate the score and the large errors
+ """
+
+ def __init__(self, config, analyzer=None, label_schema=None):
+ """
+ Initialize a new instance of the Eval class.
+ :param config: The configuration file (EasyDict)
+ :analyzer (optional): A chain that analyze the errors
+ :label_schema (optional): The label schema
+ """
+ self.score_function_name = config.function_name
+ self.score_func = self.get_eval_function(config)
+ self.num_errors = config.num_large_errors
+ self.error_threshold = config.error_threshold
+ self.dataset = None
+ self.mean_score = None
+ self.label_schema = label_schema
+ self.errors = None
+ self.history = []
+ self.analyzer = analyzer
+
+ @staticmethod
+ def get_eval_function(config: dict):
+ """
+ Returns the eval function
+ :param config: The eval configuration
+ :return: The function implementation on a record
+ """
+ if config.function_name == 'accuracy':
+ return utils.set_function_from_iterrow(lambda record: record['annotation'] == record['prediction'])
+ elif config.function_name == 'ranking':
+ return utils.set_ranking_function(config.function_params)
+ else:
+ raise NotImplementedError("Eval function not implemented")
+
+ def eval_score(self) -> float:
+ """
+ Calculate the score on each row and return the mean score.
+ :return: The mean score
+ """
+ # filter out the discarded samples
+ self.dataset = self.dataset[(self.dataset['prediction'] != 'Discarded') &
+ (self.dataset['annotation'] != 'Discarded')]
+ self.dataset = self.score_func(self.dataset)
+ self.mean_score = self.dataset['score'].mean()
+ return self.mean_score
+
+ def get_max_score(self, warmup=0):
+ """
+ Return the maximum 'mean score' (with respect to all history epochs, starting form warmup, up to last) and the epoch index of the maximum score
+ :return: The epoch index of the maximum score, and the maximum score
+ """
+ max_idx = np.argmax([epoch['score'] for epoch in self.history[warmup:-1]])
+ max_idx += warmup
+ return max_idx, self.history[max_idx]['score']
+
+
+ def large_error_to_str(self, error_df: pd.DataFrame, num_large_errors_per_label: int) -> str:
+ """
+ Return a string that contains the large errors
+ :param error_df: A dataframe contains all the mislabeled samples
+ :param num_large_errors_per_label: The (maximum) number of large errors per label
+ :return: A string that contains the large errors that is used in the meta-prompt
+ """
+ required_columns = ['annotation', 'text', 'score', 'prediction']
+ label_schema = error_df['annotation'].unique()
+ if self.score_function_name == 'ranker':
+ gt_name = 'Rank:'
+ else:
+ gt_name = 'GT:'
+ error_res_df_list = []
+ txt_res = ''
+ for label in label_schema:
+ cur_df = error_df[error_df['annotation'] == label]
+ cur_df = cur_df.sample(frac=1.0, random_state=42)[:num_large_errors_per_label]
+ error_res_df_list.append(cur_df[required_columns])
+ if len(error_res_df_list) > 0:
+ error_res_df = pd.concat(error_res_df_list, ignore_index=True)
+ error_res_df = error_res_df.sample(frac=1.0, random_state=42)
+ for i, row in error_res_df.iterrows():
+ txt_res += f"Sample: {row.text}\nPrediction: {row.prediction}, {gt_name}: {row.annotation}\n#\n"
+ return txt_res
+
+ def sample_to_text(self, sample: dict, num_errors_per_label: int = 0, is_score: bool = True) -> str:
+ """
+ Return a string that organize the information of from the step run for the meta-prompt
+ :param sample: The eval information for specific step
+ :param num_errors_per_label: The max number of large errors per class that will appear in the meta-prompt
+ :param is_score: If True, add the score information to the meta-prompt
+ :return: A string that contains the information of the step run
+ """
+ if is_score:
+ return f"####\n##Prompt Score: {sample['score']:.2f}\n##Prompt:\n{sample['prompt']}\n#################\n"
+ else:
+ return f"####\n##Prompt:\n{sample['prompt']}\n{self.large_error_to_str(sample['errors'], num_errors_per_label)}####\n "
+
+ def add_history(self, prompt: str, task_description: str):
+ """
+ Add the current step information to the history
+ :param prompt: The current prompt
+ :param task_description: The task description
+ """
+ conf_matrix = None
+ large_error_to_str = self.large_error_to_str(self.errors, self.num_errors)
+ prompt_input = {'task_description': task_description, 'accuracy': self.mean_score, 'prompt': prompt,
+ 'failure_cases': large_error_to_str}
+ if self.score_function_name == 'accuracy':
+ conf_matrix = confusion_matrix(self.dataset['annotation'],
+ self.dataset['prediction'], labels=self.label_schema)
+ conf_text = f"Confusion matrix columns:{self.label_schema} the matrix data:"
+ for i, row in enumerate(conf_matrix):
+ conf_text += f"\n{self.label_schema[i]}: {row}"
+ prompt_input['confusion_matrix'] = conf_text
+ elif self.score_function_name == 'ranking':
+ prompt_input['labels'] = self.label_schema
+ analysis = self.analyzer.invoke(prompt_input)
+
+ self.history.append({'prompt': prompt, 'score': self.mean_score,
+ 'errors': self.errors, 'confusion_matrix': conf_matrix, 'analysis': analysis['text']})
+
+ def extract_errors(self) -> pd.DataFrame:
+ """
+ Extract the errors from the dataset
+ :return: records that contains the errors
+ """
+ df = self.dataset
+ err_df = df[df['score'] < self.error_threshold]
+ err_df.sort_values(by=['score'])
+ self.errors = err_df
+ return self.errors
+
+ def extract_correct(self) -> pd.DataFrame:
+ """
+ Extract the correct samples from the dataset
+ :return: records that contains the correct samples
+ """
+ df = self.dataset
+ return df[df['score'] > self.error_threshold]
+
+ def extract_boundary_predictions(self) -> pd.DataFrame:
+ """
+ Extract boundary samples on which the model is uncertain
+ :return: records that contains boundary samples
+ """
+ pass
\ No newline at end of file
diff --git a/AutoPrompt/optimization_pipeline.py b/AutoPrompt/optimization_pipeline.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a99afe7e9f8d1693524dba0bf073a80e9f7eda2
--- /dev/null
+++ b/AutoPrompt/optimization_pipeline.py
@@ -0,0 +1,277 @@
+import pandas as pd
+
+from eval.evaluator import Eval
+from dataset.base_dataset import DatasetBase
+from utils.llm_chain import MetaChain
+from estimator import give_estimator
+from pathlib import Path
+import pickle
+import os
+import json
+import logging
+import wandb
+
+
+class OptimizationPipeline:
+ """
+ The main pipeline for optimization. The pipeline is composed of 4 main components:
+ 1. dataset - The dataset handle the data including the annotation and the prediction
+ 2. annotator - The annotator is responsible generate the GT
+ 3. predictor - The predictor is responsible to generate the prediction
+ 4. eval - The eval is responsible to calculate the score and the large errors
+ """
+
+ def __init__(self, config, task_description: str = None, initial_prompt: str = None, output_path: str = ''):
+ """
+ Initialize a new instance of the ClassName class.
+ :param config: The configuration file (EasyDict)
+ :param task_description: Describe the task that needed to be solved
+ :param initial_prompt: Provide an initial prompt to solve the task
+ :param output_path: The output dir to save dump, by default the dumps are not saved
+ """
+
+ if config.use_wandb: # In case of using W&B
+ wandb.login()
+ self.wandb_run = wandb.init(
+ project="AutoGPT",
+ config=config,
+ )
+ if output_path == '':
+ self.output_path = None
+ else:
+ if not os.path.isdir(output_path):
+ os.makedirs(output_path)
+ self.output_path = Path(output_path)
+ logging.basicConfig(filename=self.output_path / 'info.log', level=logging.DEBUG,
+ format='%(asctime)s - %(levelname)s - %(message)s', force=True)
+
+ self.dataset = None
+ self.config = config
+ self.meta_chain = MetaChain(config)
+ self.initialize_dataset()
+
+ self.task_description = task_description
+ self.cur_prompt = initial_prompt
+
+ self.predictor = give_estimator(config.predictor)
+ self.annotator = give_estimator(config.annotator)
+ self.eval = Eval(config.eval, self.meta_chain.error_analysis, self.dataset.label_schema)
+ self.batch_id = 0
+ self.patient = 0
+
+ @staticmethod
+ def log_and_print(message):
+ print(message)
+ logging.info(message)
+
+ def initialize_dataset(self):
+ """
+ Initialize the dataset: Either empty dataset or loading an existing dataset
+ """
+ logging.info('Initialize dataset')
+ self.dataset = DatasetBase(self.config.dataset)
+ if 'initial_dataset' in self.config.dataset.keys():
+ logging.info(f'Load initial dataset from {self.config.dataset.initial_dataset}')
+ self.dataset.load_dataset(self.config.dataset.initial_dataset)
+
+ def calc_usage(self):
+ """
+ Calculate the usage of the optimization process (either $ in case of openAI or #tokens the other cases)
+ """
+ total_usage = 0
+ total_usage += self.meta_chain.calc_usage()
+ total_usage += self.annotator.calc_usage()
+ total_usage += self.predictor.calc_usage()
+ return total_usage
+
+ def extract_best_prompt(self):
+ sorted_history = sorted(
+ self.eval.history[min(self.config.meta_prompts.warmup - 1, len(self.eval.history) - 1):],
+ key=lambda x: x['score'],
+ reverse=False)
+ return {'prompt': sorted_history[-1]['prompt'], 'score': sorted_history[-1]['score']}
+
+ def run_step_prompt(self):
+ """
+ Run the meta-prompts and get new prompt suggestion, estimated prompt score and a set of challenging samples
+ for the new prompts
+ """
+ step_num = len(self.eval.history)
+ if (step_num < self.config.meta_prompts.warmup) or (step_num % 3) > 0:
+ last_history = self.eval.history[-self.config.meta_prompts.history_length:]
+ else:
+ sorted_history = sorted(self.eval.history[self.config.meta_prompts.warmup - 1:], key=lambda x: x['score'],
+ reverse=False)
+ last_history = sorted_history[-self.config.meta_prompts.history_length:]
+ history_prompt = '\n'.join([self.eval.sample_to_text(sample,
+ num_errors_per_label=self.config.meta_prompts.num_err_prompt,
+ is_score=True) for sample in last_history])
+ prompt_input = {"history": history_prompt, "task_description": self.task_description,
+ 'error_analysis': last_history[-1]['analysis']}
+ if 'label_schema' in self.config.dataset.keys():
+ prompt_input["labels"] = json.dumps(self.config.dataset.label_schema)
+ prompt_suggestion = self.meta_chain.step_prompt_chain.invoke(prompt_input)
+ self.log_and_print(f'Previous prompt score:\n{self.eval.mean_score}\n#########\n')
+ self.log_and_print(f'Get new prompt:\n{prompt_suggestion["prompt"]}')
+ self.batch_id += 1
+ if len(self.dataset) < self.config.dataset.max_samples:
+ batch_input = {"num_samples": self.config.meta_prompts.samples_generation_batch,
+ "task_description": self.task_description,
+ "prompt": prompt_suggestion['prompt']}
+ batch_inputs = self.generate_samples_batch(batch_input, self.config.meta_prompts.num_generated_samples,
+ self.config.meta_prompts.samples_generation_batch)
+
+ if sum([len(t['errors']) for t in last_history]) > 0:
+ history_samples = '\n'.join([self.eval.sample_to_text(sample,
+ num_errors_per_label=self.config.meta_prompts.num_err_samples,
+ is_score=False) for sample in last_history])
+ for batch in batch_inputs:
+ extra_samples = self.dataset.sample_records()
+ extra_samples_text = DatasetBase.samples_to_text(extra_samples)
+ batch['history'] = history_samples
+ batch['extra_samples'] = extra_samples_text
+ else:
+ for batch in batch_inputs:
+ extra_samples = self.dataset.sample_records()
+ extra_samples_text = DatasetBase.samples_to_text(extra_samples)
+ batch['history'] = 'No previous errors information'
+ batch['extra_samples'] = extra_samples_text
+
+ samples_batches = self.meta_chain.step_samples.batch_invoke(batch_inputs,
+ self.config.meta_prompts.num_workers)
+ new_samples = [element for sublist in samples_batches for element in sublist['samples']]
+ new_samples = self.dataset.remove_duplicates(new_samples)
+ self.dataset.add(new_samples, self.batch_id)
+ logging.info('Get new samples')
+ self.cur_prompt = prompt_suggestion['prompt']
+
+ def stop_criteria(self):
+ """
+ Check if the stop criteria holds. The conditions for stopping:
+ 1. Usage is above the threshold
+ 2. There was no improvement in the last > patient steps
+ """
+ if 0 < self.config.stop_criteria.max_usage < self.calc_usage():
+ return True
+ if len(self.eval.history) <= self.config.meta_prompts.warmup:
+ self.patient = 0
+ return False
+ min_batch_id, max_score = self.eval.get_max_score(self.config.meta_prompts.warmup-1)
+ if max_score - self.eval.history[-1]['score'] > -self.config.stop_criteria.min_delta:
+ self.patient += 1
+ else:
+ self.patient = 0
+ if self.patient > self.config.stop_criteria.patience:
+ return True
+ return False
+
+ @staticmethod
+ def generate_samples_batch(batch_input, num_samples, batch_size):
+ """
+ Generate samples in batch
+ """
+ batch_num = num_samples // batch_size
+ all_batches = [batch_input.copy() for _ in range(batch_num)]
+ reminder = num_samples - batch_num * batch_size
+ if reminder > 0:
+ all_batches.append(batch_input.copy())
+ all_batches[-1]['num_samples'] = reminder
+ return all_batches
+
+ def generate_initial_samples(self):
+ """
+ In case the initial dataset is empty generate the initial samples
+ """
+ batch_input = {"num_samples": self.config.meta_prompts.samples_generation_batch,
+ "task_description": self.task_description,
+ "instruction": self.cur_prompt}
+ batch_inputs = self.generate_samples_batch(batch_input, self.config.meta_prompts.num_initialize_samples,
+ self.config.meta_prompts.samples_generation_batch)
+
+ samples_batches = self.meta_chain.initial_chain.batch_invoke(batch_inputs, self.config.meta_prompts.num_workers)
+ samples_list = [element for sublist in samples_batches for element in sublist['samples']]
+ samples_list = self.dataset.remove_duplicates(samples_list)
+ self.dataset.add(samples_list, 0)
+
+ def save_state(self):
+ """
+ Save the process state
+ """
+ if self.output_path is None:
+ return
+ logging.info('Save state')
+ self.dataset.save_dataset(self.output_path / 'dataset.csv')
+ state = {'history': self.eval.history, 'batch_id': self.batch_id,
+ 'prompt': self.cur_prompt, 'task_description': self.task_description,
+ 'patient': self.patient}
+ pickle.dump(state, open(self.output_path / 'history.pkl', 'wb'))
+
+ def load_state(self, path: str):
+ """
+ Load pretrain state
+ """
+ path = Path(path)
+ if (path / 'dataset.csv').is_file():
+ self.dataset.load_dataset(path / 'dataset.csv')
+ if (path / 'history.pkl').is_file():
+ state = pickle.load(open(path / 'history.pkl', 'rb'))
+ self.eval.history = state['history']
+ self.batch_id = state['batch_id']
+ self.cur_prompt = state['prompt']
+ self.task_description = state['task_description']
+ self.patient = state['patient']
+
+ def step(self, current_iter, total_iter):
+ """
+ This is the main optimization process step.
+ """
+ self.log_and_print(f'Starting step {self.batch_id}')
+ if len(self.dataset.records) == 0:
+ self.log_and_print('Dataset is empty generating initial samples')
+ self.generate_initial_samples()
+ if self.config.use_wandb:
+ cur_batch = self.dataset.get_leq(self.batch_id)
+ random_subset = cur_batch.sample(n=min(10, len(cur_batch)))[['text']]
+ self.wandb_run.log(
+ {"Prompt": wandb.Html(f"
{self.cur_prompt}
"), "Samples": wandb.Table(dataframe=random_subset)},
+ step=self.batch_id)
+
+ logging.info('Running annotator')
+ records = self.annotator.apply(self.dataset, self.batch_id)
+ self.dataset.update(records)
+
+ self.predictor.cur_instruct = self.cur_prompt
+ logging.info('Running Predictor')
+ records = self.predictor.apply(self.dataset, self.batch_id, leq=True)
+ self.dataset.update(records)
+
+ self.eval.dataset = self.dataset.get_leq(self.batch_id)
+ self.eval.eval_score()
+ logging.info('Calculating Score')
+ large_errors = self.eval.extract_errors()
+ self.eval.add_history(self.cur_prompt, self.task_description)
+ if self.config.use_wandb:
+ large_errors = large_errors.sample(n=min(6, len(large_errors)))
+ correct_samples = self.eval.extract_correct()
+ correct_samples = correct_samples.sample(n=min(6, len(correct_samples)))
+ vis_data = pd.concat([large_errors, correct_samples])
+ self.wandb_run.log({"score": self.eval.history[-1]['score'],
+ "prediction_result": wandb.Table(dataframe=vis_data),
+ 'Total usage': self.calc_usage()}, step=self.batch_id)
+ if self.stop_criteria():
+ self.log_and_print('Stop criteria reached')
+ return True
+ if current_iter != total_iter-1:
+ self.run_step_prompt()
+ self.save_state()
+ return False
+
+ def run_pipeline(self, num_steps: int):
+ # Run the optimization pipeline for num_steps
+ num_steps_remaining = num_steps - self.batch_id
+ for i in range(num_steps_remaining):
+ stop_criteria = self.step(i, num_steps_remaining)
+ if stop_criteria:
+ break
+ final_result = self.extract_best_prompt()
+ return final_result
diff --git a/AutoPrompt/prompts/meta_prompts_classification/error_analysis.prompt b/AutoPrompt/prompts/meta_prompts_classification/error_analysis.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..854819a95baabdab02fb7190285b4552f7443e1f
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_classification/error_analysis.prompt
@@ -0,0 +1,24 @@
+Assistant is a large language model designed to provide a high quality analysis for every task.
+You are given the following task description
+{task_description}
+
+Here is the prompt instructions that was given to the model:
+{prompt}
+
+The accuracy for this prompt is: {accuracy}
+The confusion matrix for this prompt is: {confusion_matrix}
+##
+Here is a list of failure cases for the given prompt:
+##Failure Cases:
+{failure_cases}
+
+###
+Note that the ground-truth labels are __absolutely correct__, but the prompts (task descriptions) may be incorrect and need modification.
+Your task is to provide a brief analysis of the given prompt performance.
+Guidelines:
+1. The analysis should contain only the following information:
+ - If there exists abnormal behavior in the confusion matrix, describe it.
+ - A summary of the common failure cases, try to cluster the failure cases into groups and describe each group.
+3. The total length of your analysis should be less than 200 token!
+###
+Analysis:
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_classification/initial.prompt b/AutoPrompt/prompts/meta_prompts_classification/initial.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..63c7ad19a905be0c8efa1fedf08f5578b08c1063
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_classification/initial.prompt
@@ -0,0 +1,11 @@
+Assistant is a large language model designed to generate challenging samples for every task.
+Generate a list of {num_samples} challenging samples for the following task.
+### Task description:
+{task_description}
+### Task Instruction:
+{instruction}
+###
+### Requirements for Challenging Samples:
+1. The generated samples must be challenging and diverse such that using the task instruction as a prompt will result in the wrong result.
+2. The number of generated samples from each class in the task instruction should be balanced (i.e. the same number of samples for each class)
+3. The generated samples should be distinct, realistic, and vary significantly to ensure diversity.
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_classification/initial_verbose.prompt b/AutoPrompt/prompts/meta_prompts_classification/initial_verbose.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..f116125866068d7fd35a261ff46439703ad7f9d1
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_classification/initial_verbose.prompt
@@ -0,0 +1,17 @@
+As an advanced language model you should create {num_samples} challenging and unique samples for the task outlined below.
+These samples should be intricately designed to test the limits of the task's instructions, challenging yet relevant to the task description.
+
+### Task Description:
+{task_description}
+
+### Task Instructions:
+{instruction}
+
+### Requirements for Challenging Samples:
+1. Each sample must present a unique and intricate challenge.
+2. The complexity of the samples should be such that simply applying the given task instruction would likely lead to incorrect or incomplete results.
+3. The samples should cover a diverse range of scenarios within the scope of the task, avoiding repetition and predictability.
+4. Ensure that the samples, while challenging, remain realistic and pertinent to the task's context.
+
+Generate the samples keeping these requirements in mind.
+###
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_classification/output_schemes.py b/AutoPrompt/prompts/meta_prompts_classification/output_schemes.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a693e01808f0365fb0a0ae1f6cab9eeaa738fc7
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_classification/output_schemes.py
@@ -0,0 +1,97 @@
+# A file containing the json schema for the output of all the LLM chains
+
+initial_schema = step_samples_schema = {
+ "description": "A List of all results",
+ "properties": {
+ "samples": {
+ "description": "Each sample is a string containing the sample content, without any additional information like the Prediction or GT",
+ "items": {
+ "type": "string"
+ },
+ "title": "Samples",
+ "type": "array"
+ }
+ },
+ "required": [
+ "samples"
+ ],
+ "title": "Sample_List",
+ "type": "object"
+}
+
+
+classification_prediction_schema = {
+ "$defs": {
+ "Result": {
+ "description": "A single result",
+ "properties": {
+ "id": {
+ "description": "The sample id",
+ "title": "Id",
+ "type": "integer"
+ },
+ "prediction": {
+ "description": "The prediction of the sample.",
+ "title": "Prediction",
+ "type": "string"
+ }
+ },
+ "required": [
+ "id",
+ "prediction"
+ ],
+ "title": "Result",
+ "type": "object"
+ }
+ },
+ "description": "A List of task classification results",
+ "properties": {
+ "results": {
+ "description": "Each item contain the id and the prediction of the sample",
+ "items": {
+ "$ref": "#/$defs/Result"
+ },
+ "title": "Results",
+ "type": "array"
+ }
+ },
+ "required": [
+ "results"
+ ],
+ "title": "Results_List",
+ "type": "object"
+}
+
+
+step_prompt_schema = {
+ "description": "A prompt suggestion which expect to get high score, and the associated score prediction",
+ "properties": {
+ "prompt": {
+ "description": "The prompt prediction",
+ "title": "Prompt",
+ "type": "string"
+ },
+ "score": {
+ "description": "The score prediction",
+ "title": "Score",
+ "type": "number"
+ }
+ },
+ "required": [
+ "prompt",
+ "score"
+ ],
+ "title": "Suggested_Prompt",
+ "type": "object"
+}
+
+def update_classification_prediction_schema(label_schema:list)->dict:
+ """
+ Updates the classification prediction schema with the label schema from the yaml file
+ :param yaml_data: The yaml data
+ """
+
+ classification_prediction_schema['$defs']['Result']['properties']['prediction']['enum'] = label_schema
+ classification_prediction_schema['$defs']['Result']['properties']['prediction'][
+ 'description'] += 'The answer must be one of the following options: {} !!'.format(label_schema)
+ return classification_prediction_schema
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_classification/step_prompt.prompt b/AutoPrompt/prompts/meta_prompts_classification/step_prompt.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..09ceb541f9bc9bf6b1f761acd093af92390ee093
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_classification/step_prompt.prompt
@@ -0,0 +1,21 @@
+Assistant is a large language model designed to provide the best prompt for every task.
+Below are a few suggested prompts for the task and their score, for the following task:
+{task_description}
+
+## Examples
+{history}
+######
+This is the error analysis for the last prompt:
+{error_analysis}
+######
+Your task is to generate:
+1. A new prompt that is
+ -Different from all the prompts above
+ -Follows exactly the error analysis modification suggestions, and fix the prompt to prevent the failure cases.
+ -Has a higher score than all the prompts above.
+2. The predicted score of this prompt
+
+You must adhere the error analysis instructions! even in case it seems there is a contradiction between these instructions, and the task. The error analysis is tested on a ground truth, thus represent the exact intent of the task.
+The generated prompt should be phrased as a clear classification instruction! it should not include any instructions and descriptions on the modification that should be done to the prompt.
+Note that the previous prompt contains an implicit assumptions on the intent of the task that might be incorrect. You should replace this assumption with more accurate assumptions using the score of the previous prompts and the error analysis.
+The result prompt should indicate that the task is a classification class with the following labels {labels}!
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_classification/step_prompt_verbose.prompt b/AutoPrompt/prompts/meta_prompts_classification/step_prompt_verbose.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..18917a8ed6bbc6cb5f430242cc546563f23c0a0b
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_classification/step_prompt_verbose.prompt
@@ -0,0 +1,25 @@
+Assistant is a large language model designed to provide the best prompt for every task.
+Below are a few suggested prompts for the task and their score, for the following task:
+{task_description}. The exact intent of this task description should be determine by external annotated dataset of challenging cases. The error analysis of this dataset is provided.
+## Examples
+{history}
+######
+For the last prompt we got the following error analysis:
+{error_analysis}
+######
+Your task is to generate:
+1. A new prompt that is
+ -Different from all the prompts above
+ -Follows exactly after the error analysis modification suggestions, and fix the prompt to prevent the failure cases.
+ -Has a higher score than all the prompts above.
+2. The predicted score of this prompt
+
+Guidelines for the new prompt:
+1. The prompt is given a 'scratchpad', he can use it to extract from the sample text relevant information to make his prediction and perform a reasoning thought to get to the correct decision
+2. The prompt is intended for a shallow LLM, which does not have access to previous failure cases or the analysis! he has only access to the generated new prompt which should be independent of the previous prompts.
+4. Lists can organize the information and help the prompt (for example list of rules and a list of samples), the lists should be short and accurate
+5. Note that the prompts and task descriptions may be inaccurate and need modification.
+6. Note that higher score means better prompt.
+7. The result prompt should indicate that the task is a classification class with the following labels {labels}!
+
+Sample randomly a number between 1 to 3. If the result is zero __change completely__ the generated prompt! including the instruction, the structure and the phrasing!
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_classification/step_samples.prompt b/AutoPrompt/prompts/meta_prompts_classification/step_samples.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..adfabec8648edd6b3d33766ad31620f71cb8a804
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_classification/step_samples.prompt
@@ -0,0 +1,24 @@
+Assistant is a large language model designed to generate challenging samples for every task.
+Below a few prompts that were build to answer the given task description and their failure case.
+Task description:
+{task_description}
+
+## Examples of common failure, each sample is followed by the the model prediction and the GT (ground truth)
+{history}
+######
+Here are few unique samples derived from realistic scenarios for the task outlined above.
+## Realistic Samples
+{extra_samples}
+#####
+This was the new proposed prompt:
+## Prompt
+{prompt}
+
+Your task is to generate {num_samples} by following this guidelines:
+1. The generated samples should be diverse
+2. They should preserve the style and the length of the given examples
+3. The samples must be challenging and hard to classify by the model. This can be achieved by:
+ 1. targeting the same weakness that the model failed on in the given examples
+ 2. targeting weakness that are different from the existing examples in the failure cases
+4. The number of generated samples from each class should be almost balanced (i.e. the same number of samples for each class)
+5. The generated samples should include only the sample content without additional information! (like the model prediction and the ground truth)
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_completion/error_analysis.prompt b/AutoPrompt/prompts/meta_prompts_completion/error_analysis.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..854819a95baabdab02fb7190285b4552f7443e1f
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_completion/error_analysis.prompt
@@ -0,0 +1,24 @@
+Assistant is a large language model designed to provide a high quality analysis for every task.
+You are given the following task description
+{task_description}
+
+Here is the prompt instructions that was given to the model:
+{prompt}
+
+The accuracy for this prompt is: {accuracy}
+The confusion matrix for this prompt is: {confusion_matrix}
+##
+Here is a list of failure cases for the given prompt:
+##Failure Cases:
+{failure_cases}
+
+###
+Note that the ground-truth labels are __absolutely correct__, but the prompts (task descriptions) may be incorrect and need modification.
+Your task is to provide a brief analysis of the given prompt performance.
+Guidelines:
+1. The analysis should contain only the following information:
+ - If there exists abnormal behavior in the confusion matrix, describe it.
+ - A summary of the common failure cases, try to cluster the failure cases into groups and describe each group.
+3. The total length of your analysis should be less than 200 token!
+###
+Analysis:
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_completion/initial.prompt b/AutoPrompt/prompts/meta_prompts_completion/initial.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..a62134052d89d1eb8cfc427a92366b053faaae34
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_completion/initial.prompt
@@ -0,0 +1,16 @@
+Assistant is a large language model designed to generate challenging samples for every task.
+Generate a list of {num_samples} challenging samples for the following task.
+### Task description:
+{task_description}
+### Task Instruction:
+{instruction}
+###
+The generated samples should be challenging and diverse such that using the task instruction as a prompt will result in the wrong result.
+
+Answer in the following format:
+#### Sample 1:
+
+#### Sample 2:
+
+############
+Results:
diff --git a/AutoPrompt/prompts/meta_prompts_completion/output_schemes.py b/AutoPrompt/prompts/meta_prompts_completion/output_schemes.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ff6fff4a64794d453f4e750a3702098cd5b230e
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_completion/output_schemes.py
@@ -0,0 +1,40 @@
+# A file containing the parser for the output of all the LLM chains
+import re
+
+def initial_parser(response: dict) -> dict:
+ """
+ Parse the response from the LLM chain
+ :param response: The response from the LLM chain
+ :return: The parsed response
+ """
+ pattern = r'(#### Sample \d+:)([\s\S]*?)(?=(#### Sample \d+:|$))'
+
+ matches = re.findall(pattern, response['text'])
+ results = {'samples' :[]}
+ for match in matches:
+ header, content = match[0], match[1]
+ results['samples'].append(content.strip())
+ return results
+
+step_samples_parser = initial_parser
+
+def step_prompt_parser(response: dict) -> dict:
+ """
+ Parse the response from the LLM chain
+ :param response: The response from the LLM chain
+ :return: The parsed response
+ """
+ pattern = re.compile( r"#### prompt:\n(?P.*?)\n#### score:\n(?P[\d.]+)", re.DOTALL)
+ match = pattern.search(response['text'])
+ if match:
+ result = {
+ 'prompt': match.group('prompt'),
+ 'score': float(match.group('score'))
+ }
+ return result
+ else:
+ result = {
+ 'prompt': '',
+ 'score': 0.0
+ }
+ return result
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_completion/step_prompt.prompt b/AutoPrompt/prompts/meta_prompts_completion/step_prompt.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..c1c6af4164e1b64dbd1970a43dd743c246c602f5
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_completion/step_prompt.prompt
@@ -0,0 +1,29 @@
+Assistant is a large language model designed to provide the best prompt for every task.
+Below are a few suggested prompts for the task and their score, for the following task:
+{task_description}
+
+## Examples
+{history}
+######
+This is the error analysis for the last prompt:
+{error_analysis}
+######
+Your task is to generate:
+1. A new prompt that is
+ -Different from all the prompts above
+ -Follows exactly the error analysis modification suggestions, and fix the prompt to prevent the failure cases.
+ -Has a higher score than all the prompts above.
+2. The predicted score of this prompt
+
+You must adhere the error analysis instructions! even in case it seems there is a contradiction between these instructions, and the task. The error analysis is tested on a ground truth, thus represent the exact intent of the task.
+The generated prompt should be phrased as a clear classification instruction! it should not include any instructions and descriptions on the modification that should be done to the prompt.
+Note that the previous prompt contains an implicit assumptions on the intent of the task that might be incorrect. You should replace this assumption with more accurate assumptions using the score of the previous prompts and the error analysis.
+The result prompt should indicate that the task is a classification class with the following labels {labels}!
+
+Answer in the following format:
+#### prompt:
+
+#### score:
+
+############
+Results:
diff --git a/AutoPrompt/prompts/meta_prompts_completion/step_samples.prompt b/AutoPrompt/prompts/meta_prompts_completion/step_samples.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..0ec1949dc1e547b14e048ea79b755064bd20c7f8
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_completion/step_samples.prompt
@@ -0,0 +1,18 @@
+Assistant is a large language model designed to generate challenging samples for every task.
+Below a few prompts and their failure case, for the following task:
+{task_description}
+
+## Examples of common failure
+{history}
+######
+Your task is to generate {num_samples} challenging and diverse samples that will confuse the model with the following prompt:
+## Prompt
+{prompt}
+
+Answer in the following format:
+#### Sample 1:
+
+#### Sample 2:
+
+############
+Results:
diff --git a/AutoPrompt/prompts/meta_prompts_generation/error_analysis.prompt b/AutoPrompt/prompts/meta_prompts_generation/error_analysis.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..7049596d41a5cc4a54bb06828f313daa0669a446
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_generation/error_analysis.prompt
@@ -0,0 +1,25 @@
+Assistant is a large language model designed to provide a high quality analysis for every task.
+You are given the following task description
+{task_description}
+
+Here is the prompt instructions that was given to the model:
+{prompt}
+
+An expert ranker evaluated the model's performance on the given task description.
+and rank according to the following scale: {labels}
+
+The mean score for this prompt is: {accuracy}
+##
+Here is a list of challenging cases for the given prompt and their rank:
+##Challenging Cases:
+{failure_cases}
+
+###
+Note that the ranker labels are __absolutely correct__, but the prompts (task descriptions) may be incorrect and need modification.
+Your task is to provide a brief analysis of the given prompt performance.
+Guidelines:
+1. The analysis should contain only the following information:
+ - A summary of the common mistakes of the prompt and the ways he can be improve his generation, try to cluster the failure cases into groups and describe each group.
+2. The total length of your analysis should be less than 200 token!
+###
+Analysis:
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_generation/initial.prompt b/AutoPrompt/prompts/meta_prompts_generation/initial.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..3212682e678840ec4f9c9a258aab672a4c489583
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_generation/initial.prompt
@@ -0,0 +1,20 @@
+As an advanced language model you should create {num_samples} challenging and unique prompts for the task outlined below.
+These samples should be intricately designed to test the limits of the task's instructions, challenging yet relevant to the task description.
+
+The task description and instruction is phrased as a generative task. The results prompts samples should be input to the the model.
+The model will be able then to generate an example given the instructions and the prompt input.
+
+### Task Description:
+{task_description}
+
+### Task Instructions:
+{instruction}
+
+### Requirements for Challenging Samples:
+1. Each prompt must present a unique and intricate challenge.
+2. The prompts should cover a diverse range of scenarios within the scope of the task, avoiding repetition and predictability.
+3. Each prompt should contain only the prompt part, without generating also the results
+4. Each prompt should contain only the prompt part, without any mention of the task description or instructions!!
+
+Generate the prompt samples keeping these requirements in mind.
+###
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_generation/output_schemes.py b/AutoPrompt/prompts/meta_prompts_generation/output_schemes.py
new file mode 100644
index 0000000000000000000000000000000000000000..13dc8e8d4ac0bd96e397f855e4b2da1352b951ae
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_generation/output_schemes.py
@@ -0,0 +1,97 @@
+# A file containing the json schema for the output of all the LLM chains
+
+initial_schema = step_samples_schema = {
+ "description": "A List of all results",
+ "properties": {
+ "samples": {
+ "description": "Each sample is a string containing only the prompt sample content, without any additional information",
+ "items": {
+ "type": "string"
+ },
+ "title": "Samples",
+ "type": "array"
+ }
+ },
+ "required": [
+ "samples"
+ ],
+ "title": "Sample_List",
+ "type": "object"
+}
+
+
+classification_prediction_schema = {
+ "$defs": {
+ "Result": {
+ "description": "A single result",
+ "properties": {
+ "id": {
+ "description": "The sample id",
+ "title": "Id",
+ "type": "integer"
+ },
+ "prediction": {
+ "description": "The prediction of the sample.",
+ "title": "Prediction",
+ "type": "string"
+ }
+ },
+ "required": [
+ "id",
+ "prediction"
+ ],
+ "title": "Result",
+ "type": "object"
+ }
+ },
+ "description": "A List of task classification results",
+ "properties": {
+ "results": {
+ "description": "Each item contain the id and the prediction of the sample",
+ "items": {
+ "$ref": "#/$defs/Result"
+ },
+ "title": "Results",
+ "type": "array"
+ }
+ },
+ "required": [
+ "results"
+ ],
+ "title": "Results_List",
+ "type": "object"
+}
+
+
+step_prompt_schema = {
+ "description": "A prompt suggestion which expect to get high score, and the associated score prediction",
+ "properties": {
+ "prompt": {
+ "description": "The prompt prediction",
+ "title": "Prompt",
+ "type": "string"
+ },
+ "score": {
+ "description": "The score prediction",
+ "title": "Score",
+ "type": "number"
+ }
+ },
+ "required": [
+ "prompt",
+ "score"
+ ],
+ "title": "Suggested_Prompt",
+ "type": "object"
+}
+
+def update_classification_prediction_schema(label_schema:list)->dict:
+ """
+ Updates the classification prediction schema with the label schema from the yaml file
+ :param yaml_data: The yaml data
+ """
+
+ classification_prediction_schema['$defs']['Result']['properties']['prediction']['enum'] = label_schema
+ classification_prediction_schema['$defs']['Result']['properties']['prediction'][
+ 'description'] += 'The answer must be one of the following options: {} !!'.format(label_schema)
+ return classification_prediction_schema
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_generation/step_prompt.prompt b/AutoPrompt/prompts/meta_prompts_generation/step_prompt.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..dba740063adcc1bb5d29b78d5775ffefc7302345
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_generation/step_prompt.prompt
@@ -0,0 +1,20 @@
+Assistant is a large language model designed to provide the best instructions for every task.
+Below are a few suggested instructions for the task and score (mean of the rank), for the following task description:
+{task_description}
+
+## Examples
+{history}
+######
+This is the analysis for the last instruction:
+{error_analysis}
+######
+Your task is to generate:
+1. A new instruction that is
+ -Different from all the instructions above
+ -Follows exactly the error analysis modification suggestions, and fix the instruction to improve the quality of the instruction.
+ -Has a higher score than all the instructions above.
+2. The predicted score of this instructions
+
+You must adhere the error analysis instructions! even in case it seems there is a contradiction between these instructions, and the task. The error analysis was evaluate by an expert ranker, thus represent the exact intent of the task.
+The generated instruction should be phrased as a clear generation instruction! it should not include any instructions and descriptions on the modification that should be done to the instruction.
+Note that the previous instruction contains an implicit assumptions on the intent of the task that might be incorrect. You should replace this assumption with more accurate assumptions using the score of the previous instructions and the error analysis.
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_generation/step_samples.prompt b/AutoPrompt/prompts/meta_prompts_generation/step_samples.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..2b049cadfbc44834e209dc02ef625b5795e483b8
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_generation/step_samples.prompt
@@ -0,0 +1,24 @@
+Assistant is a large language model designed to generate challenging samples for every task.
+Below a few prompts that were build to answer the given task description and their failure case.
+Task description:
+{task_description}
+
+## Examples, each sample is followed by the the moder prediction and the GT (ground truth)
+{history}
+######
+Here are few unique samples derived from realistic scenarios for the task outlined above.
+## Realistic Samples
+{extra_samples}
+#####
+This was the new proposed prompt:
+## Prompt
+{prompt}
+
+Your task is to generate {num_samples} by following this guidelines:
+1. The generated samples should be diverse
+2. They should preserve the style and the length of the given examples
+3. The samples must be challenging and hard to classify by the model. This can be achieved by:
+ 1. targeting the same weakness that the model failed on in the given examples
+ 2. targeting weakness that are different from the existing examples in the failure cases
+4. The number of generated samples from each class should be almost balanced (i.e. the same number of samples for each class)
+5. The generated samples should include only the sample content without additional information! (like the model prediction and the ground truth)
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_ranking/error_analysis.prompt b/AutoPrompt/prompts/meta_prompts_ranking/error_analysis.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..854819a95baabdab02fb7190285b4552f7443e1f
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_ranking/error_analysis.prompt
@@ -0,0 +1,24 @@
+Assistant is a large language model designed to provide a high quality analysis for every task.
+You are given the following task description
+{task_description}
+
+Here is the prompt instructions that was given to the model:
+{prompt}
+
+The accuracy for this prompt is: {accuracy}
+The confusion matrix for this prompt is: {confusion_matrix}
+##
+Here is a list of failure cases for the given prompt:
+##Failure Cases:
+{failure_cases}
+
+###
+Note that the ground-truth labels are __absolutely correct__, but the prompts (task descriptions) may be incorrect and need modification.
+Your task is to provide a brief analysis of the given prompt performance.
+Guidelines:
+1. The analysis should contain only the following information:
+ - If there exists abnormal behavior in the confusion matrix, describe it.
+ - A summary of the common failure cases, try to cluster the failure cases into groups and describe each group.
+3. The total length of your analysis should be less than 200 token!
+###
+Analysis:
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_ranking/initial.prompt b/AutoPrompt/prompts/meta_prompts_ranking/initial.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..dcfaae0de5eafed0b85cf60634722b571a813a09
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_ranking/initial.prompt
@@ -0,0 +1,17 @@
+Assistant is a large language model designed to generate challenging samples for every task.
+Generate a list of {num_samples} challenging samples for the following task.
+### Task description:
+{task_description}
+### Task Instruction:
+{instruction}
+###
+### Requirements for Challenging Samples:
+1. The generated samples must be challenging and diverse such that using the task instruction as a prompt will result in the wrong result.
+2. The generated samples must be only from the top two scores! With equal distribution between the two.
+3. The generated samples should be distinct, realistic, and vary significantly to ensure diversity.
+
+If the task depends both on a context, or a user input and a generated content then the sample content must include all the relevant parts.
+ -In this case the sample content structure should be as follows:
+ 1. First write the require context or user input.
+ 2. Then write the generated content of the model on this context or user input.
+ The style of the separation and the indication of the different parts, should be different in each sample.
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_ranking/initial_verbose.prompt b/AutoPrompt/prompts/meta_prompts_ranking/initial_verbose.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..f116125866068d7fd35a261ff46439703ad7f9d1
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_ranking/initial_verbose.prompt
@@ -0,0 +1,17 @@
+As an advanced language model you should create {num_samples} challenging and unique samples for the task outlined below.
+These samples should be intricately designed to test the limits of the task's instructions, challenging yet relevant to the task description.
+
+### Task Description:
+{task_description}
+
+### Task Instructions:
+{instruction}
+
+### Requirements for Challenging Samples:
+1. Each sample must present a unique and intricate challenge.
+2. The complexity of the samples should be such that simply applying the given task instruction would likely lead to incorrect or incomplete results.
+3. The samples should cover a diverse range of scenarios within the scope of the task, avoiding repetition and predictability.
+4. Ensure that the samples, while challenging, remain realistic and pertinent to the task's context.
+
+Generate the samples keeping these requirements in mind.
+###
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_ranking/output_schemes.py b/AutoPrompt/prompts/meta_prompts_ranking/output_schemes.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a693e01808f0365fb0a0ae1f6cab9eeaa738fc7
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_ranking/output_schemes.py
@@ -0,0 +1,97 @@
+# A file containing the json schema for the output of all the LLM chains
+
+initial_schema = step_samples_schema = {
+ "description": "A List of all results",
+ "properties": {
+ "samples": {
+ "description": "Each sample is a string containing the sample content, without any additional information like the Prediction or GT",
+ "items": {
+ "type": "string"
+ },
+ "title": "Samples",
+ "type": "array"
+ }
+ },
+ "required": [
+ "samples"
+ ],
+ "title": "Sample_List",
+ "type": "object"
+}
+
+
+classification_prediction_schema = {
+ "$defs": {
+ "Result": {
+ "description": "A single result",
+ "properties": {
+ "id": {
+ "description": "The sample id",
+ "title": "Id",
+ "type": "integer"
+ },
+ "prediction": {
+ "description": "The prediction of the sample.",
+ "title": "Prediction",
+ "type": "string"
+ }
+ },
+ "required": [
+ "id",
+ "prediction"
+ ],
+ "title": "Result",
+ "type": "object"
+ }
+ },
+ "description": "A List of task classification results",
+ "properties": {
+ "results": {
+ "description": "Each item contain the id and the prediction of the sample",
+ "items": {
+ "$ref": "#/$defs/Result"
+ },
+ "title": "Results",
+ "type": "array"
+ }
+ },
+ "required": [
+ "results"
+ ],
+ "title": "Results_List",
+ "type": "object"
+}
+
+
+step_prompt_schema = {
+ "description": "A prompt suggestion which expect to get high score, and the associated score prediction",
+ "properties": {
+ "prompt": {
+ "description": "The prompt prediction",
+ "title": "Prompt",
+ "type": "string"
+ },
+ "score": {
+ "description": "The score prediction",
+ "title": "Score",
+ "type": "number"
+ }
+ },
+ "required": [
+ "prompt",
+ "score"
+ ],
+ "title": "Suggested_Prompt",
+ "type": "object"
+}
+
+def update_classification_prediction_schema(label_schema:list)->dict:
+ """
+ Updates the classification prediction schema with the label schema from the yaml file
+ :param yaml_data: The yaml data
+ """
+
+ classification_prediction_schema['$defs']['Result']['properties']['prediction']['enum'] = label_schema
+ classification_prediction_schema['$defs']['Result']['properties']['prediction'][
+ 'description'] += 'The answer must be one of the following options: {} !!'.format(label_schema)
+ return classification_prediction_schema
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_ranking/step_prompt.prompt b/AutoPrompt/prompts/meta_prompts_ranking/step_prompt.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..09ceb541f9bc9bf6b1f761acd093af92390ee093
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_ranking/step_prompt.prompt
@@ -0,0 +1,21 @@
+Assistant is a large language model designed to provide the best prompt for every task.
+Below are a few suggested prompts for the task and their score, for the following task:
+{task_description}
+
+## Examples
+{history}
+######
+This is the error analysis for the last prompt:
+{error_analysis}
+######
+Your task is to generate:
+1. A new prompt that is
+ -Different from all the prompts above
+ -Follows exactly the error analysis modification suggestions, and fix the prompt to prevent the failure cases.
+ -Has a higher score than all the prompts above.
+2. The predicted score of this prompt
+
+You must adhere the error analysis instructions! even in case it seems there is a contradiction between these instructions, and the task. The error analysis is tested on a ground truth, thus represent the exact intent of the task.
+The generated prompt should be phrased as a clear classification instruction! it should not include any instructions and descriptions on the modification that should be done to the prompt.
+Note that the previous prompt contains an implicit assumptions on the intent of the task that might be incorrect. You should replace this assumption with more accurate assumptions using the score of the previous prompts and the error analysis.
+The result prompt should indicate that the task is a classification class with the following labels {labels}!
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_ranking/step_prompt_verbose.prompt b/AutoPrompt/prompts/meta_prompts_ranking/step_prompt_verbose.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..18917a8ed6bbc6cb5f430242cc546563f23c0a0b
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_ranking/step_prompt_verbose.prompt
@@ -0,0 +1,25 @@
+Assistant is a large language model designed to provide the best prompt for every task.
+Below are a few suggested prompts for the task and their score, for the following task:
+{task_description}. The exact intent of this task description should be determine by external annotated dataset of challenging cases. The error analysis of this dataset is provided.
+## Examples
+{history}
+######
+For the last prompt we got the following error analysis:
+{error_analysis}
+######
+Your task is to generate:
+1. A new prompt that is
+ -Different from all the prompts above
+ -Follows exactly after the error analysis modification suggestions, and fix the prompt to prevent the failure cases.
+ -Has a higher score than all the prompts above.
+2. The predicted score of this prompt
+
+Guidelines for the new prompt:
+1. The prompt is given a 'scratchpad', he can use it to extract from the sample text relevant information to make his prediction and perform a reasoning thought to get to the correct decision
+2. The prompt is intended for a shallow LLM, which does not have access to previous failure cases or the analysis! he has only access to the generated new prompt which should be independent of the previous prompts.
+4. Lists can organize the information and help the prompt (for example list of rules and a list of samples), the lists should be short and accurate
+5. Note that the prompts and task descriptions may be inaccurate and need modification.
+6. Note that higher score means better prompt.
+7. The result prompt should indicate that the task is a classification class with the following labels {labels}!
+
+Sample randomly a number between 1 to 3. If the result is zero __change completely__ the generated prompt! including the instruction, the structure and the phrasing!
\ No newline at end of file
diff --git a/AutoPrompt/prompts/meta_prompts_ranking/step_samples.prompt b/AutoPrompt/prompts/meta_prompts_ranking/step_samples.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..9ae9598227b1ed33316359ad8da6a4322e8f2a02
--- /dev/null
+++ b/AutoPrompt/prompts/meta_prompts_ranking/step_samples.prompt
@@ -0,0 +1,29 @@
+Assistant is a large language model designed to generate challenging samples for every task.
+Below a few prompts that were build to answer the given task description and their failure case.
+Task description:
+{task_description}
+
+## Examples of common failure, each sample is followed by the the model prediction and the GT (ground truth)
+{history}
+######
+Here are few unique samples derived from realistic scenarios for the task outlined above.
+## Realistic Samples
+{extra_samples}
+#####
+This was the new proposed prompt:
+## Prompt
+{prompt}
+
+Your task is to generate {num_samples} by following this guidelines:
+1. The generated samples should be diverse
+2. They should preserve the style and the length of the given examples
+3. The samples must be challenging and hard to classify by the model. This can be achieved by:
+ 1. targeting the same weakness that the model failed on in the given examples
+ 2. targeting weakness that are different from the existing examples in the failure cases
+4. The generated samples must be only from the top two scores! With equal distribution between the two!
+
+If the task depends both on a context, or a user input and a generated content then the sample content must include all the relevant parts.
+ -In this case the sample content structure should be as follows:
+ 1. First write the require context or user input.
+ 2. Then write the generated content of the model on this context or user input.
+ The style of the separation and the indication of the different parts, should be different in each sample.
\ No newline at end of file
diff --git a/AutoPrompt/prompts/modifiers/modifiers.yml b/AutoPrompt/prompts/modifiers/modifiers.yml
new file mode 100644
index 0000000000000000000000000000000000000000..da07ced88de8728220b1de543c95cd1356d9961f
--- /dev/null
+++ b/AutoPrompt/prompts/modifiers/modifiers.yml
@@ -0,0 +1,4 @@
+
+ranker:
+ prompt_mod: 'prompts/modifiers/ranker_prompt_mod.prompt'
+ task_desc_mod: 'prompts/modifiers/ranker_task_desc_mod.prompt'
diff --git a/AutoPrompt/prompts/modifiers/ranker_prompt_mod.prompt b/AutoPrompt/prompts/modifiers/ranker_prompt_mod.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..bf8dd1403abd66fec31517ed1cedcca1870f7b1a
--- /dev/null
+++ b/AutoPrompt/prompts/modifiers/ranker_prompt_mod.prompt
@@ -0,0 +1,10 @@
+Assistant is a large language model designed to generate instructions for every task.
+You are given a instructions phrased as text generation task.
+Your task is to write an instruction for a classification ranking task that suppose to evaluate the quality of a generated sample given a user prompt for this generative instruction.
+Guidelines:
+1. The classifier labels are {label_schema}. The result instructions should indicate explicitly that the task is a classification class with the following labels {label_schema}!
+2. The generated instruction must also evaluate how well the generated sample adhere the user prompt
+#####
+Input generative instruction: {prompt}
+#####
+Rephrased classification quality evaluation instruction:
\ No newline at end of file
diff --git a/AutoPrompt/prompts/modifiers/ranker_task_desc_mod.prompt b/AutoPrompt/prompts/modifiers/ranker_task_desc_mod.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..22a41c9ca4c855d4fd347f061baad03f1e082449
--- /dev/null
+++ b/AutoPrompt/prompts/modifiers/ranker_task_desc_mod.prompt
@@ -0,0 +1,6 @@
+Assistant is a large language model designed to generate a task description.
+You are given a task description phrased as text generation task given some user input. Your task is to rephrase it as a task that suppose to evaluate the quality of the given generative task and how well it adhere to the user input.
+#####
+Input task description: {task_description}
+#####
+Rephrased task description:
\ No newline at end of file
diff --git a/AutoPrompt/prompts/predictor/output_schemes.py b/AutoPrompt/prompts/predictor/output_schemes.py
new file mode 100644
index 0000000000000000000000000000000000000000..a35aaeabeda4e0fad00604ab1b58c3070a67995b
--- /dev/null
+++ b/AutoPrompt/prompts/predictor/output_schemes.py
@@ -0,0 +1,55 @@
+# A file containing the json schema for the output of all the LLM chains
+
+prediction_schema = {
+ "$defs": {
+ "Result": {
+ "description": "A single result",
+ "properties": {
+ "id": {
+ "description": "The sample id",
+ "title": "Id",
+ "type": "integer"
+ },
+ "prediction": {
+ "description": "The prediction of the sample.",
+ "title": "Prediction",
+ "type": "string"
+ }
+ },
+ "required": [
+ "id",
+ "prediction"
+ ],
+ "title": "Result",
+ "type": "object"
+ }
+ },
+ "description": "A List of task classification results",
+ "properties": {
+ "results": {
+ "description": "Each item contain the id and the prediction of the sample",
+ "items": {
+ "$ref": "#/$defs/Result"
+ },
+ "title": "Results",
+ "type": "array"
+ }
+ },
+ "required": [
+ "results"
+ ],
+ "title": "Results_List",
+ "type": "object"
+}
+
+
+def update_classification_prediction_schema(schema, label_schema:list)->dict:
+ """
+ Updates the classification prediction schema with the label schema from the yaml file
+ :param yaml_data: The yaml data
+ """
+
+ schema['$defs']['Result']['properties']['prediction']['enum'] = label_schema
+ schema['$defs']['Result']['properties']['prediction'][
+ 'description'] += 'The answer must be one of the following options: {} !!'.format(label_schema)
+ return schema
\ No newline at end of file
diff --git a/AutoPrompt/prompts/predictor/prediction.prompt b/AutoPrompt/prompts/predictor/prediction.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..4e3fff2e31d0b25666c2d1b274048b7d5348bdf0
--- /dev/null
+++ b/AutoPrompt/prompts/predictor/prediction.prompt
@@ -0,0 +1,9 @@
+Assistant is a large language model designed to classify challenging language tasks.
+Given a list of {batch_size} samples classify them according to the following task
+### Task Instruction:
+{task_instruction}
+
+### list of samples:
+{samples}
+##
+Remember, follow carefully after the exact task instructions!
\ No newline at end of file
diff --git a/AutoPrompt/prompts/predictor_completion/output_schemes.py b/AutoPrompt/prompts/predictor_completion/output_schemes.py
new file mode 100644
index 0000000000000000000000000000000000000000..16d8e48f235563c7cbdf74ef487e02037740e42f
--- /dev/null
+++ b/AutoPrompt/prompts/predictor_completion/output_schemes.py
@@ -0,0 +1,26 @@
+# A file containing the json schema for the output of all the LLM chains
+# A file containing the parser for the output of all the LLM chains
+import re
+
+
+def prediction_parser(response: dict) -> dict:
+ """
+ Parse the response from the LLM chain
+ :param response: The response from the LLM chain
+ :return: The parsed response
+ """
+ pattern = re.compile(r'Sample (\d+): (\w+)')
+ matches = pattern.findall(response['text'])
+ predictions = [{'id': int(match[0]), 'prediction': match[1]} for match in matches]
+ return {'results': predictions}
+
+def prediction_generation_parser(response: dict) -> dict:
+ """
+ Parse the response from the LLM chain
+ :param response: The response from the LLM chain
+ :return: The parsed response
+ """
+ pattern = re.compile(r'Sample (\d+): (.*?)(?=|$)', re.DOTALL)
+ matches = pattern.findall(response['text'])
+ predictions = [{'id': int(match[0]), 'prediction': match[1].strip()} for match in matches]
+ return {'results': predictions}
diff --git a/AutoPrompt/prompts/predictor_completion/prediction.prompt b/AutoPrompt/prompts/predictor_completion/prediction.prompt
new file mode 100644
index 0000000000000000000000000000000000000000..356ca1bb2e2f639564ece88a1ca2b396ad035986
--- /dev/null
+++ b/AutoPrompt/prompts/predictor_completion/prediction.prompt
@@ -0,0 +1,11 @@
+Assistant is a large language model designed to classify challenging language tasks.
+Given a list of {batch_size} samples classify them according to the following task
+### Task Instruction:
+{task_instruction}
+
+### list of samples:
+{samples}
+##
+Answer exactly in the following format for each sample:
+#### Sample :