Why prompt and schema is generated as part of model output?
So far i am pretty satisfy with accuracy of the model but one thing is annoying me is, prompt & schema part of model output.
What if my schema is huge around 250 columns?
Output is:
CREATE TABLE stadium (
stadium_id number,
location text,
name text,
capacity number,
highest number,
lowest number,
average number
)
CREATE TABLE singer (
singer_id number,
name text,
country text,
song_name text,
song_release_year text,
age number,
is_male others
)
CREATE TABLE concert (
concert_id number,
concert_name text,
theme text,
stadium_id text,
year text
)
CREATE TABLE singer_in_concert (
concert_id number,
singer_id text
)
-- Using valid SQLite, answer the following questions for the tables provided above.
-- What is the maximum, the average, and the minimum capacity of stadiums ?
SELECT MAX(capacity), AVG(capacity), MIN(capacity) FROM stadium
Expected output is:
only generated sql query SELECT MAX(capacity), AVG(capacity), MIN(capacity) FROM stadium
How can i get only sql query to be generated?
Hi @Aiforfun ,
Thanks for your interest in our work. We've also provided several examples of using our model in the real world and how only to get the generated SQL. Please check here: https://github.com/NumbersStationAI/NSQL/tree/main/examples
Hi
@senwu
,
Thanks for the reply.
My query was more related to DB schema e.g. if i have more than 100 DB tables and providing the schema for all those tables makes prompt very huge and model takes very long time to generate the query and model.generate(input_ids, max_length=600) function returns very huge token which are related to DB schema. Then striping out all the schema from output to print only generated SQL.
Other llms don't generated the given prompt back in fact they generate only SQL, so is there any option to generate only SQL not the given prompt?
Is there any better way to give the schema for larger DB tables considering more than 100 bigger tables?
Hi @Aiforfun ,
The input sequence length will affect the model generation time. A longer input sequence will take a longer time for the model to generate output assuming the output sequence has the same/similar length. Other CausalLM-based FM will also return the original input back as the prefix of the generated sequence but you can control the max token length for the newly generated output and you can also only print the newly generated output by filtering out the original input sequence. Here is an example:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-llama-2-7B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-llama-2-7B", torch_dtype=torch.bfloat16)
text = """CREATE TABLE stadium (
stadium_id number,
location text,
name text,
capacity number,
)
-- Using valid SQLite, answer the following questions for the tables provided above.
-- how many stadiums in total?
SELECT"""
input_ids = tokenizer(text, return_tensors="pt").input_ids
# generated_ids = model.generate(input_ids, max_length=500)
generated_ids = model.generate(input_ids, max_new_tokens=500) # <---- use max_new_tokens to control the maximum number of tokens to generate, ignoring the number of tokens in the prompt.
# print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
print(tokenizer.decode(generated_ids[0][len(input_ids[0]):], skip_special_tokens=True)) # <---- filter out the input sequence
Hi
@senwu
,
Awesome, this is what i was looking for. Much appreciated for your prompt response.
I have another issue since i have more than 100 DB tables and model requires tables schema provided in prompt, hence prompt becomes very huge for all 100 DB tables' schema.
I am trying with prompt chunking but facing some issues to make it working, so can you please guide me how to do it chunking for very large prompt with NumbersStation/nsql-llama-2-7B model or any other better approach to handle very large prompt?
The NumbersStation/nsql-llama-2-7B model supports context length 4K so you need to fit your prompt within that restriction. In practice, we recommend filtering out irrelevant tables and columns with some retrival mechanism to make the prompt fit into the model. Hope it helps.
One option is to encode the table/column with sentence_transformers and filter them by the similarity score.
This particular model is intended for the 0-shot scenario, which is the most common real-world use case. If you're conducting experiments and wish to include demonstrations from the same database, you can add several question/SQL pairs at the end. If you have any suggestions on how to enable multi-database demonstrations, please let us know, and we'll try to incorporate them in the next release.
Hi, will join queries work between tables, I tried but didn't succeed
Yes, the model supports joining over multiple tables. You could adjust the prompt to give the model more information.
Thanks for your reply, can you please share any examples, when I tried I wasn’t able to make that work
Here is an example:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-llama-2-7B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-llama-2-7B", torch_dtype=torch.bfloat16)
text = """CREATE TABLE singer (\n singer_id number,\n name text,\n birth_year number,\n net_worth_millions number,\n citizenship text\n)\n\nCREATE TABLE song (\n song_id number,\n title text,\n singer_id number,\n sales number,\n highest_position number\n)\n\n\n-- Using valid SQLite, answer the following questions for the tables provided above.\n\n-- For each singer name, what is the total sales for their songs?\n\nSELECT"""
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
The output would be something like this:
CREATE TABLE singer (
singer_id number,
name text,
birth_year number,
net_worth_millions number,
citizenship text
)
CREATE TABLE song (
song_id number,
title text,
singer_id number,
sales number,
highest_position number
)
-- Using valid SQLite, answer the following questions for the tables provided above.
-- For each singer name, what is the total sales for their songs?
SELECT T2.name, SUM(T1.sales) FROM song AS T1 JOIN singer AS T2 ON T1.singer_id = T2.singer_id GROUP BY T2.name
Hi
@senwu
,
I see model is facing challenges to generate correct SQL queries if some short of calculation is needed in SQL. I tried to generate SQL query which should have calculated the ratio in where clause but ratio is not being calculated.
CREATE TABLE work_order (
id NUMBER,
property_id NUMBER,
cost FLOAT,
invoice_amount FLOAT,i
entered_date DATE,
due_date DATE,
complete_date DATE,
total_workorder NUMBER,
completed_workorder Number
)
CREATE TABLE property (
id NUMBER,
property_name TEXT,
area FLOAT,
owner_id NUMBER,
city TEXT,
country TEXT
)
CREATE TABLE owner (
id NUMBER,
name TEXT,
salary FLOAT
)
-- Using valid sqlite, answer the following questions for the tables provided above.
-- find property id for which work order ratio is higher than 2
SELECT property_id FROM work_order GROUP BY property_id HAVING COUNT(*) > 2
Correct SQL should be: SELECT property_id FROM work_order WHERE (total_workorder - completed_workorder) / total_workorder > 2
Could you please help what is wrong and what should i do?
Try this modified version?
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("NumbersStation/nsql-llama-2-7B")
model = AutoModelForCausalLM.from_pretrained("NumbersStation/nsql-llama-2-7B", torch_dtype=torch.bfloat16).to(0)
text = """CREATE TABLE work_order (
id NUMBER,
property_id NUMBER,
cost FLOAT,
invoice_amount FLOAT,
entered_date DATE,
due_date DATE,
complete_date DATE,
total_workorder NUMBER,
completed_workorder Number
)
CREATE TABLE property (
id NUMBER,
property_name TEXT,
area FLOAT,
owner_id NUMBER,
city TEXT,
country TEXT
)
CREATE TABLE owner (
id NUMBER,
name TEXT,
salary FLOAT
)
-- Using valid SQLite, answer the following questions for the tables provided above.
-- find property id for which completed work order ratio is lower than 98
"""
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(0)
generated_ids = model.generate(input_ids, max_length=500)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
Output:
SELECT property_id FROM work_order WHERE completed_workorder / total_workorder < 98
Hi
@senwu
,
Sorry for the confusion.
My question was --find property id for which non completed work order ratio is higher than 2
So generated query should be SELECT property_id FROM work_order WHERE (total_workorder - completed_workorder) / total_workorder > 2
But model generated query is SELECT property_id FROM work_order WHERE completed_workorder / total_workorder > 2
Hi
@Aiforfun
,
After another look at your example, I think the issue is about the schema since the work_order table is at the granularity of work order which means each row is a work order. The total_workorder NUMBER
is measuring the total work orders for an entire property. Meanwhile the completed_workorder NUMBER
is a boolean over a single work order. So these two attributes don’t type check with the schema. Instead, you can have an is_completed_workorder NUMBER
to indicate the work order status with the total worker attribute. Here is the updated schema:
CREATE TABLE work_order (
id NUMBER,
property_id NUMBER,
cost FLOAT,
invoice_amount FLOAT,
entered_date DATE,
due_date DATE,
complete_date DATE,
is_completed_workorder NUMBER
)
CREATE TABLE property (
id NUMBER,
property_name TEXT,
area FLOAT,
owner_id NUMBER,
city TEXT,
country TEXT
)
CREATE TABLE owner (
id NUMBER,
name TEXT,
salary FLOAT
)
And for the question about Give me the ratio of not completed to total work orders
, the model outputs:
SELECT (SELECT COUNT(*) FROM work_order WHERE is_completed_workorder = 0) / (SELECT COUNT(*) FROM work_order)
Hi
@senwu
,
I want to train NumbersStation/nsql-llama-2-7B on my own dataset so i have to prepare my own dataset similar to the dataset you shared in github.
So what would be the right approach, pretrain and then fine tune or directly fine tune as per fin tune notebook you shared in github?
What are all hyperparameters shall i consider in fine tuning to get best accuracy in generation of sql?
You can find the prompt format in the model readme.
Hi @senwu ,
- NumbersStation/nsql-llama-2-7B model is not making joins at all for me. My tables contain string dtype and require no calculation. When I ask for multiple variables from multiple tables, it hallucinates and links all variables to a single table with the primary key of that table only. How do I ensure it is making joins properly?
- How do I add more information like column metadata/description in the prompt? Could you provide an example for the same. Right now my prompt is exactly in the model readme prompt format that you guys have provided.
- If the question asked contains a variable that is present in multiple tables, can I have the output contain all of those tables instead of just one? Currently, it is giving me the variable from just 1 table.
Looking forward to your reply. Thanks in advance.
It would be great if the model supported a few-shot learning. It would allow the system to learn based on previous mistakes, where users can guide the model on what SQL they expect. Please let me know how can I help? Can I provide training examples?
Yes, it would be great if you can share training examples