（2024.4.10）这几天试着使用了一下训练好模型，发现Trasformer等因素导致模型部署出了问题，同时推理的效果也变差了。于是重新训练了一次这个模型。您需要将模型文件全部下载到本地后使用。 Over the past few days, I attempted to deploy this pretrained model but encountered issues due to factors like the Transformer. This resulted in degraded inference performance. Consequently, I retrained the model. Please download all model files to your local machine for use.

NL2Cypher-lora介绍（Introduction to NL2Cypher-lora）

NL2Cypher-lora是在baichuan2-7B的基础上利用lora进行微调，数据集为SpCQL，在测试集上达到了较好的效果。

NL2Cypher-lora is based on baichuan2-7B with fine-tuning using lora. The dataset used is SpCQL, and it achieved good results on the test set.

NL2Cypher-lora简介（Overview of NL2Cypher-lora）

NL2Cypher-lora主要是面向Neo4j图数据库，通过输入自然语言，转化为查询语言Cypher。在一张4090显卡上训练了4000step。鉴于目前这类的微调工作不多，效果不好的话还请多多包涵。

NL2Cypher-lora is mainly fine-tuned for the Neo4j graph database, converting natural language query into the Cypher query language. It is trained for 4000 steps on a 4090 GPU. Given the limited work and possibly suboptimal performance in this area, we kindly ask for your understanding：).

快速使用NL2Cypher-lora（Quick Start with NL2Cypher-lora）

import os
from dataclasses import dataclass, field
from typing import Optional
import torch
import tyro
from accelerate import Accelerator
from datasets import load_dataset
from peft import  LoraConfig
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments

from trl import SFTTrainer
from trl.import_utils import is_xpu_available
from trl.trainer import ConstantLengthDataset

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
        "path/to/this/model",
        quantization_config=bnb_config,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map={"": "cuda:0"},
        trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("path/to/this/tokenizer", use_fast=False, trust_remote_code=True)



input_text =  """
    你是一个知识图谱领域的专家，给你一个query你就可以返回一个准确的cypher，比如：
    query:"想要5个在体育人物韩献熙的4层关系内的个人介绍？" cypher:"""

input_ids = tokenizer(input_text, return_tensors="pt").to('cuda:0')

# 生成文本
output = base_model.generate(**input_ids,num_beams=10,max_new_tokens=200, repetition_penalty=1.1)

print(tokenizer.decode(output.cpu()[0], skip_special_tokens=True).split("cypher:")[1][1:])

输出为：

The output may be:

  match (n:ENTITY{name:'韩献熙'})-[*1..4]->(x) where x.name<>'体育人物' return distinct x.name limit 5

具体训练代码会稍后开源，感谢关注！

My specific training details will be open-sourced on GitHub. Enjoy my toy work.