Update README.md

3813dc0 verified 3 days ago

5.85 kB

	---
	library_name: transformers
	license: other
	base_model: deepseek-ai/deepseek-coder-1.3b-instruct
	tags:
	- trl
	- sft
	- generated_from_trainer
	model-index:
	- name: asm2asm-deepseek-1.3b-500k-2ep-tokenizer-x86-O0-arm-gnueabi-gcc
	results: []
	---
	<!-- This model card has been generated automatically according to the information the Trainer had access to. Please review and complete it as necessary. -->

	# CISC-to-RISC

	A fine-tuned version of [deepseek-ai/deepseek-coder-1.3b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-instruct) specialized in converting x86 assembly code to ARM assembly.

	## Model Overview

	asm2asm-deepseek1.3b-xtokenizer-arm is designed to assist developers in converting x86 assembly instructions to ARM assembly. Leveraging the capabilities of the base model, this fine-tuned variant enhances accuracy and efficiency in assembly code transpilation tasks.

	## Intended Use

	This model is intended for:

	- Assembly Code Conversion: Assisting developers in translating x86 assembly instructions to ARM architecture.
	- Educational Purposes: Helping learners understand the differences and translation mechanisms between x86 and ARM assembly.
	- Code Optimization: Facilitating optimization processes by converting and refining assembly code across architectures.

	## Limitations

	- Dataset Specificity: The model is fine-tuned on a specific dataset, which may limit its performance on assembly instructions outside the training distribution.
	- Complex Instructions: May struggle with highly complex or unconventional assembly instructions not well-represented in the training data.
	- Error Propagation: Inaccuracies in the generated ARM code can lead to functional discrepancies or bugs if not reviewed.

	## Training Data

	Detailed information about the training dataset is required.

	## Training Procedure

	### Training Hyperparameters

	The model was trained with the following hyperparameters:

	- Learning Rate: 0.0002
	- Training Batch Size: 1
	- Evaluation Batch Size: 8
	- Seed: 42
	- Gradient Accumulation Steps: 4
	- Total Training Batch Size: 4
	- Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
	- Learning Rate Scheduler: Linear
	- Number of Epochs: 2

	## Usage

	All models and datasets are available on [Hugging Face](https://huggingface.co/collections/ahmedheakl/cisc-to-risc-672727bd996db985473d146e). Below is an example of how to use the best model for converting x86 assembly to ARM.

	### Inference Code

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from tqdm import tqdm

	# Replace 'hf_token' with your Hugging Face token
	hf_token = "your_hf_token_here"

	model_name = "ahmedheakl/asm2asm-deepseek1.3b-xtokenizer-arm"

	instruction = """<｜begin▁of▁sentence｜>You are a helpful coding assistant specialized in converting from x86 to ARM assembly.
	### Instruction:
	Convert this x86 assembly into ARM
	```asm
	{asm_x86}
	"```"
	### Response:
	```asm
	{asm_arm}
	"""

	# Load the model
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	token=hf_token,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)

	model.config.use_cache = True

	# Load the tokenizer
	tokenizer = AutoTokenizer.from_pretrained(
	model_name,
	trust_remote_code=True,
	token=hf_token,
	)

	def inference(asm_x86: str) -> str:
	prompt = instruction.format(asm_x86=asm_x86, asm_arm="")
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	generated_ids = model.generate(
	**inputs,
	use_cache=True,
	num_return_sequences=1,
	max_new_tokens=8000,
	do_sample=False,
	num_beams=4,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.pad_token_id,
	)
	outputs = tokenizer.batch_decode(generated_ids)[0]
	torch.cuda.empty_cache()
	torch.cuda.synchronize()
	return outputs.split("```asm\n")[-1].split(f"```{tokenizer.eos_token}")[0]

	# Example usage
	x86 = "DWORD PTR -248[rbp] movsx rdx"
	converted_arm = inference(x86)
	print(converted_arm)
	```

	## Experiments and Results

	\| Model \| Average Edit Distance (↓) \| Exact Match (↑) \| Test Accuracy (↑) \|
	\|-----------------------------------------------\|-------------------------------\|---------------------\|-----------------------\|
	\| GPT4o \| 1296 \| 0% \| 8.18% \|
	\| DeepSeekCoder2-16B \| 1633 \| 0% \| 7.36% \|
	\| Yi-Coder-9B \| 1653 \| 0% \| 6.33% \|
	\| Yi-Coder-1.5B \| 275 \| 16.98% \| 49.69% \|
	\| DeepSeekCoder-1.3B \| 107 \| 45.91% \| 77.23% \|
	\| DeepSeekCoder-1.3B-xTokenizer-int4 \| 119 \| 46.54% \| 72.96% \|
	\| DeepSeekCoder-1.3B-xTokenizer-int8 \| 96 \| 49.69% \| 75.47% \|
	\| DeepSeekCoder-1.3B-xTokenizer \| 165 \| 50.32% \| 79.25% \|

	Table: Comparison of models' performance on the x86 to ARM transpilation task, measured by Edit Distance (lower is better), Exact Match (higher is better), and Test Accuracy (higher is better). The top section lists pre-existing models, while the bottom section lists models trained by us. The best results in each metric are highlighted in bold.

	## Citations

	If you use this model in your research, please cite it as follows: