30 days ago

•

Stuck at this point for hours. Slow inference
I'm experiencing slow inference and excessive memory usage (maxing out 128GB of RAM) when running LLaMA 3.1 8B Instruct for text generation tasks.
The inference process takes too long, and the system resources are heavily taxed.
What configurations (code, model settings, or infrastructure) should I change to optimize performance?
Should I consider better hardware, or is there a way to make the current setup more efficient?

Code

import transformers
import torch
import time 
import pdfplumber
import os
import json
start_time = time.time()
model_id = "meta-llama/Llama-3.1-70B-Instruct"#"meta-llama/Llama-3.1-8B-Instruct"

import warnings
warnings.filterwarnings("ignore")

def extract_text_from_pdf(pdf_path):
    text = ""

    # Check if the PDF exists
    if os.path.exists(pdf_path):
        with pdfplumber.open(pdf_path) as pdf:
            for page_num, page in enumerate(pdf.pages):
                text += page.extract_text()
    else:
        raise FileNotFoundError(f"PDF file not found: {pdf_path}")

    return text


pdf_text = extract_text_from_pdf("some_pdf_file.pdf")
print(len(pdf_text))

pipeline = transformers.pipeline(
    "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
)

print(f"Model Load Time : {time.time() - start_time:.2f} seconds")
def get_answer_llm(question):
    inter_start_time = time.time()
    print(f"question: {question}")
    prompt = f"Context: {pdf_text}\n\n Question:{question}: - Is this correct or accurate as per the Context Yes or No? if not please provide the correct information?"
    print(len(prompt))
    with open("file.txt", "w", encoding="utf-8") as file:
        file.write(prompt)

    output = pipeline(prompt, max_new_tokens=16000)

    # Extract the 'generated_text' and find the 'Answer:' part
    generated_text = output[0]['generated_text']
    answer_start = generated_text.find("Answer:")
    answer = generated_text[answer_start:]

    # Print only the 'Answer:' part
    print(f"answer: {answer}")
    print(f"{time.time() - inter_start_time:.2f} seconds")

list_of_questions = ['Question1', 'Question2', 'Question3', 'Question4', 'Question5']
# Example usage
for question in list_of_questions:
    get_answer_llm(question)
total_time = time.time()
print(f"Total time taken: {total_time - start_time:.2f} seconds")

OUTPUT

PS C:\Users\Nam\llama3> python try_HF_gpt2.py
92780
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:06<00:00,  4.44it/s]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Model Load Time : 11.77 seconds
question: This is a promotional website intended for UK healthcare professionals only.
92985
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

System Config

GPU

CPU

Memory

GUrubux

30 days ago

This comment has been hidden

GUrubux changed discussion status to closed 30 days ago

GUrubux changed discussion status to open 30 days ago

meta-llama
/

Llama-3.1-70B-Instruct

Slow response : Text validation

Code

OUTPUT

System Config