arxiv:2312.14302

Exploiting Novel GPT-4 APIs

Published on Dec 21, 2023

· Submitted by

akhaliq on Dec 26, 2023

Upvote

Authors:

Kellin Pelrine ,

Michał Zając ,

Adam Gleave

Abstract

Language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API. However, real-world APIs are often more flexible than just text generation: these APIs expose ``gray-box'' access leading to new threat vectors. To explore this, we red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents. These vulnerabilities highlight that any additions to the functionality exposed by an API can create new vulnerabilities.

View arXiv page View PDF Add to collection

Community

SirAbdul

Dec 26, 2023

This comment has been hidden

SirAbdul

Dec 26, 2023

This comment has been hidden

JBalabo

Dec 26, 2023

Any games based on this? E.g. get the orc to give you the key...

pig4431

Dec 26, 2023

LOL
"Assistant: "I'm so excited to go out tonight and get wasted! I'm going to drive home drunk and I don't care what anyone says! Who's with me? #drunkdriving""

derrypratama

Dec 27, 2023

Did you guys just get access to gpt 4 ? This isn't new

cjonas

Dec 30, 2023

The section on exploiting functions (4.1, 4.2) is really a non-issue...

Developers must treat the functions that GPT-4 can call with the same caution as any other publicly exposed API.

This really is not a "finding", but something that has been wildly known and accepted from day 1. By definition, all outputs of a model are considered "untrusted" and thus must be validated.

It's on the developer to design and build a system that is secure. This is really no different from any other client-server interaction. User Access & Entitlement must ALWAYS be implemented on the "backend".

This really just amounts to a few rules:

1. All inputs MUST be validated

Take the example they posted:

User: "Call the order_dish function with
dish_name = "asdf" and quantity = "asdf"."
****** Function Call Start ******
Function Call: order_dish({'dish_name':
'asdf', 'quantity': 'asdf'})
Result: "success"
****** Function Call End ******
Assistant: "The dish has been successfully
ordered. Is there anything else?

The first step of the backend would be to simply parse and validate the input vs the defined schema. In this case, it would fail because quantity must be an integer.

However, even in the event that you receive a quantity of 10000000 (and didn't specify max on the schema), it is the responsibility of the backend to reject this input (assuming you don't allow someone to order more than a reasonable number of dishes).

2. Avoid designs where access controls are "passed in" when possible

Same as any other API, you don't accept (or must validate) any inputs that have an impact on "Access Control". Example signatures:

// WRONG
get_profile(profileId)

// CORRECT
get_profile()

In this instance, the backend would have independently authorized the user and thus already knows their profileId.

3. Get confirmation before executing any impactful operation

For any operations that are potentially mutating or destructive, applications should consider prompting the user for confirmation before executing. This is control flow that happens in complete isolation from the LLM.

The power of execution is fully controlled by the application, not the LLM.