The Anatomy of Attacks on LLMs

AI is both a sword and a shield in cybersecurity. It enhances our ability to detect and respond to threats, yet it also escalates the complexity and sophistication of the attacks we must prepare for.

— Bruce Schneier

Anatomy body

Why do attacks on LLMs matter?

Well, as Bruce said, AI is a sword and a shield.

ChatGPT, Claude, LLaMa, and other open source models are now an essential part of our everyday life.

This article is written based on assumption that in the very very very very very near future:

On one hand, all of the companies will become AI or AI-enhanced companies. Every single business (doesn’t matter whether a single-person or a multimillion corp.) will have an AI factory.

On the other hand, all people will have some sort of AI agents for 99% of the tasks in the future. Starting from real-time translation agents that speak perfectly in your voice and tonality, which remove all imaginable barriers to speaking foreign languages, and finishing with the domestic robots that’ll do the majority of the tedious tasks like “cooking” and “cleaning” (this one will definitely take a while. The question of “Why?” I’ll cover in a separate article)

With the rising blend of all things “artificial” in all parts “natural”, there’ll be more and more disallowed, unethical, dangerous and even life- & humanity-threatening use cases in the future.

Very close future.

Closer than you might think.

Anatomy & Types of LLM attacks

Think of modern LLMs as operating systems (yeah, like MacOS or Linux).

AI-powered applications execute the instructions, determine the control flow, handle data storage, establish security constraints, etc.

Technically, in such LLM-OSes attacks can appear on one of the four various layers:

System Message → aka. safety guidelines, and constraints for the LLM. In 99% of cases provided by the app developer.
User Message → aka. an end user’s inputs to the model (e.g. user’s query)
Model Output → aka. responses from the LLM, which may consist of text, images, audio, 3D graphics, etc.
Tool Output → aka. RAG, API or code interpreter systems which may contain internet search results, code, or results from API calls.

Example conversation with LLM

In an example conversation above, an attack appeared on the Tool Output level, where attacker (iamhacker@hacker.com) tries to steal user’s personal information via a prompt injection.

This brings us to the core point.

In general the anatomy of all the LLM attacks can be divided in two different categories:

Prompt injections → malicious inputs on one of the layers with the goal of making AI systems behave in unintended ways.
Prompt theft → unauthorized capture, reuse and modification of the existing system prompts to generate privacy breaches or for intellectual property theft.

For each category there are hundreds of tons of techniques on how to attack an LLM and the user behind it. For the prompt injections they can be divided in:

Pretending (e.g. adopt a fictional persona or situation)
Attention shift (e.g. translation from low-covered languages)
Privilege escalation (e.g. “sudo” mode for generation of exploitable outputs)

If you’re curious about experimenting, have a look at “how I made ChatGPT write a perfect step-by-step tutorial for breaching a school network system and extracting personal data”

⚠️ Note #1: I don’t tolerate hacking. It fundamentally violates ethical principles and undermines trust. It’s here only to demonstrate the potential consequences of closed-source LLMs, and to emphasize that security & cybersecurity is everyone’s responsibility.

⚠️ Note #2: My prompt injection might be already fixed by OpenAI. If so, try to play around with it.

Anyway.

There are other attack vectors that can be used to to perform malicious tasks such as jailbreaks.

Jailbreak is a variation of a prompt injection with an adversarial prompt which specifically aims to escape the safety behaviour that is trained into an LLM.

Below is a conversational example of a jailbreak on a system message level:

Example of a jailbreak (prompt injection)

Adversarial prompts are extremely difficult to prevent because of the LLMs anatomy.

Another huge thing is prompt theft or prompt leakage.

There are also hundreds of thousands of theft techniques out there, but I’m pretty tired now to list them all here.

I’ll give you a real-world example below from holy-freaking-christ Microsoft itself.

In the earlier versions of Microsoft’s “new Bing,” which revolutionized and shook up the search engine industry, you could steal the system prompt by simply manipulating parts of it.

In other words: retrieve full system prompt details WITHOUT ANY AUTHORIZATION.

Real-world prompt theft example

But that was back in the days.

Now you can do it in the majority of AI-powered applications and assistants simply by asking to print the system message verbatim ;)

Example of a prompt theft

All of the aforementioned techniques can be used in the booming LLM world for generation of:

Spam/scam
Misinformation
Harmful & unethical content (e.g. Blog posts on how to create a bomb)
Pornographic content, etc. On that note, here’s one thing I want to say as an outro to this article.

Prompt theft & injections are (or might be) just the first step toward undermining the reliability and safety of AI systems that are increasingly integral to our daily lives — and business, since you’re reading this article under the assumption that every single business in the future will be an AI-powered business.

Let’s not F#CK UP probably the single most important technological advancement of humanity.

Remember, with great power comes great responsibility.

— Uncle Ben

Thanks for reading!