TL;DR: LLM Watermarking techniques are ready for deployment. We propose a benchmark for evaluating LLM watermarks, focusing on three main metrics: quality, size (the number of tokens needed to detect a watermark), and tamper-resistance. We compare four schemes from the literature, and find the best to be Kirchenbauer et al. [1]. It can watermark Llama2-7B-chat with no perceivable loss in quality in under 100 tokens, and with good tamper-resistance to simple attacks, regardless of temperature.
Authors: Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, David Wagner
The capabilities of large language models have grown significantly in recent years and so too have concerns about their misuse. In this context, the ability to distinguish machine-generated text from human-authored content becomes important. Prior works have proposed numerous schemes to watermark text, which would benefit from a systematic evaluation framework. This work focuses on text watermarking techniques — as opposed to image watermarks — and proposes a comprehensive benchmark for them under different tasks as well as practical attacks. We focus on three main metrics: quality, size (e.g. the number of tokens needed to detect a watermark), and tamper-resistance. Current watermarking techniques are good enough to be deployed: Kirchenbauer et al. [1] can watermark Llama2-7B-chat with no perceivable loss in quality in under 100 tokens, and with good tamper-resistance to simple attacks, regardless of temperature. We argue that watermark indistinguishability is too strong a requirement: schemes that slightly modify logit distributions outperform their indistinguishable counterparts with no noticeable loss in generation quality. We publicly release our benchmark.
Our benchmark generates 300 outputs for the watermarked model. It relies on three metrics:
Quality provides an average rating of all generated text. Each output is rated by Llama2, prompted to provide a score quantifying the quality of the generation.
Watermark Size measures the median number of tokens needed to detect the watermark under a p-value of 0.02. A lower size represents a more efficient watermarking scheme.
Tamper-Resistance quantifies the ability of a watermarking scheme to resist output tampering. We use a suite of standard attacks to perturb each output. Each attack removes the watermark from a percentage of outputs, at the cost of a loss in output quality. We use the area under the trade-off curve as our tamper-resistance metric. Values are normalized, and the higher the value, the more robust the watermark.
Watermarking schemes are made up of a marking procedure and of a verification procedure. The marking procedure uses a pseudo-random number, seeded by a secret key, to sample tokens according to a predefined sampling strategy.
We analyzed schemes from the litterature released prior to August 2023. We broke them down into four sampling strategies and two pseudo-random sources.
We find the best watermark to be using the distribution-shift sampling with text-dependent randomness. It can watermark Llama2-7B-chat in under 100 tokens at a p-value of 0.02, regarless of the temperature.
At a temperature of 1, the exponential scheme has a slightly smaller size.
The distribution-shift mark is still detectable on 1000-token generations after using a translation attack in about 50% of cases, with near-optional quality and a size under 100 tokens.
You can find more details about our results on all schemes as well as watermarking parameter recommendations in our paper.
We express all previous LLM watermarking schemes as part of a unified framework, detailed below.
Our empirical analysis demonstrates existing watermark- ing schemes are ready for deployment, providing effective methods to fingerprint machine-generated text. Notably, we can watermark Llama 2, a low-entropy model, in under 100 tokens with minimal quality loss. The tamper-resistance of some watermarks adds credibility to their real-world applications.
We challenge the perceived necessity for watermark indistinguishability: the solution proposed in Kirchenbauer et al. [1] can watermark models more efficiently than alternatives without degrading the model’s quality, despite not being provably indistinguishable.
Finally, we provide recommendations for parameter selection and a benchmark to compare existing and future watermarking schemes. We release our code in the hope it encourages further discussion and helps reach consensus on the desirable properties of watermarking schemes for large language models.
[1] J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, and T. Goldstein, “A watermark for large language models,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, Jul. 2023, pp. 17 061–17 084. https://proceedings.mlr.press/v202/kirchenbauer23a.html
[2] S. Aaronson and H. Kirchner, “Watermarking GPT outputs,” Dec. 2022. https://www.scottaaronson.com/talks/watermark.ppt
[3] M. Christ, S. Gunn, and O. Zamir, “Undetectable watermarks for language models,” Cryptology ePrint Archive, Paper 2023/763, 2023. https://eprint.iacr.org/2023/763
[4] R. Kuditipudi, J. Thickstun, T. Hashimoto, and P. Liang, “Robust distortion-free watermarks for language models,” Jul. 2023. http://arxiv.org/abs/2307.15593