3.4

Attacks on watermarks

An adversary with a few hours and a commodity GPU can defeat most published image watermarks. The academic record is unkind to claims of robustness. The honest defense is to plan for it.

Every claim of watermark robustness is a claim about a specific threat model. The literature is full of papers reporting 99% detection accuracy on specific benchmarks, paired with later papers showing how to defeat those same schemes in minutes. The truth most working practitioners settle on is that any given image watermark can be defeated by a sufficiently motivated adversary; the open questions are how much motivation it takes and whether the adversary's effort is meaningful in the context where the watermark is supposed to provide a signal.

This page catalogs the published attack families, names which watermark categories each defeats, and points to where the academic record currently sits. It is deliberately on the pessimistic side of the field; vendor claims of robust watermarking should be evaluated against the attacks below, not against benign-channel benchmarks.

The attack catalog

Compression and re-encoding

The simplest attack is to re-encode the image at low quality, then back to a normal quality. JPEG re-encoding at quality 30 or below destroys most high-frequency information and many watermark schemes that embed in that region. The classical countermeasure is to embed in low-frequency bands; this preserves the watermark against compression but reduces capacity and visibility margins.

Geometric transformations

Rotation, scaling, perspective distortion, and aspect-ratio changes can defeat watermarks that lack synchronization mechanisms. The classical countermeasure is to embed in invariant feature spaces (Fourier-Mellin, log-polar) or to embed a synchronization template the decoder finds first. Both add computational cost; neither is universal.

Cropping

Heavy cropping removes whatever region the watermark occupies. Tile-based or multi-template embedding spreads the watermark across the image, so any sufficiently large remaining region still carries it. Aggressive crops below the tile size defeat this. Cropping is one of the most-used attacks because it is a normal editorial operation, not specifically adversarial.

Adversarial scrubbing

An attack that uses gradient information from the detector to compute small pixel-level perturbations that defeat the detector while remaining visually imperceptible. The seminal paper for the image case is Saberi et al., 2023, "Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks," which demonstrated that several published watermark schemes could be defeated by small ℓ_∞-bounded perturbations. Subsequent work has extended the attacks to most production schemes available for evaluation.

Regeneration attacks

An attack that runs the watermarked image through a generative model — an autoencoder, a diffusion model in image-to-image mode — and outputs a perceptually similar image. The generative model has no incentive to preserve the watermark, and most schemes do not survive. Zhao et al., 2023, "Invisible Image Watermarks Are Provably Removable Using Generative AI," formalized the attack and demonstrated it against several schemes. Regeneration is now the standard test in watermarking benchmarks; schemes that survive it are rare.

Paraphrasing attacks

The image analogue of text-paraphrasing attacks on text watermarks. The adversary uses an image-editing model to produce a semantically equivalent image with different surface form: same subject, same composition, different pixel-level realization. This is one of the strongest classes of attacks because it changes the image at a level no current watermarking scheme is designed to be invariant against.

Laundering through a non-watermarking generator

An attack specific to AI-generator watermarks: take the watermarked output, use it as a reference image for a different generator that does not embed a watermark, and emit a similar image with no watermark. This requires the second generator to be capable enough that the output is acceptable, which has become routine as open-weights models have matured.

Collusion attacks

When the same producer issues multiple watermarked versions of similar content (different framings of the same scene, multiple thumbnails), an attacker with several copies can average them to dilute the watermark signal while preserving the perceptual content. The literature on these attacks is mature; the practical mitigation is to randomize the watermark across copies so averaging does not cancel.

AttackEffortDefeatsNotable result
Re-encodingTrivialPixel-domain LSB schemesWell-established countermeasures
GeometricLowSchemes without synchronizationFourier-Mellin defends partially
CroppingTrivialSingle-region schemesTile-based embedding defends
Adversarial scrubbingModerate (needs detector access)Most published schemesSaberi et al. 2023
RegenerationLow (commodity diffusion model)Most surface-level watermarksZhao et al. 2023
ParaphrasingModerateAll current schemesField consensus
LaunderingLowGenerator-specific watermarksTrivial with open-weights models
CollusionRequires multiple copiesStatic-key schemesPer-copy randomization defends

The published record

The academic record through 2024 and 2025 has been consistent: watermark schemes evaluated outside their original benchmark conditions perform substantially worse than the original claims. The 2024 paper by An, Yu, et al. ("Benchmarking the Robustness of Image Watermarks") evaluated several production-grade and academic schemes against a standardized adversarial suite and found that most schemes lost detection accuracy of 50 percentage points or more under adversarial conditions.

The Saberi et al. work cited above made the strongest theoretical claim, providing a fundamental-limits argument: any watermark with capacity above a small threshold can be removed by an attacker with white-box access to the detector. The argument turns on the information-theoretic capacity of the watermarking channel and applies to image watermarks generally rather than to specific schemes. The practical implication is that watermarks are best understood as raising adversary cost, not as providing absolute guarantees, even against attackers without enormous resources.

What this means for SynthID and production schemes

SynthID, Stable Signature, and the other production AI-generator watermarks are not exempt from these results. The published evaluations of SynthID in particular have been mixed: it performs well under benign distribution but is vulnerable to regeneration attacks and to adversarial scrubbing in laboratory conditions. Google's response has emphasized that SynthID is a layer in a defense-in-depth strategy and not a standalone authentication mechanism — a position consistent with the academic record.

Production deployments mitigate the attack surface by combining watermarks with other signals: C2PA manifests for the strong-binding case, detection classifiers as backstop, and reputational signals at the platform layer. None of these is bulletproof either, but the combination is harder to defeat than any single layer. This is the same defense-in-depth logic that runs through the rest of this reference.

Caveat A vendor benchmark showing 99% watermark detection accuracy under "real-world conditions" should always be read with attention to what real-world conditions were tested. If adversarial scrubbing and regeneration attacks are not in the test suite, the benchmark says nothing about adversarial robustness. Watermarking is robust against the average case, not against the determined attacker.

Forgery attacks

The dual of the removal attack is the forgery attack: producing a watermark in an image that the legitimate generator never produced. If the watermark is keyed to a specific signing entity, a forged watermark on a non-generator image could falsely indicate AI provenance. This is a less-discussed attack but has real consequences if watermarks come to be relied on for "this image is AI-generated" labeling.

The risk varies by scheme. Schemes that depend on a secret key in the detector are reasonably resistant to forgery because the attacker does not know what pattern to embed. Schemes whose detection is essentially statistical (computing a learned similarity score) can be more vulnerable because an attacker can optimize an embedding to maximize the score. The C2PA architecture's choice to keep generator-side watermarking as a soft-binding signal rather than as a primary authentication mechanism is partially motivated by this kind of forgery concern.

The regulatory implication

The EU AI Act's "robust, reliable, interoperable" standard for marking technologies (Article 50) is the place where the attack literature meets compliance. If an enforcement authority adopts a strict reading — schemes vulnerable to regeneration attacks are not "robust" — then most current generator watermarks would fail. A permissive reading — survival against benign distribution suffices — would accept current practice. The interpretation question is open and likely to be answered first through European Commission guidance and second through litigation.

The honest position for an implementer is to assume that any specific watermark will be defeated within months of becoming a high-value target. The compliance answer is to use multiple complementary marks, to update them as attacks emerge, and to combine watermarking with non-watermark mechanisms (C2PA manifests, classifier-based detection). The compliance question is whether this defense-in-depth strategy satisfies a regulation written as if a single robust watermark were possible.

Where the field is moving

The watermarking research community through 2025 and into 2026 has been adapting to the attack record. New schemes (Gaussian Shading, the tree-ring family, some learned-feature variants) have been explicitly designed to resist regeneration attacks at the cost of higher computational expense. The arms race continues, with each generation of scheme followed by a generation of attacks, in the long-running pattern that any cryptographic adversarial game produces.

The deeper question is whether the field continues to attempt a single robust scheme or accepts that watermarking is a layered, evolving system. The latter view is increasingly mainstream and is reflected in the C2PA spec's deliberate algorithm-agnosticism and in the production combination of watermarks plus manifests plus detection. The expectation in the broader literature is that no single scheme will solve the problem; the system-level architecture is what determines the actual defensive value.