From a technical perspective, a watermark is just another layer of PDF content—text, vector art, or image—drawn over or under the main content. PDF’s stacking model makes removal possible via content filtering. | Tool | Stars | Method | Best for | |------|-------|--------|----------| | pdfrw + custom script | ~500 | Filter page contents by type | Text watermarks | | PyPDF2/PyMuPDF (fitz) | 6k+ | Remove annotations/overlay objects | Stamped watermarks | | pdfCropMargins | ~300 | Crop then scale | Edge watermarks | | OCRmyPDF + masking | 4k+ | OCR + regenerate | Image-based watermarks | | Stirling-PDF | 20k+ | GUI + CLI with “Remove Watermark” | Non-technical users |
This physically removes the text—even from copied text layer. Image watermarks (scan of a stamp, logo) require a different approach:
# Detect watermark region (first page, look for repeated gray text) first_page = doc[0] watermarks = [] for block in first_page.get_text("dict")["blocks"]: for line in block.get("lines", []): for span in line.get("spans", []): if span["color"] < 0.5: # dark gray/black threshold bbox = fitz.Rect(span["bbox"]) watermarks.append(bbox)
And never remove watermarks to misrepresent ownership—that’s where engineering becomes forgery. This piece was assembled from real GitHub source analysis and PDF internals documentation. The code examples run on Python 3.8+ with PyMuPDF installed ( pip install PyMuPDF ).