FWIW in my personal case I favor a self-hosted approach and scan paper documents as they come in using the CamScanner Pro Android app, share the resulting file to the Syncthing app which uploads it to a specific folder on my NAS, which will then leverage the OCRmyPDF Python script to automatically OCR the PDF and move it to a year/month-based directory. I am then able to use the Ambar software to do full-text search on any document I have on my NAS. The quick scanning/OCR procedure makes it a no-brainer to scan the documents, and having full-text search makes it very easy to find documents when I need them (in a question of seconds). The software stack on the NAS is configured using docker-compose.
Thank you!
Now I have a new project thanks to you
If you’re into self-hosting solutions, you might be interested in https://github.com/jonaswinkler/paperless-ng
OCR included. I don’t use it, but heard a lot of good about it.
That was one of the many solutions I tried, but never managed to get it to work properly. These were my requirements, by the way:
- Self-hosted;
- Quick to import and to search PDFs;
- Use tesseract OCR as that is the gold-standard of open-source OCR for me;
- No state stored that cannot be rebuilt from the PDF themselves (I don’t want to own a separate datastore). Some solutions import PDFs and then you lose sight of them and everything must be done within the tool itself. No thanks.
- I scan all documents and destroy the papers, DL the ebill bills etc.
- Sort them in different folders (assurances, prévoyance, ménage, impôts etc.) on a cloud (I do not use the internal drive of the laptop so that they stay available from every device when not at home).
- Every time I do a bit of admin work (1-2 times a month approx), I duplicate the folders on an other cloud and on a external harddrive.
Coincidently, my Ambar server recently started having issues with expired SSL certificates and consuming some CPU every 5 minutes, so I started looking for a replacement. I have now replaced it with pagerless-ng (a much better maintained fork of the original paperless project) and I’m much happier with it, for the following reasons:
- it uses much less resources when idle;
- it still leverages the great OCRmyPDF/Tesseract projects;
- it has AI technology to automatically recognize sender/document date/type of document;
- it still maintains a tree of PDFs in your disk so you don’t lose access to them if the project stops being maintained for some reason;
- there’s a ton of community support, with apps for Android/iOS and even command line interface(!)
In short, hopefully you haven’t yet gone down the Ambar route, since this will be more future-proof solution. Hope this helps!
Simple: i have a “Finance” share on my Synology nas with nightly encrypted backup to the Synology Cloud.
Thanks for the heads-up! Fortunately no, I was on holidays and I just go back at my PC to geek, so i’ll give this NG fork a test.
Hi @betterlatethannever
very good topic! I handle archiving for both work and my personal life (obviously). I apply the 10 year role for my private stuff as well (if it is in paper form I file in a binder, if not I archive everything on a NAS with backup)
However recently I came across this startup, based in Lausanne: Addmin - your intelligent digital filing cabinet
Haven’t tried it out (yet), but might be of interest to anyone if managing/filing documents is not your cup of cake
Are these hosted / servers solutions worth it for an individual user?
Can’t you just use your file manager’s built-in folders, tags and search functionality (Finder tags, Spotlight search)? Much less complexity and installation required, less likely to break with a software update or malfunction…?
Thank you for sharing this. Just gave paperless-ng a try and I have to say I’m really impressed by the OCR capabilities.
I’ll start now to digitalize all my documents and finally get rid of paper as much as possible.
Thanks for the paperless-ng recommendation, was rather easy to spin up the docker containers on my QNAP.
$ ocrmypdf -l fra MyInputFile MyOutputFile
InputFile and OutputFile can have different or same name.
The -l option defines the language of the document to optimize the ocr work.