Picking Linux

Mar 20, 2020

Needs advice

and

We Have thousands of .pdf docs generated from the same form but with lots of variability. We need to extract data from open text and more important - from tables inside the docs. The output of Couchbase/Mongo will be one row per document for backend processing. ADOBE renders the tables in an unusable form.

READ LESS

9 upvotes·239K views

Replies (3)

OtkudznamDamir Radinović-Lukić

Mar 27, 2020

Recommends

Linux

If you can select text with mouse drag in PDF. Use pdftotext it is fast! You can install it on server with command "apt-get install poppler-utils". Use it like "pdftotext -layout /path-to-your-file". In same folder it will make text file with line by line content. There is few classes on git stacks that you can use, also.

3 upvotes·231.1K views

Petr Havlicek

Freelancer at havlicekpetr.cz·Mar 21, 2020

Recommends

MongoDB

I prefer MongoDB due to own experience with migration of old archive of pdf and meta-data to a new “archive”. The biggest advantage is speed of filters output - a new archive is way faster and reliable then the old one - but also the the easy programming of MongoDB with many code snippets and examples available. I have no personal experience so far with Couchbase. From the architecture point of view both options are OK - go for the one you like.

12 upvotes·231.7K views

View all (3)