Announcement

Collapse
No announcement yet.

Convert PDF into TXT

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    finally there's a working example: http://www.mirthcorp.com/community/f...0&postcount=11

    so, now can somebody please modify that to extract text from the pdf?

    Comment


    • #17
      there's a PdfTextExtractor (http://api.itextpdf.com/itext/com/it...Extractor.html) that facilitates hugely the conversion (as we can see in http://itextpdf.com/examples/iia.php?id=296)
      BUT
      that function is alvailable only from the 2.1.4 version of iText, and since i'm using Mirth 2.2, the iText version that comes with it is 2.0.8... so =(

      by the way, Mirth 3.0 does come with iText 2.1.4

      but I already try to "migrate" the mirth version (loading my current channel in v3) and it didn't work, so... HELP!

      Comment


      • #18
        bad news, everyone: not even iText 2.1.4 support the PdfTextExtractor.getTextFromPage function, so, migrating to Mirth 3.0 IS NOT ENOUGH.

        GOOD NEWS EVERYONE!: I manage to make it work!
        what I did was:

        download the latest iText version (5.5.1) from http://sourceforge.net/projects/itext/files/iText
        then rename itextpdf-5.5.1.jar >> iText-2.0.8.jar and replace the mirth file at C:\Program Files (x86)\Mirth Connect\extensions\doc\lib (mirth should be stopped)
        And then, execute this:

        Code:
        function extractText(src){
        	var reader = new Packages.com.itextpdf.text.pdf.PdfReader(src);
        	for (i=1;i<=reader.getNumberOfPages();i++) {
        		logger.info(Packages.com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(reader, i));
        	}
        	reader.close();
        }
        extractText("C:/ldiag/ddd.pdf");
        and it worked! =)

        so, the thing is, the 5.5.1 version is AGPL license, not MPL... what does it mean?
        and
        it was good what i did? i mean, is as simple as replace the jar? no need to "compile" mirth or something like that?
        Last edited by gomezmsebastian; 07-10-2014, 12:17 PM.

        Comment


        • #19
          I'm guessing without affecting Mirth and licensing you could have installed the 5.5.1 version into the custom-lib folder, restarted Mirth, and then just included 'import' package statements. This way there would be no conflict with how Mirth uses the older version of iText. But, I haven't tried this yet, so not sure.

          -cp

          Comment


          • #20
            PDF to Text Online Class

            I had to do this recently but with Intersystems Ensemble. Needless to say, it was not at all easy with them. So, I jumped over to my favorite engine: Mirth.

            I was so impressed on how easy it was to extract text from a PDF that I thought I should document it. I used itextpdf.com.

            I have a PDF that will help:
            http://www.mediafire.com/download/8a...DF_to_Text.pdf

            Also, I put together an online class that hopefully helps, too:
            https://hl7-starter-kit.teachable.co...ct-pdf-to-text

            I hope this helps!
            Vivian
            HL7StarterKit.com

            Comment


            • #21
              I'm trying to do something similar, but from an attachment.
              If I save the attachment out, and read in using the code above it works.

              This doesn't, but I cannot see why...

              Code:
              var inputstream = new Packages.java.io.ByteArrayInputStream(getAttachments().get(0).getContent());
              var reader = new Packages.com.itextpdf.text.pdf.PdfReader(inputstream);
              for (i=1;i<=reader.getNumberOfPages();i++) {
              		logger.info(Packages.com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(reader, i));
              		//logger.info(Packages.com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(reader, i));
              }
              reader.close();
              inputstream.close();
              I get the following returned which suggest it cannot read the attachment:

              Code:
              Caused by: com.itextpdf.text.exceptions.InvalidPdfException: PDF header signature not found.
              	at com.itextpdf.text.pdf.PRTokeniser.getHeaderOffset(PRTokeniser.java:227)
              	at com.itextpdf.text.pdf.PdfReader.getOffsetTokeniser(PdfReader.java:442)
              Can anyone advise what is wrong here?

              Comment


              • #22
                This seems to work, but is it the best way?

                Code:
                var content = getAttachments().get(0).getContent();
                var decoded = FileUtil.decode(Packages.java.lang.String(content));
                var inputstream = new Packages.java.io.ByteArrayInputStream(decoded);
                var reader = new Packages.com.itextpdf.text.pdf.PdfReader(inputstream);
                for (i=1;i<=reader.getNumberOfPages();i++) {
                		logger.info(Packages.com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(reader, i));
                		// do other stuff here to regex match content
                }
                reader.close();
                inputstream.close();

                Comment


                • #23
                  Perhaps off topic, but I would most certainly do this outside of Mirth via command line utilities, delivering the extracted files to a folder for downstream mirth consumption.

                  https://stackoverflow.com/questions/...ext-extraction has some tools listed.

                  There are instances like this one that other tools seem easier, at least for me.
                  Mirth 3.8.0 / PostgreSQL 11 / Ubuntu 18.04
                  Diridium Technologies, Inc.
                  https://diridium.com

                  Comment


                  • #24
                    Error in PDF To text

                    Hi Vivian. I received this error when send a message. My Mirth is 3.5


                    ....

                    DETAILS: TypeError: [JavaPackage com.itextpdf.text.pdf.PdfReader] is not a function, it is object.
                    at 272b44b5-fffe-46a9-b446-e0e7ceb291e2:3419 (extractText)
                    at 272b44b5-fffe-46a9-b446-e0e7ceb291e2:3428 (doTransform)
                    at 272b44b5-fffe-46a9-b446-e0e7ceb291e2:3450 (doScript)
                    at 272b44b5-fffe-46a9-b446-e0e7ceb291e2:3452 .....

                    Comment

                    Working...
                    X