PDF extraction with tm package

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

PDF extraction with tm package

Steven Kang
Hi R users,

I’m having some issues trying to extract texts from PDF file using tm
package.

Here are the steps that were carried out:

1. Downloaded and installed the following programs:

- Xpdf (Copied the ‘bin32’, ‘bin64’, ‘doc’ folders into ‘C:\Program
Files\Xpdf’ directory; also added C:\Program Files\Xpdf\bin64\pdfinfo.exe &
C:\Program Files\Xpdf\bin64\pdftotext.exe in existing PATH

- Tesseract

- Imagemagick

2. Used the following scripts and the corresponding error messages:

# Directory where PDF files are stored

>cname <- getwd()

>Corpus(DirSource(cname), readerControl=list(reader = readPDF))

Error in system2("pdftotext", c(control$text, shQuote(x), "-"), stdout =
TRUE) :
'"pdftotext"' not found

 In addition: Warning message:

running command '"pdfinfo" "C:\Users\R_Files\XXX.pdf"' had status 127

>file.exists(Sys.which(c("pdfinfo","pdftpotext")))
[1] FALSE FALSE

It seems like R can’t find pdfinfo & pdftotext exe files, but not sure as
to why this would be the case despite xpdf files being copied into
‘C:\Program Files’ (Im using Windows 7 64bits)

I’m aware that ‘pdf_text’ function from pdftools package can extract texts
from PDF file and outputs into a string. But I was after something which is
able to convert PDF (ie transaction data) into a dataframe without regular
expression. Is tm package capable of doing this conversion? Are there any
other alternatives to these methods?

Your expertise in resolving this problem would be highly appreciated.


Steve

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: PDF extraction with tm package

Jeff Newmiller
This is neither the Xpdf support forum nor the Windows Setup Program Reinvention support group... and you really need to read and follow the Posting Guide for the R mailing lists.

FWIW I would guess that you need to learn about environment variables and in particular about the PATH variable. There are subtleties about when and how they get defined that are OS-specific and certainly off topic here that may trip you up along the way. Alternatively, you may read the Xpdf documentation or a how-to blog about Xpdf that gives you a recipe, but again that is not about R. Once you can start a CMD shell and run the command directly then you are most of the way to getting R to invoke it.
--
Sent from my phone. Please excuse my brevity.

On July 21, 2016 5:26:26 PM PDT, Steven Kang <[hidden email]> wrote:

>Hi R users,
>
>I’m having some issues trying to extract texts from PDF file using tm
>package.
>
>Here are the steps that were carried out:
>
>1. Downloaded and installed the following programs:
>
>- Xpdf (Copied the ‘bin32’, ‘bin64’, ‘doc’ folders into ‘C:\Program
>Files\Xpdf’ directory; also added C:\Program
>Files\Xpdf\bin64\pdfinfo.exe &
>C:\Program Files\Xpdf\bin64\pdftotext.exe in existing PATH
>
>- Tesseract
>
>- Imagemagick
>
>2. Used the following scripts and the corresponding error messages:
>
># Directory where PDF files are stored
>
>>cname <- getwd()
>
>>Corpus(DirSource(cname), readerControl=list(reader = readPDF))
>
>Error in system2("pdftotext", c(control$text, shQuote(x), "-"), stdout
>=
>TRUE) :
>'"pdftotext"' not found
>
> In addition: Warning message:
>
>running command '"pdfinfo" "C:\Users\R_Files\XXX.pdf"' had status 127
>
>>file.exists(Sys.which(c("pdfinfo","pdftpotext")))
>[1] FALSE FALSE
>
>It seems like R can’t find pdfinfo & pdftotext exe files, but not sure
>as
>to why this would be the case despite xpdf files being copied into
>‘C:\Program Files’ (Im using Windows 7 64bits)
>
>I’m aware that ‘pdf_text’ function from pdftools package can extract
>texts
>from PDF file and outputs into a string. But I was after something
>which is
>able to convert PDF (ie transaction data) into a dataframe without
>regular
>expression. Is tm package capable of doing this conversion? Are there
>any
>other alternatives to these methods?
>
>Your expertise in resolving this problem would be highly appreciated.
>
>
>Steve
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.