Extract lines from pdf files

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Extract lines from pdf files

R help mailing list-2

Colleagues,

 

I can extract specific data from lines in a pdf using:

 

library(pdftools)

pdf_text("10619.pdf")

txt <- pdf_text(".pdf")

write.table(txt,file="mydata.txt")

con <- file('mydata.txt')

open(con)

serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <- read.table(con,skip=11,nrow=1)# Extract [5]

parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]

parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]

close(con)

 

# note here that serial has 4 variables

# flatness had 6 variables

# parallel1 has 5 variables

# parallel2 has 5 variables

 

# this outputs the specific data I need

serial[3]

flatness[5]

parallel1[5] # Note here that the txt format shows 0.0007not scientific, is there a way to format this to display the original data?

parallel2[5] # Note here that the txt format shows 0.0006not scientific, , is there a way to format this to display the original data?

 

I'd like to extend this code to all of the pdf files in adirectory and to generate a table of all the serial, flatness, parallel1 andparallel2 data.

I'm not having a lot of success trying to build thescript for this. Some pointers would be appreciated.
All the best.
 
Thomas Subia

Statistician / Senior Quality Engineer



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Extract lines from pdf files

Jeff Newmiller
Please don't spam the mailing list. Especially with HTML format messages. See the Posting Guide.

PDF is designed to present data graphically. It is literally possible to place every character in the page in random order and still achieve this visual readability while practically making it nearly impossible to read. I have encountered many PDF files with the same text placed on the page multiple times... again scrambling your option to read it digitally. Tools like "pdftools" can sometimes work when the program that generated the file does so in a simple and extraction-friendly way... but there are no guarantees, and your description suggests that it is likely that you won't be able to accomplish your goal with this file.

On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help <[hidden email]> wrote:

>
>Colleagues,
>

>
>I can extract specific data from lines in a pdf using:
>

>
>library(pdftools)
>
>pdf_text("10619.pdf")
>
>txt <- pdf_text(".pdf")
>
>write.table(txt,file="mydata.txt")
>
>con <- file('mydata.txt')
>
>open(con)
>
>serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <-
>read.table(con,skip=11,nrow=1)# Extract [5]
>
>parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
>
>parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
>
>close(con)
>

>
># note here that serial has 4 variables
>
># flatness had 6 variables
>
># parallel1 has 5 variables
>
># parallel2 has 5 variables
>

>
># this outputs the specific data I need
>
>serial[3]
>
>flatness[5]
>
>parallel1[5] # Note here that the txt format shows 0.0007not
>scientific, is there a way to format this to display the original data?
>
>parallel2[5] # Note here that the txt format shows 0.0006not
>scientific, , is there a way to format this to display the original
>data?
>

>
>I'd like to extend this code to all of the pdf files in adirectory and
>to generate a table of all the serial, flatness, parallel1 andparallel2
>data.
>
>I'm not having a lot of success trying to build thescript for this.
>Some pointers would be appreciated.
>All the best.
>
>Thomas Subia
>
>Statistician / Senior Quality Engineer
>
>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Extract lines from pdf files

Eric Berger
Hi Thomas,
As Jeff wrote, your HTML email is difficult to read. This is a "plain
text" forum.
As for "pointers", here is one suggestion.
Since you write that you can do the necessary actions with a specific
file, try to write a function that carries out those actions for that
same file.
Except when implementing the function, replace any specific data with
the value of an argument passed into the function.
e.g.
txt <- pdf_text("10619.pdf")
would be replaced by
txt <- pdf_text(pdfFile)

and your function would have pdfFile as an argument, as in

myfunc <- function( pdfFile )

Since you can accomplish the task for this file without a function,
you should be able to accomplish the task with a function.
Once you succeed to do that you can then try passing the function
arguments that refer to the other files you need to process.

HTH,
Eric


On Wed, Nov 20, 2019 at 1:09 AM Jeff Newmiller <[hidden email]> wrote:

>
> Please don't spam the mailing list. Especially with HTML format messages. See the Posting Guide.
>
> PDF is designed to present data graphically. It is literally possible to place every character in the page in random order and still achieve this visual readability while practically making it nearly impossible to read. I have encountered many PDF files with the same text placed on the page multiple times... again scrambling your option to read it digitally. Tools like "pdftools" can sometimes work when the program that generated the file does so in a simple and extraction-friendly way... but there are no guarantees, and your description suggests that it is likely that you won't be able to accomplish your goal with this file.
>
> On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help <[hidden email]> wrote:
> >
> >Colleagues,
> >
> >
> >
> >I can extract specific data from lines in a pdf using:
> >
> >
> >
> >library(pdftools)
> >
> >pdf_text("10619.pdf")
> >
> >txt <- pdf_text(".pdf")
> >
> >write.table(txt,file="mydata.txt")
> >
> >con <- file('mydata.txt')
> >
> >open(con)
> >
> >serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <-
> >read.table(con,skip=11,nrow=1)# Extract [5]
> >
> >parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
> >
> >parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
> >
> >close(con)
> >
> >
> >
> ># note here that serial has 4 variables
> >
> ># flatness had 6 variables
> >
> ># parallel1 has 5 variables
> >
> ># parallel2 has 5 variables
> >
> >
> >
> ># this outputs the specific data I need
> >
> >serial[3]
> >
> >flatness[5]
> >
> >parallel1[5] # Note here that the txt format shows 0.0007not
> >scientific, is there a way to format this to display the original data?
> >
> >parallel2[5] # Note here that the txt format shows 0.0006not
> >scientific, , is there a way to format this to display the original
> >data?
> >
> >
> >
> >I'd like to extend this code to all of the pdf files in adirectory and
> >to generate a table of all the serial, flatness, parallel1 andparallel2
> >data.
> >
> >I'm not having a lot of success trying to build thescript for this.
> >Some pointers would be appreciated.
> >All the best.
> >
> >Thomas Subia
> >
> >Statistician / Senior Quality Engineer
> >
> >
> >
> >       [[alternative HTML version deleted]]
> >
> >______________________________________________
> >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Extract lines from pdf files

reichmaj
Eric

I will have to give that a try. Thanks.  
For a "it works" method I used ....

start_time <- Sys.time()

        insert code of interest

end_time <- Sys.time()
end_time - start_time

-----Original Message-----
From: R-help <[hidden email]> On Behalf Of Eric Berger
Sent: Wednesday, November 20, 2019 9:58 AM
To: Jeff Newmiller <[hidden email]>
Cc: Thomas Subia <[hidden email]>; Thomas Subia via R-help
<[hidden email]>
Subject: Re: [R] Extract lines from pdf files

Hi Thomas,
As Jeff wrote, your HTML email is difficult to read. This is a "plain text"
forum.
As for "pointers", here is one suggestion.
Since you write that you can do the necessary actions with a specific file,
try to write a function that carries out those actions for that same file.
Except when implementing the function, replace any specific data with the
value of an argument passed into the function.
e.g.
txt <- pdf_text("10619.pdf")
would be replaced by
txt <- pdf_text(pdfFile)

and your function would have pdfFile as an argument, as in

myfunc <- function( pdfFile )

Since you can accomplish the task for this file without a function, you
should be able to accomplish the task with a function.
Once you succeed to do that you can then try passing the function arguments
that refer to the other files you need to process.

HTH,
Eric


On Wed, Nov 20, 2019 at 1:09 AM Jeff Newmiller <[hidden email]>
wrote:
>
> Please don't spam the mailing list. Especially with HTML format messages.
See the Posting Guide.
>
> PDF is designed to present data graphically. It is literally possible to
place every character in the page in random order and still achieve this
visual readability while practically making it nearly impossible to read. I
have encountered many PDF files with the same text placed on the page
multiple times... again scrambling your option to read it digitally. Tools
like "pdftools" can sometimes work when the program that generated the file
does so in a simple and extraction-friendly way... but there are no
guarantees, and your description suggests that it is likely that you won't
be able to accomplish your goal with this file.
>
> On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help
<[hidden email]> wrote:

> >
> >Colleagues,
> >
> >
> >
> >I can extract specific data from lines in a pdf using:
> >
> >
> >
> >library(pdftools)
> >
> >pdf_text("10619.pdf")
> >
> >txt <- pdf_text(".pdf")
> >
> >write.table(txt,file="mydata.txt")
> >
> >con <- file('mydata.txt')
> >
> >open(con)
> >
> >serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <-
> >read.table(con,skip=11,nrow=1)# Extract [5]
> >
> >parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
> >
> >parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
> >
> >close(con)
> >
> >
> >
> ># note here that serial has 4 variables
> >
> ># flatness had 6 variables
> >
> ># parallel1 has 5 variables
> >
> ># parallel2 has 5 variables
> >
> >
> >
> ># this outputs the specific data I need
> >
> >serial[3]
> >
> >flatness[5]
> >
> >parallel1[5] # Note here that the txt format shows 0.0007not
> >scientific, is there a way to format this to display the original data?
> >
> >parallel2[5] # Note here that the txt format shows 0.0006not
> >scientific, , is there a way to format this to display the original
> >data?
> >
> >
> >
> >I'd like to extend this code to all of the pdf files in adirectory
> >and to generate a table of all the serial, flatness, parallel1
> >andparallel2 data.
> >
> >I'm not having a lot of success trying to build thescript for this.
> >Some pointers would be appreciated.
> >All the best.
> >
> >Thomas Subia
> >
> >Statistician / Senior Quality Engineer
> >
> >
> >
> >       [[alternative HTML version deleted]]
> >
> >______________________________________________
> >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Extract lines from pdf files

Bert Gunter-2
Don't do this (the timing code you showed, not Eric's suggestions).

Do this:

sys.time( {
    code of interest
})

(or use the microbenchmark package functionality)

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Nov 20, 2019 at 10:56 AM Jeff Reichman <[hidden email]>
wrote:

> Eric
>
> I will have to give that a try. Thanks.
> For a "it works" method I used ....
>
> start_time <- Sys.time()
>
>         insert code of interest
>
> end_time <- Sys.time()
> end_time - start_time
>
> -----Original Message-----
> From: R-help <[hidden email]> On Behalf Of Eric Berger
> Sent: Wednesday, November 20, 2019 9:58 AM
> To: Jeff Newmiller <[hidden email]>
> Cc: Thomas Subia <[hidden email]>; Thomas Subia via R-help
> <[hidden email]>
> Subject: Re: [R] Extract lines from pdf files
>
> Hi Thomas,
> As Jeff wrote, your HTML email is difficult to read. This is a "plain text"
> forum.
> As for "pointers", here is one suggestion.
> Since you write that you can do the necessary actions with a specific file,
> try to write a function that carries out those actions for that same file.
> Except when implementing the function, replace any specific data with the
> value of an argument passed into the function.
> e.g.
> txt <- pdf_text("10619.pdf")
> would be replaced by
> txt <- pdf_text(pdfFile)
>
> and your function would have pdfFile as an argument, as in
>
> myfunc <- function( pdfFile )
>
> Since you can accomplish the task for this file without a function, you
> should be able to accomplish the task with a function.
> Once you succeed to do that you can then try passing the function arguments
> that refer to the other files you need to process.
>
> HTH,
> Eric
>
>
> On Wed, Nov 20, 2019 at 1:09 AM Jeff Newmiller <[hidden email]>
> wrote:
> >
> > Please don't spam the mailing list. Especially with HTML format messages.
> See the Posting Guide.
> >
> > PDF is designed to present data graphically. It is literally possible to
> place every character in the page in random order and still achieve this
> visual readability while practically making it nearly impossible to read. I
> have encountered many PDF files with the same text placed on the page
> multiple times... again scrambling your option to read it digitally. Tools
> like "pdftools" can sometimes work when the program that generated the file
> does so in a simple and extraction-friendly way... but there are no
> guarantees, and your description suggests that it is likely that you won't
> be able to accomplish your goal with this file.
> >
> > On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help
> <[hidden email]> wrote:
> > >
> > >Colleagues,
> > >
> > >
> > >
> > >I can extract specific data from lines in a pdf using:
> > >
> > >
> > >
> > >library(pdftools)
> > >
> > >pdf_text("10619.pdf")
> > >
> > >txt <- pdf_text(".pdf")
> > >
> > >write.table(txt,file="mydata.txt")
> > >
> > >con <- file('mydata.txt')
> > >
> > >open(con)
> > >
> > >serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <-
> > >read.table(con,skip=11,nrow=1)# Extract [5]
> > >
> > >parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
> > >
> > >parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
> > >
> > >close(con)
> > >
> > >
> > >
> > ># note here that serial has 4 variables
> > >
> > ># flatness had 6 variables
> > >
> > ># parallel1 has 5 variables
> > >
> > ># parallel2 has 5 variables
> > >
> > >
> > >
> > ># this outputs the specific data I need
> > >
> > >serial[3]
> > >
> > >flatness[5]
> > >
> > >parallel1[5] # Note here that the txt format shows 0.0007not
> > >scientific, is there a way to format this to display the original data?
> > >
> > >parallel2[5] # Note here that the txt format shows 0.0006not
> > >scientific, , is there a way to format this to display the original
> > >data?
> > >
> > >
> > >
> > >I'd like to extend this code to all of the pdf files in adirectory
> > >and to generate a table of all the serial, flatness, parallel1
> > >andparallel2 data.
> > >
> > >I'm not having a lot of success trying to build thescript for this.
> > >Some pointers would be appreciated.
> > >All the best.
> > >
> > >Thomas Subia
> > >
> > >Statistician / Senior Quality Engineer
> > >
> > >
> > >
> > >       [[alternative HTML version deleted]]
> > >
> > >______________________________________________
> > >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > >https://stat.ethz.ch/mailman/listinfo/r-help
> > >PLEASE do read the posting guide
> > >http://www.R-project.org/posting-guide.html
> > >and provide commented, minimal, self-contained, reproducible code.
> >
> > --
> > Sent from my phone. Please excuse my brevity.
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Extract lines from pdf files

Bert Gunter-2
rather, system.time() of course.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Nov 20, 2019 at 11:19 AM Bert Gunter <[hidden email]> wrote:

> Don't do this (the timing code you showed, not Eric's suggestions).
>
> Do this:
>
> sys.time( {
>     code of interest
> })
>
> (or use the microbenchmark package functionality)
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Wed, Nov 20, 2019 at 10:56 AM Jeff Reichman <[hidden email]>
> wrote:
>
>> Eric
>>
>> I will have to give that a try. Thanks.
>> For a "it works" method I used ....
>>
>> start_time <- Sys.time()
>>
>>         insert code of interest
>>
>> end_time <- Sys.time()
>> end_time - start_time
>>
>> -----Original Message-----
>> From: R-help <[hidden email]> On Behalf Of Eric Berger
>> Sent: Wednesday, November 20, 2019 9:58 AM
>> To: Jeff Newmiller <[hidden email]>
>> Cc: Thomas Subia <[hidden email]>; Thomas Subia via R-help
>> <[hidden email]>
>> Subject: Re: [R] Extract lines from pdf files
>>
>> Hi Thomas,
>> As Jeff wrote, your HTML email is difficult to read. This is a "plain
>> text"
>> forum.
>> As for "pointers", here is one suggestion.
>> Since you write that you can do the necessary actions with a specific
>> file,
>> try to write a function that carries out those actions for that same file.
>> Except when implementing the function, replace any specific data with the
>> value of an argument passed into the function.
>> e.g.
>> txt <- pdf_text("10619.pdf")
>> would be replaced by
>> txt <- pdf_text(pdfFile)
>>
>> and your function would have pdfFile as an argument, as in
>>
>> myfunc <- function( pdfFile )
>>
>> Since you can accomplish the task for this file without a function, you
>> should be able to accomplish the task with a function.
>> Once you succeed to do that you can then try passing the function
>> arguments
>> that refer to the other files you need to process.
>>
>> HTH,
>> Eric
>>
>>
>> On Wed, Nov 20, 2019 at 1:09 AM Jeff Newmiller <[hidden email]>
>> wrote:
>> >
>> > Please don't spam the mailing list. Especially with HTML format
>> messages.
>> See the Posting Guide.
>> >
>> > PDF is designed to present data graphically. It is literally possible to
>> place every character in the page in random order and still achieve this
>> visual readability while practically making it nearly impossible to read.
>> I
>> have encountered many PDF files with the same text placed on the page
>> multiple times... again scrambling your option to read it digitally. Tools
>> like "pdftools" can sometimes work when the program that generated the
>> file
>> does so in a simple and extraction-friendly way... but there are no
>> guarantees, and your description suggests that it is likely that you won't
>> be able to accomplish your goal with this file.
>> >
>> > On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help
>> <[hidden email]> wrote:
>> > >
>> > >Colleagues,
>> > >
>> > >
>> > >
>> > >I can extract specific data from lines in a pdf using:
>> > >
>> > >
>> > >
>> > >library(pdftools)
>> > >
>> > >pdf_text("10619.pdf")
>> > >
>> > >txt <- pdf_text(".pdf")
>> > >
>> > >write.table(txt,file="mydata.txt")
>> > >
>> > >con <- file('mydata.txt')
>> > >
>> > >open(con)
>> > >
>> > >serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <-
>> > >read.table(con,skip=11,nrow=1)# Extract [5]
>> > >
>> > >parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
>> > >
>> > >parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
>> > >
>> > >close(con)
>> > >
>> > >
>> > >
>> > ># note here that serial has 4 variables
>> > >
>> > ># flatness had 6 variables
>> > >
>> > ># parallel1 has 5 variables
>> > >
>> > ># parallel2 has 5 variables
>> > >
>> > >
>> > >
>> > ># this outputs the specific data I need
>> > >
>> > >serial[3]
>> > >
>> > >flatness[5]
>> > >
>> > >parallel1[5] # Note here that the txt format shows 0.0007not
>> > >scientific, is there a way to format this to display the original data?
>> > >
>> > >parallel2[5] # Note here that the txt format shows 0.0006not
>> > >scientific, , is there a way to format this to display the original
>> > >data?
>> > >
>> > >
>> > >
>> > >I'd like to extend this code to all of the pdf files in adirectory
>> > >and to generate a table of all the serial, flatness, parallel1
>> > >andparallel2 data.
>> > >
>> > >I'm not having a lot of success trying to build thescript for this.
>> > >Some pointers would be appreciated.
>> > >All the best.
>> > >
>> > >Thomas Subia
>> > >
>> > >Statistician / Senior Quality Engineer
>> > >
>> > >
>> > >
>> > >       [[alternative HTML version deleted]]
>> > >
>> > >______________________________________________
>> > >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> > >https://stat.ethz.ch/mailman/listinfo/r-help
>> > >PLEASE do read the posting guide
>> > >http://www.R-project.org/posting-guide.html
>> > >and provide commented, minimal, self-contained, reproducible code.
>> >
>> > --
>> > Sent from my phone. Please excuse my brevity.
>> >
>> > ______________________________________________
>> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Extract lines from pdf files

R help mailing list-2
In reply to this post by Eric Berger
Thanks all for the help. I appreciate the feedback
I've developed another method to extract my desired data from multiple pdfs in a directory.

# Combine all pdfs to a combined pdf
files <- list.files(pattern = "pdf$")
pdf_combine(files, output = "joined.pdf")

# creates a text file from joined.pdf
pdf_text("joined.pdf")
txt <- pdf_text("joined.pdf")
write.table(txt,file="mydata.txt")

# I need to extract the lines which match a line beginning with AMAT
lines <- readLines("mydata.txt")
date <- grep("AMAT",lines)

# output for date looks like [1]   6  62 118 174 230 286 342 398
# These are exactly the line positions I need.

Now that I've got the desired lines, I don't know how to extract the data from those lines.

Any advice would be appreciated.

All the best,

Thomas Subia
Statistician / Quality Engineer
IMG Precision Inc.








On Wednesday, November 20, 2019, 07:58:08 AM PST, Eric Berger <[hidden email]> wrote:





Hi Thomas,
As Jeff wrote, your HTML email is difficult to read. This is a "plain
text" forum.
As for "pointers", here is one suggestion.
Since you write that you can do the necessary actions with a specific
file, try to write a function that carries out those actions for that
same file.
Except when implementing the function, replace any specific data with
the value of an argument passed into the function.
e.g.
txt <- pdf_text("10619.pdf")
would be replaced by
txt <- pdf_text(pdfFile)

and your function would have pdfFile as an argument, as in

myfunc <- function( pdfFile )

Since you can accomplish the task for this file without a function,
you should be able to accomplish the task with a function.
Once you succeed to do that you can then try passing the function
arguments that refer to the other files you need to process.

HTH,
Eric


On Wed, Nov 20, 2019 at 1:09 AM Jeff Newmiller <[hidden email]> wrote:

>
> Please don't spam the mailing list. Especially with HTML format messages. See the Posting Guide.
>
> PDF is designed to present data graphically. It is literally possible to place every character in the page in random order and still achieve this visual readability while practically making it nearly impossible to read. I have encountered many PDF files with the same text placed on the page multiple times... again scrambling your option to read it digitally. Tools like "pdftools" can sometimes work when the program that generated the file does so in a simple and extraction-friendly way... but there are no guarantees, and your description suggests that it is likely that you won't be able to accomplish your goal with this file.
>
> On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help <[hidden email]> wrote:
> >
> >Colleagues,
> >
> >
> >
> >I can extract specific data from lines in a pdf using:
> >
> >
> >
> >library(pdftools)
> >
> >pdf_text("10619.pdf")
> >
> >txt <- pdf_text(".pdf")
> >
> >write.table(txt,file="mydata.txt")
> >
> >con <- file('mydata.txt')
> >
> >open(con)
> >
> >serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <-
> >read.table(con,skip=11,nrow=1)# Extract [5]
> >
> >parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
> >
> >parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
> >
> >close(con)
> >
> >
> >
> ># note here that serial has 4 variables
> >
> ># flatness had 6 variables
> >
> ># parallel1 has 5 variables
> >
> ># parallel2 has 5 variables
> >
> >
> >
> ># this outputs the specific data I need
> >
> >serial[3]
> >
> >flatness[5]
> >
> >parallel1[5] # Note here that the txt format shows 0.0007not
> >scientific, is there a way to format this to display the original data?
> >
> >parallel2[5] # Note here that the txt format shows 0.0006not
> >scientific, , is there a way to format this to display the original
> >data?
> >
> >
> >
> >I'd like to extend this code to all of the pdf files in adirectory and
> >to generate a table of all the serial, flatness, parallel1 andparallel2
> >data.
> >
> >I'm not having a lot of success trying to build thescript for this.
> >Some pointers would be appreciated.
> >All the best.
> >
> >Thomas Subia
> >
> >Statistician / Senior Quality Engineer
> >
> >
> >
> >      [[alternative HTML version deleted]]
> >
> >______________________________________________
> >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.

>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Extract lines from pdf files

Bert Gunter-2
I think you are more likely to get a helpful answer if you give a minimal
example of what your lines look like. I certainly don't have a clue, though
maybe someone else will.

Cheers,
Bert


On Wed, Nov 20, 2019 at 12:21 PM Thomas Subia via R-help <
[hidden email]> wrote:

> Thanks all for the help. I appreciate the feedback
> I've developed another method to extract my desired data from multiple
> pdfs in a directory.
>
> # Combine all pdfs to a combined pdf
> files <- list.files(pattern = "pdf$")
> pdf_combine(files, output = "joined.pdf")
>
> # creates a text file from joined.pdf
> pdf_text("joined.pdf")
> txt <- pdf_text("joined.pdf")
> write.table(txt,file="mydata.txt")
>
> # I need to extract the lines which match a line beginning with AMAT
> lines <- readLines("mydata.txt")
> date <- grep("AMAT",lines)
>
> # output for date looks like [1]   6  62 118 174 230 286 342 398
> # These are exactly the line positions I need.
>
> Now that I've got the desired lines, I don't know how to extract the data
> from those lines.
>
> Any advice would be appreciated.
>
> All the best,
>
> Thomas Subia
> Statistician / Quality Engineer
> IMG Precision Inc.
>
>
>
>
>
>
>
>
> On Wednesday, November 20, 2019, 07:58:08 AM PST, Eric Berger <
> [hidden email]> wrote:
>
>
>
>
>
> Hi Thomas,
> As Jeff wrote, your HTML email is difficult to read. This is a "plain
> text" forum.
> As for "pointers", here is one suggestion.
> Since you write that you can do the necessary actions with a specific
> file, try to write a function that carries out those actions for that
> same file.
> Except when implementing the function, replace any specific data with
> the value of an argument passed into the function.
> e.g.
> txt <- pdf_text("10619.pdf")
> would be replaced by
> txt <- pdf_text(pdfFile)
>
> and your function would have pdfFile as an argument, as in
>
> myfunc <- function( pdfFile )
>
> Since you can accomplish the task for this file without a function,
> you should be able to accomplish the task with a function.
> Once you succeed to do that you can then try passing the function
> arguments that refer to the other files you need to process.
>
> HTH,
> Eric
>
>
> On Wed, Nov 20, 2019 at 1:09 AM Jeff Newmiller <[hidden email]>
> wrote:
> >
> > Please don't spam the mailing list. Especially with HTML format
> messages. See the Posting Guide.
> >
> > PDF is designed to present data graphically. It is literally possible to
> place every character in the page in random order and still achieve this
> visual readability while practically making it nearly impossible to read. I
> have encountered many PDF files with the same text placed on the page
> multiple times... again scrambling your option to read it digitally. Tools
> like "pdftools" can sometimes work when the program that generated the file
> does so in a simple and extraction-friendly way... but there are no
> guarantees, and your description suggests that it is likely that you won't
> be able to accomplish your goal with this file.
> >
> > On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help <
> [hidden email]> wrote:
> > >
> > >Colleagues,
> > >
> > >
> > >
> > >I can extract specific data from lines in a pdf using:
> > >
> > >
> > >
> > >library(pdftools)
> > >
> > >pdf_text("10619.pdf")
> > >
> > >txt <- pdf_text(".pdf")
> > >
> > >write.table(txt,file="mydata.txt")
> > >
> > >con <- file('mydata.txt')
> > >
> > >open(con)
> > >
> > >serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <-
> > >read.table(con,skip=11,nrow=1)# Extract [5]
> > >
> > >parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
> > >
> > >parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
> > >
> > >close(con)
> > >
> > >
> > >
> > ># note here that serial has 4 variables
> > >
> > ># flatness had 6 variables
> > >
> > ># parallel1 has 5 variables
> > >
> > ># parallel2 has 5 variables
> > >
> > >
> > >
> > ># this outputs the specific data I need
> > >
> > >serial[3]
> > >
> > >flatness[5]
> > >
> > >parallel1[5] # Note here that the txt format shows 0.0007not
> > >scientific, is there a way to format this to display the original data?
> > >
> > >parallel2[5] # Note here that the txt format shows 0.0006not
> > >scientific, , is there a way to format this to display the original
> > >data?
> > >
> > >
> > >
> > >I'd like to extend this code to all of the pdf files in adirectory and
> > >to generate a table of all the serial, flatness, parallel1 andparallel2
> > >data.
> > >
> > >I'm not having a lot of success trying to build thescript for this.
> > >Some pointers would be appreciated.
> > >All the best.
> > >
> > >Thomas Subia
> > >
> > >Statistician / Senior Quality Engineer
> > >
> > >
> > >
> > >      [[alternative HTML version deleted]]
> > >
> > >______________________________________________
> > >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > >https://stat.ethz.ch/mailman/listinfo/r-help
> > >PLEASE do read the posting guide
> > >http://www.R-project.org/posting-guide.html
> > >and provide commented, minimal, self-contained, reproducible code.
> >
> > --
> > Sent from my phone. Please excuse my brevity.
>
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.