Extracting specific lines from pdfs

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Extracting specific lines from pdfs

Thomas Subia-2
Colleagues,

I can extract specific data from lines in a pdf using:

library(pdftools)
pdf_text("10619.pdf")
txt <- pdf_text(".pdf")
write.table(txt,file="mydata.txt")
con <- file('mydata.txt')
open(con)
serial <- read.table(con,skip=5,nrow=1) #Extract[3]
flatness <- read.table(con,skip=11,nrow=1)# Extract [5]
parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
close(con)

# note here that serial has 4 variables
# flatness had 6 variables
# parallel1 has 5 variables
# parallel2 has 5 variables

# this outputs the specific data I need
serial[3]
flatness[5]
parallel1[5] # Note here that the txt format shows 0.0007 not scientific, is there a way to format this to display the original data?
parallel2[5] # Note here that the txt format shows 0.0006 not scientific, , is there a way to format this to display the original data?

I'd like to extend this code to all of the pdf files in a directory and to generate a table of all the serial, flatness, parallel1 and parallel2 data.
I'm not having a lot of success trying to build the script for this. Some pointers would be appreciated.

All the best

Thomas Subia
Statistician / Senior Quality Engineer

IMG Companies 
E. [hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.