parsing text files

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

parsing text files

ginger
Hello, I have a .txt file with many clinical exams reports (two examples of which are attached to the message).
I have to create a data frame with as many rows as the number of clinical exams reports in the text file and 24 columns:
the first (to be labelled as "ID") with a number (representing an identification code) which is the number in the 13th line of the clinical report following the string "Acc.ne n. "
the second (to be labelled as "DATE") with a date (indicating date of blood sampling), which is the date, again in the 13th line, following the identification code
the following 22 columns (to be labelled with the name of parameters at lines from 20 to 41, as "GLICEMIA" ... "COLESTEROLO LDL")

I did search in the mailing list and tried to begin something like:

#read the text file
reports <- readLines("ClinicalReports.txt")
#processing the file starting at each "Acc.ne n. "
serologic <- lapply(which(grepl("^Acc.ne n.", reports)), function(.line )....

but I'm a biostatistician whith almost no expertise in programming and I really need your hepl! Please!!!
ClinicalReports.txt
Reply | Threaded
Open this post in threaded view
|

Re: parsing text files

ginger
Ooops,
I forgot to specify that for each raw, containing records of the clinical reports , the values  of the 22 parameter measurement have to be reported. For example, first raw, first 5 columns:
ID                  DATE                  GLICEMIA   AZOTEMIA      CREATININEMIA    SODIEMIA  ...        ...      ...
0000185      05/12/2011        115              33.6                  0.99                         136             ...        ...      ...
Reply | Threaded
Open this post in threaded view
|

Re: parsing text files

jholtman
Here is one way of doing it; it reads the file and create a 'long' version.

##########
input <- file("/temp/ClinicalReports.txt", 'r')
outFile <- '/temp/output.txt'  #  tempfile()
output <- file(outFile, 'w')
writeLines("ID, Date, variable, value", output)
ID <- NULL
dataSw <- NULL
repeat{
    line <- readLines(input, n = 1)
    if (length(line) == 0) break
    if (!is.null(dataSw)){
        if (line == ''){  # end of data
            ID <- NULL
            dataSw <- NULL
            next
        }
        # now write CSV output file
        cat(ID
          , ','
          , Date
          , ','
          , substring(line, 1, 31)
          , ','
          , substring(line, 32, 43)
          , '\n'
          , sep = ''
          , file = output
          )
        next
    }
    if (grepl("Acc.ne", line)){
        ID <- (substring(line, 29,35))
        Date <- (substring(line, 52,61))
        next
    }
    if (!is.null(ID)){  # looking for Esame
        if (grepl("Esame", line)){
            # skip two lines
            readLines(input, n = 2)
            dataSw <- 1
            next
        }
    }

}

# now read in the data in a long format
close(output)
result <- read.csv(outFile, as.is = TRUE)


the results from your test data is:

> str(result)
'data.frame':   43 obs. of  4 variables:
 $ ID      : int  185 185 185 185 185 185 185 185 185 185 ...
 $ Date    : chr  "05/12/2011" "05/12/2011" "05/12/2011" "05/12/2011" ...
 $ variable: chr  "AZOTEMIA                       " "CREATININEMIA
             " "SODIEMIA                       " "POTASSIEMIA
          " ...
 $ value   : num  33.6 0.99 136 4.22 94.2 8.68 1.87 1.79 189 118 ...
> head(result)
   ID       Date                        variable  value
1 185 05/12/2011 AZOTEMIA                         33.60
2 185 05/12/2011 CREATININEMIA                     0.99
3 185 05/12/2011 SODIEMIA                        136.00
4 185 05/12/2011 POTASSIEMIA                       4.22
5 185 05/12/2011 CLOREMIA                         94.20
6 185 05/12/2011 CALCEMIA                          8.68
>


On Thu, Mar 8, 2012 at 8:24 AM, ginger <[hidden email]> wrote:

> Ooops,
> I forgot to specify that for each raw, containing records of the clinical
> reports , the values  of the 22 parameter measurement have to be reported.
> For example, first raw, first 5 columns:
> ID                  DATE                  GLICEMIA   AZOTEMIA
> CREATININEMIA    SODIEMIA  ...        ...      ...
> 0000185      05/12/2011        115              33.6                  0.99
> 136             ...        ...      ...
>
> --
> View this message in context: http://r.789695.n4.nabble.com/parsing-text-files-tp4456355p4456389.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.