extracting data from unstructured (text?) file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

extracting data from unstructured (text?) file

frauke
Dear R community,

I have the following problem I hoped you could help me with.

My data is save in thousand of files with a weird extension containing for numbers and a z. For example *.1405z. With list.files I managed to load this data into R. It looks like this (the row numbers are not in the original file):

35                             :LATEST STAGE     3.60 FT AT 730 AM CST ON 0102
36                          .ER ARCT2    0102 C DC200001020813/DH12/HGIFF/DIH6
37                   :QPF FORECAST        6AM       NOON        6PM       MDNT
38                   .E1 :0102:              /       3.5/       3.4/       3.5
39                   .E2 :0103:   /       3.5/       3.0/       2.5/       2.1
40                   .E3 :0104:   /       1.8/       1.5/       1.3/       1.2
41                   .E4 :0105:   /       1.2/       1.8/       2.3/       2.7
42                   .E5 :0106:   /       3.0/       3.0/       3.1/       3.3
43                                                    .E6 :0107:   /       3.4

I need the table in rows 37 to 43 in a matrix, for example:
0201     NA    3.5    3.4    3.5
0103     3.5    3.0    2.5     2.1
0104     1.8    1.5    1.3    1.2
0105    1.2     1.8    2.3    2.7
0106     3.0    3.0    3.1    3.3
0107     3.4    NA    NA   NA

 Unfortunately the row numbers vary per file.  I can call up each line with file[40,1] for line 40 for example. It returns:
[1] .E3 :0104:   /       1.8/       1.5/       1.3/       1.2
38 Levels: .E1 :0102:              /       3.5/       3.4/       3.5 ...

 So I have two problems really:
1. How do I detect the table in the file (resp. the line where the table starts)?
2. How do I break up each line to write the values into a matrix?

Feel free to suggest an entirely different approach if you think that is helpful.

Thanks a lot! Frauke

Reply | Threaded
Open this post in threaded view
|

Re: extracting data from unstructured (text?) file

jholtman
Can you at least provide a subset of 2 files so we can see how the
data is really stored in the file and what the separators are between
the 'columns' of data.  Also how do you determine where the data
actually starts for the rows that you want to pull off.  This will aid
in determining how to parse the data.

On Sun, Mar 11, 2012 at 3:07 PM, frauke <[hidden email]> wrote:

> Dear R community,
>
> I have the following problem I hoped you could help me with.
>
> My data is save in thousand of files with a weird extension containing for
> numbers and a z. For example *.1405z. With list.files I managed to load this
> data into R. It looks like this (the row numbers are not in the original
> file):
>
> 35                             :LATEST STAGE     3.60 FT AT 730 AM CST ON
> 0102
> 36                          .ER ARCT2    0102 C
> DC200001020813/DH12/HGIFF/DIH6
> 37                   :QPF FORECAST        6AM       NOON        6PM
> MDNT
> 38                   .E1 :0102:              /       3.5/       3.4/
> 3.5
> 39                   .E2 :0103:   /       3.5/       3.0/       2.5/
> 2.1
> 40                   .E3 :0104:   /       1.8/       1.5/       1.3/
> 1.2
> 41                   .E4 :0105:   /       1.2/       1.8/       2.3/
> 2.7
> 42                   .E5 :0106:   /       3.0/       3.0/       3.1/
> 3.3
> 43                                                    .E6 :0107:   /
> 3.4
>
> I need the table in rows 37 to 43 in a matrix, for example:
> 0201     NA    3.5    3.4    3.5
> 0103     3.5    3.0    2.5     2.1
> 0104     1.8    1.5    1.3    1.2
> 0105    1.2     1.8    2.3    2.7
> 0106     3.0    3.0    3.1    3.3
> 0107     3.4    NA    NA   NA
>
>  Unfortunately the row numbers vary per file.  I can call up each line with
> file[40,1] for line 40 for example. It returns:
> [1] .E3 :0104:   /       1.8/       1.5/       1.3/       1.2
> 38 Levels: .E1 :0102:              /       3.5/       3.4/       3.5 ...
>
>  So I have two problems really:
> 1. How do I detect the table in the file (resp. the line where the table
> starts)?
> 2. How do I break up each line to write the values into a matrix?
>
> Feel free to suggest an entirely different approach if you think that is
> helpful.
>
> Thanks a lot! Frauke
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/extracting-data-from-unstructured-text-file-tp4464423p4464423.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: extracting data from unstructured (text?) file

Vijaya Parthiban
In reply to this post by frauke
Hi Frauke,

Try unix commands with R's system() function.

Example:
Let's say you have a matrix like this in the file (note: the first element
is missing) called hello.txt
10 100
2 20 200
3 30 300
4 40 400
5 50 500

You can try something like:

hello = system("cut -f1 hello.txt", intern=T)

VP.

On 11 March 2012 19:07, frauke <[hidden email]> wrote:

> Dear R community,
>
> I have the following problem I hoped you could help me with.
>
> My data is save in thousand of files with a weird extension containing for
> numbers and a z. For example *.1405z. With list.files I managed to load
> this
> data into R. It looks like this (the row numbers are not in the original
> file):
>
> 35                             :LATEST STAGE     3.60 FT AT 730 AM CST ON
> 0102
> 36                          .ER ARCT2    0102 C
> DC200001020813/DH12/HGIFF/DIH6
> 37                   :QPF FORECAST        6AM       NOON        6PM
> MDNT
> 38                   .E1 :0102:              /       3.5/       3.4/
> 3.5
> 39                   .E2 :0103:   /       3.5/       3.0/       2.5/
> 2.1
> 40                   .E3 :0104:   /       1.8/       1.5/       1.3/
> 1.2
> 41                   .E4 :0105:   /       1.2/       1.8/       2.3/
> 2.7
> 42                   .E5 :0106:   /       3.0/       3.0/       3.1/
> 3.3
> 43                                                    .E6 :0107:   /
> 3.4
>
> I need the table in rows 37 to 43 in a matrix, for example:
> 0201     NA    3.5    3.4    3.5
> 0103     3.5    3.0    2.5     2.1
> 0104     1.8    1.5    1.3    1.2
> 0105    1.2     1.8    2.3    2.7
> 0106     3.0    3.0    3.1    3.3
> 0107     3.4    NA    NA   NA
>
>  Unfortunately the row numbers vary per file.  I can call up each line with
> file[40,1] for line 40 for example. It returns:
> [1] .E3 :0104:   /       1.8/       1.5/       1.3/       1.2
> 38 Levels: .E1 :0102:              /       3.5/       3.4/       3.5 ...
>
>  So I have two problems really:
> 1. How do I detect the table in the file (resp. the line where the table
> starts)?
> 2. How do I break up each line to write the values into a matrix?
>
> Feel free to suggest an entirely different approach if you think that is
> helpful.
>
> Thanks a lot! Frauke
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/extracting-data-from-unstructured-text-file-tp4464423p4464423.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: extracting data from unstructured (text?) file

frauke
In reply to this post by jholtman
Thank you for the quick reply! I have attached two files.

sample1.1339z
sample2.1949z
Reply | Threaded
Open this post in threaded view
|

Re: extracting data from unstructured (text?) file

jholtman
Here is one way assuming the data start after QPF FORECAST:

> setwd('/temp')  # where the data is
> files <- c('sample1.htm', 'sample2.htm')  # files to read
> # assumes 4 columns of data
> fields <- list(c(19, 24), c(30, 35), c(41, 46), c(52, 57))  #columns of data
> results <- lapply(files, function(.file){
+     inData <- FALSE  # switch to indicate in data
+     collection <- NULL  # will hold the data
+     inputFile <- file(.file, 'r')  # open the connection
+     repeat{
+         input <- readLines(inputFile, n = 1)
+         if (inData){  # parse the line and collect data
+             key <- sub("^.E[0-9]+[^:]*:([^:]+).*", "\\1", input)
+             if (nchar(key) != 4){ # done with data; return result
+                 colnames(collection) <- colNames
+                 close(inputFile)
+                 return(collection)
+             }
+             # get the data assuming that 'fields' defines where data is
+             cols <- numeric(length(fields))
+             for (i in seq_along(fields)){
+                 cols[i] <- as.numeric(substring(input
+                                             , fields[[i]][1]
+                                             , fields[[i]][2]
+                                             )
+                                      )
+             }
+             collection <- rbind(collection, cols)
+             rownames(collection)[nrow(collection)] <- key
+         } else {  # looking for the start of the data
+             if (grepl("^:QPF FORECAST", input)){
+                 # extract the column names
+                 colNames <- NULL
+                 for (i in seq_along(fields)){
+                     colNames <- c(colNames, substring(input
+                                                     , fields[[i]][1]
+                                                     , fields[[i]][2]
+                                                     )
+                                  )
+                 }
+                 inData <- TRUE  # now get the data
+             }
+         }
+     }
+ })
Warning message:
NAs introduced by coercion
> print(results)
[[1]]
        7AM    1PM    7PM    1AM
0830     NA    5.6    4.4    3.8
0831    3.3    3.0    2.6    2.5
0901    2.3    2.2    2.2    2.1
0902    2.1    2.0    2.0    2.0
0903    2.0    1.9    1.9    1.9
0904    1.8     NA     NA     NA

[[2]]
        7AM    1PM    7PM    1AM
0604     NA     NA    7.0    8.4
0605    9.4    9.2    8.6    7.8
0606    6.8    5.6    4.2    3.5
0607    3.2    3.0    2.9    2.8
0608    2.8    2.8    2.7    2.7
0609    2.7     NA     NA     NA



On Sun, Mar 11, 2012 at 4:07 PM, frauke <[hidden email]> wrote:

> Thank you for the quick reply! I have attached two files.
>
> http://r.789695.n4.nabble.com/file/n4464511/sample1.1339z sample1.1339z
> http://r.789695.n4.nabble.com/file/n4464511/sample2.1949z sample2.1949z
>
> --
> View this message in context: http://r.789695.n4.nabble.com/extracting-data-from-unstructured-text-file-tp4464423p4464511.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: extracting data from unstructured (text?) file

frauke
Wow Jim, this is much more than I expected. Thank you!!

It took me a while to figure out what exactly you are doing in that code. But I think I understand and it definitely runs. May I ask you two follow up questions?

First, some of my files have data from two or more cities in them. So I have trouble that it picks the right city. What makes it difficult is that not in all files will a city be called the same. Sometimes it might be "van Buren", other times "Arkansas River at van Buren". Sometimes the target city is the first in the file, other times further down. Here is an example: sample3.txt. Additionally, some files miss the city that I am looking for.

Second, I would like extract some more data from the files, printed in bold below.  I thought of storing this data in an extra line appended to the main table or so.  I do manage to extract one at a time, but of course it takes ages to run the process over and over again to get all the data.

:ARKANSAS RIVER AT VAN BUREN
:FLOOD STAGE  22.0  
:
:LATEST STAGE    19.25 FT AT 400 AM CST ON 010100
.ER VBUA4    0101 C DC200001010823/DH12/HGIFF/DIH6
:QPF FORECAST        6AM       NOON        6PM       MDNT
.E1 :0101:              /      19.3/      19.4/      19.4
.E2 :0102:   /      19.4/      19.4/      19.4/      19.4
.E3 :0103:   /      19.4/      19.4/      19.4/      19.4
.E4 :0104:   /      19.4/      19.4/      19.4/      19.4
.E5 :0105:   /      19.4/      19.4/      19.4/      19.4
.E6 :0106:   /      19.4
.ER VBUA4    0101 C DC200001010823/DH12/PPQFZ/DIH6/  0.00/0.00/0.00/0.00  
.ER VBUA4    0101 C DC200001010823/DH12/QTIFF/DIH6
:QPF FORECAST        6AM       NOON        6PM       MDNT
.E1 :0101:              /      0.98/      2.78/      8.66
.E2 :0102:   /      9.88/      8.70/      7.36/      7.48
.E3 :0103:   /      8.25/      8.42/      8.53/      9.02

Please Jim, only answer these questions if you have time. I certainly appreciate any help very much.

Thank you, Frauke
Reply | Threaded
Open this post in threaded view
|

Re: extracting data from unstructured (text?) file

jholtman
On Sun, Mar 11, 2012 at 9:21 PM, frauke <[hidden email]> wrote:

> Wow Jim, this is much more than I expected. Thank you!!
>
> It took me a while to figure out what exactly you are doing in that code.
> But I think I understand and it definitely runs. May I ask you two follow up
> questions?
>
> First, some of my files have data from two or more cities in them. So I have
> trouble that it picks the right city. What makes it difficult is that not in
> all files will a city be called the same. Sometimes it might be "van Buren",
> other times "Arkansas River at van Buren". Sometimes the target city is the
> first in the file, other times further down. Here is an example:
> http://r.789695.n4.nabble.com/file/n4465068/sample3.txt sample3.txt .
> Additionally, some files miss the city that I am looking for.

You can add code to search fo the city name that you want before
setting the 'inData' flag.  You can use regular expressions to pick
out the city's name.

>
> Second, I would like extract some more data from the files, printed in bold
> below.  I thought of storing this data in an extra line appended to the main
> table or so.  I do manage to extract one at a time, but of course it takes
> ages to run the process over and over again to get all the data.
>
> :ARKANSAS RIVER AT VAN BUREN
> :FLOOD STAGE * 22.0  *
> :
> :LATEST STAGE    *19.25* FT AT *400 AM* CST ON *010100*
> .ER VBUA4    0101 C DC200001010823/DH12/HGIFF/DIH6
> :QPF FORECAST        6AM       NOON        6PM       MDNT
> .E1 :0101:              /      19.3/      19.4/      19.4
> .E2 :0102:   /      19.4/      19.4/      19.4/      19.4
> .E3 :0103:   /      19.4/      19.4/      19.4/      19.4
> .E4 :0104:   /      19.4/      19.4/      19.4/      19.4
> .E5 :0105:   /      19.4/      19.4/      19.4/      19.4
> .E6 :0106:   /      19.4
> .ER VBUA4    0101 C DC200001010823/DH12/PPQFZ/DIH6/  0.00/0.00/0.00/0.00
> .ER VBUA4    0101 C DC200001010823/DH12/QTIFF/DIH6
> :QPF FORECAST        6AM       NOON        6PM       MDNT
> .E1 :0101:              /      0.98/      2.78/      8.66
> .E2 :0102:   /      9.88/      8.70/      7.36/      7.48
> .E3 :0103:   /      8.25/      8.42/      8.53/      9.02
>

As you are reading the lines in, you can use regular expressions to
extract the data that you are interest in.  I am not sure where you
want to store the data.  Do you want it in a separate file?

> Please Jim, only answer these questions if you have time. I certainly
> appreciate any help very much.
>
> Thank you, Frauke
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/extracting-data-from-unstructured-text-file-tp4464423p4465068.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.