Parsing txt file

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing txt file

karthicklakshman
Hello,

I have a tab limited text document with multiple lines as mentioned below,



#FILE FORMAT
#Book bookname author publisher pages
#CD name content
####################################################################################################
----------------------------------------------------------------------
Book bioR xxx abc publishers 230
CD biorexamples chapter5
----------------------------------------------------------------------
Book bioc++ mmm tata publishers 400
CD samples workexamples
CD data experiments
----------------------------------------------------------------------
Book management tools aaa some publishers 200
----------------------------------------------------------------------


here the texts "book" and "CD" are present in each block.

now, I am interested in creating a data frame with two columns, column names="bookname" and "content". Using "grep" it is possible to pick specific rows (grep("^book, finename")) but my expertise in programming is limited to create the mentioned data.frame.

Note: the rowname "book" is present in all blocks but "CD" is variable (ie., some block has two and some with no CD row, as shown above)

please help me in creating something like this,


     bookname   content
[1] bioR           chapter5
[2] bioc++        workexamples, experiments
[3] management tools   NA


Thanks in advance,
karthick
 
Reply | Threaded
Open this post in threaded view
|

Re: Parsing txt file

Santosh Srinivas
You could use the following to achieve your objective. To start with

?readLines
?strsplit
?for
?ifelse

As you try, you may receive more specific answers for the issues you come up
with.

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On
Behalf Of karthicklakshman
Sent: 10 November 2010 15:06
To: [hidden email]
Subject: [R] Parsing txt file


Hello,

I have a tab limited text document with multiple lines as mentioned below,



#FILE FORMAT
#Book bookname author publisher pages
#CD name content
############################################################################
########################
----------------------------------------------------------------------
Book bioR xxx abc publishers 230
CD biorexamples chapter5
----------------------------------------------------------------------
Book bioc++ mmm tata publishers 400
CD samples workexamples
CD data experiments
----------------------------------------------------------------------
Book management tools aaa some publishers 200
----------------------------------------------------------------------


here the texts "book" and "CD" are present in each block.

now, I am interested in creating a data frame with two columns, column
names="bookname" and "content". Using "grep" it is possible to pick specific
rows (grep("^book, finename")) but my expertise in programming is limited to
create the mentioned data.frame.

Note: the rowname "book" is present in all blocks but "CD" is variable (ie.,
some block has two and some with no CD row, as shown above)

please help me in creating something like this,


     bookname   content
[1] bioR           chapter5
[2] bioc++        workexamples, experiments
[3] management tools   NA


Thanks in advance,
karthick
 
--
View this message in context:
http://r.789695.n4.nabble.com/Parsing-txt-file-tp3035749p3035749.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Parsing txt file

Mike Marchywka

----------------------------------------

> From: [hidden email]
> To: [hidden email]; [hidden email]
> Date: Wed, 10 Nov 2010 16:00:26 +0530
> Subject: Re: [R] Parsing txt file
>
> You could use the following to achieve your objective. To start with
>
> ?readLines
> ?strsplit
> ?for
> ?ifelse
>
> As you try, you may receive more specific answers for the issues you come up
> with.

If you don't have some compelling reason to do it in R, it may be worth
the learning curve to get something like awk, perl, or even sed.
These tasks come up in many settings and these tools are quite verstatile
for ad hoc text manipulation. You can save your reformatted text
file in a form that is easy for R.


>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of karthicklakshman
> Sent: 10 November 2010 15:06
> To: [hidden email]
> Subject: [R] Parsing txt file
>
>
> Hello,
>
> I have a tab limited text document with multiple lines as mentioned below,
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
     
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Parsing txt file

jholtman
In reply to this post by Santosh Srinivas
Here is a start:

> # read the input file
> input <- readLines('/tempxx.txt')
> # process the file starting at each "Book"
> result <- lapply(which(grepl("^Book", input)), function(.line){
+     contents <- NULL  # initialize
+     name <- strsplit(input[.line], '\t')[[1]][2]  # book name
+     # process succeeding lines as long as they are "CD"
+     while (grepl("^CD", input[.line + 1L])){
+         contents <- c(contents, strsplit(input[.line + 1L], '\t')[[1]][3])
+         .line <- .line + 1L
+     }
+     c(bookname = name, contents = paste(contents, collapse = ','))
+ })
>
> do.call(rbind, result)
     bookname              contents
[1,] " bioR  "             "   chapter5"
[2,] "  bioc++ "           " workexamples,  experiments"
[3,] " management tools  " ""
>


On Wed, Nov 10, 2010 at 5:30 AM, Santosh Srinivas
<[hidden email]> wrote:

> You could use the following to achieve your objective. To start with
>
> ?readLines
> ?strsplit
> ?for
> ?ifelse
>
> As you try, you may receive more specific answers for the issues you come up
> with.
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of karthicklakshman
> Sent: 10 November 2010 15:06
> To: [hidden email]
> Subject: [R] Parsing txt file
>
>
> Hello,
>
> I have a tab limited text document with multiple lines as mentioned below,
>
>
>
> #FILE FORMAT
> #Book   bookname        author  publisher       pages
> #CD     name    content
> ############################################################################
> ########################
> ----------------------------------------------------------------------
> Book    bioR    xxx     abc publishers  230
> CD      biorexamples    chapter5
> ----------------------------------------------------------------------
> Book    bioc++  mmm     tata publishers 400
> CD      samples workexamples
> CD      data    experiments
> ----------------------------------------------------------------------
> Book    management tools        aaa     some publishers 200
> ----------------------------------------------------------------------
>
>
> here the texts "book" and "CD" are present in each block.
>
> now, I am interested in creating a data frame with two columns, column
> names="bookname" and "content". Using "grep" it is possible to pick specific
> rows (grep("^book, finename")) but my expertise in programming is limited to
> create the mentioned data.frame.
>
> Note: the rowname "book" is present in all blocks but "CD" is variable (ie.,
> some block has two and some with no CD row, as shown above)
>
> please help me in creating something like this,
>
>
>     bookname   content
> [1] bioR           chapter5
> [2] bioc++        workexamples, experiments
> [3] management tools   NA
>
>
> Thanks in advance,
> karthick
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Parsing-txt-file-tp3035749p3035749.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Parsing txt file

karthicklakshman
Hello Jim, hello all,

Thanks very much for the inputs, I used the code and it solved my problem....
special thanks to Jim Holtman for the code.

Regards,
karthick