parsing strings between [ ] in columns

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

parsing strings between [ ] in columns

milton ruser
Dear all,

I have a data.frame with a column like the x shown below
myDF<-data.frame(cbind(x=c("[[1, 0, 0], [0, 1]]",
   "[[1, 1, 0], [0, 1]]","[[1, 0, 0], [1, 1]]",
   "[[0, 0, 1], [0, 1]]")))
> myDF
                    x
1 [[1, 0, 0], [0, 1]]
2 [[1, 1, 0], [0, 1]]
3 [[1, 0, 0], [1, 1]]
4 [[0, 0, 1], [0, 1]]

As you can see my x column is composed of some
strings between [[]], and using colon to separate
some "fields".

I need to identify the numbers of
groups inside the main [ ] and call each
group with different sequential string.
On the example above I would like to have:

  A         B
1 [1, 0, 0] [0, 1]
2 [1, 1, 0] [0, 1]
3 [1, 0, 0] [1, 1]
4 [0, 0, 1] [0, 1]
Although here I have only two groups, my
real dataset will have much more (~30).
After identify the groups I would like
to idenfity the subgroups:
  A1 A2 A3  B1 B2
1 1  0  0   0  1
2 1  1  0   0  1
3 1  0  0   1  1
4 0  0  1   0  1

Any hint are welcome.

milton ribeiro

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: parsing strings between [ ] in columns

barry rowlingson
On Thu, Feb 18, 2010 at 8:29 AM, milton ruser <[hidden email]> wrote:

> Dear all,
>
> I have a data.frame with a column like the x shown below
> myDF<-data.frame(cbind(x=c("[[1, 0, 0], [0, 1]]",
>   "[[1, 1, 0], [0, 1]]","[[1, 0, 0], [1, 1]]",
>   "[[0, 0, 1], [0, 1]]")))
>> myDF
>                    x
> 1 [[1, 0, 0], [0, 1]]
> 2 [[1, 1, 0], [0, 1]]
> 3 [[1, 0, 0], [1, 1]]
> 4 [[0, 0, 1], [0, 1]]
>
> As you can see my x column is composed of some
> strings between [[]], and using colon to separate
> some "fields".
>
> I need to identify the numbers of
> groups inside the main [ ] and call each
> group with different sequential string.
> On the example above I would like to have:
>
>  A         B
> 1 [1, 0, 0] [0, 1]
> 2 [1, 1, 0] [0, 1]
> 3 [1, 0, 0] [1, 1]
> 4 [0, 0, 1] [0, 1]
> Although here I have only two groups, my
> real dataset will have much more (~30).
> After identify the groups I would like
> to idenfity the subgroups:
>  A1 A2 A3  B1 B2
> 1 1  0  0   0  1
> 2 1  1  0   0  1
> 3 1  0  0   1  1
> 4 0  0  1   0  1
>
> Any hint are welcome.
>


This looks like the same syntax as JSON, so you might be able to use
the fromJSON function from the rjson package:

> x="[[1, 0, 0], [0, 1]]"
> library(rjson)
> fromJSON(x)
[[1]]
[1] 1 0 0

[[2]]
[1] 0 1

> unlist(fromJSON(x))
[1] 1 0 0 0 1

 - so just apply that over your first dataframe and collect it all up
in a new dataframe. The plyr package may help.

All your data frame columns have to have the same name, so you only
need to parse the first one to work out your naming system. In this
case you can get it from the length of the list and its elements:

> l = fromJSON(x)
> unlist(lapply(l,length))

[1] 3 2

 so you want A1 to A3 and B1 to B2. Not sure what you want when you
get to the 27th group.... You can generate this with a bit of rep and
paste functionality. Bit early in the day to get my head round that at
the moment.

 But rjson will parse and split up your grouped numbers anyway.
Probably other solutions using split and sub and gsub.

Barry

--
blog: http://geospaced.blogspot.com/
web: http://www.maths.lancs.ac.uk/~rowlings
web: http://www.rowlingson.com/
twitter: http://twitter.com/geospacedman
pics: http://www.flickr.com/photos/spacedman

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: parsing strings between [ ] in columns

barry rowlingson
In reply to this post by milton ruser
On Thu, Feb 18, 2010 at 8:29 AM, milton ruser <[hidden email]> wrote:
> Dear all,
>
> I have a data.frame with a column like the x shown below
> myDF<-data.frame(cbind(x=c("[[1, 0, 0], [0, 1]]",
>   "[[1, 1, 0], [0, 1]]","[[1, 0, 0], [1, 1]]",
>   "[[0, 0, 1], [0, 1]]")))
>> myDF

> After identify the groups I would like
> to idenfity the subgroups:
>  A1 A2 A3  B1 B2
> 1 1  0  0   0  1
> 2 1  1  0   0  1
> 3 1  0  0   1  1
> 4 0  0  1   0  1

Maybe it's not too early in the morning. Given your myDF above:

# how is the first one structured?
> lets = unlist(lapply(fromJSON(as.character(myDF[1,])),length))

# 3 then 2:
> lets
[1] 3 2

# make the letters (fails for >26 groups)
> rep(LETTERS[1:length(lets)],lets)
[1] "A" "A" "A" "B" "B"

# handy sequence function makes the numbers:
> sequence(lets)
[1] 1 2 3 1 2

# splat them together:
> paste(rep(LETTERS[1:length(lets)],lets),sequence(lets),sep="")
[1] "A1" "A2" "A3" "B1" "B2"

 then you can just make this the column names of your new dataframe.

 I think the morning coffee has got through the blood-brain barrier now.

Barry

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: parsing strings between [ ] in columns

Gabor Grothendieck
In reply to this post by milton ruser
Here is a solution using strapply in the gsubfn package.

First we define a proto object p containing a single method, i.e.
function, called fun.  fun will take one [...] construct and split it
into the numeric vector v using strsplit and will also assign it
names.  strapply has a built in variable, count, that is maintained
automatically in the proto object that will be used for determining
which letter to use.

Using strapply apply fun in p to each substring matching this regexp
"\\[([01, ]*)\\]".  This regexpr matches [ followed by a string of
characters made up of 0, 1, comma and space, followed by ] and applies
p$fun to each such occurrence.  (Modify the regexp appropriately if
the true problem has different characteristics.)

Finally, simplify = rbind will cause the resulting vectors to  be
rbind'ed together.  (If the different rows of myDF do not have the
same structure then omit the simplify = rbind argument of strapply to
get out a list.)


p <- proto(fun = function(this, x) {
        v <- as.numeric(strsplit(x, ",")[[1]])
        names(v) <- paste(LETTERS[count], seq_along(v), sep = "")
        v
})
strapply(as.character(myDF[[1]]), "\\[([01, ]*)\\]", p, simplify = rbind)


Here is what the output looks like:
       
> strapply(as.character(myDF[[1]]), "\\[([01, ]*)\\]", p, simplify = rbind)
     A1 A2 A3 B1 B2
[1,]  1  0  0  0  1
[2,]  1  1  0  0  1
[3,]  1  0  0  1  1
[4,]  0  0  1  0  1

See http://gsubfn.googlecode.com and the gsubfn vignette for more info.


On Thu, Feb 18, 2010 at 3:29 AM, milton ruser <[hidden email]> wrote:

> Dear all,
>
> I have a data.frame with a column like the x shown below
> myDF<-data.frame(cbind(x=c("[[1, 0, 0], [0, 1]]",
>   "[[1, 1, 0], [0, 1]]","[[1, 0, 0], [1, 1]]",
>   "[[0, 0, 1], [0, 1]]")))
>> myDF
>                    x
> 1 [[1, 0, 0], [0, 1]]
> 2 [[1, 1, 0], [0, 1]]
> 3 [[1, 0, 0], [1, 1]]
> 4 [[0, 0, 1], [0, 1]]
>
> As you can see my x column is composed of some
> strings between [[]], and using colon to separate
> some "fields".
>
> I need to identify the numbers of
> groups inside the main [ ] and call each
> group with different sequential string.
> On the example above I would like to have:
>
>  A         B
> 1 [1, 0, 0] [0, 1]
> 2 [1, 1, 0] [0, 1]
> 3 [1, 0, 0] [1, 1]
> 4 [0, 0, 1] [0, 1]
> Although here I have only two groups, my
> real dataset will have much more (~30).
> After identify the groups I would like
> to idenfity the subgroups:
>  A1 A2 A3  B1 B2
> 1 1  0  0   0  1
> 2 1  1  0   0  1
> 3 1  0  0   1  1
> 4 0  0  1   0  1
>
> Any hint are welcome.
>
> milton ribeiro
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.