splitting strings effriciently

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

splitting strings effriciently

Andrew Roberts
Folks,

I have a data frame with 4861469 rows that contains an ip address
xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each
row based on IP ranges. To do this I have a function to split the ip
address as character into class A,B,C and D components. It works but is
horribly inefficient in terms of speed. I can't quite see how one of the
l/s/m/t/apply functions could be brought to bear on the problem. Does
anyone have any thoughts?

for(i in 1:4861469)
   {
   lst <-unlist(strsplit(data$ComputerName[i], "\\."))
   data$IPA[i] <-lst[[1]]
   data$IPB[i] <-lst[[2]]
   data$IPC[i] <-lst[[3]]
   data$IPD[i] <-lst[[4]]
   rm(lst)
   }

Andrew

Andrew Roberts
Children's Orthopaedic Surgeon
RJAH, Oswestry, UK

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: splitting strings effriciently

Enrico Schumann

Hi Andrew,

you can use strsplit for a character vector; you do not have to call it
for every element data$ComputerName[i].

If I understand correctly, maybe something like this helps

 > ip <- "123.456.789.321"  ## example data
 > df <- data.frame(ip = rep(ip, 9), stringsAsFactors=FALSE)
 > df
                ip
1 123.456.789.321
2 123.456.789.321
3 123.456.789.321
4 123.456.789.321
5 123.456.789.321
6 123.456.789.321
7 123.456.789.321
8 123.456.789.321
9 123.456.789.321

 >
 > res <- unlist(strsplit(df[["ip"]], "\\."))
 > ii <- seq(1, nrow(df)*4, by = 4)
 > res[ii]   ## A
[1] "123" "123" "123" "123" "123" "123" "123"
[8] "123" "123"
 > res[ii+1] ## B
[1] "456" "456" "456" "456" "456" "456" "456"
[8] "456" "456"
 > res[ii+2] ## C
[1] "789" "789" "789" "789" "789" "789" "789"
[8] "789" "789"
 > res[ii+3] ## D
[1] "321" "321" "321" "321" "321" "321" "321"
[8] "321" "321"


Regards,
Enrico


Am 08.01.2012 11:06, schrieb Andrew Roberts:

> Folks,
>
> I have a data frame with 4861469 rows that contains an ip address
> xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each
> row based on IP ranges. To do this I have a function to split the ip
> address as character into class A,B,C and D components. It works but is
> horribly inefficient in terms of speed. I can't quite see how one of the
> l/s/m/t/apply functions could be brought to bear on the problem. Does
> anyone have any thoughts?
>
> for(i in 1:4861469)
>     {
>     lst<-unlist(strsplit(data$ComputerName[i], "\\."))
>     data$IPA[i]<-lst[[1]]
>     data$IPB[i]<-lst[[2]]
>     data$IPC[i]<-lst[[3]]
>     data$IPD[i]<-lst[[4]]
>     rm(lst)
>     }
>
> Andrew
>
> Andrew Roberts
> Children's Orthopaedic Surgeon
> RJAH, Oswestry, UK
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

--
Enrico Schumann
Lucerne, Switzerland
http://nmof.net/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: splitting strings effriciently

jholtman
Just a quick followup to the previous post using 4M entries:  (20
seconds would seem like a reasonable time for the operation)

>  ip <- "123.456.789.321"  ## example data
>  df <- data.frame(ip = rep(ip, 4e6), stringsAsFactors=FALSE)
>  system.time(x <- strsplit(df$ip, '\\.'))
   user  system elapsed
  19.47    0.12   20.86
>  str(x)
List of 4000000
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"
 $ : chr [1:4] "123" "456" "789" "321"




On Sun, Jan 8, 2012 at 8:11 AM, Enrico Schumann <[hidden email]> wrote:

>
> Hi Andrew,
>
> you can use strsplit for a character vector; you do not have to call it for
> every element data$ComputerName[i].
>
> If I understand correctly, maybe something like this helps
>
>> ip <- "123.456.789.321"  ## example data
>> df <- data.frame(ip = rep(ip, 9), stringsAsFactors=FALSE)
>> df
>               ip
> 1 123.456.789.321
> 2 123.456.789.321
> 3 123.456.789.321
> 4 123.456.789.321
> 5 123.456.789.321
> 6 123.456.789.321
> 7 123.456.789.321
> 8 123.456.789.321
> 9 123.456.789.321
>
>>
>> res <- unlist(strsplit(df[["ip"]], "\\."))
>> ii <- seq(1, nrow(df)*4, by = 4)
>> res[ii]   ## A
> [1] "123" "123" "123" "123" "123" "123" "123"
> [8] "123" "123"
>> res[ii+1] ## B
> [1] "456" "456" "456" "456" "456" "456" "456"
> [8] "456" "456"
>> res[ii+2] ## C
> [1] "789" "789" "789" "789" "789" "789" "789"
> [8] "789" "789"
>> res[ii+3] ## D
> [1] "321" "321" "321" "321" "321" "321" "321"
> [8] "321" "321"
>
>
> Regards,
> Enrico
>
>
> Am 08.01.2012 11:06, schrieb Andrew Roberts:
>
>> Folks,
>>
>> I have a data frame with 4861469 rows that contains an ip address
>> xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each
>> row based on IP ranges. To do this I have a function to split the ip
>> address as character into class A,B,C and D components. It works but is
>> horribly inefficient in terms of speed. I can't quite see how one of the
>> l/s/m/t/apply functions could be brought to bear on the problem. Does
>> anyone have any thoughts?
>>
>> for(i in 1:4861469)
>>    {
>>    lst<-unlist(strsplit(data$ComputerName[i], "\\."))
>>    data$IPA[i]<-lst[[1]]
>>    data$IPB[i]<-lst[[2]]
>>    data$IPC[i]<-lst[[3]]
>>    data$IPD[i]<-lst[[4]]
>>    rm(lst)
>>    }
>>
>> Andrew
>>
>> Andrew Roberts
>> Children's Orthopaedic Surgeon
>> RJAH, Oswestry, UK
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --
> Enrico Schumann
> Lucerne, Switzerland
> http://nmof.net/
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: splitting strings efficiently

Andrew Roberts
In reply to this post by Enrico Schumann
Thanks Enrico & Jim,

The following finished the job in under a minute!

res <- unlist(strsplit(data[["ComputerName"]], "\\."))
ii <- seq(1, nrow(data)*4, by = 4)
data$IPA <-res[ii]   ## A
data$IPB <-res[ii+1] ## B
data$IPC <-res[ii+2] ## C
data$IPD <-res[ii+3] ## D

Andrew

On 08/01/2012 13:11, Enrico Schumann wrote:

>
> Hi Andrew,
>
> you can use strsplit for a character vector; you do not have to call
> it for every element data$ComputerName[i].
>
> If I understand correctly, maybe something like this helps
>
> > ip <- "123.456.789.321"  ## example data
> > df <- data.frame(ip = rep(ip, 9), stringsAsFactors=FALSE)
> > df
>                ip
> 1 123.456.789.321
> 2 123.456.789.321
> 3 123.456.789.321
> 4 123.456.789.321
> 5 123.456.789.321
> 6 123.456.789.321
> 7 123.456.789.321
> 8 123.456.789.321
> 9 123.456.789.321
>
> >
> > res <- unlist(strsplit(df[["ip"]], "\\."))
> > ii <- seq(1, nrow(df)*4, by = 4)
> > res[ii]   ## A
> [1] "123" "123" "123" "123" "123" "123" "123"
> [8] "123" "123"
> > res[ii+1] ## B
> [1] "456" "456" "456" "456" "456" "456" "456"
> [8] "456" "456"
> > res[ii+2] ## C
> [1] "789" "789" "789" "789" "789" "789" "789"
> [8] "789" "789"
> > res[ii+3] ## D
> [1] "321" "321" "321" "321" "321" "321" "321"
> [8] "321" "321"
>
>
> Regards,
> Enrico
>
>
> Am 08.01.2012 11:06, schrieb Andrew Roberts:
>> Folks,
>>
>> I have a data frame with 4861469 rows that contains an ip address
>> xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each
>> row based on IP ranges. To do this I have a function to split the ip
>> address as character into class A,B,C and D components. It works but is
>> horribly inefficient in terms of speed. I can't quite see how one of the
>> l/s/m/t/apply functions could be brought to bear on the problem. Does
>> anyone have any thoughts?
>>
>> for(i in 1:4861469)
>>     {
>>     lst<-unlist(strsplit(data$ComputerName[i], "\\."))
>>     data$IPA[i]<-lst[[1]]
>>     data$IPB[i]<-lst[[2]]
>>     data$IPC[i]<-lst[[3]]
>>     data$IPD[i]<-lst[[4]]
>>     rm(lst)
>>     }
>>
>> Andrew
>>
>> Andrew Roberts
>> Children's Orthopaedic Surgeon
>> RJAH, Oswestry, UK
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: splitting strings effriciently

Martin Morgan
In reply to this post by jholtman
On 01/08/2012 11:37 AM, jim holtman wrote:
> Just a quick followup to the previous post using 4M entries:  (20
> seconds would seem like a reasonable time for the operation)
>
>>   ip<- "123.456.789.321"  ## example data
>>   df<- data.frame(ip = rep(ip, 4e6), stringsAsFactors=FALSE)
>>   system.time(x<- strsplit(df$ip, '\\.'))

or if the IP addresses really are repeated multiple times

df <- data.frame(ip=rep(ip, 4e6))  ## df$ip is a factor

 > system.time(x <- local({
+     ip0 <- strsplit(levels(df$ip), "\\.")
+     ip0[match(df$ip, levels(df$ip))]
+ }))
    user  system elapsed
   0.352   0.000   0.352

although the speed-up in the example is best-case.

Martin

>     user  system elapsed
>    19.47    0.12   20.86
>>   str(x)
> List of 4000000
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>
>
>
>
> On Sun, Jan 8, 2012 at 8:11 AM, Enrico Schumann<[hidden email]>  wrote:
>>
>> Hi Andrew,
>>
>> you can use strsplit for a character vector; you do not have to call it for
>> every element data$ComputerName[i].
>>
>> If I understand correctly, maybe something like this helps
>>
>>> ip<- "123.456.789.321"  ## example data
>>> df<- data.frame(ip = rep(ip, 9), stringsAsFactors=FALSE)
>>> df
>>                ip
>> 1 123.456.789.321
>> 2 123.456.789.321
>> 3 123.456.789.321
>> 4 123.456.789.321
>> 5 123.456.789.321
>> 6 123.456.789.321
>> 7 123.456.789.321
>> 8 123.456.789.321
>> 9 123.456.789.321
>>
>>>
>>> res<- unlist(strsplit(df[["ip"]], "\\."))
>>> ii<- seq(1, nrow(df)*4, by = 4)
>>> res[ii]   ## A
>> [1] "123" "123" "123" "123" "123" "123" "123"
>> [8] "123" "123"
>>> res[ii+1] ## B
>> [1] "456" "456" "456" "456" "456" "456" "456"
>> [8] "456" "456"
>>> res[ii+2] ## C
>> [1] "789" "789" "789" "789" "789" "789" "789"
>> [8] "789" "789"
>>> res[ii+3] ## D
>> [1] "321" "321" "321" "321" "321" "321" "321"
>> [8] "321" "321"
>>
>>
>> Regards,
>> Enrico
>>
>>
>> Am 08.01.2012 11:06, schrieb Andrew Roberts:
>>
>>> Folks,
>>>
>>> I have a data frame with 4861469 rows that contains an ip address
>>> xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each
>>> row based on IP ranges. To do this I have a function to split the ip
>>> address as character into class A,B,C and D components. It works but is
>>> horribly inefficient in terms of speed. I can't quite see how one of the
>>> l/s/m/t/apply functions could be brought to bear on the problem. Does
>>> anyone have any thoughts?
>>>
>>> for(i in 1:4861469)
>>>     {
>>>     lst<-unlist(strsplit(data$ComputerName[i], "\\."))
>>>     data$IPA[i]<-lst[[1]]
>>>     data$IPB[i]<-lst[[2]]
>>>     data$IPC[i]<-lst[[3]]
>>>     data$IPD[i]<-lst[[4]]
>>>     rm(lst)
>>>     }
>>>
>>> Andrew
>>>
>>> Andrew Roberts
>>> Children's Orthopaedic Surgeon
>>> RJAH, Oswestry, UK
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> --
>> Enrico Schumann
>> Lucerne, Switzerland
>> http://nmof.net/
>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>


--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: splitting strings effriciently

drflxms
In reply to this post by Andrew Roberts
Hi Andrew,

I am aware, that this is an R-mailing list, but for such tasks (I deal a
lot with huge genomic datasets) I tend to use awk and sed for
preprocessing of data, in case I run into performance problems.
Otherwise for handling of strings in R I recommend stringr library, but
I don't know about it's performance...

Felix

> Folks,
>
> I have a data frame with 4861469 rows that contains an ip address
> xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each
> row based on IP ranges. To do this I have a function to split the ip
> address as character into class A,B,C and D components. It works but is
> horribly inefficient in terms of speed. I can't quite see how one of the
> l/s/m/t/apply functions could be brought to bear on the problem. Does
> anyone have any thoughts?
>
> for(i in 1:4861469)
>    {
>    lst <-unlist(strsplit(data$ComputerName[i], "\\."))
>    data$IPA[i] <-lst[[1]]
>    data$IPB[i] <-lst[[2]]
>    data$IPC[i] <-lst[[3]]
>    data$IPD[i] <-lst[[4]]
>    rm(lst)
>    }
>
> Andrew
>
> Andrew Roberts
> Children's Orthopaedic Surgeon
> RJAH, Oswestry, UK

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: splitting strings effriciently

MacQueen, Don
In reply to this post by Enrico Schumann
See suggestion inserted below.
It assumes and requires that every input IP address has the required four
elements.

-Don

--
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062





On 1/8/12 5:11 AM, "Enrico Schumann" <[hidden email]> wrote:

>
>Hi Andrew,
>
>you can use strsplit for a character vector; you do not have to call it
>for every element data$ComputerName[i].
>
>If I understand correctly, maybe something like this helps
>
> > ip <- "123.456.789.321"  ## example data
> > df <- data.frame(ip = rep(ip, 9), stringsAsFactors=FALSE)
> > df
>                ip
>1 123.456.789.321
>2 123.456.789.321
>3 123.456.789.321
>4 123.456.789.321
>5 123.456.789.321
>6 123.456.789.321
>7 123.456.789.321
>8 123.456.789.321
>9 123.456.789.321
>
> >
> > res <- unlist(strsplit(df[["ip"]], "\\."))


At this point, I would do

> res <- matrix(res,ncol=4, byrow=TRUE)
> res
      [,1]  [,2]  [,3]  [,4]
 [1,] "123" "456" "789" "321"
 [2,] "123" "456" "789" "321"
 [3,] "123" "456" "789" "321"
 [4,] "123" "456" "789" "321"
 [5,] "123" "456" "789" "321"
 [6,] "123" "456" "789" "321"
 [7,] "123" "456" "789" "321"
 [8,] "123" "456" "789" "321"
 [9,] "123" "456" "789" "321"

Then each column of the matrix is one element of the IP address.



> > ii <- seq(1, nrow(df)*4, by = 4)
> > res[ii]   ## A
>[1] "123" "123" "123" "123" "123" "123" "123"
>[8] "123" "123"
> > res[ii+1] ## B
>[1] "456" "456" "456" "456" "456" "456" "456"
>[8] "456" "456"
> > res[ii+2] ## C
>[1] "789" "789" "789" "789" "789" "789" "789"
>[8] "789" "789"
> > res[ii+3] ## D
>[1] "321" "321" "321" "321" "321" "321" "321"
>[8] "321" "321"
>
>
>Regards,
>Enrico
>
>
>Am 08.01.2012 11:06, schrieb Andrew Roberts:
>> Folks,
>>
>> I have a data frame with 4861469 rows that contains an ip address
>> xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each
>> row based on IP ranges. To do this I have a function to split the ip
>> address as character into class A,B,C and D components. It works but is
>> horribly inefficient in terms of speed. I can't quite see how one of the
>> l/s/m/t/apply functions could be brought to bear on the problem. Does
>> anyone have any thoughts?
>>
>> for(i in 1:4861469)
>>     {
>>     lst<-unlist(strsplit(data$ComputerName[i], "\\."))
>>     data$IPA[i]<-lst[[1]]
>>     data$IPB[i]<-lst[[2]]
>>     data$IPC[i]<-lst[[3]]
>>     data$IPD[i]<-lst[[4]]
>>     rm(lst)
>>     }
>>
>> Andrew
>>
>> Andrew Roberts
>> Children's Orthopaedic Surgeon
>> RJAH, Oswestry, UK
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>--
>Enrico Schumann
>Lucerne, Switzerland
>http://nmof.net/
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.