Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions

João Azevedo Patrício
Hi,

I've been trying to solve this issue but with no success.

I have some data like this:

1 > TC WC
2 > 0 Instruments & Instrumentation; Nuclear Science & Technology;
Physics, Particles & Fields; Spectroscopy
3 > 0 Nanoscience & Nanotechnology; Materials Science,
Multidisciplinary; Physics, Applied
4 > 2 Physics, Nuclear; Physics, Particles & Fields
5 > 0 Chemistry, Inorganic & Nuclear
6 > 2 Chemistry, Physical; Materials Science, Multidisciplinary;
Metallurgy & Metallurgical Engineering

And I need to have this:

1 > TC WC
2 > 0 Instruments & Instrumentation
2 > 0 Nuclear Science & Technology
2 > 0 Physics, Particles & Fields
2 > 0 Spectroscopy
3 > 0 Nanoscience & Nanotechnology
3 > 0 Materials Science, Multidisciplinary
3 > 0 Physics, Applied
4 > 2 Physics, Nuclear
4 > 2 Physics, Particles & Fields
5 > 0 Chemistry, Inorganic & Nuclear
6 > 2 Chemistry, Physical
6 > 2 Materials Science, Multidisciplinary
6 > 2 Metallurgy & Metallurgical Engineering

This means repeat the row for each element in WC and keeping the same
value in TC. The goal is to check how many TC (sum) there are by WC,
when WC is multiple.

i've tried to separate the column using strsplt but then I cannot keep
the track of TC.

thanks in advance.
--
João Azevedo Patrício
Tel.: +31 91 400 53 63
Portugal
@ http://tripaforra.bl.ee

"Take 2 seconds to think before you act"

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions

arun kirshna


Hi,
Try:
dat1 <- read.table(text="'1 > TC' 'WC'
'2 > 0'  'Instruments & Instrumentation; Nuclear Science & Technology;Physics, Particles & Fields; Spectroscopy'
'3 > 0' 'Nanoscience & Nanotechnology; Materials Science,Multidisciplinary; Physics, Applied'
'4 > 2'    'Physics, Nuclear; Physics, Particles & Fields'
'5 > 0'    'Chemistry, Inorganic & Nuclear'
'6 > 2'    'Chemistry, Physical; Materials Science, Multidisciplinary;Metallurgy & Metallurgical Engineering'",sep="",header=F, stringsAsFactors=F)

library(data.table)
Using `cSplit()` from
https://gist.github.com/mrdwab/11380733

cSplit(dat1, "V2", ";", "long")
        V1                                     V2
 1: 1 > TC                                     WC
 2:  2 > 0          Instruments & Instrumentation
 3:  2 > 0           Nuclear Science & Technology
 4:  2 > 0            Physics, Particles & Fields
 5:  2 > 0                           Spectroscopy
 6:  3 > 0           Nanoscience & Nanotechnology
 7:  3 > 0    Materials Science,Multidisciplinary
 8:  3 > 0                       Physics, Applied
 9:  4 > 2                       Physics, Nuclear
10:  4 > 2            Physics, Particles & Fields
11:  5 > 0         Chemistry, Inorganic & Nuclear
12:  6 > 2                    Chemistry, Physical
13:  6 > 2   Materials Science, Multidisciplinary
14:  6 > 2 Metallurgy & Metallurgical Engineering



A.K.


On Friday, July 4, 2014 9:53 AM, João Azevedo Patrício <[hidden email]> wrote:
Hi,

I've been trying to solve this issue but with no success.

I have some data like this:

1 > TC    WC
2 > 0    Instruments & Instrumentation; Nuclear Science & Technology;
Physics, Particles & Fields; Spectroscopy
3 > 0    Nanoscience & Nanotechnology; Materials Science,
Multidisciplinary; Physics, Applied
4 > 2    Physics, Nuclear; Physics, Particles & Fields
5 > 0    Chemistry, Inorganic & Nuclear
6 > 2    Chemistry, Physical; Materials Science, Multidisciplinary;
Metallurgy & Metallurgical Engineering

And I need to have this:

1 > TC    WC
2 > 0    Instruments & Instrumentation
2 > 0    Nuclear Science & Technology
2 > 0    Physics, Particles & Fields
2 > 0    Spectroscopy
3 > 0    Nanoscience & Nanotechnology
3 > 0    Materials Science, Multidisciplinary
3 > 0    Physics, Applied
4 > 2    Physics, Nuclear
4 > 2    Physics, Particles & Fields
5 > 0    Chemistry, Inorganic & Nuclear
6 > 2    Chemistry, Physical
6 > 2    Materials Science, Multidisciplinary
6 > 2    Metallurgy & Metallurgical Engineering

This means repeat the row for each element in WC and keeping the same
value in TC. The goal is to check how many TC (sum) there are by WC,
when WC is multiple.

i've tried to separate the column using strsplt but then I cannot keep
the track of TC.

thanks in advance.
--
João Azevedo Patrício
Tel.: +31 91 400 53 63
Portugal
@ http://tripaforra.bl.ee

"Take 2 seconds to think before you act"

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions

João Azevedo Patrício
Em 04-07-2014 15:15, arun escreveu:

>
> Hi,
> Try:
> dat1 <- read.table(text="'1 > TC' 'WC'
> '2 > 0'  'Instruments & Instrumentation; Nuclear Science & Technology;Physics, Particles & Fields; Spectroscopy'
> '3 > 0' 'Nanoscience & Nanotechnology; Materials Science,Multidisciplinary; Physics, Applied'
> '4 > 2'    'Physics, Nuclear; Physics, Particles & Fields'
> '5 > 0'    'Chemistry, Inorganic & Nuclear'
> '6 > 2'    'Chemistry, Physical; Materials Science, Multidisciplinary;Metallurgy & Metallurgical Engineering'",sep="",header=F, stringsAsFactors=F)
>
> library(data.table)
> Using `cSplit()` from
> https://gist.github.com/mrdwab/11380733
>
> cSplit(dat1, "V2", ";", "long")
>          V1                                     V2
>   1: 1 > TC                                     WC
>   2:  2 > 0          Instruments & Instrumentation
>   3:  2 > 0           Nuclear Science & Technology
>   4:  2 > 0            Physics, Particles & Fields
>   5:  2 > 0                           Spectroscopy
>   6:  3 > 0           Nanoscience & Nanotechnology
>   7:  3 > 0    Materials Science,Multidisciplinary
>   8:  3 > 0                       Physics, Applied
>   9:  4 > 2                       Physics, Nuclear
> 10:  4 > 2            Physics, Particles & Fields
> 11:  5 > 0         Chemistry, Inorganic & Nuclear
> 12:  6 > 2                    Chemistry, Physical
> 13:  6 > 2   Materials Science, Multidisciplinary
> 14:  6 > 2 Metallurgy & Metallurgical Engineering
>
>
>
> A.K.
>
>
> On Friday, July 4, 2014 9:53 AM, João Azevedo Patrício <[hidden email]> wrote:
> Hi,
>
> I've been trying to solve this issue but with no success.
>
> I have some data like this:
>
> 1 > TC    WC
> 2 > 0    Instruments & Instrumentation; Nuclear Science & Technology;
> Physics, Particles & Fields; Spectroscopy
> 3 > 0    Nanoscience & Nanotechnology; Materials Science,
> Multidisciplinary; Physics, Applied
> 4 > 2    Physics, Nuclear; Physics, Particles & Fields
> 5 > 0    Chemistry, Inorganic & Nuclear
> 6 > 2    Chemistry, Physical; Materials Science, Multidisciplinary;
> Metallurgy & Metallurgical Engineering
>
> And I need to have this:
>
> 1 > TC    WC
> 2 > 0    Instruments & Instrumentation
> 2 > 0    Nuclear Science & Technology
> 2 > 0    Physics, Particles & Fields
> 2 > 0    Spectroscopy
> 3 > 0    Nanoscience & Nanotechnology
> 3 > 0    Materials Science, Multidisciplinary
> 3 > 0    Physics, Applied
> 4 > 2    Physics, Nuclear
> 4 > 2    Physics, Particles & Fields
> 5 > 0    Chemistry, Inorganic & Nuclear
> 6 > 2    Chemistry, Physical
> 6 > 2    Materials Science, Multidisciplinary
> 6 > 2    Metallurgy & Metallurgical Engineering
>
> This means repeat the row for each element in WC and keeping the same
> value in TC. The goal is to check how many TC (sum) there are by WC,
> when WC is multiple.
>
> i've tried to separate the column using strsplt but then I cannot keep
> the track of TC.
>
> thanks in advance.
Thanks is simply fantastic!
After that I just have to do an aggregate by WC and it gives me the n of
TC by WC.

thanks!

my code looks like this:

isi <- read.table("filename", header = TRUE, sep=";") ##get citations
and web of science categories file
cSplit(isi, "WC", ";", "long") ## split by WC
isisplit <- cSplit(isi, "WC", ";", "long") ## create file with split WC info
wccitations <- aggregate (isisplit$TC, by=list(Category=isisplit$WC),
FUN = sum) ## creates a table with the list of WCategories and the
specific citations sum for  each
wcproduction <- table(isisplit$WC) ## creates a table with the number of
pubs by WCategories

--
João Azevedo Patrício
Tel.: +31 91 400 53 63
Portugal
@ http://tripaforra.bl.ee

"Take 2 seconds to think before you act"

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions

John McKown
In reply to this post by João Azevedo Patrício
On Fri, Jul 4, 2014 at 7:50 AM, João Azevedo Patrício
<[hidden email]> wrote:

> Hi,
>
> I've been trying to solve this issue but with no success.
>
> I have some data like this:
>
> 1 > TC  WC
> 2 > 0   Instruments & Instrumentation; Nuclear Science & Technology;
> Physics, Particles & Fields; Spectroscopy
> 3 > 0   Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
> Physics, Applied
> 4 > 2   Physics, Nuclear; Physics, Particles & Fields
> 5 > 0   Chemistry, Inorganic & Nuclear
> 6 > 2   Chemistry, Physical; Materials Science, Multidisciplinary;
> Metallurgy & Metallurgical Engineering
>
> And I need to have this:
>
> 1 > TC  WC
> 2 > 0   Instruments & Instrumentation
> 2 > 0   Nuclear Science & Technology
> 2 > 0   Physics, Particles & Fields
> 2 > 0   Spectroscopy
> 3 > 0   Nanoscience & Nanotechnology
> 3 > 0   Materials Science, Multidisciplinary
> 3 > 0   Physics, Applied
> 4 > 2   Physics, Nuclear
> 4 > 2   Physics, Particles & Fields
> 5 > 0   Chemistry, Inorganic & Nuclear
> 6 > 2   Chemistry, Physical
> 6 > 2   Materials Science, Multidisciplinary
> 6 > 2   Metallurgy & Metallurgical Engineering
>
> This means repeat the row for each element in WC and keeping the same value
> in TC. The goal is to check how many TC (sum) there are by WC, when WC is
> multiple.
>
> i've tried to separate the column using strsplt but then I cannot keep the
> track of TC.
>
> thanks in advance.
> --
> João Azevedo Patrício

Best that I've come up with, which seems to give the result desired
from the example data given.

splitAtSemiColon <- function(input) {
    z <- strsplit(input$WC,';');
    result <- data.table(TC=rep(input$TC,sapply(z,length)), WC=unlist(z));
    return(result);
}

flatted.data <- splitAtSemiColon(original.data);

<transcript>
> print(original.data,right=FALSE)
  TC
1 0
2 0
3 2
4 0
5 2
  WC
1 Instruments & Instrumentation; Nuclear Science & Technology;
Physics, Particles & Fields; Spectroscopy
2 Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
Physics, Applied
3 Physics, Nuclear; Physics, Particles & Fields
4 Chemistry, Inorganic & Nuclear
5 Chemistry, Physical; Materials Science, Multidisciplinary;
Metallurgy & Metallurgical Engineering
>
>> print(splitAtSemiColon,right=FALSE);
function(x) {
    z=strsplit(x$WC,';');
    result3=data.frame(TC=rep(x$TC,sapply(z,length)),WC=unlist(z));
    return(result3);
}
> print(splitAtSemiColon(original.data),right=FALSE);
   TC WC
1  0  Instruments & Instrumentation
2  0   Nuclear Science & Technology
3  0   Physics, Particles & Fields
4  0   Spectroscopy
5  0  Nanoscience & Nanotechnology
6  0   Materials Science, Multidisciplinary
7  0   Physics, Applied
8  2  Physics, Nuclear
9  2   Physics, Particles & Fields
10 0  Chemistry, Inorganic & Nuclear
11 2  Chemistry, Physical
12 2   Materials Science, Multidisciplinary
13 2   Metallurgy & Metallurgical Engineering

Note that I still have a problem in that the WC data can have leading
and/or trailing blanks due to the say that strsplit works. The easiest
way to fix this is to use the strtrim() function from the stringr
package.


--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions

João Azevedo Patrício
Em 05-07-2014 03:35, John McKown escreveu:

> On Fri, Jul 4, 2014 at 7:50 AM, João Azevedo Patrício
> <[hidden email]> wrote:
>> Hi,
>>
>> I've been trying to solve this issue but with no success.
>>
>> I have some data like this:
>>
>> 1 > TC  WC
>> 2 > 0   Instruments & Instrumentation; Nuclear Science & Technology;
>> Physics, Particles & Fields; Spectroscopy
>> 3 > 0   Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
>> Physics, Applied
>> 4 > 2   Physics, Nuclear; Physics, Particles & Fields
>> 5 > 0   Chemistry, Inorganic & Nuclear
>> 6 > 2   Chemistry, Physical; Materials Science, Multidisciplinary;
>> Metallurgy & Metallurgical Engineering
>>
>> And I need to have this:
>>
>> 1 > TC  WC
>> 2 > 0   Instruments & Instrumentation
>> 2 > 0   Nuclear Science & Technology
>> 2 > 0   Physics, Particles & Fields
>> 2 > 0   Spectroscopy
>> 3 > 0   Nanoscience & Nanotechnology
>> 3 > 0   Materials Science, Multidisciplinary
>> 3 > 0   Physics, Applied
>> 4 > 2   Physics, Nuclear
>> 4 > 2   Physics, Particles & Fields
>> 5 > 0   Chemistry, Inorganic & Nuclear
>> 6 > 2   Chemistry, Physical
>> 6 > 2   Materials Science, Multidisciplinary
>> 6 > 2   Metallurgy & Metallurgical Engineering
>>
>> This means repeat the row for each element in WC and keeping the same value
>> in TC. The goal is to check how many TC (sum) there are by WC, when WC is
>> multiple.
>>
>> i've tried to separate the column using strsplt but then I cannot keep the
>> track of TC.
>>
>> thanks in advance.
>> --
>> João Azevedo Patrício
> Best that I've come up with, which seems to give the result desired
> from the example data given.
>
> splitAtSemiColon <- function(input) {
>      z <- strsplit(input$WC,';');
>      result <- data.table(TC=rep(input$TC,sapply(z,length)), WC=unlist(z));
>      return(result);
> }
>
> flatted.data <- splitAtSemiColon(original.data);
>
> <transcript>
>> print(original.data,right=FALSE)
>    TC
> 1 0
> 2 0
> 3 2
> 4 0
> 5 2
>    WC
> 1 Instruments & Instrumentation; Nuclear Science & Technology;
> Physics, Particles & Fields; Spectroscopy
> 2 Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
> Physics, Applied
> 3 Physics, Nuclear; Physics, Particles & Fields
> 4 Chemistry, Inorganic & Nuclear
> 5 Chemistry, Physical; Materials Science, Multidisciplinary;
> Metallurgy & Metallurgical Engineering
>>> print(splitAtSemiColon,right=FALSE);
> function(x) {
>      z=strsplit(x$WC,';');
>      result3=data.frame(TC=rep(x$TC,sapply(z,length)),WC=unlist(z));
>      return(result3);
> }
>> print(splitAtSemiColon(original.data),right=FALSE);
>     TC WC
> 1  0  Instruments & Instrumentation
> 2  0   Nuclear Science & Technology
> 3  0   Physics, Particles & Fields
> 4  0   Spectroscopy
> 5  0  Nanoscience & Nanotechnology
> 6  0   Materials Science, Multidisciplinary
> 7  0   Physics, Applied
> 8  2  Physics, Nuclear
> 9  2   Physics, Particles & Fields
> 10 0  Chemistry, Inorganic & Nuclear
> 11 2  Chemistry, Physical
> 12 2   Materials Science, Multidisciplinary
> 13 2   Metallurgy & Metallurgical Engineering
>
> Note that I still have a problem in that the WC data can have leading
> and/or trailing blanks due to the say that strsplit works. The easiest
> way to fix this is to use the strtrim() function from the stringr
> package.
>
>
Yes also have that problem. Tried to work it ou using "sub" but didn't
work at all.

--
João Azevedo Patrício
Tel.: +31 91 400 53 63
Portugal
@ http://tripaforra.bl.ee

"Take 2 seconds to think before you act"

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Transform a data.frame with "; " sep column and another one in a a new one with the same two column but with repetitions

João Azevedo Patrício
In reply to this post by João Azevedo Patrício
Em 05-07-2014 00:43, John McKown escreveu:

> I messed up my original response by not including r-help in the
> distribution. And now I won't look as bad because, after a short nap,
> I have new, much shorted (but more difficult, for me, to understand)
> answer.
>
> #
> # The original data is in the variable "x".
> z=data.frame(TC=x$TC,
> WC=I(mapply(strsplit,x$WC,MoreArgs=list(';'),USE.NAMES=FALSE)));
> result=data.frame(TC=rep(x$TC,sapply(z$WC,length)),WC=unlist(z$WC));
> #
>
> There may be a way to eliminate the temporary variable "z". Maybe I
> need another nap!
>
> The heart of this is the mapply, which results in a list where each
> entry in the list is another list. And the entries in embedded list
> are the list of results from the output of strsplit() on the WC
> information.
>
> If this needs to be a function, then
>
> splitUp <- function(x) {
>      z=data.frame(TC=x$TC,
> WC=I(mapply(strsplit,x$WC,MoreArgs=list(';'),USE.NAMES=FALSE)));
>      result=data.frame(TC=rep(x$TC,sapply(z$WC,length)),WC=unlist(z$WC));
>      return(result);
> }
>
> Then invoke it with:
>
> flattened.result <- splitUp(original.data.frame);
>
> On Fri, Jul 4, 2014 at 7:50 AM, João Azevedo Patrício
> <[hidden email]> wrote:
>> Hi,
>>
>> I've been trying to solve this issue but with no success.
>>
>> I have some data like this:
>>
>> 1 > TC  WC
>> 2 > 0   Instruments & Instrumentation; Nuclear Science & Technology;
>> Physics, Particles & Fields; Spectroscopy
>> 3 > 0   Nanoscience & Nanotechnology; Materials Science, Multidisciplinary;
>> Physics, Applied
>> 4 > 2   Physics, Nuclear; Physics, Particles & Fields
>> 5 > 0   Chemistry, Inorganic & Nuclear
>> 6 > 2   Chemistry, Physical; Materials Science, Multidisciplinary;
>> Metallurgy & Metallurgical Engineering
>>
>> And I need to have this:
>>
>> 1 > TC  WC
>> 2 > 0   Instruments & Instrumentation
>> 2 > 0   Nuclear Science & Technology
>> 2 > 0   Physics, Particles & Fields
>> 2 > 0   Spectroscopy
>> 3 > 0   Nanoscience & Nanotechnology
>> 3 > 0   Materials Science, Multidisciplinary
>> 3 > 0   Physics, Applied
>> 4 > 2   Physics, Nuclear
>> 4 > 2   Physics, Particles & Fields
>> 5 > 0   Chemistry, Inorganic & Nuclear
>> 6 > 2   Chemistry, Physical
>> 6 > 2   Materials Science, Multidisciplinary
>> 6 > 2   Metallurgy & Metallurgical Engineering
>>
>> This means repeat the row for each element in WC and keeping the same value
>> in TC. The goal is to check how many TC (sum) there are by WC, when WC is
>> multiple.
>>
>> i've tried to separate the column using strsplt but then I cannot keep the
>> track of TC.
>>
>> thanks in advance.
>> --
>> João Azevedo Patrício
I've been testing it and the results is coming nicely.

It grabs a CSV taken from ISI Web Of science, works it out and produces
a table organized by WC (web of science category) with number of papers
per area, citations and impact factor.

my code is like this right now:

 > isi <- read.table("file.csv", header = TRUE, sep=";") ##get citations
and web of science categories file
 > isisplit=data.frame(TC=isi$TC,
+ WC=I(mapply(strsplit,isi$WC,MoreArgs=list(';'),USE.NAMES=FALSE)));
 >
result=data.frame(TC=rep(isi$TC,sapply(isisplit$WC,length)),WC=unlist(isisplit$WC));
 > isisplit$WC <- str_trim(isisplit$WC)
 > wccitations <- aggregate (isisplit$TC, by=list(Category=isisplit$WC),
FUN = sum) ## creates a table with the list of WCategories and the
specific + citations
 > colnames(wccitations) <- c("WC", "TC")
 > wcproduction <- table(isisplit$WC) ## creates a table with the number
of pubs by WCategories
 > wcproduction <- as.data.table(wcproduction)
 > colnames(wcproduction) <- c("WC", "PUB")
 >wc <- data.frame(WC = wccitations$WC, PUB = wcproduction$PUB, TC =
wccitations$TC, IMP = round((wcproduction$PUB/wccitations$TC), digits =
+ 2))
 > wc[wc == Inf] = 0 ## removes inf in impact by impact 0
 > write.table(wc, file = "file.csv", sep = ";", dec = ",")


--
João Azevedo Patrício
Tel.: +31 91 400 53 63
Portugal
@ http://tripaforra.bl.ee

"Take 2 seconds to think before you act"

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.