comparing two strings from data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

comparing two strings from data

Yasin Gocgun
Hi,

I have two columns that contain numbers along with letters (as shown below)
and have different lengths. Each entry in the first column is likely to be
 found in the second column at most once.

For each entry of the first column, if that entry is found in the second
column, I would like to get the corresponding index. For instance, if the
first entry of the first column is 5th entry in the second column, I would
like to keep this index 5.

AST2017000005534   TUR2017000001428
CTS2017000079930    CTS2017000071989
CTS2017000079931     CTS2017000072015

In a loop, when I use the following code to get those indices,


data_2 = read.csv("excel_data.csv")
column_1 = data_2$data1
column_2 = data_2$data2

match_list <- array(0,dim=c(310,1));  # 310 is the length of the first
column

for (indx in 1: 310){
    for(indx2 in 1:713){ # 713 is the length of the second column
        if(column_1[indx] == column_2[indx2] ){
            match_list[indx,1] = indx2;
            break;
        }
    }
}


R provides the following error:

Error in Ops.factor(column_1[indx], column_2[indx2]) :
  level sets of factors are different

So can someone explain me how I can resolve this issue?

Thnak you,

Yasin

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: comparing two strings from data

glsnow
The error is because the read.csv function converted both columns to
factors.  The simplest thing to do is to set stringsAsFactors=FALSE is
the call to read.csv so that they are compared as strings.  You could
also call as.character on each of the columns if you don't want to
read the data in again.

Also, look at the match function, I think it will give you what you
want without the explicit looping.


On Thu, Oct 12, 2017 at 2:25 PM, Yasin Gocgun <[hidden email]> wrote:

> Hi,
>
> I have two columns that contain numbers along with letters (as shown below)
> and have different lengths. Each entry in the first column is likely to be
>  found in the second column at most once.
>
> For each entry of the first column, if that entry is found in the second
> column, I would like to get the corresponding index. For instance, if the
> first entry of the first column is 5th entry in the second column, I would
> like to keep this index 5.
>
> AST2017000005534   TUR2017000001428
> CTS2017000079930    CTS2017000071989
> CTS2017000079931     CTS2017000072015
>
> In a loop, when I use the following code to get those indices,
>
>
> data_2 = read.csv("excel_data.csv")
> column_1 = data_2$data1
> column_2 = data_2$data2
>
> match_list <- array(0,dim=c(310,1));  # 310 is the length of the first
> column
>
> for (indx in 1: 310){
>     for(indx2 in 1:713){ # 713 is the length of the second column
>         if(column_1[indx] == column_2[indx2] ){
>             match_list[indx,1] = indx2;
>             break;
>         }
>     }
> }
>
>
> R provides the following error:
>
> Error in Ops.factor(column_1[indx], column_2[indx2]) :
>   level sets of factors are different
>
> So can someone explain me how I can resolve this issue?
>
> Thnak you,
>
> Yasin
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Gregory (Greg) L. Snow Ph.D.
[hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: comparing two strings from data

Boris Steipe
In reply to this post by Yasin Gocgun
It's generally a very good idea to examine the structure of data after you have read it in. str(data2) would have shown you that read.csv() turned your strings into factors, and that's why the == operator no longer does what you think it does.

use ...

data_2 <- read.csv("excel_data.csv", stringsAsFactors = FALSE)

... to turn this off. Also, the %in% operator will achieve more directly what you are trying to do. No need for loops.

B.




> On Oct 12, 2017, at 4:25 PM, Yasin Gocgun <[hidden email]> wrote:
>
> Hi,
>
> I have two columns that contain numbers along with letters (as shown below)
> and have different lengths. Each entry in the first column is likely to be
> found in the second column at most once.
>
> For each entry of the first column, if that entry is found in the second
> column, I would like to get the corresponding index. For instance, if the
> first entry of the first column is 5th entry in the second column, I would
> like to keep this index 5.
>
> AST2017000005534   TUR2017000001428
> CTS2017000079930    CTS2017000071989
> CTS2017000079931     CTS2017000072015
>
> In a loop, when I use the following code to get those indices,
>
>
> data_2 = read.csv("excel_data.csv")
> column_1 = data_2$data1
> column_2 = data_2$data2
>
> match_list <- array(0,dim=c(310,1));  # 310 is the length of the first
> column
>
> for (indx in 1: 310){
>    for(indx2 in 1:713){ # 713 is the length of the second column
>        if(column_1[indx] == column_2[indx2] ){
>            match_list[indx,1] = indx2;
>            break;
>        }
>    }
> }
>
>
> R provides the following error:
>
> Error in Ops.factor(column_1[indx], column_2[indx2]) :
>  level sets of factors are different
>
> So can someone explain me how I can resolve this issue?
>
> Thnak you,
>
> Yasin
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: comparing two strings from data

Eric Berger
Combining and completing the advice from Greg and Boris the complete
solution is two lines:

data_2 <- read.csv("excel_data.csv", stringsAsFactors = FALSE)
match_list <- match( data_2$data1, data_2$data2 )

The vector match_list will have the matching position when it exists and
NA's otherwise. Its length will be the same as the length of data_2$data1.

You should get experience in reading the help information for R functions.
In this case, type ?match to get information about the 'match' function.

HTH,
Eric


On Fri, Oct 13, 2017 at 12:16 AM, Boris Steipe <[hidden email]>
wrote:

> It's generally a very good idea to examine the structure of data after you
> have read it in. str(data2) would have shown you that read.csv() turned
> your strings into factors, and that's why the == operator no longer does
> what you think it does.
>
> use ...
>
> data_2 <- read.csv("excel_data.csv", stringsAsFactors = FALSE)
>
> ... to turn this off. Also, the %in% operator will achieve more directly
> what you are trying to do. No need for loops.
>
> B.
>
>
>
>
> > On Oct 12, 2017, at 4:25 PM, Yasin Gocgun <[hidden email]> wrote:
> >
> > Hi,
> >
> > I have two columns that contain numbers along with letters (as shown
> below)
> > and have different lengths. Each entry in the first column is likely to
> be
> > found in the second column at most once.
> >
> > For each entry of the first column, if that entry is found in the second
> > column, I would like to get the corresponding index. For instance, if the
> > first entry of the first column is 5th entry in the second column, I
> would
> > like to keep this index 5.
> >
> > AST2017000005534   TUR2017000001428
> > CTS2017000079930    CTS2017000071989
> > CTS2017000079931     CTS2017000072015
> >
> > In a loop, when I use the following code to get those indices,
> >
> >
> > data_2 = read.csv("excel_data.csv")
> > column_1 = data_2$data1
> > column_2 = data_2$data2
> >
> > match_list <- array(0,dim=c(310,1));  # 310 is the length of the first
> > column
> >
> > for (indx in 1: 310){
> >    for(indx2 in 1:713){ # 713 is the length of the second column
> >        if(column_1[indx] == column_2[indx2] ){
> >            match_list[indx,1] = indx2;
> >            break;
> >        }
> >    }
> > }
> >
> >
> > R provides the following error:
> >
> > Error in Ops.factor(column_1[indx], column_2[indx2]) :
> >  level sets of factors are different
> >
> > So can someone explain me how I can resolve this issue?
> >
> > Thnak you,
> >
> > Yasin
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: comparing two strings from data

Eric Berger
One additional comment. If you want 0 instead of NA when there is no match
then the match statement should read:

match_list <- match( data_2$data1, data_2$data2, nomatch=0)



On Fri, Oct 13, 2017 at 7:39 AM, Eric Berger <[hidden email]> wrote:

> Combining and completing the advice from Greg and Boris the complete
> solution is two lines:
>
> data_2 <- read.csv("excel_data.csv", stringsAsFactors = FALSE)
> match_list <- match( data_2$data1, data_2$data2 )
>
> The vector match_list will have the matching position when it exists and
> NA's otherwise. Its length will be the same as the length of data_2$data1.
>
> You should get experience in reading the help information for R functions.
> In this case, type ?match to get information about the 'match' function.
>
> HTH,
> Eric
>
>
> On Fri, Oct 13, 2017 at 12:16 AM, Boris Steipe <[hidden email]>
> wrote:
>
>> It's generally a very good idea to examine the structure of data after
>> you have read it in. str(data2) would have shown you that read.csv() turned
>> your strings into factors, and that's why the == operator no longer does
>> what you think it does.
>>
>> use ...
>>
>> data_2 <- read.csv("excel_data.csv", stringsAsFactors = FALSE)
>>
>> ... to turn this off. Also, the %in% operator will achieve more directly
>> what you are trying to do. No need for loops.
>>
>> B.
>>
>>
>>
>>
>> > On Oct 12, 2017, at 4:25 PM, Yasin Gocgun <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I have two columns that contain numbers along with letters (as shown
>> below)
>> > and have different lengths. Each entry in the first column is likely to
>> be
>> > found in the second column at most once.
>> >
>> > For each entry of the first column, if that entry is found in the second
>> > column, I would like to get the corresponding index. For instance, if
>> the
>> > first entry of the first column is 5th entry in the second column, I
>> would
>> > like to keep this index 5.
>> >
>> > AST2017000005534   TUR2017000001428
>> > CTS2017000079930    CTS2017000071989
>> > CTS2017000079931     CTS2017000072015
>> >
>> > In a loop, when I use the following code to get those indices,
>> >
>> >
>> > data_2 = read.csv("excel_data.csv")
>> > column_1 = data_2$data1
>> > column_2 = data_2$data2
>> >
>> > match_list <- array(0,dim=c(310,1));  # 310 is the length of the first
>> > column
>> >
>> > for (indx in 1: 310){
>> >    for(indx2 in 1:713){ # 713 is the length of the second column
>> >        if(column_1[indx] == column_2[indx2] ){
>> >            match_list[indx,1] = indx2;
>> >            break;
>> >        }
>> >    }
>> > }
>> >
>> >
>> > R provides the following error:
>> >
>> > Error in Ops.factor(column_1[indx], column_2[indx2]) :
>> >  level sets of factors are different
>> >
>> > So can someone explain me how I can resolve this issue?
>> >
>> > Thnak you,
>> >
>> > Yasin
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: comparing two strings from data

jdnewmil-2
In reply to this post by Yasin Gocgun
data_2 <- read.csv("excel_data.csv",stringsAsFactors=FALSE)
column_1 <- data_2$data1
column_2 <- data_2$data2
result <- match( column_1, column_2 )

Please read the Posting Guide mentioned at the bottom of this and every posting, in particular about posting plain text so that what we see will be what you saw when you sent the message. You should also read about how to create reproducible examples, e.g. about using dput as mentioned in [1] and [2], and verifying the example before sending it [3].

[1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

[2] http://adv-r.had.co.nz/Reproducibility.html

[3] https://cran.r-project.org/web/packages/reprex/index.html (read the vignette)
--
Sent from my phone. Please excuse my brevity.

On October 12, 2017 9:25:35 PM GMT+01:00, Yasin Gocgun <[hidden email]> wrote:

>Hi,
>
>I have two columns that contain numbers along with letters (as shown
>below)
>and have different lengths. Each entry in the first column is likely to
>be
> found in the second column at most once.
>
>For each entry of the first column, if that entry is found in the
>second
>column, I would like to get the corresponding index. For instance, if
>the
>first entry of the first column is 5th entry in the second column, I
>would
>like to keep this index 5.
>
>AST2017000005534   TUR2017000001428
>CTS2017000079930    CTS2017000071989
>CTS2017000079931     CTS2017000072015
>
>In a loop, when I use the following code to get those indices,
>
>
>data_2 = read.csv("excel_data.csv")
>column_1 = data_2$data1
>column_2 = data_2$data2
>
>match_list <- array(0,dim=c(310,1));  # 310 is the length of the first
>column
>
>for (indx in 1: 310){
>    for(indx2 in 1:713){ # 713 is the length of the second column
>        if(column_1[indx] == column_2[indx2] ){
>            match_list[indx,1] = indx2;
>            break;
>        }
>    }
>}
>
>
>R provides the following error:
>
>Error in Ops.factor(column_1[indx], column_2[indx2]) :
>  level sets of factors are different
>
>So can someone explain me how I can resolve this issue?
>
>Thnak you,
>
>Yasin
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.