matching country name tables from different sources

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

matching country name tables from different sources

Werner
Hi,
 
  Before I reinvent the wheel I wanted to kindly ask you for your opinion if there is a simple way to do it.
 
  I want to merge a larger number of tables from different data sources  in R and the matching criterium are country names. The tables are of  different size and sometimes the country names do differ slightly.
 
  Has anyone done this or any recommendation on what commands I should look at to automize this task as much as possible?
 
  Thanks a lot for your effort in advance.
 
  All the best,
    Werner
 

               
---------------------------------
Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: matching country name tables from different sources

Gabor Grothendieck
If they were the same you could use merge.   To figure out
the correspondence automatically or semiautomatically, try this:

x <- c("Canada", "US", "Mexico")
y <- c("Kanada", "United States", "Mehico")
result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
result[] <- sapply(result, nchar)
# try both which.max and which.min and if you are lucky
# one of them will give unique values and that is the one to use
# In this case which.max does.
apply(result, 1, which.max)  # 1 2 3

# calculate longest common subsequence between 2 strings
lcs2 <- function(s1,s2) {
     longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
     # Make sure args are strings
     a <- as.character(s1); an <- nchar(s1)+1
     b <- as.character(s2); bn <- nchar(s2)+1


     # If one arg is an empty string, returns the length of the other
     if (nchar(a)==0) return(nchar(b))
     if (nchar(b)==0) return(nchar(a))


     # Initialize matrix for calculations
     m <- matrix("", nrow=an, ncol=bn)

     for (i in 2:an)
          for (j in 2:bn)
                m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
                        paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
                else
                        longest(m[i-1,j], m[i,j-1])

     # Returns the distance
     m[an,bn]
}



On 1/10/06, Werner Wernersen <[hidden email]> wrote:

> Hi,
>
>  Before I reinvent the wheel I wanted to kindly ask you for your opinion if there is a simple way to do it.
>
>  I want to merge a larger number of tables from different data sources  in R and the matching criterium are country names. The tables are of  different size and sometimes the country names do differ slightly.
>
>  Has anyone done this or any recommendation on what commands I should look at to automize this task as much as possible?
>
>  Thanks a lot for your effort in advance.
>
>  All the best,
>    Werner
>
>
>
> ---------------------------------
> Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: matching country name tables from different sources

Werner
Thanks for the nice code, Gabor!
 
  Unfortunately, it seems not to work for my purpose, confuses lots of  countries when I compare two lists of over 150 countries each.
  Do you have any other suggestions?
 
 

Gabor Grothendieck <[hidden email]> schrieb:  If they were the same you could use merge.   To figure out
the correspondence automatically or semiautomatically, try this:

x <- c("Canada", "US", "Mexico")
y <- c("Kanada", "United States", "Mehico")
result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
result[] <- sapply(result, nchar)
# try both which.max and which.min and if you are lucky
# one of them will give unique values and that is the one to use
# In this case which.max does.
apply(result, 1, which.max)  # 1 2 3

# calculate longest common subsequence between 2 strings
lcs2 <- function(s1,s2) {
     longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
     # Make sure args are strings
     a <- as.character(s1); an <- nchar(s1)+1
     b <- as.character(s2); bn <- nchar(s2)+1


     # If one arg is an empty string, returns the length of the other
     if (nchar(a)==0) return(nchar(b))
     if (nchar(b)==0) return(nchar(a))


     # Initialize matrix for calculations
     m <- matrix("", nrow=an, ncol=bn)

     for (i in 2:an)
          for (j in 2:bn)
  m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
   paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
  else
   longest(m[i-1,j], m[i,j-1])

     # Returns the distance
     m[an,bn]
}



On 1/10/06, Werner Wernersen
 wrote:

> Hi,
>
>  Before I reinvent the wheel I wanted to kindly ask you for your opinion if there is a simple way to do it.
>
>  I want to merge a larger number of tables from different data sources  in R and the matching criterium are country names. The tables are of  different size and sometimes the country names do differ slightly.
>
>  Has anyone done this or any recommendation on what commands I should look at to automize this task as much as possible?
>
>  Thanks a lot for your effort in advance.
>
>  All the best,
>    Werner
>
>
>
> ---------------------------------
> Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>




               
---------------------------------


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: matching country name tables from different sources

bogdan romocea
In reply to this post by Werner
See
http://en.wikipedia.org/wiki/Levenshtein_distance
http://thread.gmane.org/gmane.comp.lang.r.general/31499


> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Werner
> Wernersen
> Sent: Tuesday, January 10, 2006 2:00 PM
> To: Gabor Grothendieck
> Cc: [hidden email]
> Subject: Re: [R] matching country name tables from different sources
>
> Thanks for the nice code, Gabor!
>
>   Unfortunately, it seems not to work for my purpose,
> confuses lots of  countries when I compare two lists of over
> 150 countries each.
>   Do you have any other suggestions?
>
>
>
> Gabor Grothendieck <[hidden email]> schrieb:  If
> they were the same you could use merge.   To figure out
> the correspondence automatically or semiautomatically, try this:
>
> x <- c("Canada", "US", "Mexico")
> y <- c("Kanada", "United States", "Mehico")
> result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
> result[] <- sapply(result, nchar)
> # try both which.max and which.min and if you are lucky
> # one of them will give unique values and that is the one to use
> # In this case which.max does.
> apply(result, 1, which.max)  # 1 2 3
>
> # calculate longest common subsequence between 2 strings
> lcs2 <- function(s1,s2) {
>      longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
>      # Make sure args are strings
>      a <- as.character(s1); an <- nchar(s1)+1
>      b <- as.character(s2); bn <- nchar(s2)+1
>
>
>      # If one arg is an empty string, returns the length of the other
>      if (nchar(a)==0) return(nchar(b))
>      if (nchar(b)==0) return(nchar(a))
>
>
>      # Initialize matrix for calculations
>      m <- matrix("", nrow=an, ncol=bn)
>
>      for (i in 2:an)
>           for (j in 2:bn)
>   m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
>    paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
>   else
>    longest(m[i-1,j], m[i,j-1])
>
>      # Returns the distance
>      m[an,bn]
> }
>
>
>
> On 1/10/06, Werner Wernersen
>  wrote:
> > Hi,
> >
> >  Before I reinvent the wheel I wanted to kindly ask you for
> your opinion if there is a simple way to do it.
> >
> >  I want to merge a larger number of tables from different
> data sources  in R and the matching criterium are country
> names. The tables are of  different size and sometimes the
> country names do differ slightly.
> >
> >  Has anyone done this or any recommendation on what
> commands I should look at to automize this task as much as possible?
> >
> >  Thanks a lot for your effort in advance.
> >
> >  All the best,
> >    Werner
> >
> >
> >
> > ---------------------------------
> > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von
> PC zu PC!
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> >
>
>
>
>
>
> ---------------------------------
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: matching country name tables from different sources

McGehee, Robert
In reply to this post by Werner
I would throw a tolower() around s1 and s2 so that 'canada' matches with
'CANADA', and perhaps consider using a Levenshtein distance rather than
the longest common subsequence.

An algorithm for Levenshtein distance can be found here (courtesy of
Stephen Upton)
https://stat.ethz.ch/pipermail/r-help/2005-January/062254.html

Robert

-----Original Message-----
From: Werner Wernersen [mailto:[hidden email]]
Sent: Tuesday, January 10, 2006 2:00 PM
To: Gabor Grothendieck
Cc: [hidden email]
Subject: Re: [R] matching country name tables from different sources

Thanks for the nice code, Gabor!
 
  Unfortunately, it seems not to work for my purpose, confuses lots of
countries when I compare two lists of over 150 countries each.
  Do you have any other suggestions?
 
 

Gabor Grothendieck <[hidden email]> schrieb:  If they were the
same you could use merge.   To figure out
the correspondence automatically or semiautomatically, try this:

x <- c("Canada", "US", "Mexico")
y <- c("Kanada", "United States", "Mehico")
result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
result[] <- sapply(result, nchar)
# try both which.max and which.min and if you are lucky
# one of them will give unique values and that is the one to use
# In this case which.max does.
apply(result, 1, which.max)  # 1 2 3

# calculate longest common subsequence between 2 strings
lcs2 <- function(s1,s2) {
     longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
     # Make sure args are strings
     a <- as.character(s1); an <- nchar(s1)+1
     b <- as.character(s2); bn <- nchar(s2)+1


     # If one arg is an empty string, returns the length of the other
     if (nchar(a)==0) return(nchar(b))
     if (nchar(b)==0) return(nchar(a))


     # Initialize matrix for calculations
     m <- matrix("", nrow=an, ncol=bn)

     for (i in 2:an)
          for (j in 2:bn)
  m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
   paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
  else
   longest(m[i-1,j], m[i,j-1])

     # Returns the distance
     m[an,bn]
}



On 1/10/06, Werner Wernersen
 wrote:
> Hi,
>
>  Before I reinvent the wheel I wanted to kindly ask you for your
opinion if there is a simple way to do it.
>
>  I want to merge a larger number of tables from different data sources
in R and the matching criterium are country names. The tables are of
different size and sometimes the country names do differ slightly.
>
>  Has anyone done this or any recommendation on what commands I should
look at to automize this task as much as possible?

>
>  Thanks a lot for your effort in advance.
>
>  All the best,
>    Werner
>
>
>
> ---------------------------------
> Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>




               
---------------------------------


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: matching country name tables from different sources

Gabor Grothendieck
In reply to this post by Werner
You can improve it somewhat by first accepting all the largest
matches and removing the rows and columns for those and
repeatedly doing that with what is left.

On 1/10/06, Werner Wernersen <[hidden email]> wrote:

> Thanks for the nice code, Gabor!
>
> Unfortunately, it seems not to work for my purpose, confuses lots of
> countries when I compare two lists of over 150 countries each.
> Do you have any other suggestions?
>
>
>
> Gabor Grothendieck <[hidden email]> schrieb:
> If they were the same you could use merge. To figure out
> the correspondence automatically or semiautomatically, try this:
>
> x <- c("Canada", "US", "Mexico")
> y <- c("Kanada", "United States", "Mehico")
> result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
> result[] <- sapply(result, nchar)
> # try both which.max and which.min and if you are lucky
> # one of them will give unique values and that is the one to use
> # In this case which.max does.
> apply(result, 1, which.max) # 1 2 3
>
> # calculate longest common subsequence between 2 strings
> lcs2 <- function(s1,s2) {
> longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
> # Make sure args are strings
> a <- as.character(s1); an <- nchar(s1)+1
> b <- as.character(s2); bn <- nchar(s2)+1
>
>
> # If one arg is an empty string, returns the length of the other
> if (nchar(a)==0) return(nchar(b))
> if (nchar(b)==0) return(nchar(a))
>
>
> # Initialize matrix for calculations
> m <- matrix("", nrow=an, ncol=bn)
>
> for (i in 2:an)
> for (j in 2:bn)
> m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
> paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
> else
> longest(m[i-1,j], m[i,j-1])
>
> # Returns the distance
> m[an,bn]
> }
>
>
>
> On 1/10/06, Werner Wernersen wrote:
> > Hi,
> >
> > Before I reinvent the wheel I wanted to kindly ask you for your opinion if
> there is a simple way to do it.
> >
> > I want to merge a larger number of tables from different data sources in R
> and the matching criterium are country names. The tables are of different
> size and sometimes the country names do differ slightly.
> >
> > Has anyone done this or any recommendation on what commands I should look
> at to automize this task as much as possible?
> >
> > Thanks a lot for your effort in advance.
> >
> > All the best,
> > Werner
> >
> >
> >
> > ---------------------------------
> > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> >
>
>
>
> ________________________________
> Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
> Jetzt Yahoo! Messenger installieren!
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: matching country name tables from different sources

Gabor Grothendieck
One other thing to try could be soundex.  ITs normally used for
last names but it might work here too.  Google to find the
soundex encoding rules.  Reviewing the country names might
suggest minor modifications to the soundex algorithm to
improve it for your case.

On 1/10/06, Gabor Grothendieck <[hidden email]> wrote:

> You can improve it somewhat by first accepting all the largest
> matches and removing the rows and columns for those and
> repeatedly doing that with what is left.
>
> On 1/10/06, Werner Wernersen <[hidden email]> wrote:
> > Thanks for the nice code, Gabor!
> >
> > Unfortunately, it seems not to work for my purpose, confuses lots of
> > countries when I compare two lists of over 150 countries each.
> > Do you have any other suggestions?
> >
> >
> >
> > Gabor Grothendieck <[hidden email]> schrieb:
> > If they were the same you could use merge. To figure out
> > the correspondence automatically or semiautomatically, try this:
> >
> > x <- c("Canada", "US", "Mexico")
> > y <- c("Kanada", "United States", "Mehico")
> > result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
> > result[] <- sapply(result, nchar)
> > # try both which.max and which.min and if you are lucky
> > # one of them will give unique values and that is the one to use
> > # In this case which.max does.
> > apply(result, 1, which.max) # 1 2 3
> >
> > # calculate longest common subsequence between 2 strings
> > lcs2 <- function(s1,s2) {
> > longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
> > # Make sure args are strings
> > a <- as.character(s1); an <- nchar(s1)+1
> > b <- as.character(s2); bn <- nchar(s2)+1
> >
> >
> > # If one arg is an empty string, returns the length of the other
> > if (nchar(a)==0) return(nchar(b))
> > if (nchar(b)==0) return(nchar(a))
> >
> >
> > # Initialize matrix for calculations
> > m <- matrix("", nrow=an, ncol=bn)
> >
> > for (i in 2:an)
> > for (j in 2:bn)
> > m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
> > paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
> > else
> > longest(m[i-1,j], m[i,j-1])
> >
> > # Returns the distance
> > m[an,bn]
> > }
> >
> >
> >
> > On 1/10/06, Werner Wernersen wrote:
> > > Hi,
> > >
> > > Before I reinvent the wheel I wanted to kindly ask you for your opinion if
> > there is a simple way to do it.
> > >
> > > I want to merge a larger number of tables from different data sources in R
> > and the matching criterium are country names. The tables are of different
> > size and sometimes the country names do differ slightly.
> > >
> > > Has anyone done this or any recommendation on what commands I should look
> > at to automize this task as much as possible?
> > >
> > > Thanks a lot for your effort in advance.
> > >
> > > All the best,
> > > Werner
> > >
> > >
> > >
> > > ---------------------------------
> > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> > >
> >
> >
> >
> > ________________________________
> > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
> > Jetzt Yahoo! Messenger installieren!
> >
> >
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: matching country name tables from different sources

SAULEAU Erik-André
In reply to this post by Werner
 dear all,

yes but the problem with soundex for example is that it does not work when
an error occur in the first place (Canada vs Kanada) as it keeps the fist
character. It seems that you have to look after an approximate string
matching algorithm (for example, a very good one if from Porter-Jaro and
Winkler at the US Census bureau or have o look to the book of Navarro about
classification of algorithm).

HTH and an happy new year, erik.


-----Message d'origine-----
De: Gabor Grothendieck
A: Werner Wernersen
Cc: [hidden email]
Date: 10/01/2006 21:16
Objet: Re: [R] matching country name tables from different sources

One other thing to try could be soundex.  ITs normally used for
last names but it might work here too.  Google to find the
soundex encoding rules.  Reviewing the country names might
suggest minor modifications to the soundex algorithm to
improve it for your case.

On 1/10/06, Gabor Grothendieck <[hidden email]> wrote:

> You can improve it somewhat by first accepting all the largest
> matches and removing the rows and columns for those and
> repeatedly doing that with what is left.
>
> On 1/10/06, Werner Wernersen <[hidden email]> wrote:
> > Thanks for the nice code, Gabor!
> >
> > Unfortunately, it seems not to work for my purpose, confuses lots of
> > countries when I compare two lists of over 150 countries each.
> > Do you have any other suggestions?
> >
> >
> >
> > Gabor Grothendieck <[hidden email]> schrieb:
> > If they were the same you could use merge. To figure out
> > the correspondence automatically or semiautomatically, try this:
> >
> > x <- c("Canada", "US", "Mexico")
> > y <- c("Kanada", "United States", "Mehico")
> > result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
> > result[] <- sapply(result, nchar)
> > # try both which.max and which.min and if you are lucky
> > # one of them will give unique values and that is the one to use
> > # In this case which.max does.
> > apply(result, 1, which.max) # 1 2 3
> >
> > # calculate longest common subsequence between 2 strings
> > lcs2 <- function(s1,s2) {
> > longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
> > # Make sure args are strings
> > a <- as.character(s1); an <- nchar(s1)+1
> > b <- as.character(s2); bn <- nchar(s2)+1
> >
> >
> > # If one arg is an empty string, returns the length of the other
> > if (nchar(a)==0) return(nchar(b))
> > if (nchar(b)==0) return(nchar(a))
> >
> >
> > # Initialize matrix for calculations
> > m <- matrix("", nrow=an, ncol=bn)
> >
> > for (i in 2:an)
> > for (j in 2:bn)
> > m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
> > paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
> > else
> > longest(m[i-1,j], m[i,j-1])
> >
> > # Returns the distance
> > m[an,bn]
> > }
> >
> >
> >
> > On 1/10/06, Werner Wernersen wrote:
> > > Hi,
> > >
> > > Before I reinvent the wheel I wanted to kindly ask you for your
opinion if
> > there is a simple way to do it.
> > >
> > > I want to merge a larger number of tables from different data
sources in R
> > and the matching criterium are country names. The tables are of
different
> > size and sometimes the country names do differ slightly.
> > >
> > > Has anyone done this or any recommendation on what commands I
should look

> > at to automize this task as much as possible?
> > >
> > > Thanks a lot for your effort in advance.
> > >
> > > All the best,
> > > Werner
> > >
> > >
> > >
> > > ---------------------------------
> > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu
PC!

> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> > >
> >
> >
> >
> > ________________________________
> > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu
PC!
> > Jetzt Yahoo! Messenger installieren!
> >
> >
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html


************************************************************************
**********
Afin d'eviter toute propagation de virus informatique, et en complement
des dispositifs en place, ce message (et ses pieces jointes s'il y en a)

a ete automatiquement analyse par un antivirus de messagerie.
************************************************************************
**********


**********************************************************************************
Afin d'eviter toute propagation de virus informatique, et en complement
des dispositifs en place, ce message (et ses pieces jointes s'il y en a)
a ete automatiquement analyse par un antivirus de messagerie.
**********************************************************************************


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Reply | Threaded
Open this post in threaded view
|

Re: matching country name tables from different sources

Roger Bivand
In reply to this post by McGehee, Robert
On Tue, 10 Jan 2006, McGehee, Robert wrote:

> I would throw a tolower() around s1 and s2 so that 'canada' matches with
> 'CANADA', and perhaps consider using a Levenshtein distance rather than
> the longest common subsequence.
>
> An algorithm for Levenshtein distance can be found here (courtesy of
> Stephen Upton)
> https://stat.ethz.ch/pipermail/r-help/2005-January/062254.html

Or even ?agrep - uses Levenshtein edit distance and has an argument for
ignoring case. First hit in RSiteSearch("fuzzy match"), by the way.

>
> Robert
>
> -----Original Message-----
> From: Werner Wernersen [mailto:[hidden email]]
> Sent: Tuesday, January 10, 2006 2:00 PM
> To: Gabor Grothendieck
> Cc: [hidden email]
> Subject: Re: [R] matching country name tables from different sources
>
> Thanks for the nice code, Gabor!
>  
>   Unfortunately, it seems not to work for my purpose, confuses lots of
> countries when I compare two lists of over 150 countries each.
>   Do you have any other suggestions?
>  
>  
>
> Gabor Grothendieck <[hidden email]> schrieb:  If they were the
> same you could use merge.   To figure out
> the correspondence automatically or semiautomatically, try this:
>
> x <- c("Canada", "US", "Mexico")
> y <- c("Kanada", "United States", "Mehico")
> result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
> result[] <- sapply(result, nchar)
> # try both which.max and which.min and if you are lucky
> # one of them will give unique values and that is the one to use
> # In this case which.max does.
> apply(result, 1, which.max)  # 1 2 3
>
> # calculate longest common subsequence between 2 strings
> lcs2 <- function(s1,s2) {
>      longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
>      # Make sure args are strings
>      a <- as.character(s1); an <- nchar(s1)+1
>      b <- as.character(s2); bn <- nchar(s2)+1
>
>
>      # If one arg is an empty string, returns the length of the other
>      if (nchar(a)==0) return(nchar(b))
>      if (nchar(b)==0) return(nchar(a))
>
>
>      # Initialize matrix for calculations
>      m <- matrix("", nrow=an, ncol=bn)
>
>      for (i in 2:an)
>           for (j in 2:bn)
>   m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
>    paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
>   else
>    longest(m[i-1,j], m[i,j-1])
>
>      # Returns the distance
>      m[an,bn]
> }
>
>
>
> On 1/10/06, Werner Wernersen
>  wrote:
> > Hi,
> >
> >  Before I reinvent the wheel I wanted to kindly ask you for your
> opinion if there is a simple way to do it.
> >
> >  I want to merge a larger number of tables from different data sources
> in R and the matching criterium are country names. The tables are of
> different size and sometimes the country names do differ slightly.
> >
> >  Has anyone done this or any recommendation on what commands I should
> look at to automize this task as much as possible?
> >
> >  Thanks a lot for your effort in advance.
> >
> >  All the best,
> >    Werner
> >
> >
> >
> > ---------------------------------
> > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> >
>
>
>
>
>
> ---------------------------------
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Roger Bivand
Department of Economics
NHH Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: matching country name tables from different sources

Gabor Grothendieck
In reply to this post by SAULEAU Erik-André
I was aware of that which is why I mentioned that it is usually used
for matching last names rather than countries and noted possible need to
modify the algorithm slightly.  soundex is a relatively simple algorithm
so its not too hard.  For example, one could just code the first
letter too.

On 1/11/06, SAULEAU Erik-André <[hidden email]> wrote:

>
>
>  dear all,
>
> yes but the problem with soundex for example is that it does not work when
> an error occur in the first place (Canada vs Kanada) as it keeps the fist
> character. It seems that you have to look after an approximate string
> matching algorithm (for example, a very good one if from Porter-Jaro and
> Winkler at the US Census bureau or have o look to the book of Navarro about
> classification of algorithm).
>
> HTH and an happy new year, erik.
>
>
> -----Message d'origine-----
> De: Gabor Grothendieck
> A: Werner Wernersen
> Cc: [hidden email]
> Date: 10/01/2006 21:16
> Objet: Re: [R] matching country name tables from different sources
>
>
> One other thing to try could be soundex.  ITs normally used for
> last names but it might work here too.  Google to find the
> soundex encoding rules.  Reviewing the country names might
> suggest minor modifications to the soundex algorithm to
> improve it for your case.
>
> On 1/10/06, Gabor Grothendieck <[hidden email]> wrote:
> > You can improve it somewhat by first accepting all the largest
> > matches and removing the rows and columns for those and
> > repeatedly doing that with what is left.
> >
> > On 1/10/06, Werner Wernersen <[hidden email]> wrote:
> > > Thanks for the nice code, Gabor!
> > >
> > > Unfortunately, it seems not to work for my purpose, confuses lots of
> > > countries when I compare two lists of over 150 countries each.
> > > Do you have any other suggestions?
> > >
> > >
> > >
> > > Gabor Grothendieck <[hidden email]> schrieb:
> > > If they were the same you could use merge. To figure out
> > > the correspondence automatically or semiautomatically, try this:
> > >
> > > x <- c("Canada", "US", "Mexico")
> > > y <- c("Kanada", "United States", "Mehico")
> > > result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
> > > result[] <- sapply(result, nchar)
> > > # try both which.max and which.min and if you are lucky
> > > # one of them will give unique values and that is the one to use
> > > # In this case which.max does.
> > > apply(result, 1, which.max) # 1 2 3
> > >
> > > # calculate longest common subsequence between 2 strings
> > > lcs2 <- function(s1,s2) {
> > > longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
> > > # Make sure args are strings
> > > a <- as.character(s1); an <- nchar(s1)+1
> > > b <- as.character(s2); bn <- nchar(s2)+1
> > >
> > >
> > > # If one arg is an empty string, returns the length of the other
> > > if (nchar(a)==0) return(nchar(b))
> > > if (nchar(b)==0) return(nchar(a))
> > >
> > >
> > > # Initialize matrix for calculations
> > > m <- matrix("", nrow=an, ncol=bn)
> > >
> > > for (i in 2:an)
> > > for (j in 2:bn)
> > > m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
> > > paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
> > > else
> > > longest(m[i-1,j], m[i,j-1])
> > >
> > > # Returns the distance
> > > m[an,bn]
> > > }
> > >
> > >
> > >
> > > On 1/10/06, Werner Wernersen wrote:
> > > > Hi,
> > > >
> > > > Before I reinvent the wheel I wanted to kindly ask you for your
> opinion if
> > > there is a simple way to do it.
> > > >
> > > > I want to merge a larger number of tables from different data
> sources in R
> > > and the matching criterium are country names. The tables are of
> different
> > > size and sometimes the country names do differ slightly.
> > > >
> > > > Has anyone done this or any recommendation on what commands I
> should look
> > > at to automize this task as much as possible?
> > > >
> > > > Thanks a lot for your effort in advance.
> > > >
> > > > All the best,
> > > > Werner
> > > >
> > > >
> > > >
> > > > ---------------------------------
> > > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu
> PC!
> > > >
> > > > [[alternative HTML version deleted]]
> > > >
> > > > ______________________________________________
> > > > [hidden email] mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > > >
> > >
> > >
> > >
> > > ________________________________
> > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu
> PC!
> > > Jetzt Yahoo! Messenger installieren!
> > >
> > >
> >
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>
> ************************************************************************
> **********
> Afin d'eviter toute propagation de virus informatique, et en complement
> des dispositifs en place, ce message (et ses pieces jointes s'il y en a)
>
> a ete automatiquement analyse par un antivirus de messagerie.
> ************************************************************************
> **********
>
> **********************************************************************************
> Afin d'eviter toute propagation de virus informatique, et en complement
> des dispositifs en place, ce message (et ses pieces jointes s'il y en a)
> a ete automatiquement analyse par un antivirus de messagerie.
> **********************************************************************************
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html