|
I've imported a .csv file where character strings that contained
accented characters were written as HTML character entities. Is there a function that works on a vector to translate them back to accented (latin1) characters? Some examples: > grep("&", author$lname, value=TRUE) [1] "Frère de Montizon" "Lumière" [3] "Lumière" "Niépce" [5] "Süssmilch" "Schüpbach" > grep("&", author$birthplace, value=TRUE) [1] "Marbach, Württemberg" [2] "Côte-d'Or" [3] "Chalon-sur-Saône, Saône-et-Loire" [4] "Groß Särchen, Germany" > apropos("HTML") thx, -Michael -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 4700 Keele Street Web: http://www.datavis.ca Toronto, ONT M3J 1P3 CANADA ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
It's not quite an R solution, but I just pasted your examples into a script
window in R and saved it as chars.html. Then I opened it in Firefox and pasted the results here (with returns inserted to match your original). > grep("&", author$lname, value=TRUE) [1] "Frère de Montizon" "Lumière" [3] "Lumière" "Niépce" [5] "Süssmilch" "Schüpbach" > grep("&", author$birthplace, value=TRUE) [1] "Marbach, Württemberg" [2] "Côte-d'Or" [3] "Chalon-sur-Saône, Saône-et-Loire" [4] "Groß Särchen, Germany" > apropos("HTML") For a CSV file you would want to preserve the lines by adding <br> to the end of each line first. ---------------------------------------------- David L Carlson Associate Professor of Anthropology Texas A&M University College Station, TX 77843-4352 > -----Original Message----- > From: [hidden email] [mailto:r-help-bounces@r- > project.org] On Behalf Of Michael Friendly > Sent: Friday, August 10, 2012 11:15 AM > To: R-help > Subject: [R] translating HTML character entities to accented characters > > I've imported a .csv file where character strings that contained > accented characters were written as HTML > character entities. Is there a function that works on a vector to > translate them back to accented (latin1) characters? > > Some examples: > > > grep("&", author$lname, value=TRUE) > [1] "Frère de Montizon" "Lumière" > [3] "Lumière" "Niépce" > [5] "Süssmilch" "Schüpbach" > > grep("&", author$birthplace, value=TRUE) > [1] "Marbach, Württemberg" > [2] "Côte-d'Or" > [3] "Chalon-sur-Saône, Saône-et-Loire" > [4] "Groß Särchen, Germany" > > apropos("HTML") > > thx, > -Michael > > -- > Michael Friendly Email: friendly AT yorku DOT ca > Professor, Psychology Dept. > York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 > 4700 Keele Street Web: http://www.datavis.ca > Toronto, ONT M3J 1P3 CANADA > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Thanks, David
I need an all-R solution for this, because the author.csv file is exported from a database that enforces the HTML encoding and the import into R may have to be repeated several times as the database is updated. -Michael On 8/10/2012 12:40 PM, David L Carlson wrote: > It's not quite an R solution, but I just pasted your examples into a script > window in R and saved it as chars.html. Then I opened it in Firefox and > pasted the results here (with returns inserted to match your original). > >> grep("&", author$lname, value=TRUE) > [1] "Frère de Montizon" "Lumière" > [3] "Lumière" "Niépce" > [5] "Süssmilch" "Schüpbach" >> grep("&", author$birthplace, value=TRUE) > [1] "Marbach, Württemberg" > [2] "Côte-d'Or" > [3] "Chalon-sur-Saône, Saône-et-Loire" > [4] "Groß Särchen, Germany" >> apropos("HTML") > For a CSV file you would want to preserve the lines by adding <br> to the > end of each line first. > > ---------------------------------------------- > David L Carlson > Associate Professor of Anthropology > Texas A&M University > College Station, TX 77843-4352 > > > >> -----Original Message----- >> From: [hidden email] [mailto:r-help-bounces@r- >> project.org] On Behalf Of Michael Friendly >> Sent: Friday, August 10, 2012 11:15 AM >> To: R-help >> Subject: [R] translating HTML character entities to accented characters >> >> I've imported a .csv file where character strings that contained >> accented characters were written as HTML >> character entities. Is there a function that works on a vector to >> translate them back to accented (latin1) characters? >> >> Some examples: >> >> > grep("&", author$lname, value=TRUE) >> [1] "Frère de Montizon" "Lumière" >> [3] "Lumière" "Niépce" >> [5] "Süssmilch" "Schüpbach" >> > grep("&", author$birthplace, value=TRUE) >> [1] "Marbach, Württemberg" >> [2] "Côte-d'Or" >> [3] "Chalon-sur-Saône, Saône-et-Loire" >> [4] "Groß Särchen, Germany" >> > apropos("HTML") >> >> thx, >> -Michael >> >> -- >> Michael Friendly Email: friendly AT yorku DOT ca >> Professor, Psychology Dept. >> York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 >> 4700 Keele Street Web: http://www.datavis.ca >> Toronto, ONT M3J 1P3 CANADA >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting- >> guide.html >> and provide commented, minimal, self-contained, reproducible code. -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 4700 Keele Street Web: http://www.datavis.ca Toronto, ONT M3J 1P3 CANADA ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
This may work for your needs with a little fine tuning. Special and accented
characters can be represented in HTML with a character name or a numeric value. For example, " can be represented as " or as " and it appears from your example that both are used. I've attached a dput(HTMLChars) to the end of this message with the concordances. The following works on your data, but I haven't included any error checking. Assuming your .csv file is called txt and the data.frame HTMLChars is loaded: # Search for &Name; lsta <- unique(unlist(regmatches(txt, gregexpr("&[[:alpha:]]+;", txt)))) lsta <- data.frame(Name=lsta) matches <- merge(HTMLChars, lsta) for (i in 1:nrow(matches)) { txt <- gsub(matches$Name[i], matches$Character[i], txt) } # Search for &#Number; lstn <- unique(unlist(regmatches(txt, gregexpr("&#[[:digit:]]+;", txt)))) lstn <- data.frame(Number=lstn) matches <- merge(HTMLChars, lstn) for (i in 1:nrow(matches)) { txt <- gsub(matches$Number[i], matches$Character[i], txt) } txt now contains the converted characters. dput(HTMLChars) structure(list(Character = c("\"", "'", "&", "<", ">", "", "¡", "¢", "£", "¤", "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "", "®", "¯", "°", "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", "¾", "¿", "×", "÷", "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "ø", "ù", "ú", "û", "ü", "ý", "þ"), Number = c(""", "'", "&", "<", ">", " ", "¡", "¢", "£", "¤", "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "­", "®", "¯", "°", "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", "¾", "¿", "×", "÷", "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "ø", "ù", "ú", "û", "ü", "ý", "þ"), Name = c(""", "'", "&", "<", ">", " ", "¡", "¢", "£", "¤", "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "­", "®", "¯", "°", "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", "¾", "¿", "×", "÷", "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "ø", "ù", "ú", "û", "ü", "ý", "þ")), .Names = c("Character", "Number", "Name"), row.names = c(NA, 100L), class = "data.frame") ------- David > -----Original Message----- > From: Michael Friendly [mailto:[hidden email]] > Sent: Friday, August 10, 2012 12:14 PM > To: [hidden email] > Cc: 'R-help' > Subject: Re: [R] translating HTML character entities to accented > characters > > Thanks, David > > I need an all-R solution for this, because the author.csv file is > exported from a database that enforces the HTML > encoding and the import into R may have to be repeated several times as > the database is updated. > > -Michael > > On 8/10/2012 12:40 PM, David L Carlson wrote: > > It's not quite an R solution, but I just pasted your examples into a > script > > window in R and saved it as chars.html. Then I opened it in Firefox > and > > pasted the results here (with returns inserted to match your > original). > > > >> grep("&", author$lname, value=TRUE) > > [1] "Frère de Montizon" "Lumière" > > [3] "Lumière" "Niépce" > > [5] "Süssmilch" "Schüpbach" > >> grep("&", author$birthplace, value=TRUE) > > [1] "Marbach, Württemberg" > > [2] "Côte-d'Or" > > [3] "Chalon-sur-Saône, Saône-et-Loire" > > [4] "Groß Särchen, Germany" > >> apropos("HTML") > > For a CSV file you would want to preserve the lines by adding <br> to > the > > end of each line first. > > > > ---------------------------------------------- > > David L Carlson > > Associate Professor of Anthropology > > Texas A&M University > > College Station, TX 77843-4352 > > > > > > > >> -----Original Message----- > >> From: [hidden email] [mailto:r-help-bounces@r- > >> project.org] On Behalf Of Michael Friendly > >> Sent: Friday, August 10, 2012 11:15 AM > >> To: R-help > >> Subject: [R] translating HTML character entities to accented > characters > >> > >> I've imported a .csv file where character strings that contained > >> accented characters were written as HTML > >> character entities. Is there a function that works on a vector to > >> translate them back to accented (latin1) characters? > >> > >> Some examples: > >> > >> > grep("&", author$lname, value=TRUE) > >> [1] "Frère de Montizon" "Lumière" > >> [3] "Lumière" "Niépce" > >> [5] "Süssmilch" "Schüpbach" > >> > grep("&", author$birthplace, value=TRUE) > >> [1] "Marbach, Württemberg" > >> [2] "Côte-d'Or" > >> [3] "Chalon-sur-Saône, Saône-et-Loire" > >> [4] "Groß Särchen, Germany" > >> > apropos("HTML") > >> > >> thx, > >> -Michael > >> > >> -- > >> Michael Friendly Email: friendly AT yorku DOT ca > >> Professor, Psychology Dept. > >> York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 > >> 4700 Keele Street Web: http://www.datavis.ca > >> Toronto, ONT M3J 1P3 CANADA > >> > >> ______________________________________________ > >> [hidden email] mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide http://www.R-project.org/posting- > >> guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > > -- > Michael Friendly Email: friendly AT yorku DOT ca > Professor, Psychology Dept. > York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 > 4700 Keele Street Web: http://www.datavis.ca > Toronto, ONT M3J 1P3 CANADA ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Beautiful, David. thanks so much!
I packaged this as a function, html2latin1(), with this simple test > grep("&", author$givennames, value=TRUE) [1] "Adolphe d'" "Émile" [3] "Louis Jacques Mandé" "René" [5] "André Michel" "Léon" [7] "Émile" "Maurice d'" [9] "Louis Ézéchiel" "Louis-Léger" [11] "Pierre-François" > grep("&", author$givennames) [1] 5 33 36 37 59 79 84 108 117 140 153 > html2latin1(author$givennames)[grep("&", author$givennames)] [1] "Adolphe d'" "Émile" "Louis Jacques Mandé" "René" [5] "André Michel" "Léon" "Émile" "Maurice d'" [9] "Louis Ézéchiel" "Louis-Léger" "Pierre-François" > html2latin1 <- function(txt) { # Search for &Name; lsta <- unique(unlist(regmatches(txt, gregexpr("&[[:alpha:]]+;", txt)))) lsta <- data.frame(Name=lsta) matches <- merge(HTMLChars, lsta) for (i in 1:nrow(matches)) { txt <- gsub(matches$Name[i], matches$Character[i], txt) } # Search for &#Number; lstn <- unique(unlist(regmatches(txt, gregexpr("&#[[:digit:]]+;", txt)))) lstn <- data.frame(Number=lstn) matches <- merge(HTMLChars, lstn) for (i in 1:nrow(matches)) { txt <- gsub(matches$Number[i], matches$Character[i], txt) } txt } And this seems to work for the whole file: authorfile <- readLines(file("author.csv")) authorfilet <- html2latin1(authorfile) writeLines(authorfilet, file("authort.csv")) best, -Michael On 8/12/2012 4:36 PM, David L Carlson wrote: > This may work for your needs with a little fine tuning. Special and accented > characters can be represented in HTML with a character name or a numeric > value. For example, " can be represented as " or as " and it > appears from your example that both are used. I've attached a > dput(HTMLChars) to the end of this message with the concordances. The > following works on your data, but I haven't included any error checking. > Assuming your .csv file is called txt and the data.frame HTMLChars is > loaded: > > # Search for &Name; > lsta <- unique(unlist(regmatches(txt, gregexpr("&[[:alpha:]]+;", txt)))) > lsta <- data.frame(Name=lsta) > matches <- merge(HTMLChars, lsta) > for (i in 1:nrow(matches)) { > txt <- gsub(matches$Name[i], matches$Character[i], txt) > } > > # Search for &#Number; > lstn <- unique(unlist(regmatches(txt, gregexpr("&#[[:digit:]]+;", txt)))) > lstn <- data.frame(Number=lstn) > matches <- merge(HTMLChars, lstn) > for (i in 1:nrow(matches)) { > txt <- gsub(matches$Number[i], matches$Character[i], txt) > } > > txt now contains the converted characters. > > dput(HTMLChars) > structure(list(Character = c("\"", "'", "&", "<", ">", "", "¡", > "¢", "£", "¤", "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "", > "®", "¯", "°", "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", > "»", "¼", "½", "¾", "¿", "×", "÷", "À", "Á", "Â", "Ã", "Ä", "Å", > "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", > "Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", > "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", > "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "ø", "ù", "ú", "û", > "ü", "ý", "þ"), Number = c(""", "'", "&", "<", > ">", " ", "¡", "¢", "£", "¤", "¥", > "¦", "§", "¨", "©", "ª", "«", "¬", > "­", "®", "¯", "°", "±", "²", "³", > "´", "µ", "¶", "·", "¸", "¹", "º", > "»", "¼", "½", "¾", "¿", "×", "÷", > "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", > "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", > "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", > "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", > "Ý", "Þ", "ß", "à", "á", "â", "ã", > "ä", "å", "æ", "ç", "è", "é", "ê", > "ë", "ì", "í", "î", "ï", "ð", "ñ", > "ò", "ó", "ô", "õ", "ö", "ø", "ù", > "ú", "û", "ü", "ý", "þ"), Name = c(""", > "'", "&", "<", ">", " ", "¡", "¢", > "£", "¤", "¥", "¦", "§", "¨", > "©", "ª", "«", "¬", "­", "®", "¯", > "°", "±", "²", "³", "´", "µ", > "¶", "·", "¸", "¹", "º", "»", > "¼", "½", "¾", "¿", "×", "÷", > "À", "Á", "Â", "Ã", "Ä", "Å", > "Æ", "Ç", "È", "É", "Ê", "Ë", > "Ì", "Í", "Î", "Ï", "Ð", "Ñ", > "Ò", "Ó", "Ô", "Õ", "Ö", "Ø", > "Ù", "Ú", "Û", "Ü", "Ý", "Þ", > "ß", "à", "á", "â", "ã", "ä", > "å", "æ", "ç", "è", "é", "ê", > "ë", "ì", "í", "î", "ï", "ð", > "ñ", "ò", "ó", "ô", "õ", "ö", > "ø", "ù", "ú", "û", "ü", "ý", > "þ")), .Names = c("Character", "Number", "Name"), row.names = c(NA, > 100L), class = "data.frame") > > ------- > David > >> -----Original Message----- >> From: Michael Friendly [mailto:[hidden email]] >> Sent: Friday, August 10, 2012 12:14 PM >> To: [hidden email] >> Cc: 'R-help' >> Subject: Re: [R] translating HTML character entities to accented >> characters >> >> Thanks, David >> >> I need an all-R solution for this, because the author.csv file is >> exported from a database that enforces the HTML >> encoding and the import into R may have to be repeated several times as >> the database is updated. >> >> -Michael >> >> On 8/10/2012 12:40 PM, David L Carlson wrote: >>> It's not quite an R solution, but I just pasted your examples into a >> script >>> window in R and saved it as chars.html. Then I opened it in Firefox >> and >>> pasted the results here (with returns inserted to match your >> original). >>>> grep("&", author$lname, value=TRUE) >>> [1] "Frère de Montizon" "Lumière" >>> [3] "Lumière" "Niépce" >>> [5] "Süssmilch" "Schüpbach" >>>> grep("&", author$birthplace, value=TRUE) >>> [1] "Marbach, Württemberg" >>> [2] "Côte-d'Or" >>> [3] "Chalon-sur-Saône, Saône-et-Loire" >>> [4] "Groß Särchen, Germany" >>>> apropos("HTML") >>> For a CSV file you would want to preserve the lines by adding <br> to >> the >>> end of each line first. >>> >>> ---------------------------------------------- >>> David L Carlson >>> Associate Professor of Anthropology >>> Texas A&M University >>> College Station, TX 77843-4352 >>> >>> >>> >>>> -----Original Message----- >>>> From: [hidden email] [mailto:r-help-bounces@r- >>>> project.org] On Behalf Of Michael Friendly >>>> Sent: Friday, August 10, 2012 11:15 AM >>>> To: R-help >>>> Subject: [R] translating HTML character entities to accented >> characters >>>> I've imported a .csv file where character strings that contained >>>> accented characters were written as HTML >>>> character entities. Is there a function that works on a vector to >>>> translate them back to accented (latin1) characters? >>>> >>>> Some examples: >>>> >>>> > grep("&", author$lname, value=TRUE) >>>> [1] "Frère de Montizon" "Lumière" >>>> [3] "Lumière" "Niépce" >>>> [5] "Süssmilch" "Schüpbach" >>>> > grep("&", author$birthplace, value=TRUE) >>>> [1] "Marbach, Württemberg" >>>> [2] "Côte-d'Or" >>>> [3] "Chalon-sur-Saône, Saône-et-Loire" >>>> [4] "Groß Särchen, Germany" >>>> > apropos("HTML") >>>> >>>> thx, >>>> -Michael >>>> >>>> -- >>>> Michael Friendly Email: friendly AT yorku DOT ca >>>> Professor, Psychology Dept. >>>> York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 >>>> 4700 Keele Street Web: http://www.datavis.ca >>>> Toronto, ONT M3J 1P3 CANADA >>>> >>>> ______________________________________________ >>>> [hidden email] mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posting- >>>> guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >> >> -- >> Michael Friendly Email: friendly AT yorku DOT ca >> Professor, Psychology Dept. >> York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 >> 4700 Keele Street Web: http://www.datavis.ca >> Toronto, ONT M3J 1P3 CANADA > -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 4700 Keele Street Web: http://www.datavis.ca Toronto, ONT M3J 1P3 CANADA ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
