While profiling some C code, I rolled my own nchar function which
appears to be much faster than base R's (25 times faster for a 10M length vector). Obviously base::nchar provides significantly more features than my barebones function (C snippet below); however, for argument type = "bytes" it seems that the R_nchar and do_nchar functions do not actually do anything more than this function. My suspicion is that I have overlooked some subtlety in the base R code, or that my benchmarks are not representative. Alternatively, the action in `do_nchar` of preparing the potential error message before being passed to `R_nchar` may be quite costly indeed. Or the function cannot be unswitched from the more complex width and chars arguments by the compiler. If I haven't missed something, would a patch be warranted? SEXP Cnchar(SEXP x) { R_xlen_t N = xlength(x); SEXP ans = PROTECT(allocVector(INTSXP, N)); int * restrict ansp = INTEGER(ans); // Ignoring NA to avoid the branch has a very small // impact on performance. for (R_xlen_t i = 0; i < N; ++i) { SEXP sxi = STRING_ELT(x, i); if (sxi == NA_STRING) { ansp[i] = NA_INTEGER; continue; } ansp[i] = length(sxi); } UNPROTECT(1); return ans; } x <- rep_len(c(as.character(c(5L, 1:1e6)), NA_character_, 1e6:15e5), 1e7) Cnchar(x) 90ms nchar(x, type = "bytes") 2500 ms ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Thanks for the report, you are probably running into the overhead of the
eager creation of the error message. On my system, with your micro-benchmark, it is about 10x. I've tested simply by uncommenting it and re-running the benchmark. I'll fix (this is not a good task for a contributed patch). Best, Tomas On 3/30/21 8:02 AM, Hugh Parsonage wrote: > While profiling some C code, I rolled my own nchar function which > appears to be much faster than base R's (25 times faster for a 10M > length vector). Obviously base::nchar provides significantly more > features than my barebones function (C snippet below); however, for > argument type = "bytes" it seems that the R_nchar and do_nchar > functions do not actually do anything more than this function. > My suspicion is that I have overlooked some subtlety in the base R > code, or that my benchmarks are not representative. Alternatively, > the action in `do_nchar` of preparing the potential error message > before being passed to `R_nchar` may be quite costly indeed. Or the > function cannot be unswitched from the more complex width and chars > arguments by the compiler. > > If I haven't missed something, would a patch be warranted? > > SEXP Cnchar(SEXP x) { > R_xlen_t N = xlength(x); > SEXP ans = PROTECT(allocVector(INTSXP, N)); > int * restrict ansp = INTEGER(ans); > > // Ignoring NA to avoid the branch has a very small > // impact on performance. > for (R_xlen_t i = 0; i < N; ++i) { > SEXP sxi = STRING_ELT(x, i); > if (sxi == NA_STRING) { > ansp[i] = NA_INTEGER; > continue; > } > ansp[i] = length(sxi); > } > UNPROTECT(1); > return ans; > } > > x <- rep_len(c(as.character(c(5L, 1:1e6)), NA_character_, 1e6:15e5), 1e7) > Cnchar(x) > 90ms > nchar(x, type = "bytes") > 2500 ms > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
For reference, fixed in R-devel (80153).
Tomas On 3/30/21 10:20 AM, Tomas Kalibera wrote: > Thanks for the report, you are probably running into the overhead of > the eager creation of the error message. On my system, with your > micro-benchmark, it is about 10x. I've tested simply by uncommenting > it and re-running the benchmark. I'll fix (this is not a good task for > a contributed patch). > > Best, > Tomas > > On 3/30/21 8:02 AM, Hugh Parsonage wrote: >> While profiling some C code, I rolled my own nchar function which >> appears to be much faster than base R's (25 times faster for a 10M >> length vector). Obviously base::nchar provides significantly more >> features than my barebones function (C snippet below); however, for >> argument type = "bytes" it seems that the R_nchar and do_nchar >> functions do not actually do anything more than this function. >> My suspicion is that I have overlooked some subtlety in the base R >> code, or that my benchmarks are not representative. Alternatively, >> the action in `do_nchar` of preparing the potential error message >> before being passed to `R_nchar` may be quite costly indeed. Or the >> function cannot be unswitched from the more complex width and chars >> arguments by the compiler. >> >> If I haven't missed something, would a patch be warranted? >> >> SEXP Cnchar(SEXP x) { >> R_xlen_t N = xlength(x); >> SEXP ans = PROTECT(allocVector(INTSXP, N)); >> int * restrict ansp = INTEGER(ans); >> >> // Ignoring NA to avoid the branch has a very small >> // impact on performance. >> for (R_xlen_t i = 0; i < N; ++i) { >> SEXP sxi = STRING_ELT(x, i); >> if (sxi == NA_STRING) { >> ansp[i] = NA_INTEGER; >> continue; >> } >> ansp[i] = length(sxi); >> } >> UNPROTECT(1); >> return ans; >> } >> >> x <- rep_len(c(as.character(c(5L, 1:1e6)), NA_character_, 1e6:15e5), >> 1e7) >> Cnchar(x) >> 90ms >> nchar(x, type = "bytes") >> 2500 ms >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel |
Free forum by Nabble | Edit this page |