exporting tables from an access database using parallel foreach

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

exporting tables from an access database using parallel foreach

vivek
Hi
I want to extract data from a Microsoft access database having many tables
with more than 1e7 rows. I find that the following code works to export a
table to a rds file :
#####################
setwd('C:/sFolder')
library(RODBC);library(DBI)
ch<-odbcConnect("sample")

#No. of rows in the table not known
rowN<-1e6  # no. of rows defined
db<-sqlFetch(ch,"Table1",max=rowN,as.is=TRUE)
file<-paste0('Table1',1,'.rds')
df1<-saveRDS(db,file1)

rm(db);gc()   # garbage collection to free up the memory

# To successively obtain more chunks from the access database
for (i in 2:10) {
  rm(df);gc()
  df<-sqlFetchMore(ch,"Table1",max=rowN,as.is=TRUE)
  file<-paste0('Table1',i,'.rds')
  df1<-saveRDS(df,file)
  if (dim(df)[1]<rowN)
    break
}
rm(df);gc()
odbcCloseAll()
##############################

I would like to know the following :
1. Is there any way to extract data from a table by just specifying the row
number range. I have extracted data before. Instead of repeating the
operations, I would just like to obtain data from, let's say, 8e6 to 9e6
row range. I cannot do this now. I have to successively use the
sqlfetchMore command. I would like to know if it is possible to straight
away go to the 8e6 to 9e6 row range.
2. Is it possible to use the foreach package in the extraction step (in
place of the for loop above). I am planning to use the foreach command in
parallel later for processing the data in the multiple files. I just wonder
if it is possible to do parallel processing for the data extraction also.
Thanks,
Vivek Sutradhara

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: exporting tables from an access database using parallel foreach

John McKown
A possibility could be to not use ODBC, but the CRAN package openslsx (
https://cran.revolutionanalytics.com/web/packages/openxlsx/index.html ).
Then use the read.xlsx() function.
<quote>
Description Read data from an Excel file or Workbook object into a
data.frame

Usage read.xlsx(xlsxFile, sheet = 1, startRow = 1, colNames = TRUE,
rowNames = FALSE, detectDates = FALSE, skipEmptyRows = TRUE, rows = NULL,
cols = NULL, check.names = FALSE, namedRegion = NULL)

Arguments xlsxFile An xlsx file or Workbook object sheet The name or index
of the sheet to read data from.
startRow first row to begin looking for data. Empty rows at the top of a
file are always skipped, regardless of the value of startRow.
colNames If TRUE, the first row of data will be used as column names.
rowNames If TRUE, first column of data will be used as row names.
detectDates If TRUE, attempt to recognise dates and perform conversion.
skipEmptyRows If TRUE, empty rows are skipped else empty rows after the
first row containing data will return a row of NAs.
rows A numeric vector specifying which rows in the Excel file to read. If
NULL, all rows are read.
cols A numeric vector specifying which columns in the Excel file to read.
If NULL, all columns are read.
check.names logical. If TRUE then the names of the variables in the data
frame are checked to ensure that they are syntactically valid variable
names
namedRegion A named region in the Workbook. If not NULL startRow, rows and
cols paramters are ignored.
</quote>

On Fri, Nov 20, 2015 at 11:38 AM, Vivek Sutradhara <[hidden email]>
wrote:

> Hi
> I want to extract data from a Microsoft access database having many tables
> with more than 1e7 rows. I find that the following code works to export a
> table to a rds file :
> #####################
> setwd('C:/sFolder')
> library(RODBC);library(DBI)
> ch<-odbcConnect("sample")
>
> #No. of rows in the table not known
> rowN<-1e6  # no. of rows defined
> db<-sqlFetch(ch,"Table1",max=rowN,as.is=TRUE)
> file<-paste0('Table1',1,'.rds')
> df1<-saveRDS(db,file1)
>
> rm(db);gc()   # garbage collection to free up the memory
>
> # To successively obtain more chunks from the access database
> for (i in 2:10) {
>   rm(df);gc()
>   df<-sqlFetchMore(ch,"Table1",max=rowN,as.is=TRUE)
>   file<-paste0('Table1',i,'.rds')
>   df1<-saveRDS(df,file)
>   if (dim(df)[1]<rowN)
>     break
> }
> rm(df);gc()
> odbcCloseAll()
> ##############################
>
> I would like to know the following :
> 1. Is there any way to extract data from a table by just specifying the row
> number range. I have extracted data before. Instead of repeating the
> operations, I would just like to obtain data from, let's say, 8e6 to 9e6
> row range. I cannot do this now. I have to successively use the
> sqlfetchMore command. I would like to know if it is possible to straight
> away go to the 8e6 to 9e6 row range.
> 2. Is it possible to use the foreach package in the extraction step (in
> place of the for loop above). I am planning to use the foreach command in
> parallel later for processing the data in the multiple files. I just wonder
> if it is possible to do parallel processing for the data extraction also.
> Thanks,
> Vivek Sutradhara
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--

Schrodinger's backup: The condition of any backup is unknown until a
restore is attempted.

Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.

He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: exporting tables from an access database using parallel foreach

vivek
Hi John,
Thanks a lot for your quick reply. And thanks for drawing my attention to
the openslsx package. I will certainly look into it when I work with Excel.
But right now, my problems are with Microsoft Access.

There are huge tables there which I am not able to export to excel, csv or
text files with native access methods. The only solution that has worked so
far is to incrementally extract data with the the help of RODBC. This was a
huge leap in my attempts to export the tables. Once I have the data in form
of rds files (which are compressed as well), I have found that it is much
easier to work with them.

But my wishes have suddenly expanded and I want to find out if it is
possible to go beyond the normal capabilities of RODBC (the sqlFetch
command does not have a provision for specifying the row number range). I
am a newbie with parallel methods (using the 4 cores on my pc) but I am
hoping to progress with that for processing the data from the multiple
chunks of data (the first step will be just to filter and gather the data
of relevance).

I hope that I have explained what I am looking for.
Thanks,
Vivek

2015-11-20 19:09 GMT+01:00 John McKown <[hidden email]>:

> A possibility could be to not use ODBC, but the CRAN package openslsx (
> https://cran.revolutionanalytics.com/web/packages/openxlsx/index.html ).
> Then use the read.xlsx() function.
> <quote>
> Description Read data from an Excel file or Workbook object into a
> data.frame
>
> Usage read.xlsx(xlsxFile, sheet = 1, startRow = 1, colNames = TRUE,
> rowNames = FALSE, detectDates = FALSE, skipEmptyRows = TRUE, rows = NULL,
> cols = NULL, check.names = FALSE, namedRegion = NULL)
>
> Arguments xlsxFile An xlsx file or Workbook object sheet The name or index
> of the sheet to read data from.
> startRow first row to begin looking for data. Empty rows at the top of a
> file are always skipped, regardless of the value of startRow.
> colNames If TRUE, the first row of data will be used as column names.
> rowNames If TRUE, first column of data will be used as row names.
> detectDates If TRUE, attempt to recognise dates and perform conversion.
> skipEmptyRows If TRUE, empty rows are skipped else empty rows after the
> first row containing data will return a row of NAs.
> rows A numeric vector specifying which rows in the Excel file to read. If
> NULL, all rows are read.
> cols A numeric vector specifying which columns in the Excel file to read.
> If NULL, all columns are read.
> check.names logical. If TRUE then the names of the variables in the data
> frame are checked to ensure that they are syntactically valid variable
> names
> namedRegion A named region in the Workbook. If not NULL startRow, rows and
> cols paramters are ignored.
> </quote>
>
> On Fri, Nov 20, 2015 at 11:38 AM, Vivek Sutradhara <[hidden email]>
> wrote:
>
>> Hi
>> I want to extract data from a Microsoft access database having many tables
>> with more than 1e7 rows. I find that the following code works to export a
>> table to a rds file :
>> #####################
>> setwd('C:/sFolder')
>> library(RODBC);library(DBI)
>> ch<-odbcConnect("sample")
>>
>> #No. of rows in the table not known
>> rowN<-1e6  # no. of rows defined
>> db<-sqlFetch(ch,"Table1",max=rowN,as.is=TRUE)
>> file<-paste0('Table1',1,'.rds')
>> df1<-saveRDS(db,file1)
>>
>> rm(db);gc()   # garbage collection to free up the memory
>>
>> # To successively obtain more chunks from the access database
>> for (i in 2:10) {
>>   rm(df);gc()
>>   df<-sqlFetchMore(ch,"Table1",max=rowN,as.is=TRUE)
>>   file<-paste0('Table1',i,'.rds')
>>   df1<-saveRDS(df,file)
>>   if (dim(df)[1]<rowN)
>>     break
>> }
>> rm(df);gc()
>> odbcCloseAll()
>> ##############################
>>
>> I would like to know the following :
>> 1. Is there any way to extract data from a table by just specifying the
>> row
>> number range. I have extracted data before. Instead of repeating the
>> operations, I would just like to obtain data from, let's say, 8e6 to 9e6
>> row range. I cannot do this now. I have to successively use the
>> sqlfetchMore command. I would like to know if it is possible to straight
>> away go to the 8e6 to 9e6 row range.
>> 2. Is it possible to use the foreach package in the extraction step (in
>> place of the for loop above). I am planning to use the foreach command in
>> parallel later for processing the data in the multiple files. I just
>> wonder
>> if it is possible to do parallel processing for the data extraction also.
>> Thanks,
>> Vivek Sutradhara
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
>
> Schrodinger's backup: The condition of any backup is unknown until a
> restore is attempted.
>
> Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.
>
> He's about as useful as a wax frying pan.
>
> 10 to the 12th power microphones = 1 Megaphone
>
> Maranatha! <><
> John McKown
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: exporting tables from an access database using parallel foreach

John McKown
In reply to this post by vivek
My apologies, you wrote "access" and I read "Excel". I really should not
play a game on my smartphone while speed reading emails.

On Fri, Nov 20, 2015 at 11:38 AM, Vivek Sutradhara <[hidden email]>
wrote:

> Hi
> I want to extract data from a Microsoft access database having many tables
> with more than 1e7 rows. I find that the following code works to export a
> table to a rds file :
> #####################
> setwd('C:/sFolder')
> library(RODBC);library(DBI)
> ch<-odbcConnect("sample")
>
> #No. of rows in the table not known
> rowN<-1e6  # no. of rows defined
> db<-sqlFetch(ch,"Table1",max=rowN,as.is=TRUE)
> file<-paste0('Table1',1,'.rds')
> df1<-saveRDS(db,file1)
>
> rm(db);gc()   # garbage collection to free up the memory
>
> # To successively obtain more chunks from the access database
> for (i in 2:10) {
>   rm(df);gc()
>   df<-sqlFetchMore(ch,"Table1",max=rowN,as.is=TRUE)
>   file<-paste0('Table1',i,'.rds')
>   df1<-saveRDS(df,file)
>   if (dim(df)[1]<rowN)
>     break
> }
> rm(df);gc()
> odbcCloseAll()
> ##############################
>
> I would like to know the following :
> 1. Is there any way to extract data from a table by just specifying the row
> number range. I have extracted data before. Instead of repeating the
> operations, I would just like to obtain data from, let's say, 8e6 to 9e6
> row range. I cannot do this now. I have to successively use the
> sqlfetchMore command. I would like to know if it is possible to straight
> away go to the 8e6 to 9e6 row range.
> 2. Is it possible to use the foreach package in the extraction step (in
> place of the for loop above). I am planning to use the foreach command in
> parallel later for processing the data in the multiple files. I just wonder
> if it is possible to do parallel processing for the data extraction also.
> Thanks,
> Vivek Sutradhara
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--

Schrodinger's backup: The condition of any backup is unknown until a
restore is attempted.

Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.

He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: exporting tables from an access database using parallel foreach

Jeff Newmiller
In reply to this post by vivek
Row numbers are not a standard feature in SQL, and as far as I know the Access Jet engine does not support them.   You are supposed to use the key columns to partition your data, but that may require knowing how many records fall within convenient bin sizes if the data are not uniformly distributed.  You can find that out using SQL group by queries.

Note that you the resource you appear to be limited by is the database engine.  Parallel processing (more CPUs) is unlikely to yield any improvement,  and is in fact likely to slow you down.

This looks like a good topic for the R-sig-db mailing list if you have further questions about R and databases,  or find a SQL support forum if you need to learn more about using SQL in general.

On November 20, 2015 10:32:31 AM PST, Vivek Sutradhara <[hidden email]> wrote:

>Hi John,
>Thanks a lot for your quick reply. And thanks for drawing my attention
>to
>the openslsx package. I will certainly look into it when I work with
>Excel.
>But right now, my problems are with Microsoft Access.
>
>There are huge tables there which I am not able to export to excel, csv
>or
>text files with native access methods. The only solution that has
>worked so
>far is to incrementally extract data with the the help of RODBC. This
>was a
>huge leap in my attempts to export the tables. Once I have the data in
>form
>of rds files (which are compressed as well), I have found that it is
>much
>easier to work with them.
>
>But my wishes have suddenly expanded and I want to find out if it is
>possible to go beyond the normal capabilities of RODBC (the sqlFetch
>command does not have a provision for specifying the row number range).
>I
>am a newbie with parallel methods (using the 4 cores on my pc) but I am
>hoping to progress with that for processing the data from the multiple
>chunks of data (the first step will be just to filter and gather the
>data
>of relevance).
>
>I hope that I have explained what I am looking for.
>Thanks,
>Vivek
>
>2015-11-20 19:09 GMT+01:00 John McKown <[hidden email]>:
>
>> A possibility could be to not use ODBC, but the CRAN package openslsx
>(
>> https://cran.revolutionanalytics.com/web/packages/openxlsx/index.html
>).
>> Then use the read.xlsx() function.
>> <quote>
>> Description Read data from an Excel file or Workbook object into a
>> data.frame
>>
>> Usage read.xlsx(xlsxFile, sheet = 1, startRow = 1, colNames = TRUE,
>> rowNames = FALSE, detectDates = FALSE, skipEmptyRows = TRUE, rows =
>NULL,
>> cols = NULL, check.names = FALSE, namedRegion = NULL)
>>
>> Arguments xlsxFile An xlsx file or Workbook object sheet The name or
>index
>> of the sheet to read data from.
>> startRow first row to begin looking for data. Empty rows at the top
>of a
>> file are always skipped, regardless of the value of startRow.
>> colNames If TRUE, the first row of data will be used as column names.
>> rowNames If TRUE, first column of data will be used as row names.
>> detectDates If TRUE, attempt to recognise dates and perform
>conversion.
>> skipEmptyRows If TRUE, empty rows are skipped else empty rows after
>the
>> first row containing data will return a row of NAs.
>> rows A numeric vector specifying which rows in the Excel file to
>read. If
>> NULL, all rows are read.
>> cols A numeric vector specifying which columns in the Excel file to
>read.
>> If NULL, all columns are read.
>> check.names logical. If TRUE then the names of the variables in the
>data
>> frame are checked to ensure that they are syntactically valid
>variable
>> names
>> namedRegion A named region in the Workbook. If not NULL startRow,
>rows and
>> cols paramters are ignored.
>> </quote>
>>
>> On Fri, Nov 20, 2015 at 11:38 AM, Vivek Sutradhara
><[hidden email]>
>> wrote:
>>
>>> Hi
>>> I want to extract data from a Microsoft access database having many
>tables
>>> with more than 1e7 rows. I find that the following code works to
>export a
>>> table to a rds file :
>>> #####################
>>> setwd('C:/sFolder')
>>> library(RODBC);library(DBI)
>>> ch<-odbcConnect("sample")
>>>
>>> #No. of rows in the table not known
>>> rowN<-1e6  # no. of rows defined
>>> db<-sqlFetch(ch,"Table1",max=rowN,as.is=TRUE)
>>> file<-paste0('Table1',1,'.rds')
>>> df1<-saveRDS(db,file1)
>>>
>>> rm(db);gc()   # garbage collection to free up the memory
>>>
>>> # To successively obtain more chunks from the access database
>>> for (i in 2:10) {
>>>   rm(df);gc()
>>>   df<-sqlFetchMore(ch,"Table1",max=rowN,as.is=TRUE)
>>>   file<-paste0('Table1',i,'.rds')
>>>   df1<-saveRDS(df,file)
>>>   if (dim(df)[1]<rowN)
>>>     break
>>> }
>>> rm(df);gc()
>>> odbcCloseAll()
>>> ##############################
>>>
>>> I would like to know the following :
>>> 1. Is there any way to extract data from a table by just specifying
>the
>>> row
>>> number range. I have extracted data before. Instead of repeating the
>>> operations, I would just like to obtain data from, let's say, 8e6 to
>9e6
>>> row range. I cannot do this now. I have to successively use the
>>> sqlfetchMore command. I would like to know if it is possible to
>straight
>>> away go to the 8e6 to 9e6 row range.
>>> 2. Is it possible to use the foreach package in the extraction step
>(in
>>> place of the for loop above). I am planning to use the foreach
>command in
>>> parallel later for processing the data in the multiple files. I just
>>> wonder
>>> if it is possible to do parallel processing for the data extraction
>also.
>>> Thanks,
>>> Vivek Sutradhara
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>>
>> Schrodinger's backup: The condition of any backup is unknown until a
>> restore is attempted.
>>
>> Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you
>will be.
>>
>> He's about as useful as a wax frying pan.
>>
>> 10 to the 12th power microphones = 1 Megaphone
>>
>> Maranatha! <><
>> John McKown
>>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: exporting tables from an access database using parallel foreach

vivek
Hi John and Jeff,
Thanks a lot for your help. I agree that row numbers are not a standard
feature in SQL. What I am looking for is some kind of a hack. After all,
the sqlFetch command is able to return a specific number of rows. And the
sqlFetchMore command is able to take up the baton from that row onwards to
futher return rows corresponding to the max parameter.

I wonder if it is possible to straight away hop to a certain row number
(without going through sqlfetch and sqlFetchMore and without loading any
data into memory) and then return the contents corresponding to a certain
number of rows. The question is : is there a "catch" for accessing a row
location, and what could be the "hook" for that? I am interested in the the
recent updated rows to a table after a certain date. Is it possible to
identify them in a quick way? Running sql queries on such large tables
appears to take too long a time.

 I understand that there is no provision to do this by available methods.
But, is it possible to get under the hood and find some hack?

Jeff, I will take your suggestion and try my luck at the R-sig-db mailing
list.
Thanks,
Vivek

2015-11-20 20:25 GMT+01:00 Jeff Newmiller <[hidden email]>:

> Row numbers are not a standard feature in SQL, and as far as I know the
> Access Jet engine does not support them. You are supposed to use the key
> columns to partition your data, but that may require knowing how many
> records fall within convenient bin sizes if the data are not uniformly
> distributed. You can find that out using SQL group by queries.
>
> Note that you the resource you appear to be limited by is the database
> engine. Parallel processing (more CPUs) is unlikely to yield any
> improvement, and is in fact likely to slow you down.
>
> This looks like a good topic for the R-sig-db mailing list if you have
> further questions about R and databases, or find a SQL support forum if you
> need to learn more about using SQL in general.
>
> On November 20, 2015 10:32:31 AM PST, Vivek Sutradhara <
> [hidden email]> wrote:
>
>> Hi John,
>> Thanks a lot for your quick reply. And thanks for drawing my attention to
>> the openslsx package. I will certainly look into it when I work with Excel.
>> But right now, my problems are with Microsoft Access.
>>
>> There are huge tables there which I am not able to export to excel, csv or
>> text files with native access methods. The only solution that has worked so
>> far is to incrementally extract data with the the help of RODBC. This was a
>> huge leap in my attempts to export the tables. Once I have the data in form
>> of rds files (which are compressed as well), I have found that it is much
>> easier to work with them.
>>
>> But my wishes have suddenly expanded and I want to find out if it is
>> possible to go beyond the normal capabilities of RODBC (the sqlFetch
>> command does not have a provision for specifying the row number range). I
>> am a newbie with parallel methods (using the 4 cores on my pc) but I
>> am
>> hoping to progress with that for processing the data from the multiple
>> chunks of data (the first step will be just to filter and gather the data
>> of relevance).
>>
>> I hope that I have explained what I am looking for.
>> Thanks,
>> Vivek
>>
>> 2015-11-20 19:09 GMT+01:00 John McKown <[hidden email]>:
>>
>>  A possibility could be to not use ODBC, but the CRAN package openslsx (
>>>  https://cran.revolutionanalytics.com/web/packages/openxlsx/index.html ).
>>>  Then use the read.xlsx() function.
>>>  <quote>
>>>  Description Read data from an Excel file or Workbook object into a
>>>  data.frame
>>>
>>>  Usage read.xlsx(xlsxFile, sheet = 1, startRow = 1, colNames = TRUE,
>>>  rowNames = FALSE, detectDates = FALSE,
>>> skipEmptyRows = TRUE, rows = NULL,
>>>  cols = NULL, check.names = FALSE, namedRegion = NULL)
>>>
>>>  Arguments xlsxFile An xlsx file or Workbook object sheet The name or index
>>>  of the sheet to read data from.
>>>  startRow first row to begin looking for data. Empty rows at the top of a
>>>  file are always skipped, regardless of the value of startRow.
>>>  colNames If TRUE, the first row of data will be used as column names.
>>>  rowNames If TRUE, first column of data will be used as row names.
>>>  detectDates If TRUE, attempt to recognise dates and perform conversion.
>>>  skipEmptyRows If TRUE, empty rows are skipped else empty rows after the
>>>  first row containing data will return a row of NAs.
>>>  rows A numeric vector specifying which rows in the Excel file to read. If
>>>  NULL, all rows are read.
>>>  cols A numeric vector specifying which columns in the Excel file to read.
>>>  If NULL, all columns are read.
>>>  check.names logical. If TRUE then
>>> the names of the variables in the data
>>>  frame are checked to ensure that they are syntactically valid variable
>>>  names
>>>  namedRegion A named region in the Workbook. If not NULL startRow, rows and
>>>  cols paramters are ignored.
>>>  </quote>
>>>
>>>  On Fri, Nov 20, 2015 at 11:38 AM, Vivek Sutradhara <[hidden email]>
>>>  wrote:
>>>
>>>  Hi
>>>>  I want to extract data from a Microsoft access database having many tables
>>>>  with more than 1e7 rows. I find that the following code works to export a
>>>>  table to a rds file :
>>>>  #####################
>>>>  setwd('C:/sFolder')
>>>>  library(RODBC);library(DBI)
>>>>  ch<-odbcConnect("sample")
>>>>
>>>>  #No. of rows in the table not known
>>>>  rowN<-1e6  # no. of rows defined
>>>>  db<-sqlFetch(ch,"Table1",max=rowN,as.is=TRUE)
>>>>
>>>> file<-paste0('Table1',1,'.rds')
>>>>  df1<-saveRDS(db,file1)
>>>>
>>>>  rm(db);gc()   # garbage collection to free up the memory
>>>>
>>>>  # To successively obtain more chunks from the access database
>>>>  for (i in 2:10) {
>>>>    rm(df);gc()
>>>>    df<-sqlFetchMore(ch,"Table1",max=rowN,as.is=TRUE)
>>>>    file<-paste0('Table1',i,'.rds')
>>>>    df1<-saveRDS(df,file)
>>>>    if (dim(df)[1]<rowN)
>>>>      break
>>>>  }
>>>>  rm(df);gc()
>>>>  odbcCloseAll()
>>>>  ##############################
>>>>
>>>>  I would like to know the following :
>>>>  1. Is there any way to extract data from a table by just specifying the
>>>>  row
>>>>  number range. I have extracted data before. Instead of repeating the
>>>>  operations, I would just like to obtain data from, let's say, 8e6 to 9e6
>>>>  row range. I cannot do this now. I have to successively use the
>>>>  sqlfetchMore command. I would like to know if it is possible to straight
>>>>
>>>> away go to the 8e6 to 9e6 row range.
>>>>  2. Is it possible to use the foreach package in the extraction step (in
>>>>  place of the for loop above). I am planning to use the foreach command in
>>>>  parallel later for processing the data in the multiple files. I just
>>>>  wonder
>>>>  if it is possible to do parallel processing for the data extraction also.
>>>>  Thanks,
>>>>  Vivek Sutradhara
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ------------------------------
>>>>
>>>>  [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>  https://stat.ethz.ch/mailman/listinfo/r-help
>>>>  PLEASE do read the posting guide
>>>>  http://www.R-project.org/posting-guide.html
>>>>  and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>>
>>>
>>>  --
>>>
>>>  Schrodinger's backup: The condition of any backup is unknown until
>>> a
>>>  restore is attempted.
>>>
>>>  Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.
>>>
>>>  He's about as useful as a wax frying pan.
>>>
>>>  10 to the 12th power microphones = 1 Megaphone
>>>
>>>  Maranatha! <><
>>>  John McKown
>>
>>
>>
>>  [[alternative HTML version deleted]]
>>
>> ------------------------------
>>
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: exporting tables from an access database using parallel foreach

John McKown
On Sat, Nov 21, 2015 at 11:55 AM, Vivek Sutradhara <[hidden email]>
wrote:

> Hi John and Jeff,
> Thanks a lot for your help. I agree that row numbers are not a standard
> feature in SQL. What I am looking for is some kind of a hack. After all,
> the sqlFetch command is able to return a specific number of rows. And the
> sqlFetchMore command is able to take up the baton from that row onwards to
> futher return rows corresponding to the max parameter.
>
> I wonder if it is possible to straight away hop to a certain row number
> (without going through sqlfetch and sqlFetchMore and without loading any
> data into memory) and then return the contents corresponding to a certain
> number of rows. The question is : is there a "catch" for accessing a row
> location, and what could be the "hook" for that? I am interested in the the
> recent updated rows to a table after a certain date. Is it possible to
> identify them in a quick way? Running sql queries on such large tables
> appears to take too long a time.
>
>  I understand that there is no provision to do this by available methods.
> But, is it possible to get under the hood and find some hack?
>

​Now you're talking about the internals of Microsoft Access. And you're
_way_ beyond my knowledge. Is there such knowledge? I sure there is. But,
unfortunately, once you get into that depth, you can get into real trouble
when (not if) MS decides to change the internals out from under you without
any warning at all. If you are really needing this, try looking the the
"MDB Tools" software at either https://github.com/brianb/mdbtools or
http://mdbtools.sourceforge.net/​ I don't think this does exactly what you
want, but it may give you the information you need to read the MDB file
yourself directly in R code. <shudder/>

​What you would really want is something like the ROWID in SQLite. That is
a "system" maintained column in every table in SQLite. It is a 64-bit
unique number. Basically, it starts at 1 and increments every time you add
a new row.

What would be "best", IMO, would be if you could alter your Access database
to have a "serial" column which would be your "row number" You could then
get "directly" there by using a SELECT similar to:

SELECT * FROM table WHERE serial BETWEEN (first-row,last-row)



>
> Jeff, I will take your suggestion and try my luck at the R-sig-db mailing
> list.
> Thanks,
> Vivek
>
>
>
--

Schrodinger's backup: The condition of any backup is unknown until a
restore is attempted.

Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.

He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: exporting tables from an access database using parallel foreach

Rainer Hurling
Am 22.11.15 um 02:38 schrieb John McKown:

> On Sat, Nov 21, 2015 at 11:55 AM, Vivek Sutradhara <[hidden email]>
> wrote:
>
>> Hi John and Jeff,
>> Thanks a lot for your help. I agree that row numbers are not a standard
>> feature in SQL. What I am looking for is some kind of a hack. After all,
>> the sqlFetch command is able to return a specific number of rows. And the
>> sqlFetchMore command is able to take up the baton from that row onwards to
>> futher return rows corresponding to the max parameter.
>>
>> I wonder if it is possible to straight away hop to a certain row number
>> (without going through sqlfetch and sqlFetchMore and without loading any
>> data into memory) and then return the contents corresponding to a certain
>> number of rows. The question is : is there a "catch" for accessing a row
>> location, and what could be the "hook" for that? I am interested in the the
>> recent updated rows to a table after a certain date. Is it possible to
>> identify them in a quick way? Running sql queries on such large tables
>> appears to take too long a time.
>>
>>  I understand that there is no provision to do this by available methods.
>> But, is it possible to get under the hood and find some hack?
>>
>
> ​Now you're talking about the internals of Microsoft Access. And you're
> _way_ beyond my knowledge. Is there such knowledge? I sure there is. But,
> unfortunately, once you get into that depth, you can get into real trouble
> when (not if) MS decides to change the internals out from under you without
> any warning at all. If you are really needing this, try looking the the
> "MDB Tools" software at either https://github.com/brianb/mdbtools or
> http://mdbtools.sourceforge.net/​ I don't think this does exactly what you
> want, but it may give you the information you need to read the MDB file
> yourself directly in R code. <shudder/>

The mdb-tools give direct access to the physical mdb files, lying around
in a filesystem. A database file xxx.mdb has not to be 'active' within a
MS Access 'Server', to read in its contents via mdb-tools. The idea
behind is, that one should be able to read the contents of mdb files,
even when there is no MS Access you can connect to and/or when no
Windows installation is running.

In my knowledge, mdb-tools is not available for Windows platforms, only
for Unix alikes and Linux, maybe OSX.

The R package 'Hmisc' is able to use mdb-tools, if they are also present
on that system.

Unfortunately, it seems, that mdb-tools also has no direct way to select
rows by their number.

>
> ​What you would really want is something like the ROWID in SQLite. That is
> a "system" maintained column in every table in SQLite. It is a 64-bit
> unique number. Basically, it starts at 1 and increments every time you add
> a new row.
>
> What would be "best", IMO, would be if you could alter your Access database
> to have a "serial" column which would be your "row number" You could then
> get "directly" there by using a SELECT similar to:
>
> SELECT * FROM table WHERE serial BETWEEN (first-row,last-row)
>
>
>
>>
>> Jeff, I will take your suggestion and try my luck at the R-sig-db mailing
>> list.
>> Thanks,
>> Vivek
>>
>>
>>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: exporting tables from an access database using parallel foreach

vivek
Hi Rainer and John,
Thanks once again for your continued help. I have actually tried out
mdbtools. I was able to get going on ubuntu. Unfortunately, to my
disappointment, it was not helpful in my specific case. Because the access
database is on a server which can be accessed only on windows and not from
ubuntu (for administrative reasons beyond my control).

When I copied the mdb file over to ubuntu, I saw only synonyms and no
tables (because I had no contact with the server).

Thanks for the suggestion of serial number. I will check and see if any
change is possible.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.