Quantcast

seek(), skip by bits (not by bytes) in binary file

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

seek(), skip by bits (not by bytes) in binary file

Ben quant
Hello,

Has a function been built that will skip to a certain bit in a binary file?

As of 2009 the answer was 'no':
http://r.789695.n4.nabble.com/read-binary-file-seek-td900847.html
https://stat.ethz.ch/pipermail/r-help/2009-May/199819.html

If you feel I don't need to (like in the links above), please provide some
help. (Note this is my first time working with binary files.)

I'm still working on the script, but here is where I am right now. The for
loop is being used because:

1) I have to get down to correct position then get the info I want/need.
The stuff I am reading through (x) is not fully understood and it is a mix
of various chars, floats, integers, etc. of various sizes etc. so I don't
know who many bytes to read in unless I read them bit by bit. (The
information and structure of the information changes daily so I'm skipping
over it.)
2) If I skip all in one readBin() my 'n' value is often up to 20 times too
big (I get an error) and/or R won't let me "allocate a vector of size...."
etc. So I split it up into chunks (divide by 20 etc.) and read each chuck
then trash each part that is readBin()'d. Then the last line I get the data
that I want (data1).

Here is my working code:

# I have to read 'junk' bits from the to.read file which is huge integer so
I divide it up and loop through to.read in parts (jb_part).
  divr = 20
  mod = junk %% divr

  jb_part = as.integer(junk/divr)
  jb_part_mod = jb_part + mod # catch the remainder/modulus

  to.read = file(paste(dbs_path,"/",dbs_file,sep=""),"rb") # connect to the
binary file
# loop in chunks to where I want to be
  for(i in 1:(divr-1)){
    x = readBin(to.read,"raw",n=jb_part,size=1)
    x = NULL # trash the result b/c I don't want it
  }
# read a a little more to include the remainder/modulus bits left over by
dividing by 20 above
  x = readBin(to.read,'raw',n=jb_part_mod,size=1)
  x = NULL # trash it

# finally get the data that I want
data1 = readBin(to.read,double(),n=some_number,size=size_to_use)

This works, but it is SLOW!  Any ideas on how to get down to the correct
bit a bit quicker (pun intended). :)

Thanks for any help!

Ben

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: seek(), skip by bits (not by bytes) in binary file

jholtman
I am not sure why reading through 'bit-by-bit' gets you to where you
want to be.  I assume that the file has some structure, even though it
may be changing daily.  You mentioned the various types of data that
it might contain; are they all in 'byte' sized chucks?  If you really
have data that begins in the middle of a byte and then extends over
several bytes, you will have to write some functions that will pull
out this data and then reconstruct it into an object (e.g., integer,
numeric, ...) that R understands.  Can you provide some more
definition of what the data actually looks like and how you would find
the "pattern" of the data.  Almost all systems read at the lowest
level byte sized chucks, and if you really have to get down to the bit
level to reconstruct the data, then you have to write the unpack/pack
functions.  This can all be done once you understand the structure of
the data.  So some examples would be useful if you want someone to
propose a solution.

On Tue, Jun 19, 2012 at 11:54 AM, Ben quant <[hidden email]> wrote:

> Hello,
>
> Has a function been built that will skip to a certain bit in a binary file?
>
> As of 2009 the answer was 'no':
> http://r.789695.n4.nabble.com/read-binary-file-seek-td900847.html
> https://stat.ethz.ch/pipermail/r-help/2009-May/199819.html
>
> If you feel I don't need to (like in the links above), please provide some
> help. (Note this is my first time working with binary files.)
>
> I'm still working on the script, but here is where I am right now. The for
> loop is being used because:
>
> 1) I have to get down to correct position then get the info I want/need.
> The stuff I am reading through (x) is not fully understood and it is a mix
> of various chars, floats, integers, etc. of various sizes etc. so I don't
> know who many bytes to read in unless I read them bit by bit. (The
> information and structure of the information changes daily so I'm skipping
> over it.)
> 2) If I skip all in one readBin() my 'n' value is often up to 20 times too
> big (I get an error) and/or R won't let me "allocate a vector of size...."
> etc. So I split it up into chunks (divide by 20 etc.) and read each chuck
> then trash each part that is readBin()'d. Then the last line I get the data
> that I want (data1).
>
> Here is my working code:
>
> # I have to read 'junk' bits from the to.read file which is huge integer so
> I divide it up and loop through to.read in parts (jb_part).
>  divr = 20
>  mod = junk %% divr
>
>  jb_part = as.integer(junk/divr)
>  jb_part_mod = jb_part + mod # catch the remainder/modulus
>
>  to.read = file(paste(dbs_path,"/",dbs_file,sep=""),"rb") # connect to the
> binary file
> # loop in chunks to where I want to be
>  for(i in 1:(divr-1)){
>    x = readBin(to.read,"raw",n=jb_part,size=1)
>    x = NULL # trash the result b/c I don't want it
>  }
> # read a a little more to include the remainder/modulus bits left over by
> dividing by 20 above
>  x = readBin(to.read,'raw',n=jb_part_mod,size=1)
>  x = NULL # trash it
>
> # finally get the data that I want
> data1 = readBin(to.read,double(),n=some_number,size=size_to_use)
>
> This works, but it is SLOW!  Any ideas on how to get down to the correct
> bit a bit quicker (pun intended). :)
>
> Thanks for any help!
>
> Ben
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: seek(), skip by bits (not by bytes) in binary file

Ben quant
Other people at my firm who know a lot about binary files couldn't figure
out the parts of the file that I am skipping over. Part of the issue is
that there are several different files (dbs extension files) like this that
I have to process and the structures do change depending on the source of
these files.

In short, the problem is over my head and I was hoping to go right to the
correct bit and read, which would make things much easier. I guess not...
Thanks for your help though.

Anyone else?

thanks,

ben

On Tue, Jun 19, 2012 at 10:10 AM, jim holtman <[hidden email]> wrote:

> I am not sure why reading through 'bit-by-bit' gets you to where you
> want to be.  I assume that the file has some structure, even though it
> may be changing daily.  You mentioned the various types of data that
> it might contain; are they all in 'byte' sized chucks?  If you really
> have data that begins in the middle of a byte and then extends over
> several bytes, you will have to write some functions that will pull
> out this data and then reconstruct it into an object (e.g., integer,
> numeric, ...) that R understands.  Can you provide some more
> definition of what the data actually looks like and how you would find
> the "pattern" of the data.  Almost all systems read at the lowest
> level byte sized chucks, and if you really have to get down to the bit
> level to reconstruct the data, then you have to write the unpack/pack
> functions.  This can all be done once you understand the structure of
> the data.  So some examples would be useful if you want someone to
> propose a solution.
>
> On Tue, Jun 19, 2012 at 11:54 AM, Ben quant <[hidden email]> wrote:
> > Hello,
> >
> > Has a function been built that will skip to a certain bit in a binary
> file?
> >
> > As of 2009 the answer was 'no':
> > http://r.789695.n4.nabble.com/read-binary-file-seek-td900847.html
> > https://stat.ethz.ch/pipermail/r-help/2009-May/199819.html
> >
> > If you feel I don't need to (like in the links above), please provide
> some
> > help. (Note this is my first time working with binary files.)
> >
> > I'm still working on the script, but here is where I am right now. The
> for
> > loop is being used because:
> >
> > 1) I have to get down to correct position then get the info I want/need.
> > The stuff I am reading through (x) is not fully understood and it is a
> mix
> > of various chars, floats, integers, etc. of various sizes etc. so I don't
> > know who many bytes to read in unless I read them bit by bit. (The
> > information and structure of the information changes daily so I'm
> skipping
> > over it.)
> > 2) If I skip all in one readBin() my 'n' value is often up to 20 times
> too
> > big (I get an error) and/or R won't let me "allocate a vector of
> size...."
> > etc. So I split it up into chunks (divide by 20 etc.) and read each chuck
> > then trash each part that is readBin()'d. Then the last line I get the
> data
> > that I want (data1).
> >
> > Here is my working code:
> >
> > # I have to read 'junk' bits from the to.read file which is huge integer
> so
> > I divide it up and loop through to.read in parts (jb_part).
> >  divr = 20
> >  mod = junk %% divr
> >
> >  jb_part = as.integer(junk/divr)
> >  jb_part_mod = jb_part + mod # catch the remainder/modulus
> >
> >  to.read = file(paste(dbs_path,"/",dbs_file,sep=""),"rb") # connect to
> the
> > binary file
> > # loop in chunks to where I want to be
> >  for(i in 1:(divr-1)){
> >    x = readBin(to.read,"raw",n=jb_part,size=1)
> >    x = NULL # trash the result b/c I don't want it
> >  }
> > # read a a little more to include the remainder/modulus bits left over by
> > dividing by 20 above
> >  x = readBin(to.read,'raw',n=jb_part_mod,size=1)
> >  x = NULL # trash it
> >
> > # finally get the data that I want
> > data1 = readBin(to.read,double(),n=some_number,size=size_to_use)
> >
> > This works, but it is SLOW!  Any ideas on how to get down to the correct
> > bit a bit quicker (pun intended). :)
> >
> > Thanks for any help!
> >
> > Ben
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: seek(), skip by bits (not by bytes) in binary file

Jeff Newmiller
If the structure really changes day by day, then you have to decipher how it is constructed in order to find the correct bit to go to.

If you think you already know which bit to go to, then the way you know is "the 3rd bit of the 71st byte", which means that the existing seek function should be sufficient to get that byte and pick apart the bits to get the ones you want.

I recommend using the hexBin package for this kind of task.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.



Ben quant <[hidden email]> wrote:

>Other people at my firm who know a lot about binary files couldn't
>figure
>out the parts of the file that I am skipping over. Part of the issue is
>that there are several different files (dbs extension files) like this
>that
>I have to process and the structures do change depending on the source
>of
>these files.
>
>In short, the problem is over my head and I was hoping to go right to
>the
>correct bit and read, which would make things much easier. I guess
>not...
>Thanks for your help though.
>
>Anyone else?
>
>thanks,
>
>ben
>
>On Tue, Jun 19, 2012 at 10:10 AM, jim holtman <[hidden email]>
>wrote:
>
>> I am not sure why reading through 'bit-by-bit' gets you to where you
>> want to be.  I assume that the file has some structure, even though
>it
>> may be changing daily.  You mentioned the various types of data that
>> it might contain; are they all in 'byte' sized chucks?  If you really
>> have data that begins in the middle of a byte and then extends over
>> several bytes, you will have to write some functions that will pull
>> out this data and then reconstruct it into an object (e.g., integer,
>> numeric, ...) that R understands.  Can you provide some more
>> definition of what the data actually looks like and how you would
>find
>> the "pattern" of the data.  Almost all systems read at the lowest
>> level byte sized chucks, and if you really have to get down to the
>bit
>> level to reconstruct the data, then you have to write the unpack/pack
>> functions.  This can all be done once you understand the structure of
>> the data.  So some examples would be useful if you want someone to
>> propose a solution.
>>
>> On Tue, Jun 19, 2012 at 11:54 AM, Ben quant <[hidden email]>
>wrote:
>> > Hello,
>> >
>> > Has a function been built that will skip to a certain bit in a
>binary
>> file?
>> >
>> > As of 2009 the answer was 'no':
>> > http://r.789695.n4.nabble.com/read-binary-file-seek-td900847.html
>> > https://stat.ethz.ch/pipermail/r-help/2009-May/199819.html
>> >
>> > If you feel I don't need to (like in the links above), please
>provide
>> some
>> > help. (Note this is my first time working with binary files.)
>> >
>> > I'm still working on the script, but here is where I am right now.
>The
>> for
>> > loop is being used because:
>> >
>> > 1) I have to get down to correct position then get the info I
>want/need.
>> > The stuff I am reading through (x) is not fully understood and it
>is a
>> mix
>> > of various chars, floats, integers, etc. of various sizes etc. so I
>don't
>> > know who many bytes to read in unless I read them bit by bit. (The
>> > information and structure of the information changes daily so I'm
>> skipping
>> > over it.)
>> > 2) If I skip all in one readBin() my 'n' value is often up to 20
>times
>> too
>> > big (I get an error) and/or R won't let me "allocate a vector of
>> size...."
>> > etc. So I split it up into chunks (divide by 20 etc.) and read each
>chuck
>> > then trash each part that is readBin()'d. Then the last line I get
>the
>> data
>> > that I want (data1).
>> >
>> > Here is my working code:
>> >
>> > # I have to read 'junk' bits from the to.read file which is huge
>integer
>> so
>> > I divide it up and loop through to.read in parts (jb_part).
>> >  divr = 20
>> >  mod = junk %% divr
>> >
>> >  jb_part = as.integer(junk/divr)
>> >  jb_part_mod = jb_part + mod # catch the remainder/modulus
>> >
>> >  to.read = file(paste(dbs_path,"/",dbs_file,sep=""),"rb") # connect
>to
>> the
>> > binary file
>> > # loop in chunks to where I want to be
>> >  for(i in 1:(divr-1)){
>> >    x = readBin(to.read,"raw",n=jb_part,size=1)
>> >    x = NULL # trash the result b/c I don't want it
>> >  }
>> > # read a a little more to include the remainder/modulus bits left
>over by
>> > dividing by 20 above
>> >  x = readBin(to.read,'raw',n=jb_part_mod,size=1)
>> >  x = NULL # trash it
>> >
>> > # finally get the data that I want
>> > data1 = readBin(to.read,double(),n=some_number,size=size_to_use)
>> >
>> > This works, but it is SLOW!  Any ideas on how to get down to the
>correct
>> > bit a bit quicker (pun intended). :)
>> >
>> > Thanks for any help!
>> >
>> > Ben
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > [hidden email] mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>> --
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: seek(), skip by bits (not by bytes) in binary file

Ben quant
This post got me thinking and this works (fast!) to get the first 10
integers that I want:

#I'm still testing this...
# once I find the value of 'junk' and 'size_to_use', which I already
had/have.

to.read = file(file_path_name,"rb")
seek(to.read,where=junk)
data1 = readBin(to.read,integer(),n=10,size=size_to_use)

Seems kinda silly that I didn't think of this before...I looked into using
seek() before...

Anyway, thanks for helping me think it through.

PS - I still don't know how to use "the 3rd bit of the 71st byte" ...or was
that an example of how to think about the problem?

Thanks!
Ben


On Tue, Jun 19, 2012 at 11:07 AM, Jeff Newmiller
<[hidden email]>wrote:

> If the structure really changes day by day, then you have to decipher how
> it is constructed in order to find the correct bit to go to.
>
> If you think you already know which bit to go to, then the way you know is
> "the 3rd bit of the 71st byte", which means that the existing seek function
> should be sufficient to get that byte and pick apart the bits to get the
> ones you want.
>
> I recommend using the hexBin package for this kind of task.
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
> Go...
>                                      Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
>
>
> Ben quant <[hidden email]> wrote:
>
> >Other people at my firm who know a lot about binary files couldn't
> >figure
> >out the parts of the file that I am skipping over. Part of the issue is
> >that there are several different files (dbs extension files) like this
> >that
> >I have to process and the structures do change depending on the source
> >of
> >these files.
> >
> >In short, the problem is over my head and I was hoping to go right to
> >the
> >correct bit and read, which would make things much easier. I guess
> >not...
> >Thanks for your help though.
> >
> >Anyone else?
> >
> >thanks,
> >
> >ben
> >
> >On Tue, Jun 19, 2012 at 10:10 AM, jim holtman <[hidden email]>
> >wrote:
> >
> >> I am not sure why reading through 'bit-by-bit' gets you to where you
> >> want to be.  I assume that the file has some structure, even though
> >it
> >> may be changing daily.  You mentioned the various types of data that
> >> it might contain; are they all in 'byte' sized chucks?  If you really
> >> have data that begins in the middle of a byte and then extends over
> >> several bytes, you will have to write some functions that will pull
> >> out this data and then reconstruct it into an object (e.g., integer,
> >> numeric, ...) that R understands.  Can you provide some more
> >> definition of what the data actually looks like and how you would
> >find
> >> the "pattern" of the data.  Almost all systems read at the lowest
> >> level byte sized chucks, and if you really have to get down to the
> >bit
> >> level to reconstruct the data, then you have to write the unpack/pack
> >> functions.  This can all be done once you understand the structure of
> >> the data.  So some examples would be useful if you want someone to
> >> propose a solution.
> >>
> >> On Tue, Jun 19, 2012 at 11:54 AM, Ben quant <[hidden email]>
> >wrote:
> >> > Hello,
> >> >
> >> > Has a function been built that will skip to a certain bit in a
> >binary
> >> file?
> >> >
> >> > As of 2009 the answer was 'no':
> >> > http://r.789695.n4.nabble.com/read-binary-file-seek-td900847.html
> >> > https://stat.ethz.ch/pipermail/r-help/2009-May/199819.html
> >> >
> >> > If you feel I don't need to (like in the links above), please
> >provide
> >> some
> >> > help. (Note this is my first time working with binary files.)
> >> >
> >> > I'm still working on the script, but here is where I am right now.
> >The
> >> for
> >> > loop is being used because:
> >> >
> >> > 1) I have to get down to correct position then get the info I
> >want/need.
> >> > The stuff I am reading through (x) is not fully understood and it
> >is a
> >> mix
> >> > of various chars, floats, integers, etc. of various sizes etc. so I
> >don't
> >> > know who many bytes to read in unless I read them bit by bit. (The
> >> > information and structure of the information changes daily so I'm
> >> skipping
> >> > over it.)
> >> > 2) If I skip all in one readBin() my 'n' value is often up to 20
> >times
> >> too
> >> > big (I get an error) and/or R won't let me "allocate a vector of
> >> size...."
> >> > etc. So I split it up into chunks (divide by 20 etc.) and read each
> >chuck
> >> > then trash each part that is readBin()'d. Then the last line I get
> >the
> >> data
> >> > that I want (data1).
> >> >
> >> > Here is my working code:
> >> >
> >> > # I have to read 'junk' bits from the to.read file which is huge
> >integer
> >> so
> >> > I divide it up and loop through to.read in parts (jb_part).
> >> >  divr = 20
> >> >  mod = junk %% divr
> >> >
> >> >  jb_part = as.integer(junk/divr)
> >> >  jb_part_mod = jb_part + mod # catch the remainder/modulus
> >> >
> >> >  to.read = file(paste(dbs_path,"/",dbs_file,sep=""),"rb") # connect
> >to
> >> the
> >> > binary file
> >> > # loop in chunks to where I want to be
> >> >  for(i in 1:(divr-1)){
> >> >    x = readBin(to.read,"raw",n=jb_part,size=1)
> >> >    x = NULL # trash the result b/c I don't want it
> >> >  }
> >> > # read a a little more to include the remainder/modulus bits left
> >over by
> >> > dividing by 20 above
> >> >  x = readBin(to.read,'raw',n=jb_part_mod,size=1)
> >> >  x = NULL # trash it
> >> >
> >> > # finally get the data that I want
> >> > data1 = readBin(to.read,double(),n=some_number,size=size_to_use)
> >> >
> >> > This works, but it is SLOW!  Any ideas on how to get down to the
> >correct
> >> > bit a bit quicker (pun intended). :)
> >> >
> >> > Thanks for any help!
> >> >
> >> > Ben
> >> >
> >> >        [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > [hidden email] mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >>
> >>
> >>
> >> --
> >> Jim Holtman
> >> Data Munger Guru
> >>
> >> What is the problem that you are trying to solve?
> >> Tell me what you want to do, not how you want to do it.
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> >______________________________________________
> >[hidden email] mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...