File manipulation is harder than it ought to be.

January 29, 2005 at 6:42 pm (PT) in Programming, Rants/Raves

I am amazed that programming languages (well, the typical ones, at least) don’t make it easier to manipulate files.

A common way files are read in C is to create a struct that matches the file format and to call fread to read the file into it. Isn’t that easy enough?

Not really. This approach is fine in isolation, but it’s non-portable:

  • Different architectures or compilers may lay out structs differently. Your compiler sometimes can choose to add padding bytes to guarantee alignment requirements. Luckily compilers aren’t allowed to do it willy-nilly, and some compilers offer #pragmas to control this.
  • Different architectures have different integer sizes. Appropriate typedefs often can mitigate this, but it’s still imperfect since it requires a small porting effort.
  • Different architectures use different endianness. If a file format is defined to store integers in big-endian byte order but your architecture is little-endian, then if you read the bytes out of the struct without first swapping the bytes you’ll end up with the wrong value.

The typical way to solve these problems is to read a file a byte at a time, copying each byte into the appropriate location within the struct. This is tedious.

Programming languages should provide a mechanism for programmers to declare a struct that must conform to some external format requirement. Programmers should be able to attribute the struct, prohibiting implicit padding bytes and specifying what the size and endian requirements are for each field. For example:

file_struct myFileFormat
{
    uint8 version;
    uint8[3]; // Reserved.
    uint32BE numElements;
    uint32BE dataOffset;
};

When retrieving fields from such a struct, the compiler should generate code that automatically performs the necessary byte swaps and internal type promotions.

Newer: First time to a club
Older: One-dimensional Cube

6 Comments »

  1. When you refer to the tedious typical way, does that apply to code like this?

    version = readUINT8(stream);
    readUINT8Array(stream, reserved, 3);
    numElements = BEToHost(readUINT32(stream));
    dataOffset = BEToHost(readUINT32(stream));

    — Chris Y @ February 1, 2005, 1:26 pm (PT)

  2. IMO, yes. Even if you package all of it up into functions such as:

    readFormat(FILE* fp, struct format_t* data);
    writeFormat(FILE* fp, struct format_t* data);

    There is nothing cohesive tying the struct definition and the code together. Furthermore, there’s nothing cohesive tying the read and write routines together.

    After some thought, maybe something to try in C++ is to create some Datum class that has various child classes, each corresponding to things such as uint32BE, uint32LE, uint16BE, uint16LE, etc. Each could have virtual read/write methods, and you could make a container class that invokes each of them.

    — James @ February 1, 2005, 1:49 pm (PT)

  3. I think equivalent functionality is already implementable in Java:

    Add empty-argument constructors to the relevant Classes, if none. Use custom annotations to markup fields with endianness, sort order, etc. Then a single method could retrieve the necessary info from any Object (using reflection) and spit out the byte array form. Another method could reconstruct any Object given a byte array and Class.

    — Chris Y @ February 10, 2005, 3:30 pm (PT)

  4. Yeah, reflection certainly makes it easier. I’m still not sure yet exactly what a good C++ approach would be.

    — James @ February 13, 2005, 3:17 am (PT)

  5. A good way to avoid most of these problems is to save information as text instead of binary. As an added bonus, your files can be human readable. I usually use text files unless for some reason speed is important.

    — Billy @ February 13, 2005, 3:18 am (PT)

  6. … which is fine if you’re defining your own file format, but it’s not helpful if you’re reading or writing to, say, some existing image format.

    — James @ February 13, 2005, 12:14 pm (PT)

RSS feed for comments on this post.

Leave a comment

(will never be displayed)


Allowed HTML tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>