10. Persistence

Contents:
Persistence Issues
Streamed Data
Record-Oriented Approach
Relational Databases
Resources

There must be at least 500,000,000 rats in the United States. Of course, I'm speaking only from memory.

- Edgar Wilson Nye

It would be an ideal world indeed if we never had to worry about fatal bugs or power failures.[1] For now, we have to contend with the fact that the attention span of a computer is only as long as its cord and that our data is too precious to be left within the confines of electronic memory. The ability of a system or module to make an application's data live longer than its application is called persistence.[2]

[1] Or end-users, as a letter to Byte magazine once complained!
[2] We'll use the term "system" to mean a C implementation, such as a DBM library or a database, and "module" to refer to a Perl module.

Considering that databases amount to a multi-billion-dollar industry and that DBI (Database Interface) and associated Perl modules are next only to CGI in CPAN's download statistics, it would not be a stretch to say that persistence is the most important of all technologies. In this chapter, we first study the myriad factors to be considered in making our data persistent; we then play with most of the freely available Perl persistence modules and hold them up against the checklist of factors, to clearly understand their strengths and weaknesses and what they provide and where they expect the developer to step in. In the next chapter, we will use some of these modules to create an object persistence framework to store an object transparently in files and databases.

10.1 Persistence Issues

Data ranges from simple comma-delimited records to complex self-referential structures. Users vary in level of paranoia and their ability (and need) to share persistent data. Application programmers attempt to juggle between solutions that are varying combinations of simple, robust, and efficient. The following list examines these differences in a slightly greater detail:

Serialization

Ordinary arrays and hashes can be written to a file using tabs, commas, and so on. Nested structures such as arrays of hashes or arrays of arrays have to be flattened, or serialized, before they can be dumped into a file. If you have ever packed the wiring for your holiday lights, you know that not only do you have to strive for a tight packing, you have to do it in a way that it can be easily and efficiently unscrambled the next time you need to use it. Further, data items can be typeglobs, can contain pointers to native C data structures, or can be references to other data items (making the structures cyclic or self-referential). In this chapter, we will study three modules that serialize data: FreezeThaw, Data::Dumper, and Storable.

Boundaries

Ordinary files, being byte streams, neither offer nor impose any kind of boundaries; you have to decide how you keep each data item distinct and recognizable on disk. DBM and ISAM systems impose a record-oriented structure. Relational databases provide record and column boundaries; if your data can be slotted into such a grid structure, you are in luck; otherwise, you have what is commonly called an "impedance mismatch." Newer technologies, such as object-relational and object-oriented databases, attempt to make this "restriction" or "failure" a nonissue.[3]

[3] E.F. Codd, considered the father of relational database theory, has constantly maintained that this mismatch is not an inherent part of the theory itself; it is an artifact of the RDBMS implementation technology.

Concurrency

Multiple applications or users may want concurrent access to persistent data stores. Some systems ignore this issue altogether; others offer different types of locking schemes.

Access privileges

Most persistence solutions leave it to the operating system to enforce file-level privileges (create, update, read, or delete). Databases offer a finer level of access restriction.

Random access and insertion

Databases make it easy to insert a new record or update a single attribute. With streams, you have no option but to serialize and rewrite the entire data into the file.

Queries

DBM and ISAM files allow you to selectively fetch records on the basis of the primary key, and databases allow you to selectively fetch records on the basis of any field. The more data you have, the less you can afford the luxury of examining each record to see whether it matches your criteria.

Transactions

Important commercial applications require "ACID" properties from persistence solutions [3]:

Atomicity : A series of actions that happen as one unit or not at all.

Consistency : The transaction must leave the system in a consistent state. Consistency is the responsibility of the application; a transaction monitor or a database knows nothing about specific application domains to judge what is consistent and what is not.

Isolation: Reads and writes from independent transactions must be isolated from each other; the result should be identical to what would result if the applications were forced to operate on the data in serial order, one at a time.

Durability : Once a transaction finishes, its results must be firmly committed to disk.

Currently, only databases provide this facility, and there are very few transactional file systems going around. The 2.0 release of the Berkeley DB library provides concurrency, transactions, and recovery, but the Perl wrappers have not been updated to take advantage of it, as of this writing.

Meta-data

If you have access to information that describes your data - meta-data - you can afford to hardcode less. Databases make meta-data explicitly available, while the other solutions simply translate from disk to in-memory Perl structures and let Perl provide the meta-information.

Machine independence

You may want to retrieve data from a file that has been created on a different type of machine. You have to contend with differences in integer and floating-point representation: size as well as byte order.

Portability and transparency

Finally, requirements change, and an application that accounts for some of the factors listed above may have to account for more factors - or worse, a different set of factors. There have been several attempts to provide a layer of uniformity between different solutions; for example, DBI and ODBC are two efforts that specify a consistent API across competing relational database implementations. We will be more ambitious in the next chapter: we will build ourselves a set of modules that hide the API differences between file and database storage. It is a fact that the more transparency you look for, the more of an impact there is on performance.

In the following pages we examine a variety of Perl modules that enable us to persistently store our data. We classify them by the boundary constraints: streamed (no boundaries), record-oriented, and grid-oriented (relational databases).