September 2007 Archives

How to create an 11GB database table with 0 rows

There are a decent number of ZIP codes in the US (42,296 to be precise; variations bring the table size up to 79,834). When putting them into a database table, it helps to create an index, but you’re not likely to run into much trouble otherwise.

The ZIP+4 database is a different story. Back in April, there were roughly 43.5 million entries in that particular table. The raw text file is 9GB, and doing a simple task like adding an index requires you to check to make sure you’re not going to run out of disk space, and have plenty of time on your hands.

The September database has apparently gotten a little bigger, as my estimate was a few GB short, and I ran out of disk space while importing it. The sad thing is that it was probably really close to being done, too.

But the aftermath of that problem is where it gets amusing, in an odd sort of way:

Because of the way databases work, the import basically started a transaction, imported practically all of the data, and then stopped when it ran out of hard drive space.

What it didn’t do was delete all of the data that had gotten imported. It’s still there, though inaccessible because it was in an incomplete transaction.

Aborting that transaction doesn’t delete the data. That doesn’t happen until vacuuming, which is normally done periodically by PostgreSQL (since 8.0 or so), and which can be forced if needed (like when you import 11GB of data and immediately delete it all).

Because of Point-in-Time-Recovery, a copy of all 11GB of that non-existent data is currently in the process of being copied to my backup server, and is also taking up a rather large amount of disk space in the mean time.

As a result, “SELECT * FROM usps_zip9s” takes about 15 minutes, and returns nothing.

And now I get to start over. Or, at least I will once I get the table cleared out (11GB) and the PITR files transfered (another 11GB). This time, I’m going to import the data in chunks, just in case.

Also Seen in a Letter

Together we welcomed a Boy Scout group who brought in rabbits and made cages for them out of bamboo, a new missionary family from Texas, and a team of eight from Korea.

Introducing the garden path sentence. The most obvious/immediate reading of the above is that the Boy Scout group made cages out of bamboo, an unsuspecting southern family, and a bunch of Asians.

Never underestimate the be-preparedness of Boy Scouts!

Ok. [deep breath]. I think I can go back to reading that letter now, and I think I’m going to change the order of who got welcomed so that the Boy Scouts and their rabbits are last.

As Seen in a Letter

Here is the question of the week: Can you scare the daylights out of evil spirits by shooting a machine gun at them? We recently attended a funeral here and apparently at least one of those present thought the answer to the question was yes. His answer was incorrect, but I do have to say that it did make for a lively graveside service.

I bet!

As Seen in a Letter

Another new staff member is a recent graduate of _ State University. [New Staff Member] is developing his ministry partner team in _ and is cursing right along.

“cursing” and “cruising” are very different words. :-)

One of the stats I keep track of for work is how quickly we get mailings out the door. It’s actually the only stat that I make public, incidentally — partly for personal accountability, and partly because it’s a selling point.

It’s a stat that’s easy enough to calculate right in the database:

turnaround_time := SELECT mailed - submitted;

Catch 1: I don’t want to include time spent waiting for someone to approve a preview or pay for a mailing.

That makes the calculation a little more complicated, but not by much — just subtract the time spent waiting from the total:

turnaround_time := SELECT mailed - submitted - (GREATER(approved, paid) - responded);

Catch 2: I don’t want to include weekends or postal holidays in the turnaround time either, since I don’t work on either, as a general rule.

This is where it gets messy — PostgreSQL doesn’t have a built-in way of calculating business days, and it certainly doesn’t have a database of postal holidays.

So, a while back, I wrote a database function to do that calculation, given two dates and a set of holidays.

Assuming, for the time being, that no two postal holidays will happen on successive days* (or on weekends), the logic looks something like this:

Part 1:

  • If the start time is on a holiday, move the start time to midnight on the next day
  • If the start time is on a weekend, move the start time to midnight on Monday
  • If the start time is on a holiday, move the start time to midnight on the next day

Part 2:

  • If the end time is on a holiday, move the end time to one second before the beginning of the day
  • If the end time is on a weekend, move the end time to one second before midnight on Saturday
  • If the end time is on a holiday, move the end time to one second before the beginning of the day

Part 3:

  • Figure out how much time there is between the adjusted start and end times
  • Subtract two days for every seven
  • If the weekday of the end time is less than the weekday of the start time (i.e. there’s a weekend in between, but not a full week), subtract another two days

Part 4:

  • Subtract one day for every holiday between the start time and the end time

Part 5:

  • Return 0 seconds if the remaining time is negative (e.g. if start and end times are both on a weekend)
  • Return the remaining time otherwise

As far as I know, that logic is complete, and should work.

But it wasn’t working.

Specifically, when the start and end dates were exactly a week apart, I was getting seven work days instead of five, meaning that the “subtract two days for every seven” line had a bug in it.

Here’s the code that should have worked:

gap := gap - ((((EXTRACT(EPOCH FROM gap)/86400.0) / 7.0 - 0.5)::INTEGER * 2) || ' days')::INTERVAL;

That’s a mouthful, so here’s the expanded version:

  • Starting with gap (i.e. the end time minus the start time)…

  • Figure out how many seconds there are in gap (that’s the EPOCH bit — it’s a cheat to turn a PostgreSQL interval into an integer that I can do arithmetic on).

  • Figure out how many days that represents (60 seconds x 60 minutes x 24 hours = 86,400). The .0 keeps the number from becoming an integer just yet (i.e. don’t round off the number).

  • Divide that number by 7 to get the number of weeks. Again, don’t round yet.

  • I only want to take whole weeks into account, so turn the number into an integer now. Since PostgreSQL does this by rounding to the nearest whole number, subtract 0.5 first, which is equivalent to saying “round down”.

  • Multiply that number by two, so that I get two days per week.

  • Subtract that number from the initial gap.

  • Turn it back into a PostgreSQL interval by appending the word “days” and type-casting it into an interval (the two vertical lines, or “pipes”, are the PostgreSQL command to append text).

On the face of it, everything looks fine. So, now comes the debugging.

Taking two arbitrary dates that are a week apart, not on weekends, and which definitely don’t have holidays in my database, I run it through each of the steps in the long line above:

select extract(epoch from (‘1950-01-09’::timestamp - ‘1950-01-02’::timestamp)::interval);
=> 604800

select extract(epoch from (‘1950-01-09’::timestamp - ‘1950-01-02’::timestamp)::interval) / 86400.0;
=> 7

select extract(epoch from (‘1950-01-09’::timestamp - ‘1950-01-02’::timestamp)::interval) / 86400.0 / 7.0;
=> 1

select extract(epoch from (‘1950-01-09’::timestamp - ‘1950-01-02’::timestamp)::interval) / 86400.0 / 7.0 - 0.5; => 0.5

So far, everything’s correct. But watch the next line:

select (extract(epoch from (‘1950-01-09’::timestamp - ‘1950-01-02’::timestamp)::interval) / 86400.0 / 7.0 - 0.5)::integer;
=> 0

That’s not right — 0.5 is supposed to round to 1, not 0. See:

select 0.5::integer;
=> 1

It’s the nasty floating point math, striking again. You know, where 2 + 2 can equal 5 for sufficiently large values of 2.

In pretty much every programming language known to man, you have integers and floating point numbers. Integers are whole numbers, and are stored by being converted to binary and stored in a container that’s a specific number of bytes long. A one-byte container (i.e. eight binary digits, or “bits”) lets you store a range of 256 numbers, because that’s how many combinations of 0 and 1 you can fit into eight characters.

Floating point numbers let you get around that 1:1 limitation by storing numbers in something approximating scientific notation, dividing the bits between both parts of the number. This gives you a huge range of numbers, but it has the drawback that it’s not quite as precise, and sometimes bites you when you’re expecting one thing and get another.

I would’ve expected a warning sign to show up before the last step in my troubleshooting, but apparently I was wrong.

To fix the problem, I added a ::NUMERIC(8,2) type cast after subtracting 0.5 and before turning it into an integer. This tells PostgreSQL to turn it into a specific precision number (eight total digits, two of which are after the decimal point — eight was chosen as an arbitrary and excessively large number for this function), and then round it, which ensures that there won’t be any floating point mishaps this time.

So, now my published turnaround times are actually correct! Given how few mailings this (and two other holiday-related implementation bugs that I corrected along the way) affected, the numbers didn’t change much, but it means I’m not seeing some oddball numbers on my work list any more.

* You can adjust the function to allow for successive holidays by changing “if” to “while”. I haven’t done that yet, because I don’t need it, and because I’m afraid of accidentally creating an infinite loop inside the database on a remote production server. You can also account for whole weeks of holidays by grouping the first and second threesomes of statements in while loops.

Update: As it turns out, there was another bug in the code mentioned above. In order to fake a “round down” function, you need to subtract 0.499… (with as many nines as you need for precision), not 0.5, which I discovered when fixing the case when the two times were a week apart by day, but not by hour (e.g. noon on Monday through 9am the next Monday).

My code for that line is now:

gap := gap - ((((EXTRACT(EPOCH FROM (t2::DATE::TIMESTAMP - t1::DATE::TIMESTAMP))/86400.0) / 7.0 - 0.499)::NUMERIC(8,3)::INTEGER * 2) || ' days')::INTERVAL;

The ::DATE::TIMESTAMP bit is the equivalent of converting the dates to midnight, which means that two Mondays a week apart will always have the two days subtracted, even if it hasn’t quite been a full week.

An Important Postal Announcement

Take note, all ye people who mail animals via the post.

Effective the beginning of this month, you may no longer mail an animal that is going to be used in an animal fighting venture. You also are prohibited from mailing any sharp instrument that is designed or intended to be attached to the leg of a bird for use in said venture.

And, finally, you may not print messages on a box that will be mailed (whether or not it contains a prohibited animal) that could be construed as condoning said animal fighting ventures. Unless you’re talking about birds, and the fighting would be happening in a state where it’s legal. (You still can’t mail the bird being promoted, regardless of whether or not it’s legal.)

If you’re curious, an animal in these rules refers to any live bird, live dog, or other mammal, not including humans. You’re apparently able to mail the latter.

Animal fighting specifically excludes hunting, so you can mail your hunting dogs, at least according to the new rules posted this month. Whether or not this is disallowed elsewhere in the mailing manual is outside my area of expertise.

A less than stellar introduction to Ubuntu

All three hard drives in my development/gateway server have been getting progressively worse over the past few months, so I finally got around to installing a new hard drive and putting an up to date operating system on it last night.

I’ve always done Red Hat and Fedora in the past, but thought I might try something different this time, since Fedora gets stale really quickly nowadays.

Ubuntu, on the other hand, has a “Long Term Support” thing where they pledge to keep supporting a given version for five years. Nice!

I downloaded the ISO, burned it to a CD, and installed it, which went nice and smoothly, until it was finished. Then, upon rebooting, I got the boot loader, then “Error 18”. And that was that.

So much for experimentation. A minimal version of Fedora 7 is now on the system instead, which I’ll build up over the weekend. That involved an 8MB bootable CD, with everything else getting downloaded as needed (and only if needed), which made for a rather pleasant install, since I need almost nothing on that server. Network install is a nice trick!

USPS Survey

I’m filling out a survey that was sent to me by the postal service. It’s one of the rare times when most of the survey questions are actually applicable, and where I don’t have to choose “other”, “not applicable”, or “what in the world are you talking about?” on most of the questions.

One of them, though, was a little amusing:

About how many pieces of First-Class Mail (R) does your business send monthly?

  • None
  • 1-25 pieces
  • 26-100 pieces
  • 101-500 pieces
  • More than 500 pieces
  • Don’t know

Considering that my daily output is more than 500 pieces, they ended their list waaaaay too soon. :-)

Recent Entries

Who's on first?
I’m going through the settings of a hard-core E-Mail program, and came across this lovely setting: Copy To Address to…
USPS Extreme Cost-Saving Measures
In the July 2010 edition of the PCC Insider (a USPS publication) is found the following statement: There are 26,000…
At 1:15am this morning
At a hotel, after a day of eventful travel and 3.5 hours of sleep the night before, it’s (almost) needless…
My Amazon.com Wish List