Steve Simms

Home

Postal

How to create an 11GB database table with 0 rows

Submitted by Steve Simms on Wed, 09/26/2007 - 3:48pm.
  • Postal
  • Programming

There are a decent number of ZIP codes in the US (42,296 to be precise; variations bring the table size up to 79,834). When putting them into a database table, it helps to create an index, but you're not likely to run into much trouble otherwise.

The ZIP+4 database is a different story. Back in April, there were roughly 43.5 million entries in that particular table. The raw text file is 9GB, and doing a simple task like adding an index requires you to check to make sure you're not going to run out of disk space, and have plenty of time on your hands.

The September database has apparently gotten a little bigger, as my estimate was a few GB short, and I ran out of disk space while importing it. The sad thing is that it was probably really close to being done, too.

But the aftermath of that problem is where it gets amusing, in an odd sort of way:

Because of the way databases work, the import basically started a transaction, imported practically all of the data, and then stopped when it ran out of hard drive space.

What it didn't do was delete all of the data that had gotten imported. It's still there, though inaccessible because it was in an incomplete transaction.

Aborting that transaction doesn't delete the data. That doesn't happen until vacuuming, which is normally done periodically by PostgreSQL (since 8.0 or so), and which can be forced if needed (like when you import 11GB of data and immediately delete it all).

Because of Point-in-Time-Recovery, a copy of all 11GB of that non-existent data is currently in the process of being copied to my backup server, and is also taking up a rather large amount of disk space in the mean time.

As a result, "SELECT * FROM usps_zip9s" takes about 15 minutes, and returns nothing.

And now I get to start over. Or, at least I will once I get the table cleared out (11GB) and the PITR files transfered (another 11GB). This time, I'm going to import the data in chunks, just in case.

An Important Postal Announcement

Submitted by Steve Simms on Thu, 09/13/2007 - 11:01am.
  • Postal

Take note, all ye people who mail animals via the post.

Effective the beginning of this month, you may no longer mail an animal that is going to be used in an animal fighting venture. You also are prohibited from mailing any sharp instrument that is designed or intended to be attached to the leg of a bird for use in said venture.

And, finally, you may not print messages on a box that will be mailed (whether or not it contains a prohibited animal) that could be construed as condoning said animal fighting ventures. Unless you’re talking about birds, and the fighting would be happening in a state where it’s legal. (You still can’t mail the bird being promoted, regardless of whether or not it’s legal.)

If you’re curious, an animal in these rules refers to any live bird, live dog, or other mammal, not including humans. You’re apparently able to mail the latter.

Animal fighting specifically excludes hunting, so you can mail your hunting dogs, at least according to the new rules posted this month. Whether or not this is disallowed elsewhere in the mailing manual is outside my area of expertise.

Postal Oddities

Submitted by Steve Simms on Wed, 04/25/2007 - 8:20pm.
  • Business
  • Postal

This probably isn’t news to anyone, but the postal service is a little odd. Or, at least, they’re odd who determine which city names are acceptable and which ones aren’t.

For each zip code, the postal service has a list of city names that are associated with that zip code, with one of three designations:

  • D (default) — this is the normal city name for this zip code.

  • A (acceptable) — this name is acceptable as well. It’s often a shortened version of the first record, or a smaller township within the main zip code.

  • N (not acceptable) — this name may not be used for bulk mailings (unless there’s a barcode on the envelope, in which case they don’t care). I’m not entirely sure why not in many cases. In some cases, it’s because the town has a different zip code, but is similarly named to this one, and would cause confusion.

For a while, I thought that certain abbreviations were always taboo. For example, “Fort Collins” may not be abbreviated as “Ft Collins”. (I mentioned this one the last time I wrote on this topic.) “Saint” becoming “St” was another example, as well as “Mount” becoming “Mt”.

I can’t think of a good reason why these names should be verboten — it’s not as though the machines couldn’t very easily figure out which one’s which (they do for first-class mail, and the fact that there are “A” records means that it doesn’t need to be a 1-to-1 correlation). But, okay, I can accept that.

But it’s not that simple. If you then look at Ft. Myer, VA, it’s reversed. You can write “Ft. Myer”. But write “Fort Myer,” and it’s “No postal discount for you!”. Why? No clue.

Saint Petersburg, FL can go either way. “St. Petersburg” is allowed. They must be easygoing there. Not so in Saint Cloud, FL. “St. Cloud” is forbidden. I was thinking at one point earlier today that it might be related to the length — longer names can be abbreviated, while shorter ones can’t. No go.

The directions aren’t any better. Most of the time, you’re not allowed to abbreviate North, West, South, or East. Some times, you have to. Westhampton Beach, NY is an amusing example, though I actually know why two (and only two) of the following are allowed:

  • West Hampton Beach
  • Westhampton Beach
  • Westhampton Dunes
  • W Hampton Bch
  • W Hampton Beach

The second and fourth ones are allowed. The others are forbidden. In this case, it’s explainable: the town name doesn’t have a space in it, so the names with spaces are forbidden. But, the post office guarantees that there will be an acceptable name that’s no longer than 13 characters (the entire last line of an address can fit in 28 spaces, if you’re space-constrained — 13 for the city, space, 2 for the state, two spaces, 5-digit zip code, hyphen, 4-digit add-on). So, it had to abbreviate the “West” and “Beach” in order to get it to fit. The space would’ve been added so that the city didn’t become “Whampton Bch”, which just sounds violent. But it was begrudging, so they weren’t also going to allow “Beach” to be expanded.

Returning to the normal oddities, there’s Bowdoin, ME. “W. Bowdoin” is also allowed. But “West Bowdoin” isn’t. It’s usually the other way around.

Spaces also cause trouble. “Mc Allister, MT” must have the space. Most places aren’t allowed to have it. Missing or extraneous spaces account for a significant number of the addresses in my database that don’t standardize properly without changes.

The street-level information, by comparison, is kinda-sorta more or less straightforward, most of the time. Unless you’re in Puerto Rico, where it’s a mess. Otherwise, there are only about five rulesets that they call “unusual addressing situations.” Though there’s still enough odd data to keep that code from being simple. Street addresses are also where people get creative when writing their addresses (e.g. is “8 E Tammy Ct” address 8, apartment E? Or is it address 8 on “East Tammy Court”? Or is it address 8 on a street named “E Tammy Court”, as differentiated from “D Tammy Court”? Or maybe the lots got subdivided, and lots 8A through 8E are all separate addresses on Tammy Court. And don’t get me started on fractions). That one needs quite a few database lookups before finding one that works.

And the script I was running just finished, so I’m back to programming all these odd rules into the code.

Back from the National Postal Forum

Submitted by Steve Simms on Thu, 03/29/2007 - 1:39pm.
  • Business
  • Postal

Now my pile of work awaits. At the very last session, they challenged us (who were left) not to go back, face the mountain of work and other projects, and just file everything away, never to be seen again. Instead, in the next week, do something, to keep the momentum going, and hopefully to build steam for a more systematic approach to address quality.

I decided to one-up that, and actually spent most of last night (wisely or not) doing some tests and writing some validation rules (previously, I had a page that would report validation errors, but it wouldn’t auto-correct any of them).

The stats were interesting — looking just at the city, state, and zip (five-digit) line for customers who had sent mail in the past six months, and ignoring international addresses, 4.6% of the addresses in the database didn’t have a valid combination. Some were missing zip codes, many had incorrectly-spelled or improperly-abbreviated city names (there are standards for that), some had the wrong state, and roughly half of the errors were due to the zip code not having any relation to the city and state, whether because a digit was wrong, two digits were transposed, or the people moved to a new location and updated the address, city, and state, but not the zip.

In all, 90% of my customers have at least one address that doesn’t have a valid city+state+zip. I’m curious to see what the percentages will be when I add street-level checking.

The nice thing is that, without needing to look at the street addresses, I was able to write rules to correct roughly 70% of these errors last night. I think I can get that a little higher before needing to use the street-level data to verify guesses (e.g. in the case where there isn’t a zip code at all, if the city only has one valid zip code, I can use that, but I did need to go to bed at some point last night).

The biggest challenges are going to be when there are two or more errors in the city+state+zip. Take, for example, “N Augusta, GA 29841”. First of all, it should be “North Augusta”. Secondly, it isn’t actually in Georgia — it’s in South Carolina — though the mistake is understandable. Since “N” is always going to be “North” at the beginning of the city, that one’s not too hard.

A more complicated example is “Grey Town, OH 43423.” This one has three errors — there shouldn’t be a space between “Grey” and “Town,” this particular instance of “Grey” should actually be “Gray” (both exist in the US, though the latter is more common), and the last two digits of the zip code need to be flipped. All of them are simple errors (flipped digits in the zip code is probably the most common error — it takes an uber-postal geek or a local to be able to recognize that kind of problem without software), and all of them I can fix individually in code, but processing time goes up drastically when I try to catch several problems at once.

Anyway, time to work, so I’ll end with a few final stats. 75% of the corrections I made were attributable to the following five rules, in order:

  1. Expanding abbreviations at the beginning of the city (e.g. “Ft Collins” should be “Fort Collins”)
  2. One letter of the city was wrong (e.g. “Fort Collens”)
  3. One letter of the city was missing (e.g. “Fort Colins”)
  4. City was spelled incorrectly and I added it to a “common misspells” list (e.g. “Cincinatti”)
  5. City had one more letter than it should have had (e.g. “Fort Colllins”)

Yes, the post office is smart enough to figure any of these out, but it doesn’t give supporters the best of impressions if your prayer letter is addressed to them at “Ft Collens, CA 90254”, and takes an extra two weeks to get delivered because the post office’s machines obligingly sent it to zip code 90254 (Hermosa Beach, CA) instead of 80524 (Fort Collins, CO). The City+State+Zip also need to be correct on every envelope before you can get any kind of discount for sending standard or non-profit mail.

Once I get a little caught up on work, I’ll probably write about why I’m planning to write my own software to do all of this matching rather than using an off-the-shelf product (there are several vendors who write this kind of software, and I have demo discs from a number of them).

Syndicate content
  • About
  • Humor
  • Programming
  • Tech
  • Sound
  • Contact Info
  • What's a Deef?
My Amazon.com Wish List

Welcome to Deef's Net, Steve's online home.

If you think of his professional web sites as his office and workplace, this is where he hangs out in the evenings, chats about what happened during the day, and explores some hobbies.

Recent Posts

  • Highlights from 2008
  • August E-Mail Statistics
  • A new kind of job
  • A Public Service Postal Announcement
  • You know your computer is slow when...
© 2000-2008, Stephen Simms. All Rights Reserved.