Now my pile of work awaits. At the very last session, they challenged us (who were left) not to go back, face the mountain of work and other projects, and just file everything away, never to be seen again. Instead, in the next week, do something, to keep the momentum going, and hopefully to build steam for a more systematic approach to address quality.
I decided to one-up that, and actually spent most of last night (wisely or not) doing some tests and writing some validation rules (previously, I had a page that would report validation errors, but it wouldn’t auto-correct any of them).
The stats were interesting — looking just at the city, state, and zip (five-digit) line for customers who had sent mail in the past six months, and ignoring international addresses, 4.6% of the addresses in the database didn’t have a valid combination. Some were missing zip codes, many had incorrectly-spelled or improperly-abbreviated city names (there are standards for that), some had the wrong state, and roughly half of the errors were due to the zip code not having any relation to the city and state, whether because a digit was wrong, two digits were transposed, or the people moved to a new location and updated the address, city, and state, but not the zip.
In all, 90% of my customers have at least one address that doesn’t have a valid city+state+zip. I’m curious to see what the percentages will be when I add street-level checking.
The nice thing is that, without needing to look at the street addresses, I was able to write rules to correct roughly 70% of these errors last night. I think I can get that a little higher before needing to use the street-level data to verify guesses (e.g. in the case where there isn’t a zip code at all, if the city only has one valid zip code, I can use that, but I did need to go to bed at some point last night).
The biggest challenges are going to be when there are two or more errors in the city+state+zip. Take, for example, “N Augusta, GA 29841”. First of all, it should be “North Augusta”. Secondly, it isn’t actually in Georgia — it’s in South Carolina — though the mistake is understandable. Since “N” is always going to be “North” at the beginning of the city, that one’s not too hard.
A more complicated example is “Grey Town, OH 43423.” This one has three errors — there shouldn’t be a space between “Grey” and “Town,” this particular instance of “Grey” should actually be “Gray” (both exist in the US, though the latter is more common), and the last two digits of the zip code need to be flipped. All of them are simple errors (flipped digits in the zip code is probably the most common error — it takes an uber-postal geek or a local to be able to recognize that kind of problem without software), and all of them I can fix individually in code, but processing time goes up drastically when I try to catch several problems at once.
Anyway, time to work, so I’ll end with a few final stats. 75% of the corrections I made were attributable to the following five rules, in order:
- Expanding abbreviations at the beginning of the city (e.g. “Ft Collins” should be “Fort Collins”)
- One letter of the city was wrong (e.g. “Fort Collens”)
- One letter of the city was missing (e.g. “Fort Colins”)
- City was spelled incorrectly and I added it to a “common misspells” list (e.g. “Cincinatti”)
- City had one more letter than it should have had (e.g. “Fort Colllins”)
Yes, the post office is smart enough to figure any of these out, but it doesn’t give supporters the best of impressions if your prayer letter is addressed to them at “Ft Collens, CA 90254”, and takes an extra two weeks to get delivered because the post office’s machines obligingly sent it to zip code 90254 (Hermosa Beach, CA) instead of 80524 (Fort Collins, CO). The City+State+Zip also need to be correct on every envelope before you can get any kind of discount for sending standard or non-profit mail.
Once I get a little caught up on work, I’ll probably write about why I’m planning to write my own software to do all of this matching rather than using an off-the-shelf product (there are several vendors who write this kind of software, and I have demo discs from a number of them).







