Normalisation of data

This post used to be called Don’t confuse validation with filtering.  This was meant to be about normalising data, but at the time, filtering was at the forefront of my mine due to work that did some of the things discussed here.  Filtering, validation, and normalising are all essential parts of handling data.  Finally, I want to stress that while normalisation is good, it’s not easy.  Making assumptions is fine as long as you make educated assumptions.  The reason you don’t see normalisation as much as you do is because it’s difficult to get right.  I’ve attempted to be careful in my examples, and have suggested seeking confirmation from the user before continuing.

Brandon Savage posted an interesting article today about how validation blind spots hurt real people.  He gave four succinct rules.

  1. Accept valid data in any form provided by the user.
  2. Where possible, use well-developed validation libraries.
  3. Do not place artificial limits on valid data.
  4. Do place valid limits on specific data.

All good advice.  However, it misses out on filtering.  You see, filtering is different from validation.  Filtering is the art of taking bad data and turning it into good data.  It might sound difficult, but it’s actually easier than you think, and will ease the pain in using your software.  Let’s see some examples.

Usernames

Passwords are secret.  Usernames, however, are usually not.  Even if they are private, like an email address, they are usually not the secure part of the application.  Couple this with mobile devices that auto-capitalize, if you’re requiring case-sensitive usernames, you are making things more difficult then they should be for your users.  It’s a pain having to go back, lower case a single letter at the beginning, and then retype a fairly long password just to log in because your site decided that ‘Jasonlotito’ and ‘jasonlotito’ are not the same user.

At the same time, if I do enter my email address, and you ask for a username, then just go with what I give you.  I presume that their is only one email address per account.  If I’ve forgotten my username for your site, but I know I have an email address tied to an account, just let me use that.  You’ve seen I’ve entered my email address, you can then look up by email address, get my username from there, and proceed to validate.

Credit cards

Credit card numbers follow simple rules.  If I enter my credit card number, you know what card I’m using, so asking me is silly.  You don’t even need to know the full number on the card to figure out the card type.  Once I’ve entered enough numbers in for you to know, simply change the pull down to the correct type, and change the icon.  It will give visual feedback that yes, we know what card you are using.  Their is no reason not to do this.

Also, if I put in dashes, or dots, or spaces, or whatever, you can deal with it.  Ignore them.  Simply use the numbers.  You don’t need to exclude characters, simply white list numbers.

While we are here, expiration dates are easy, and yet, somehow, people get them wrong.  Maybe I’m missing out on some details here, but every credit card I’ve ever seen lists expiration dates as MM/YY.  This is a sign that maybe your forms should follow suit.  This means your month should be ’03’, not March (03).  Just ‘3’ would suffice.  The year should be equally clear: ’15’, not ‘2015’.  I promise you, no one is assuming 1915, or even 2115.

Phone numbers

Filtering phone numbers is hard.  I know.  I’ve had to do it.  The system needed a filtered number because we had to call back in real time from an automated system located in the US.  This mean we needed a properly formatted number that could be dialed.  This isn’t easy, because when you ask for a phone number, people are going to start adding in country codes (but they might).  My suggestion is, unless you are having to do some automated system is to simple not filter numbers unless you can be sure, 100%, of what you are doing.  Even then, you probably don’t need to do it.

Simply ask the user for their phone number, and store whatever they put.  If you need to call them, you can find their country code, or area code if you need to.

Email addresses

Email addresses can be tricky.  You want to make sure they enter in the right email address. A simple mistake can cause problems.  I’ve, on occasion, type in jasonlottio instead of jasonlotito, and the problem was painful to resolve.  On the username side, you can’t do much.  However, after you hit the ampersand, you can start some simple checks.  If someone types in gmail or hotmail or yahoo and leaves off the .com, you can make assumptions.  You can validate these assumptions with the user on the next page.  Transposing letters (htomail) might be something you should look out for, too.  Compile a list of common email providers, and then check how close they are to the users non-common provider name.  If they are close, verify with the user if they didn’t misspell the name.

Assumption

The key to all these filtering techniques is assumption.  It’s not always perfect, but it’s better than failing.  After all, if I’m in Canada, and my phone number is a 10 digit phone number, you can make the assumption that to call me long distance, you’ll need to add a 1.  You can assume that if my credit card number starts with 4111, I’m using a Visa.  Their are lots of assumptions you can make that will make things easier for the user.  Go ahead.  It’s okay.

6 thoughts on “Normalisation of data”

  1. “Filtering is the art of taking bad data and turning it into good data.”

    You’re confusing filtering with normalization, which is generally frowned upon. It’s best to make sure data is valid (validate) and only allow valid data (filter) than to try to modify bad data to make it good (normalize). Normalization flaws tend to create worse vulnerabilities than filtering flaws.

    There’s a really good book that covers this specific topic, but I’m blanking on the name of it.

    1. Normalisation is a better word, but I disagree with it being frowned on. Going back to the Credit Card example, penalising a user because he didn’t change the default option of Visa and entered a MasterCard number is frustrating. Normalising a phone number is better than demanding a user enter the correct phone as it would be dialled from a call center in India, for example. Normalising a username or email address to a lowercase string is better than denying a user entry because their phone auto-capitalises the first letter.

      Taking data and filtering out the parts that don’t make sense and having it make sense is what computers are for. Indeed, I don’t suggest doing this for every occasion, and not something you should apply blindly.

      The examples I posed (phone numbers, credit cards, emails, etc) are all real world problems I’ve had to deal with. Pushing the responsibility back onto the user might be the easiest way of doing things, but doesn’t encourage better software.

      1. There’s nothing wrong with advocating for format-independent validation or case insensitivity. If (123) 456-7890 is a valid phone number, then so is 123-456-7890. No one’s going to disagree with that.

        I do wish you would change the title of this post. I expected an explanation of the subtle difference between validation and filtering, and if I didn’t already know the difference very well, I would be left utterly confused.

        Maybe make the title something about why stricter is not always better. Your examples are good ones, because they demonstrate a strictness that applies solely to the format of the data. That’s not only unhelpful, it’s annoying to users for exactly the reasons you mention.

        1. I’ve changed the title at your suggestion to clarify the topic. I’ve also added a bit to the beginning making it clear normalising data is not something you just *do*. It’s fairly difficult, and usually less straightforward then validating and filtering data.

          On a side note, I’d been working with Filters in Zend Framework lately that normalise data, which is where the idea of the name came from (besides a play on Brandon’s title).

          Thanks for the feedback in more than 140 characters. =)

  2. “Their are lots of assumptions you can make that will make things easier for the user. Go ahead. It’s okay.” I’m not sure if that’s such a good idea. Unless you know your userbase very very very well (3 verys)..

    Assumptions the app makes are “magic” when they work, and the app is broken when it’s not, normalizing google.co to google.com could be a problem, as the email address may be google.co.uk, or something else. This level of assumption should be done on a case by case basis, once you know your audience very well.

    Just my inflated $0.02 (USD :))

    1. Very true. And I hope no one assumes that it’s simple to implement. Keep in mind I did suggest that you verify with the user the correctness of the email address. If someone types in ‘jasonlotito@gmail’ for example, I wouldn’t assume ‘gmail.com’ and put it into the database straight away. Rather, I’d fix it, and validate it with the user. “We noticed your email address was incomplete. Is this correct?” That sort of thing.

      However, beyond those extremes, a lot of normalising can be done to help improve the user experience. We shouldn’t be afraid to work on that.

Leave a Reply

Your email address will not be published. Required fields are marked *