I usually watch the news in the evenings. Unfortunately, the “all maps all the time” channels grow boring fast. So instead, I spent my evenings over the last week looking at barcodes. (I know, sounds equally boring. But I was fascinated!)
Barcodes are everywhere. EAN/UPC codes are on every product and package (and on every book at the library). QR codes are common for web-related content. Even my Halloween candy bars have a QR code that takes you to their nutritional breakdown. (See! Chocolate is nutritious!)
While UPC and QR codes are common, there are dozens of other types of barcodes. Visually, they all look different. (Wikipedia has a great list of the most common formats.) They basically break down into 1D, 2D, and 3D barcodes.
- 1D: One dimensional barcodes are your typical zebra stripes. (Unless you know exactly what to look for, different formats all look pretty much the same to a human.) Most are designed to be read by a laser. Basically, a laser scans a line and measures the dark and light reflections. The dark/light pattern identifies the code and encoded data. (This is why the grocery store scanner sometimes has trouble reading a product code. If the laser can’t see the whole barcode, isn’t aligned so that it crosses the entire barcode, or the code is distored by wrinkles or folds, then it won’t scan.)
Most 1D barcodes either store numbers or letters-and-numbers. While they can be read by old tech (lasers), they take up a lot of real estate on the product packaging.
- 2D: 2D barcodes are usually square or rectangular and have dot dithering in them. Most are designed to be read by a camera. For example the camera may stare at an assembly line and watch products as they pass on a conveyor belt. The camera frames are quickly passed to a barcode decoder that “sees” the 2D barcode and processes it.
Humans can usually identify these barcode types on sight. For example, QR codes have big black squares in three of the corners. Aztec has one big square in the middle. PDF417 (usually found on ID cards) is rectangular with thick black strips on the ends and lots of dot dithering in the middle.
In contrast to 1D, most 2D barcodes can store much more information in a smaller space.
- 3D: 3D barcodes typically look like 2D barcodes, but use color as a third dimension. While they are used in very niche markets, I haven’t encountered any in typical day-to-day use.
Barcodes are everywhere, and each serves a different purpose. I recently received a package in the mail that had five (5) different barcode formats on it: EAN/UPC (also called EAN-13) to identify the item, ITF-14 describing the box shape and weight, Interleave 2 of 5 as a tracking number, Codabar as a different tracking number, and a 2D barcode called ‘Maxicode’ that is used by UPS. I suspect that everyone uses different barcode formats so that a barcode from one vendor doesn’t cause confusion with a different vendor.
My main interest last week was on the Data Matrix 2D barcode format. These are usually teeny tiny squares with dot dithering. Along the left and bottom edges is a solid “L” line, while the top and right edges are every-other-square dot dithered.
(This encodes: “ABCDefgh1234”.)
As soon as I started looking for them, I saw them everywhere. They are much more common than QR codes. Some of the places I’ve seen them:
- Letters from banks, insurance companies, and utility bills
- Prescription pill bottles
- Mouthwash bottles, dental floss containers, and toothpaste tubes
- My voter ballot stub (so I can check to see if my vote was received)
- Letters with prepaid postage stamps
Most of the time, the Data Matrix barcode only contains numbers or a few characters. The encoded data usually has meaning to the manufacturers, but not to regular people. (The box for my raspberry pi has a Data Matrix code that says “WP”. I have no idea what that stands for, or whether it is related to the computer or the box it came in.) Other times, the data contains product and serial number information. For example, someone uploaded a picture of a computer chip to FotoForensics:
The text on the chip is low-contrast, but the barcode is clear enough for decoding. It says “9JF0959V00099_100-000000065”. Those two numbers (9JF0959V00099 and 100-000000065) are also printed in human-readable text. However, I don’t know if either is a serial number, part number, batch number, or something else.
My mouthwash (Listerine) has three of these Data Matrix codes on the bottle; one on each sticker. These do not appear to be unique identifiers. Instead, I think the codes identify the type of sticker. As the bottles go down the assembly line, they probably have cameras that double-check that the right stickers are on the right bottles.
The papers that I get from banks, investments, insurance companies, and utilities all seem to have unique identifiers. One of my friends said that he and his wife both get the same letters from their banks. Even though the text is the same, the Data Matrix numbers are different on every single page. They don’t seem to contain personal or account information; they are just numbers. However, I bet the bank can scan in that code and identify the exact mailing, page, and recipient. I suspect that this is for mailing verification; they make sure the letter contains the correct pages to the correct person before sealing the envelope. And if they have a printer disaster (with pages spewing all over the floor), they can identify which page belongs to which intended recipient.
From numbers to GS1
As far as I can tell, none of my prescription bottles contain personal information in the barcode. (Good! Otherwise, it could be a HIPAA violation.) While some bottles just contained numbers, others contained data in a GS1 format. GS1 is a standardized data format that only a few industries seem to use.
Personally, I think GS1 is a nightmare format. It’s mostly numeric. The initial numbers identify the type of data field, then comes the value. The problem is that the length of each value varies based on the data field. For example:
- An initial “01” defines the Global Trade Item Number (GTIN) field. The GTIN length is always 14 digits, including the final checksum digit.
- An initial “11” defines the production date. The value is exactly 6 digits that denote the date (YYMMDD; let’s ignore the obvious “Y2K” date issue).
- The initial sequence “10” defines the batch or lot number. This is a variable length field containing letters and numbers.
The problem here is that nothing in the data defines the value’s type or length. You cannot decode an arbitrary GS1 sequence unless you know every single field type and the length of every single field. (If you implement this, then you need a long hard-coded list of prefixes, lengths, and field names.)
Making matters more complicated, GS1 changes the rules based on the barcode format. For example:
- Numeric-only barcodes (like EAN/UPC or DataBar) can only have one variable length field and it must be at the end.
- Formats that support binary characters, like QR Codes, can use the ASCII GS character (0x1d) to mark the end of a variable length data set.
- With Data Matrix, you must start with an FNC1 code (not an ASCII character) and use the FNC1 character to identify the end of variable length fields… Unless it’s the last field, where the code can leave off the final FNC1.
I’m not making up the level of complexity. No wonder this standard isn’t widely adopted.
On the plus side, pill bottles are really interesting. They usually include the GTIN, batch number, serial number, and date information. Embedded in the GTIN is a special-case code (a GTIN value that begins with a “3” after any leading zeros). This means that the rest of the GTIN contains an identifier from the Federal Drug Administration’s National Drug Code (NDC). You can look up the GTIN value in the NDC registry and identify the exact type of medication: proprietary name, generic name, manufacture, type of drug (syringe, pill, etc.), and even the shape and size (e.g., 1 pill, white oval with the number “7” printed on it).
Data Matrix usually stores simple numbers or letters, but it can also store large binary sequences. (Large data sets usually need 2-4 Data Matrix blocks next to each other.) These are common for non-GS1 data sets, like prepaid postage stamps.
OMG… Postage stamps are amazing. Their Data Matrix codes come in two types: square (2×2) and rectangle (1×2 or 2×1). For example:
Both formats contain digital information and human-readable printed text. While there is some overlap, there is also information only found in the digital content and other information only available in printed text.
The small format (1×2 or 2×1) is called IBI Lite. The postage meter or printer is called a “Postal Security Device” (PSD). The IBI Lite barcode contains a code that represents the PSD’s make and model (in this example, “17”), unique device serial number (13501011), and the amount of the stamp, down to tenths of a penny ($0.293). The data also contains the piece counter. In this case, the piece counter says 4,012,763. (Has this PSD really printed over four million stamps?)
In contrast to the IBI Lite, the full (2×2) format includes:
- The PSD make and model: 02 1W. Unlike IBI Lite, this is not a short code; this is the full information that is also visible in the printed text.
- The PSD serial number: 1365457
- This pseudo-unique stamp serial number: 3b768b7
- The postage amount: $0.450
- The date the postage was intended to be mailed: 2013-01-11. (The PSD allows you to print stamps for mailing on a later date, in case you print stamps on Monday but plan to head to the post office on Wednesday.)
- The total postage printed. In this example, the PSD had already printed $91,961.780 worth of stamps.
- With some stamps (not this example), it also includes the current pre-paid balance. Not only do you know how much the sender has spent, you also know how much they have remaining.
- The zip code where the PSD was licensed. This is not necessarily the zip code where the sender is located and it may not match the printed zip code number. In this example, it’s 33860 (central Florida).
- Some stamp formats (not this example) include the recipient’s zip code.
Both formats also include a cryptographic signature that can only be validated by the post office. The signature prevents people from generating arbitrary stamps for free.
I suspect that the serial numbers, total postage spent, and other information are designed to identify fraud. For example, if the date and dollar amount do not match what the USPS has on file, then they can flag it as an irregularity.
I’ve been scanning and decoding the various barcodes from my junk mail, as well as bills. I’m just stunned by the information. For example:
- Most of the bulk and large corporate mailings have spend hundreds of thousands on stamps. I had one bulk mailer who had already spent over $1 million and had a remaining pre-paid balance of over $90,000. (If they can afford this, then junk mail must be very lucrative!)
- Going through my recycle bin, I found two bulk mailings that appeared to have unrelated content, but they had the same unique PSD serial number embedded in the barcode. The mailings were related — the same postal meter printed the stamps!
- The people who mow my lawn used the same postal printer two months in a row. (Common for a small company.) If I assume all stamps are the same amount and they print the customer invoices in the same order each month… then I know exactly how many customers they invoice each month. (For a small company, they are much larger than I expected!)
I described all of this to a few coworkers. After taking it all in, one of them remarked: You know you’re burnt out on politics when you find black-and-white barcodes fascinating.