Clustering Data and Exact Matches

One record to rule them all…

In my full-time job I’m both a Customer Insight Manager and a data developer.

We have recently developed a “Single Customer View” (SCV), or “Single Customer Record”.

Our customer data is UK business data. It is possible for businesses to have records within our two different billing systems with variations of their name, different locations, different people responsible for paying our business, different accounts etc. So despite having a master Customer reference in each billing system, the reality is that business may be represented across multiples of these.

This makes marketing and analytics hard.

At it’s heart, our SCV clusters customer records from two different systems, and produces one master record that represents the customer. This produces a table like so:

SCV grouped data

Now our SCV uses several rules of matching data, in combination with each other.

The process can basically be described as:

  • Key matching fields are converted into match codes at various sensitivities.
  • Match codes are then joined together to create clusters.
  • Several passes are made of the matching rules, so super-clusters can be created.

With this top-level SCV we can now refer to a single customer entity, despite it’s data being spread across different systems and in multiple top-level records.

That’s great for marketing, but can we use for X… ?

Operationally, we have a scenario where our spread of customer data over multiple business IDs causes us a problem. An online system can only do things on a per-business ID basis and so customers cannot administer their whole online accounts with us. They end up needing multiple logins.

We have a merging facility, but this requires an awful lot of human effort to do all the necessary checks.

I was asked to quickly estimate how many records linked together by our SCV we could just merge automatically, based on them having a 100% match across all key fields. This included things such as the business name, their HQ address’s post code.

Getting a view of this quickly, could easily have been a pain. The data looks a little like this:

SCV grouped with key fields

In this example, SCV 1 has an exact match across both records. SCV 2 does not.

In order to make this assessment, I brought the data into SAS and started processing it.

The Business Name and Post Code fields were ran through an MD5 checksum routine, and converted to numbers. In SAS base, I used this code to do this.


The data now looks a little like this:

SCV data MD5 numeric checksums

Now it’s a matter of grouping the data by the SCV Cluster ID, and taking the MIN/MAX values of each of the numeric fields, to assess whether the cluster has a 100% similarity across it’s source records.

In SAS this is like so:

,MIN(BusinessNameNumeric) as MinBusinessNameNum
,MAX(BusinessNameNumeric) as MaxBusinessNameNum
,MIN(PostCodeNumeric) as MinPostCodeNum
,MAX(PostCodeNumeric) as MaxPostCodeNum
FROM SCVNumericFields;

/* Join the Max/Min values back on to source data */ 
FROM SCVNumericFields as A
LEFT JOIN SCVAccess1 as B ON A.SCVClusterID = B.SCVClusterID;

/* Create assessment fields */
DATA SCVAccess3;
SET SCVAccess2;
format BusinessNameSame 1.; BusinessNameSame = 0;
format PostCodeSame 1.; PostCodeSame = 0;
format AllSame 1.; AllSame = 0;
IF MinBusinessNameNum = MaxBusinessNameNum THEN BusinessNameSame = 1;
IF MinPostCodeNum = MaxPostCodeNum THEN PostCodeSame = 1;
IF BusinessNameSame = 1 AND PostCodeSame = 1 THEN AllSame = 1;

Now I can filter my data based on whether the whole cluster is the same or not. I could even check for partial similarities of individual fields, using the fields BusinessNameSame or PostCodeNameSame separately.

The data looks like so:

Final SCV data with filtering fields

You can see here that the first cluster shows it’s eligible for merging all the records, and the second does not.

Finishing Up

I don’t think any of this is really rocket science. A lot of the manipulation here is data-dev  / SAS base 101 really.

What this gave me though was the ability to answer a business question pretty quickly, and give some good estimates back to senior management.

It turns out about 20% of records could be automatically merged with no real detriment to the customer. This is a sizeable win. These customers will see a true benefit when they go to use our services that rely on this data being all together.

Although largely unnecessary, a human is still in the loop for assessing the data before it goes through bulk processing.

I don’t really intend to write so much about my data dev work here, but this was something that happened that was fresh in my mind and it seemed worth writing up!

AVAssetImageGeneratorCompletionHandler and Swift

Posting this here for my own probable future reference.

I’m picking up Swift in small bursts. Usually where I have a small screen that doesn’t rely on any of the Objective C based library I have within GoVJ.

I had a UITableViewController that I wanted to load thumbnails into from a collection of videos held as an array of AVAssets.

// Configure the cell

let assetUrl = demoContent.objectAtIndex(indexPath.row) as! NSURL
let asset: AVAsset = AVAsset(URL:assetUrl) as AVAsset
let imageGenerator = AVAssetImageGenerator(asset: asset);
imageGenerator.maximumSize = CGSize(width: 640,height: 480)
imageGenerator.apertureMode = AVAssetImageGeneratorApertureModeProductionAperture;

imageGenerator.appliesPreferredTrackTransform = true;

imageGenerator.requestedTimeToleranceAfter = kCMTimeZero

imageGenerator.requestedTimeToleranceBefore = kCMTimeZero

So far so good, everything is basically the same as Objective C.

I then create a CMTime value, that refers to the middle of the video file;

// Create thumbNail at middle of loop

let tVal = NSValue(CMTime: CMTimeMultiplyByFloat64(asset.duration, 0.5))

I then call generateCGImagesAsynchronouslyForTimes. I want to be able to chuck my thumbNail setting code into it’s block handler. This is fairly straight forwards in objective C, and I understand what I’m doing there in creating the block.

However Swift’s block formation eluded me, in particular the way ‘in’ follows the parameters and then you type the code you want to execute.

This is what I have now to set my thumbNails:

imageGenerator.generateCGImagesAsynchronouslyForTimes([tVal], completionHandler: {(_, im:CGImage?, _, _, e:NSError?) in 

if let img = im {
dispatch_async(dispatch_get_main_queue()) {
cell.demoContentThumbNail.image = UIImage(CGImage: img)
} else {
print("failed in generating thumbnail")

return cell

Through the eyes of a child

This weekend we upgraded my wife’s iPad. This meant that my older iPad 3 and her original iPad mini 1 were available for our kids.

They had been using 2012 nexus tablets. Arguably these were from a similar era to the iPads but they lag so much when doing certain tasks now. The iPads do too at times but nowhere near as much. It’s a statement on how well iOS9 can run on older devices perhaps but that’s besides the point of this post.

What was really awesome was setting up both children iCloud accounts, putting them on family sharing and being able to let them just explore their new devices and play.

My eldest was really pleased to be able to iMessage. Both kids spent a good while sending silly photos and picture messages to each other and me and my wife.

I started typing messages back from my Mac. When I told my eldest that was how I was chating he was amazed and had to check it out. We had a chat about how messages are routed through the net, and how it all syncs up on whatever device I’m on.

Log of chat
Log of chat

We played with FaceTime too, which he thinks is awesome.

The thing that really struck me, was seeing all this anew from a child’s point of view. The discovery process they went through with the UI and the capabilities of the device were all found easily through play.

The power of these devices to them is very much in the sharing and communication available to them straight away.

There’s a lot to be said for that.