Mining the Medicare Benefits Scheme

A few weeks ago, the Australian Department of Health released an anonymized sample of more than 30 years’ worth of Medicare claim data to the public.

The intention is to encourage researchers to identify Medicare rebates that aren’t delivering value, or areas where doctors may be making mistakes in treatment advice.

It also happens to be a great dataset to demonstrate some of the big data tools developed over the past few years.

Objective

I’m not a medically-inclined person, so I needed someone to get me started with an interesting question to answer.

I turned to my brother, who’s currently training in orthopaedic surgery. He suggested I find the age distribution for patients who had undergone subacromial decompression, rotator cuff repair or shoulder replacement (shoulder surgery, in layman’s terms).

I couldn’t have asked for a better question – this was concise, to the point, and directly relevant to someone’s field of practice.

Understanding the structure

Now that I had an actual question to look at, I needed to understand how the data was structured.

The Medicare data set is hosted on the government’s data.gov.au portal, and contains anonymized records of approximately 10% of patients who had claimed under the Medicare Benefits Schedule (MBS) and Pharmaceutical Benefits Schedule (PBS) programs between 1984 and 2014.

My question didn’t concern pharmaceuticals, so I could safely leave the PBS data to one side.

The MBS data, though, contained everything I needed – yearly tables of patient claims (identified by an anonymous PIN), and a table of all patients sampled (with their gender and year of birth).

Given the size of each claim table – around 4gb per year uncompressed – I didn’t really want to download/analyse these on my local machine. I needed a distributed cluster of machines, and my platform of choice was Microsoft Azure (though Amazon Web Services would also have been a perfectly acceptable choice).

Starting the cluster

Before I could start my analysis, I need to get both the patient claim tables, and the patient PIN tables, onto Azure Blob Storage (Microsoft’s Hadoop-based distributed file storage system).

It was also possible to write a custom C# program to do this, but this was overkill for a once-off job.

After a few unsuccessful attempts, I concluded that Microsoft’s Azure toolset wasn’t able to instruct Blob Storage to pull files directly from the data.gov.au repository. I decided it would be easier to spin up a cluster and pull the files directly from the shell.

My cluster consisted of 4x DS3 V2Standard virtual machines, each running Ubuntu Server with a 4-core CPU and 14GB of RAM. The choice wasn’t scientific – it just seemed to be a good balance between memory, processor speed and cost.

The great thing about distributed computing is horizontal scalability – achieving better performance by adding more (or more powerful) nodes to the cluster. Absolutely necessary if I was dealing with terabytes of data, but the MBS dataset wasn’t quite at that level.

With the cluster running, I was able to:

  1. SSH into one of the virtual machines
  2. mount Azure File Storage as a network filesystem
  3. use bash/wget to grab the files for each year
  4. use Microsoft’s AzCopy locally to move the files from File Storage to Blob Storage

Configuring a distributed query platform

With my cluster up and running and the MBS dataset uploaded to Azure Blob Storage, I needed some way to distribute my queries across the cluster.

Since Azure Blob Storage is an extension for HDFS, I could choose from any of the Hadoop-based platforms (Hive, Spark, etc). I wanted to avoid using a NoSQL database like MongoDB or ElasticSearch, as this would require a lot of preprocessing to load the data, and wouldn’t fully leverage the benefits of the Hadoop ecosystem.

The yearly MBS tables were basically all just large spreadsheets, so a simple query tool like Apache Drill was the logical choice. With Drill, we can use SQL to directly query flat files (CSV, Parquet, JSON, etc). Drill also doesn’t require Hadoop, though with the size of our MBS dataset, we would definitely want to leverage Hadoop if we wanted a reasonable query response times.

Microsoft has a great series of articles on installing Drill, and configuring Blob Storage as an HDFS connector, which I don’t intend on repeating here. The only real difference in my approach was that I didn’t need a separate ZooKeeper cluster (which also avoids the need for a load-balancer to route through to the private IP addresses).

drill-cluster

Once we were up and running across 4 nodes, we could connect the Drill cluster to Azure Blob Storage by adding our container address/keys to Drill’s core-site.xml, and creating a storage plugin connected to wasb://medicare@yyy.blob.core.windows.net/ (where medicare was the name of our Blob Storage container).

Note there were two directories inside the medicare container – /mbs, containing the yearly MBS files, and /pin containing the single PIN lookup file.

Configuring a local query tool

The Drill command line client is perfectly acceptable for executing SQL queries. My personal preference, however, is for a “notebook” tool like Apache Zeppelin. With Zeppelin, I can create notebooks to run multiple queries against multiple datasources, save and visualize the results, and commit directly to version control.

For this particular exercise, I connected Zeppelin to Drill by configuring the main Drillbit node (the virtual machine with ZooKeeper installed) as a JDBC datasource. Once connected, I could use Zeppelin to query the files inside the /mbs and /pin directories, referencing the tables as wasb.`/mbs` and wasb.`/pin`.

zeppelin-subacromial

Running the queries

With my local query tool configured, all the technical work was complete. Now, I needed to consider how I was going to query the data.

Each row in the yearly claim table represents a claim made by a patient for a particular MBS “item” – a specific medical consultation, procedure or service, represented by a numeric code.

The other columns – the dollar amount claimed, the practitioner’s field of medicine, etc. – weren’t really relevant to my question.

All I needed was to find all shoulder surgery items, extract the patient’s PIN (letting me find the patient’s year of birth) and the year of the operation (letting me calculate the patient’s actual age when the operation was performed).

Searching the online MBS item database, I found the word “subacromial” was associated with two MBS item numbers:

  1. 48903 – SHOULDER, decompression of subacromial space by acromioplasty, excision of coraco-acromial ligament and distal clavicle, or any combination
  2. 48909 – SHOULDER, repair of rotator cuff, including decompression of subacromial space by acromioplasty, excision of coraco-acromial ligament and distal clavicle, or any combination, not being a service associated with a service to which item 48903 applies

These two items look almost identical – the difference being that rotator cuff repairs (48909) usually require all of the items under 48903, and practitioners are prevented from “double-dipping” by claiming both 48903 and 48909.

With the relevant MBS item numbers, the SQL query itself was quite straightforward:

SELECT age,sex,COUNT(age) FROM
(SELECT
pin.sex,
((TO_NUMBER(mbs.yos,'######')) - (TO_NUMBER(pin.yob,'#######'))) AS age
FROM (SELECT BYTE_SUBSTR(dos,6,4) AS yos,pin FROM wasb.`/mbs` WHERE item=48909) mbs
LEFT JOIN wasb.`/pin` AS pin
ON pin.pin=mbs.pin)
GROUP BY age, sex
ORDER BY age ASC

Breaking this down step by step, we:

  1. select all mbs records pertaining to item 48909
  2. take the pin and dos (date of service) columns
  3. use the BYTE_SUBSTR function to convert the dos column to a yos (year of service) column
  4. left join the results with the pin table
  5. use the TO_NUMBER function to convert the yos and yob columns to numbers
  6. subtract the yob (year of birth) column from the yos (year of surgery) column to create an age column</em>
    • return a list of age/gender groups with their respective counts</ol> Step 3 was needed as date of service was stored as a string in the form 14JUL2014, and we needed the last 4 characters (2014) in numeric format.
class="highlight">
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
The same steps were then repeated for item 48903. I could have searched for both items within the same query, but I knew I wanted separate charts for each item, so separate queries were simpler. 

### Displaying the results

Zeppelin itself has some fantastic built-in graphing tools, allowing us to visualize the results as a column chart, combining both genders:

<img src="http://www.avinium.com/wp-content/uploads/2016/08/zeppelin-chart1.png" alt="zeppelin-chart1" width="850" height="299" class="alignnone size-full wp-image-169" srcset="http://www.avinium.com/wp-content/uploads/2016/08/zeppelin-chart1.png 850w, http://www.avinium.com/wp-content/uploads/2016/08/zeppelin-chart1-300x106.png 300w, http://www.avinium.com/wp-content/uploads/2016/08/zeppelin-chart1-768x270.png 768w" sizes="(max-width: 850px) 100vw, 850px" />
  
<span class="subscript">Even though I only used MBS claim tables for 2000 to 2014, it would be fairly trivial to include earlier claims &#8211; simply upload the earlier claim tables and re-run the query.</span>

Or alternatively, visualizing male and female results side-by-side:

<img src="http://www.avinium.com/wp-content/uploads/2016/08/zeppelin-chart2.png" alt="zeppelin-chart2" width="850" height="299" class="alignnone size-full wp-image-170" srcset="http://www.avinium.com/wp-content/uploads/2016/08/zeppelin-chart2.png 850w, http://www.avinium.com/wp-content/uploads/2016/08/zeppelin-chart2-300x106.png 300w, http://www.avinium.com/wp-content/uploads/2016/08/zeppelin-chart2-768x270.png 768w" sizes="(max-width: 850px) 100vw, 850px" />

For these types of results, I would normally bin values to create age brackets &#8211; 16-20, 21-25, and so on. I don&#8217;t believe this is possible in Zeppelin, so the easiest way would have been to export the results as a CSV file and plot a histogram in Excel.

### Big data in a couple of hours

From start to finish, this exercise required around 3 hours (half of which was spent unsuccessfully trying to download directly from HTTP to Azure Blob Storage). 

The queries themselves took around 3 minutes to run (and if I really needed sub-second response time, I could have simply added more nodes to the cluster).

With the incredibly powerful combination of Azure/HDFS, Drill and Zeppelin, hopefully you have an appreciation for just how quickly we can start diving into large datasets. My brother certainly appreciated the insight!

If anyone else wants to suggest a query, feel free to leave them in the comments or [shoot us an e-mail](mailto:contact@avinium.com).

<span class="subscript">Image courtesy of <a href="https://www.flickr.com/photos/10154402@N03/16387828665/in/photolist-qY8Uw6-2E9Zba-mF3BLM-pxJ3o5-5XmXRG-qu4nVM-59WyAA-anyTbb-89oTap-nXDwwV-5XwYAf-hfdkf-7hnzNv-maX5Qk-8HZ4zg-8HZ4zX-maX634-HMZN7J-8HZ4zB-aa4vSQ-aa1Pga-5jYNP-7yym5K-pKEPHP-5xPy8D-8shRVu-9cfTVu-5KEvJb-GH1xv-aHiCQx-BJf7W-9HPkWM-7FSaME-8Quaba-7NhVWA-foejTq-5ijgRH-69fQmu-aa1G5t-dfq9JU-gx9JL4-74U8kN-6RTbh9-7M3vfj-aa4pQf-5GtS8x-4oQ6Cp-fcafXF-5fRwSJ-7ZgE2z">Bruce Guenter</a>, available under Creative Commons Attribution 2.0 Generic licence.</span>

written by Nick Fisher