Friday, August 15, 2014

Swimming in Healthy #BigData Lakes….

Ingesting CSV files.

I have recently been on a mission to import the NPI data file that is generated each month by the Center for Medicare and Medicaid Services ( part of the Department of Health and Human Services). The NPPES monthly release is a zip file that contains all the NPI records for doctors and facilities in the USA.

The zip file is about 455MB in size. The main NPI file within that zip file is a 330 column, 4.3 million line file that is over 5.2GB in size. This is so large it will choke Excel.

If you want to take a look at the file check out this link:

If you are dealing with size of file you are going to want to get to grips with a bunch of unix commands such as:

  • head
  • tail
  • sed
  • split

You may even need to polish your bash scripting. That is what I had to resort to.

Another useful toolset is CSVKIT. This is a library that provides a bunch of utilities for manipulating .csv files. Check out CSVKIT here:

Big data platform

My big data platform for this exercise was Sqrrl. I had already had fun importing this monster file in to MongoDB.

The first challenge was to build the single server environment. This involved configuring:

  • Hadoop
  • Zookeeper
  • Accumulo
  • Sqrrl

This is a non-trivial task. But with those services running and communicating with each other the next step was to master the bulk upload of the .csv file using the Sqrrl shell.

In theory this is a simple task. But that assumes that the data is clean. That can be a big assumption.

The first step in the upload is to create a Field Description File (FDF). This is easy to do. You can create this in a text file. The start of the NPI Field Description File looks like this:


Starting from the first column (column zero) you include the field name that you want in the sqrrl dataset, and the field type (Integer, String, DateTime)

I imported the NPI Header file that comes in the monthly zip file and manipulated it with some transformations in Excel and then exported the end result to a text file.

With the FDF file created for the 330 columns of data the next step was to deal with the main NPI .csv file itself.

The first step was to check the cleanliness of the file. I used CSVKits csvclean and csvstat. CSVCLEAN didn’t report any errors so things were looking good.

The version of Sqrrl I was using didn’t appear to have a setting to ignore the header line in a CSV file when using the FDF file. So I used a sed command to remove the first line:

sed ’1d’ {input.file} >{new_output_file}

With the header line removed I now had 5.2GB and 4.3M+ lines of CSV data.

Before the file can be worked on by sqrrl it needs to be uploaded to Hadoop’s HDFS file system. to do that you use Haddop’s utility.

hadoop fs -put {source_file} {folder_in_HDFS}

Now we are in a position to use the Sqrrl shell and process the file.

first use the sqrrl shell to create the target dataset, if it doesn’t exist.

sqrrl shell -u {Sqrrl user} -p {sqrrl user password} -s {sqrrl_host:port} -e “startload –csv {folder_and_CSV_in_HDFS} –field-descriptions-file {FDF_file_in_local_file_system} –uuid-fields {Columns_to_use_for_UUID} –uuid-delimiter _ –do-upates -w -d {target_dataset}”

The theory goes that the data now is ingested from Hadoop in to sqrrl and deposited in to accumulo. But now the adventure really started.

When the ingestion fails…

The batch upload process would run for about 40,000 lines of the 4.3 million lines and fail. The MapReduce logs give very little information. The indication is that the data has some corrupt information in it. This seemed strange since CSVKit gave it a clean bill of health.

I therefore had to resort to a process of elimination to find the root cause. The challenge is that scanning 330 columns across even just a few lines of data is challenging.

The first step was to use the head command to split off a few lines at the top of the file. So I grabbed 100 lines and exported them to a new file.

I then uploaded the file to hadoop and ran the startload command. It worked. This indicated that the system was runnning and the FDF file format was okay. Going in to the sqrrl shell I was able to use select statements to pull back information from the uploaded data.

I then retried the startload command but still no dice. There is a problem with the data. When the startload command fails it tells you how many lines it successfully processed before failing. For some reason this number varied each time it ran. However the failure occurred at a point about 40,000 lines.

So my next step was to use the “wc -l” command to determine how many lines were in the file. The answer: 4336701. Breaking that file up in to chunks that were smaller than the number of lines were failure occured would mean creating a couple of hundred files. It was time to resort to some bash scripting.

I therefore basically wrote a script that took the starting CSV file, split it in to 30,000 line chunks and deposited those chunks in a temporary folder.
I then used “ls” to get the list of files and then iterated over each of them to:

  • hadoop fs -put {upload_file} {sqrrl_processing_folder}
  • sqrrl shell to run startload to process the upload file using the local FDF file.

This took a number of hours to run.

I then grabbed the jobnumbers for each sqrrl startload command and sued sed to create a shell script that would run a checkload on each job.

When I scanned the files there were two of the upload files that failed. At least I had found the culprits but each file was still 30,000 lines in size.

I basically then ran a process to split each problem file in to a series of sub files and submit them through the same process of uploading to hadoop and sqrrl.

Each cycle there would be one file upload that would fail. The problem was obviously in the data but which line? When I eventually split the lines down until there was only 1 line in each file I was able to isolate the problem.

So what was the problem?

To unix experts this will probably be a “Dohh!!” moment.

When I examined the problem lines there was one field where there was a “\” as either the only character in a field, or it was the LAST character in the field. The effect of this was to escape the quotes character that closed the field. This is what was causing the import to choke.

I could easily go in and modify the line by hand and re-submit but I wanted to work out how to do it via a command. So I basically came up with a sed command that would substitute any occurrence of ‘\”‘ in the file and replace with ‘ “‘. This worked using sed ‘s/\”/ “/’.

Problem solved! What an adventure!

[category News, Health, bigdata]
[tag health cloud Genomics, sqrrl, accumulo, hadoop, bigdata, nosql]

Mark Scrimshire
Health & Cloud Technology Consultant

Mark is available for challenging assignments at the intersection of Health and Technology using Big Data, Mobile and Cloud Technologies. If you need help to move, or create, your health applications in the cloud let’s talk.
Stay up-to-date: Twitter @ekivemark
Disclosure: I began as a Patient Engagement Advisor and am now CTO to Personiform, Inc. and their platform. Medyear is a powerful free tool that helps you collect, organize and securely share health information, however you want. Manage your own health records today. Medyear: The Power Grid for your Health.

via WordPress