ICD9, the International Statistical Classification of Diseases, is one of the common coding systems often used in medical databases for indicating patient conditions. A handy feature of it is that it’s a hierarchical system. Just as the Dewey Decimal System lets someone quickly find a book on geometry (library aisle 500 for “Natural sciences and mathematics”, bookcase 510 for “Mathematics”, and shelf 516 for “Geometry”), the ICD9 allows researchers to drill down into disease categories, or up from a patient’s specific diagnosis to a more general condition.
Introduction
I’ll be using the ICD9 taxonomy in later notebooks as a convenient way to reduce the number of features for a classifier. The problem I’ll be working on is trying to predict whether a patient has notes with a particular label (e.g. “substance abuse”) from their ICD9 codes. The difficulty is that I don’t have very many labeled notes and there are thousands of ICD9 codes. Classification based on such a dataset is likely to overfit - the machine learning algorithms can essentially memorize the training dataset to achieve good performance, but the algorithm won’t generalize. By using parent conditions (e.g. “Intestinal infectious diseases” rather than “Salmonella gastroenteritis”) we can quickly group together patients who have meaningfully similar conditions.
Happily, I found a very convenient Python library for navigating this hierarchy, located here. The notebook below walks through a few simple operations with it, and in a later post I’ll show how I combined it with scikit-learn to help select medical notes for further annotation.
Notebook
import sys
sys.path.append('icd9')
from icd9 import ICD9
# feel free to replace with your path to the json file
tree = ICD9('icd9/codes.json')
# list of top level codes (e.g., '001-139', ...)
toplevelnodes = tree.children
toplevelcodes = [node.code for node in toplevelnodes]
print('\t'.join(toplevelcodes))
node = tree.find('003')
node.description
node.codes
code = tree.find('003.0')
code.description
code
node.leaves[2].description
def print_tree(node):
if node is not None:
print('Parents:')
for c in node.parents:
print('- {}: {}'.format(c.code, c.description))
print('\n-> {}: {}\n'.format(node.code, node.description))
print('Children:')
for c in node.children:
print('- {}: {}'.format(c.code, c.description))
print_tree(node)
print_tree(tree.find('003.0'))
print_tree(tree.find('004.8'))
Comments
comments powered by Disqus