Table of Contents

In the previous post I showed a convenient way to navigate the ICD9 hierarchy with Python, now let’s use that to extract the full taxonomy of ICD9 codes for patients who we’ll using to train a classifier. In this post we’ll be extracting the original ICD9 codes for all patients of interest from the MIMIC database, extracting the ICD9 hierarchy, and saving the results for later analysis.

The full notebook is available here, but the bulk of the work happens in the accessory file structured_data_utils.py, which we import and access as sdu. In the selection below I walk through using the routines in this library

Reload any changes made to the structured data utils code

In [46]:
reload(sdu)
Out[46]:
<module 'structured_data_utils' from '/mnt/cbds_homes/ecarlson/Notebooks/mit_frequent_fliers/mit-team-code/software/notebooks/structured_data_utils.py'>

Use pandas to extract the list of unique notes and patients - the primary thing we’re looking for is the MIMIC III row id, which is used to get the MIMIC encounter ID, and from there the ICD9 diagnoses.

In [47]:
found_notes = comb_dat.loc[comb_dat['row_id_m3'].notnull()].\
    groupby(['subject_id', 'md5', 'row_id_m3']).count()['total_m3_distance'].index.tolist()

Iterate through the rows, building up a dictionary of dictionaries. note_info is a dictionary where the keys are the unique subject_id-md5-row_id triplet from the pandas line above. The values are another dictionary with 2 keys:

  • meta - note metadata, including the patient id (subject_id), encounter id (hadm_id), and associated timestamps
  • diagnoses - a list of the diagnoses associated with this encounter, including the original poorly formated ICD9 code from MIMIC, the reformated version (clean_icd9_code), and the label of the code
In [48]:
note_info = {}
for idx in found_notes:
    note_meta = sdu.get_note_metadata(conn, idx[2])
    note_diag = sdu.get_hadm_diagnoses(conn, note_meta['hadm_id'])
    dat = {'meta': note_meta, 'diagnoses': note_diag}
    note_info[idx] = dat

Print one element out to see how it looks

In [49]:
note_info[[k for k in note_info.keys()][0]]
Out[49]:
{'diagnoses': [{'clean_icd9_code': '410.71',
   'hadm_id': 172993,
   'icd9_code': '41071',
   'known_icd9_code': False,
   'long_title': 'Subendocardial infarction, initial episode of care',
   'seq_num': 1,
   'short_title': 'Subendo infarct, initial',
   'subject_id': 11590},
  {'clean_icd9_code': '398.91',
   'hadm_id': 172993,
   'icd9_code': '39891',
   'known_icd9_code': True,
   'long_title': 'Rheumatic heart failure (congestive)',
   'seq_num': 2,
   'short_title': 'Rheumatic heart failure',
   'subject_id': 11590},
  {'clean_icd9_code': '396.3',
   'hadm_id': 172993,
   'icd9_code': '3963',
   'known_icd9_code': True,
   'long_title': 'Mitral valve insufficiency and aortic valve insufficiency',
   'seq_num': 3,
   'short_title': 'Mitral/aortic val insuff',
   'subject_id': 11590},
  {'clean_icd9_code': '397.0',
   'hadm_id': 172993,
   'icd9_code': '3970',
   'known_icd9_code': True,
   'long_title': 'Diseases of tricuspid valve',
   'seq_num': 4,
   'short_title': 'Tricuspid valve disease',
   'subject_id': 11590},
  {'clean_icd9_code': '042',
   'hadm_id': 172993,
   'icd9_code': '042',
   'known_icd9_code': True,
   'long_title': 'Human immunodeficiency virus [HIV] disease',
   'seq_num': 5,
   'short_title': 'Human immuno virus dis',
   'subject_id': 11590},
  {'clean_icd9_code': '403.91',
   'hadm_id': 172993,
   'icd9_code': '40391',
   'known_icd9_code': False,
   'long_title': 'Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage V or end stage renal disease',
   'seq_num': 6,
   'short_title': 'Hyp kid NOS w cr kid V',
   'subject_id': 11590},
  {'clean_icd9_code': '518.81',
   'hadm_id': 172993,
   'icd9_code': '51881',
   'known_icd9_code': True,
   'long_title': 'Acute respiratory failure',
   'seq_num': 7,
   'short_title': 'Acute respiratry failure',
   'subject_id': 11590},
  {'clean_icd9_code': '414.01',
   'hadm_id': 172993,
   'icd9_code': '41401',
   'known_icd9_code': True,
   'long_title': 'Coronary atherosclerosis of native coronary artery',
   'seq_num': 8,
   'short_title': 'Crnry athrscl natve vssl',
   'subject_id': 11590},
  {'clean_icd9_code': '272.0',
   'hadm_id': 172993,
   'icd9_code': '2720',
   'known_icd9_code': True,
   'long_title': 'Pure hypercholesterolemia',
   'seq_num': 9,
   'short_title': 'Pure hypercholesterolem',
   'subject_id': 11590}],
 'meta': {'cgid': 17770,
  'chartdate': datetime.datetime(2154, 6, 3, 0, 0),
  'charttime': datetime.datetime(2154, 6, 3, 17, 30),
  'hadm_id': 172993,
  'storetime': datetime.datetime(2154, 6, 3, 17, 51),
  'subject_id': 11590}}

Now we can use this list and the ICD9 python library to extract all of the parents for each code. Not all codes in MIMIC III are know to the library (likely due to slightly different ICD versions), so we need to handle that possibility by just skipping unknown codes. If it’s a know code then we look up the parents, and we’ll add the code and each of its parents to the note_codes list. We’ll also keep a list of the metadata.

In [ ]:
note_codes = []
note_meta = []
unknown_codes = set()
for k, note_dat in note_info.items():
    subject_id, md5, row_id = k

    meta = note_dat['meta'].copy()
    meta['subject_id'] = subject_id
    meta['md5'] = md5
    meta['note_row_id'] = row_id
    note_meta.append(meta)

    diagnoses = note_dat['diagnoses']
    if diagnoses is not None:
        for diag in diagnoses:
            new_code = {
                'subject_id': subject_id,
                'md5': md5,
                'note_row_id': row_id,
                'level': 'source',
                'code': diag['icd9_code']
            }
            note_codes.append(new_code)

            if diag['known_icd9_code']:
                levels = sdu.get_icd9_levels(diag['clean_icd9_code'])
                for ind, lev_code in enumerate(levels):
                    new_code = {
                        'subject_id': subject_id,
                        'md5': md5,
                        'note_row_id': row_id,
                        'level': ind,
                        'code': lev_code
                    }
                    note_codes.append(new_code)

            else:
                if diag['icd9_code'] not in unknown_codes:
                    unknown_codes.add(diag['icd9_code'])
                    logger.info('Unknown code ({}) for subject ({})'.format(diag['icd9_code'], subject_id))
In [51]:
len(unknown_codes)
Out[51]:
375

Inspecting the records, we see that for a particular note (row id 1414073), the code found a known ICD9 code (39891), then found a root parent (390-459), and the path from it through children 393-398, 398, … We keep track of the hierarchy level from the root node - in a future post we’ll use this info to select a cutoff depth for classification based on ICD9

In [52]:
note_codes_df = pd.DataFrame.from_records(note_codes)
note_codes_df.head(5)
Out[52]:
code level md5 note_row_id subject_id
0 41071 source be74552c73a0f9895c4f372763054d26 1414073.0 11590
1 39891 source be74552c73a0f9895c4f372763054d26 1414073.0 11590
2 390-459 0 be74552c73a0f9895c4f372763054d26 1414073.0 11590
3 393-398 1 be74552c73a0f9895c4f372763054d26 1414073.0 11590
4 398 2 be74552c73a0f9895c4f372763054d26 1414073.0 11590
In [57]:
output_path = pl.Path(path_config['repo_data_dir']).joinpath('notes_icd9_codes_{}.csv'.format(time_str))
logger.info(output_path)
note_codes_df.to_csv(output_path.as_posix(), index=False)
2016-10-24 16:41:53,258 - root - INFO - ../../data/notes_icd9_codes_2016-10-24-16-35.csv
In [54]:
note_meta_df = pd.DataFrame.from_records(note_meta)
note_meta_df.head(5)
Out[54]:
cgid chartdate charttime hadm_id md5 note_row_id storetime subject_id
0 17770.0 2154-06-03 2154-06-03 17:30:00 172993.0 be74552c73a0f9895c4f372763054d26 1414073.0 2154-06-03 17:51:00 11590
1 17698.0 2183-07-28 2183-07-28 05:41:00 116105.0 2bd0c96855c6107be79d0150e1f121e7 1449706.0 2183-07-28 05:53:00 14342
2 NaN 2170-02-13 NaT 122710.0 bd4bf8040238e3e2cdd7466692defe73 47105.0 NaT 8217
3 18469.0 2175-06-07 2175-06-07 05:39:00 196691.0 6d20d9b6d3cfdc3fc9e8a72fbab0f697 1573953.0 2175-06-07 06:27:00 23829
4 17079.0 2125-04-27 2125-04-27 20:51:00 133059.0 d35003faa86241e60396014264b14a4d 1264491.0 2125-04-27 21:03:00 305
In [58]:
output_path = pl.Path(path_config['repo_data_dir']).joinpath('mimic3_note_metadata_{}.csv'.format(time_str))
logger.info(output_path)
note_meta_df.to_csv(output_path.as_posix(), index=False)
2016-10-24 16:41:54,930 - root - INFO - ../../data/mimic3_note_metadata_2016-10-24-16-35.csv

Supporting code

structured_data_utils.py download
import logging
import pandas as pd
import sys 

sys.path.append('./icd9/')
from icd9 import ICD9

# feel free to replace with your path to the json file
tree = ICD9('icd9/codes.json')

logger = logging.getLogger()


def get_note_metadata(conn, row_id):
    """Retrieve note metadata from MIMIC III database

    Parameters
    ----------
    conn : sqlalchemy connection
    row_id : MIMIC III note row id to retrieve

    Returns
    -------
    dict : subject_id, hadm_id, chartdate, charttime, storetime, cgid corresponding to note
    """

    query = """
select subject_id, hadm_id, chartdate, charttime, storetime, cgid
from mimiciii.noteevents
where row_id={};"""
    
    res = conn.execute(query.format(int(row_id))).fetchone()
    
    if res is None:
        return None

    return dict(res)


def clean_icd9_code(icd9_str):
    """Convert a MIMIC III-style ICD9 code to a standard code for lookup

    Parameters
    ----------
    icd9_str : str
        MIMIC III code (e.g. '39891')

    Returns
    -------
    str :
        Standard ICD9 code, e.g. 398.91
    """
    if icd9_str is None:
        return None
    
    if '.' not in icd9_str:
        if icd9_str.startswith('E') and len(icd9_str) > 4:
            icd9_str = icd9_str[:4] + '.' + icd9_str[4:]
        elif len(icd9_str) > 3:
            icd9_str = icd9_str[:3] + '.' + icd9_str[3:]
        
    return icd9_str


def print_icd9_tree(node):
    """Print the ICD9 tree of a given ICD9 code

    Parameters
    ----------
    node : str
        Properly formatted ICD9 code (e.g. '398.91')

    """
    if isinstance(node, str):
        icd9_str = clean_icd9_code(node)
        node = tree.find(icd9_str)
    
    if node is not None:    
        print('Parents:')
        for c in node.parents:
            print('- {}: {}'.format(c.code, c.description))    

        print('\n-> {}: {}\n'.format(node.code, node.description))

        print('Children:')
        for c in node.children:
            print('- {}: {}'.format(c.code, c.description))


def get_hadm_diagnoses(conn, hadm_id):
    """Retrieve all ICD9 diagnoses for a given encounter ID

    Parameters
    ----------
    conn : sqlalchemy connection
    hadm_id : int or str
        MIMIC III encounter ID

    Returns
    -------
    list of dict : each dict contains the ICD9 code, short title, and long title from MIMIC III

    """
    if hadm_id is None:
        return None
    
    query = """
select a.subject_id, a.hadm_id, a.seq_num, a.icd9_code, diags.short_title, diags.long_title
from mimiciii.diagnoses_icd as a
left join mimiciii.d_icd_diagnoses as diags on a.icd9_code = diags.icd9_code
where a.hadm_id = {}
order by a.seq_num
"""
    res = conn.execute(query.format(int(hadm_id))).fetchall()
    
    if res is not None:
        res = [dict(r.items()) for r in res]
        for r in res:
            r['clean_icd9_code'] = clean_icd9_code(r['icd9_code'])
            r['known_icd9_code'] = tree.find(r['clean_icd9_code']) is not None                
    
    return res


def get_icd9_levels(icd9_code, max_depth=5):
    """Retrieve parents in the ICD9 hierarchy of the given code

    Parameters
    ----------
    icd9_code : str
        Properly formated ICD9 code
    max_depth : int
        Maximum depth to retrieve

    Returns
    -------
    list
        Parents of the given ICD9 code, starting from top-most parent and decending down to max_depth
    """
    icd9_str = clean_icd9_code(icd9_code)
    node = tree.find(icd9_str)
    
    levels = None
    
    if node is not None:
        levels = [c.code for c in node.parents[1:max_depth]]

    return levels

Comments

comments powered by Disqus