In the previous post I showed a convenient way to navigate the ICD9 hierarchy with Python, now let’s use that to extract the full taxonomy of ICD9 codes for patients who we’ll using to train a classifier. In this post we’ll be extracting the original ICD9 codes for all patients of interest from the MIMIC database, extracting the ICD9 hierarchy, and saving the results for later analysis.
The full notebook is available here,
but the bulk of the work happens in the accessory file structured_data_utils.py
, which we import
and access as sdu
. In the selection below I walk through using the routines in this library
Reload any changes made to the structured data utils code
reload(sdu)
Use pandas to extract the list of unique notes and patients - the primary thing we’re looking for is the MIMIC III row id, which is used to get the MIMIC encounter ID, and from there the ICD9 diagnoses.
found_notes = comb_dat.loc[comb_dat['row_id_m3'].notnull()].\
groupby(['subject_id', 'md5', 'row_id_m3']).count()['total_m3_distance'].index.tolist()
Iterate through the rows, building up a dictionary of dictionaries. note_info
is a dictionary where the keys are the unique subject_id-md5-row_id triplet from the pandas line above. The values are another dictionary with 2 keys:
meta
- note metadata, including the patient id (subject_id
), encounter id (hadm_id
), and associated timestampsdiagnoses
- a list of the diagnoses associated with this encounter, including the original poorly formated ICD9 code from MIMIC, the reformated version (clean_icd9_code
), and the label of the code
note_info = {}
for idx in found_notes:
note_meta = sdu.get_note_metadata(conn, idx[2])
note_diag = sdu.get_hadm_diagnoses(conn, note_meta['hadm_id'])
dat = {'meta': note_meta, 'diagnoses': note_diag}
note_info[idx] = dat
Print one element out to see how it looks
note_info[[k for k in note_info.keys()][0]]
Now we can use this list and the ICD9 python library to extract all of the parents for each code. Not all codes in MIMIC III are know to the library (likely due to slightly different ICD versions), so we need to handle that possibility by just skipping unknown codes. If it’s a know code then we look up the parents, and we’ll add the code and each of its parents to the note_codes
list. We’ll also keep a list of the metadata.
note_codes = []
note_meta = []
unknown_codes = set()
for k, note_dat in note_info.items():
subject_id, md5, row_id = k
meta = note_dat['meta'].copy()
meta['subject_id'] = subject_id
meta['md5'] = md5
meta['note_row_id'] = row_id
note_meta.append(meta)
diagnoses = note_dat['diagnoses']
if diagnoses is not None:
for diag in diagnoses:
new_code = {
'subject_id': subject_id,
'md5': md5,
'note_row_id': row_id,
'level': 'source',
'code': diag['icd9_code']
}
note_codes.append(new_code)
if diag['known_icd9_code']:
levels = sdu.get_icd9_levels(diag['clean_icd9_code'])
for ind, lev_code in enumerate(levels):
new_code = {
'subject_id': subject_id,
'md5': md5,
'note_row_id': row_id,
'level': ind,
'code': lev_code
}
note_codes.append(new_code)
else:
if diag['icd9_code'] not in unknown_codes:
unknown_codes.add(diag['icd9_code'])
logger.info('Unknown code ({}) for subject ({})'.format(diag['icd9_code'], subject_id))
len(unknown_codes)
Inspecting the records, we see that for a particular note (row id 1414073), the code found a known ICD9 code (39891), then found a root parent (390-459), and the path from it through children 393-398, 398, … We keep track of the hierarchy level from the root node - in a future post we’ll use this info to select a cutoff depth for classification based on ICD9
note_codes_df = pd.DataFrame.from_records(note_codes)
note_codes_df.head(5)
output_path = pl.Path(path_config['repo_data_dir']).joinpath('notes_icd9_codes_{}.csv'.format(time_str))
logger.info(output_path)
note_codes_df.to_csv(output_path.as_posix(), index=False)
note_meta_df = pd.DataFrame.from_records(note_meta)
note_meta_df.head(5)
output_path = pl.Path(path_config['repo_data_dir']).joinpath('mimic3_note_metadata_{}.csv'.format(time_str))
logger.info(output_path)
note_meta_df.to_csv(output_path.as_posix(), index=False)
Supporting code
import logging
import pandas as pd
import sys
sys.path.append('./icd9/')
from icd9 import ICD9
# feel free to replace with your path to the json file
tree = ICD9('icd9/codes.json')
logger = logging.getLogger()
def get_note_metadata(conn, row_id):
"""Retrieve note metadata from MIMIC III database
Parameters
----------
conn : sqlalchemy connection
row_id : MIMIC III note row id to retrieve
Returns
-------
dict : subject_id, hadm_id, chartdate, charttime, storetime, cgid corresponding to note
"""
query = """
select subject_id, hadm_id, chartdate, charttime, storetime, cgid
from mimiciii.noteevents
where row_id={};"""
res = conn.execute(query.format(int(row_id))).fetchone()
if res is None:
return None
return dict(res)
def clean_icd9_code(icd9_str):
"""Convert a MIMIC III-style ICD9 code to a standard code for lookup
Parameters
----------
icd9_str : str
MIMIC III code (e.g. '39891')
Returns
-------
str :
Standard ICD9 code, e.g. 398.91
"""
if icd9_str is None:
return None
if '.' not in icd9_str:
if icd9_str.startswith('E') and len(icd9_str) > 4:
icd9_str = icd9_str[:4] + '.' + icd9_str[4:]
elif len(icd9_str) > 3:
icd9_str = icd9_str[:3] + '.' + icd9_str[3:]
return icd9_str
def print_icd9_tree(node):
"""Print the ICD9 tree of a given ICD9 code
Parameters
----------
node : str
Properly formatted ICD9 code (e.g. '398.91')
"""
if isinstance(node, str):
icd9_str = clean_icd9_code(node)
node = tree.find(icd9_str)
if node is not None:
print('Parents:')
for c in node.parents:
print('- {}: {}'.format(c.code, c.description))
print('\n-> {}: {}\n'.format(node.code, node.description))
print('Children:')
for c in node.children:
print('- {}: {}'.format(c.code, c.description))
def get_hadm_diagnoses(conn, hadm_id):
"""Retrieve all ICD9 diagnoses for a given encounter ID
Parameters
----------
conn : sqlalchemy connection
hadm_id : int or str
MIMIC III encounter ID
Returns
-------
list of dict : each dict contains the ICD9 code, short title, and long title from MIMIC III
"""
if hadm_id is None:
return None
query = """
select a.subject_id, a.hadm_id, a.seq_num, a.icd9_code, diags.short_title, diags.long_title
from mimiciii.diagnoses_icd as a
left join mimiciii.d_icd_diagnoses as diags on a.icd9_code = diags.icd9_code
where a.hadm_id = {}
order by a.seq_num
"""
res = conn.execute(query.format(int(hadm_id))).fetchall()
if res is not None:
res = [dict(r.items()) for r in res]
for r in res:
r['clean_icd9_code'] = clean_icd9_code(r['icd9_code'])
r['known_icd9_code'] = tree.find(r['clean_icd9_code']) is not None
return res
def get_icd9_levels(icd9_code, max_depth=5):
"""Retrieve parents in the ICD9 hierarchy of the given code
Parameters
----------
icd9_code : str
Properly formated ICD9 code
max_depth : int
Maximum depth to retrieve
Returns
-------
list
Parents of the given ICD9 code, starting from top-most parent and decending down to max_depth
"""
icd9_str = clean_icd9_code(icd9_code)
node = tree.find(icd9_str)
levels = None
if node is not None:
levels = [c.code for c in node.parents[1:max_depth]]
return levels
Comments
comments powered by Disqus