In the previous post I showed a convenient way to navigate the ICD9 hierarchy with Python, now let’s use that to extract the full taxonomy of ICD9 codes for patients who we’ll using to train a classifier. In this post we’ll be extracting the original ICD9 codes for all patients of interest from the MIMIC database, extracting the ICD9 hierarchy, and saving the results for later analysis.
The full notebook is available here,
but the bulk of the work happens in the accessory file structured_data_utils.py
, which we import
and access as sdu
. In the selection below I walk through using the routines in this library
Reload any changes made to the structured data utils code
reload(sdu)
Use pandas to extract the list of unique notes and patients - the primary thing we’re looking for is the MIMIC III row id, which is used to get the MIMIC encounter ID, and from there the ICD9 diagnoses.
found_notes = comb_dat.loc[comb_dat['row_id_m3'].notnull()].\
groupby(['subject_id', 'md5', 'row_id_m3']).count()['total_m3_distance'].index.tolist()
Iterate through the rows, building up a dictionary of dictionaries. note_info
is a dictionary where the keys are the unique subject_id-md5-row_id triplet from the pandas line above. The values are another dictionary with 2 keys:
meta
- note metadata, including the patient id (subject_id
), encounter id (hadm_id
), and associated timestampsdiagnoses
- a list of the diagnoses associated with this encounter, including the original poorly formated ICD9 code from MIMIC, the reformated version (clean_icd9_code
), and the label of the code
note_info = {}
for idx in found_notes:
note_meta = sdu.get_note_metadata(conn, idx[2])
note_diag = sdu.get_hadm_diagnoses(conn, note_meta['hadm_id'])
dat = {'meta': note_meta, 'diagnoses': note_diag}
note_info[idx] = dat
Print one element out to see how it looks
note_info[[k for k in note_info.keys()][0]]
Now we can use this list and the ICD9 python library to extract all of the parents for each code. Not all codes in MIMIC III are know to the library (likely due to slightly different ICD versions), so we need to handle that possibility by just skipping unknown codes. If it’s a know code then we look up the parents, and we’ll add the code and each of its parents to the note_codes
list. We’ll also keep a list of the metadata.
note_codes = []
note_meta = []
unknown_codes = set()
for k, note_dat in note_info.items():
subject_id, md5, row_id = k
meta = note_dat['meta'].copy()
meta['subject_id'] = subject_id
meta['md5'] = md5
meta['note_row_id'] = row_id
note_meta.append(meta)
diagnoses = note_dat['diagnoses']
if diagnoses is not None:
for diag in diagnoses:
new_code = {
'subject_id': subject_id,
'md5': md5,
'note_row_id': row_id,
'level': 'source',
'code': diag['icd9_code']
}
note_codes.append(new_code)
if diag['known_icd9_code']:
levels = sdu.get_icd9_levels(diag['clean_icd9_code'])
for ind, lev_code in enumerate(levels):
new_code = {
'subject_id': subject_id,
'md5': md5,
'note_row_id': row_id,
'level': ind,
'code': lev_code
}
note_codes.append(new_code)
else:
if diag['icd9_code'] not in unknown_codes:
unknown_codes.add(diag['icd9_code'])
logger.info('Unknown code ({}) for subject ({})'.format(diag['icd9_code'], subject_id))
len(unknown_codes)
Inspecting the records, we see that for a particular note (row id 1414073), the code found a known ICD9 code (39891), then found a root parent (390-459), and the path from it through children 393-398, 398, … We keep track of the hierarchy level from the root node - in a future post we’ll use this info to select a cutoff depth for classification based on ICD9
note_codes_df = pd.DataFrame.from_records(note_codes)
note_codes_df.head(5)
output_path = pl.Path(path_config['repo_data_dir']).joinpath('notes_icd9_codes_{}.csv'.format(time_str))
logger.info(output_path)
note_codes_df.to_csv(output_path.as_posix(), index=False)
note_meta_df = pd.DataFrame.from_records(note_meta)
note_meta_df.head(5)
output_path = pl.Path(path_config['repo_data_dir']).joinpath('mimic3_note_metadata_{}.csv'.format(time_str))
logger.info(output_path)
note_meta_df.to_csv(output_path.as_posix(), index=False)
Comments
comments powered by Disqus