One tricky thing to deal with in the transition from MIMIC II to MIMIC III has been that IDs have changed and are sometimes missing, so work that a collaborating team did to annotate patient notes in MIMIC II is not translatable to MIMIC III. This notebook shows one way to discover note relationships between the two datasets.

Introduction

I’ve been working with the fantastic MIMIC dataset for several years now, and have been excited by the new MIMIC III release. It cleans up a lot of the old structure, adds content, and generally is just a lot nicer to use. (Thanks to all involved!) In the process of developing the new system the old way of identifying patients was reset, making it difficult to translate work from MIMIC II to MIMIC III.

As an example, in the previous dataset the team had created annotations such as “Patient 1’s 3rd ICU stay has notes with indications of advanced heart failure” and “Patient 1’s 7th ICU stay has notes with indication of substance abuse”. In the new dataset the patient numbers are consistent but the ICU stay indications are removed and times are all shifted differently, so we can’t directly apply our old annotations to the nte notes. Ideally one could just take a hash of the notes (e.g. MD5) and create a mapping, but the de-identification process changed as well, so the same original note text will be different in the two output datasets we have access to.

Approach

The first part of the notebook goes through the following steps

  1. Connect to MIMIC II and MIMIC III databases using SQLAlchemy
  2. Load notes labeled by our annotators
  3. Verify that the annotated notes match the MIMIC II source (expected)
  4. Check whether they match the MIMIC III data (they did not)

I did find that subject IDs were consistent between datasets - an observation that greatly reduces the potential search space of matches for each note. Using this observation I was able to take each annotated note and only try to match it to each of the same subject’s notes in MIMIC III. To identify matches, I first tried MD5 hashes - this did not work as the de-identification procedure changed between datasets (the same source note was presented differently in the two datasets, causing the hash to differ). I finally settled on a distance heuristic combining the overall note length, the similarity of the beginning of the note (first several hundred characters, excluding whitespace), and the similarity of the end of the note (last several hundred characters). When there were multiple possible matches for a note, the note with the aggregate lowest distance was chosen as the most probable match.

New Approach:

  1. Gather all notes from MIMIC II for each subject
  2. Gather all notes from MIMIC III for the same subjects
  3. For each subject’s notes in MIMIC II, calculate the following metrics for that subject MIMIC III notes:
    • Difference in note length
    • Similarity of the text beginning (as evaluated by SequenceMatcher)
    • Similarity of the text ending
  4. Any notes with more than 10%-20% difference in any of these metrics are considered to not match
  5. If there are multiple potential matches remaining, the notes with the minimum overall distance are considered best matches.

In this case thresholds for the heuristic distance measure and overall matches were chosen by inspection, and that resulted in reasonable results. A larger problem, or a problem with less constrained population for potential note matches, would require a different approach - e.g. pre-clustering (by topic, word frequency, etc), making use of extracted patient demographics (sex, age, race), or likely diagnoses. One could also apply more rigorous optimization for the thresholds to balance false positives and false negatives, or could apply automated approaches to developing distance thresholds (e.g. based on semisupervised learning).

Notebook

Introduction

Notes that were evaluated were pulled from both MIMIC 2 and MIMIC 3 due to availablity of data at the time of extraction. This causes difficulty when trying to pair the notes with structured data for the purposes of relating the notes to the overall patient context.

This notebook takes annotated note files as input, collects the necessary information from MIMIC 2 and 3 to find the note in MIMIC 3, and creates a new annotation file with the MIMIC 3 data.

Authors

  • Eric Carlson
In [1]:
from datetime import datetime
import configparser
import hashlib
from importlib import reload
import logging
import numpy as np
import os
import pandas as pd
import pathlib as pl
import sys
import yaml

from IPython import display

import etc_utils as eu
import mimic_extraction_utils as meu
In [2]:
import sqlalchemy as sa
from sqlalchemy.engine import reflection
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine, MetaData, inspect
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.automap import automap_base
In [3]:
reload(eu)
#%cat etc_utils.py
Out[3]:
<module 'etc_utils' from '/Users/ecarlson/code/etc/ektar-pelican/content/notebooks/mit_freq_fliers/etc_utils.py'>
In [4]:
reload(meu)
Out[4]:
<module 'mimic_extraction_utils' from '/Users/ecarlson/code/etc/ektar-pelican/content/notebooks/mit_freq_fliers/mimic_extraction_utils.py'>

Configure pandas and matplot lib for nice web printing

In [5]:
pd.options.display.max_rows = 1000
pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 100
In [6]:
%matplotlib inline

Load config files, configure logging

In [7]:
work_desc = "mimic_iii_note_conversion"
In [24]:
time_str, path_config, creds = eu.load_config_v2(creds_file='../../private/mit_freq_fliers/credentials.yaml')
print('Time string: {}'.format(time_str))

print('Paths:')
for k, item in path_config.items():
    print('{}: {}'.format(k, item))
Time string: 2016-12-31-11-20
Paths:
repo_data_dir: /Users/ecarlson/code/etc/mit_frequent_fliers/mit-team-code/data
input_dir: /Users/ecarlson/data/mit_frequent_fliers
results_dir: /Users/ecarlson/results/mit_frequent_fliers
log_dir: /Users/ecarlson/logs/mit_frequent_fliers
In [9]:
logger = logging.getLogger()

eu.configure_logging(logger, work_desc=work_desc, log_directory=path_config['log_dir'], time_str=time_str)
Logging to /Users/ecarlson/logs/mit_frequent_fliers/2016-12-31-11-04_mimic_iii_note_conversion.log
In [10]:
[k for k in creds.keys()]
Out[10]:
['mimic3_v1_4', 'mimic2_v2_6', 'postgres']
In [11]:
[k for k in creds['mimic3_v1_4'].keys()]
Out[11]:
['dbname', 'username', 'sa_driver', 'password', 'hostname']

Connect to databases

In [12]:
db3_path = '{sa_driver}://{username}:{password}@{hostname}/{dbname}'.format(**creds['mimic3_v1_4'])
engine3 = create_engine(db3_path)
sm3 = sessionmaker(bind=engine3)
s3 = sm3()
conn3 = s3.connection()

meta3 = MetaData(schema="mimiciii", bind=engine3)
meta3.reflect(bind=engine3)
In [13]:
base3 = automap_base(metadata=meta3)
base3.prepare(engine3, reflect=True)
In [14]:
for cls_name in sorted(base3.classes.keys()):
    print(cls_name)
admissions
callout
caregivers
chartevents
cptevents
d_cpt
d_icd_diagnoses
d_icd_procedures
d_items
d_labitems
datetimeevents
diagnoses_icd
drgcodes
icustays
inputevents_cv
inputevents_mv
labevents
microbiologyevents
noteevents
outputevents
patients
prescriptions
procedureevents_mv
procedures_icd
services
transfers
In [15]:
note_tb = base3.classes['noteevents']
In [16]:
s3.query(note_tb.category).count()
Out[16]:
2083180
In [17]:
db2_path = '{sa_driver}://{username}:{password}@{hostname}/{dbname}'.format(**creds['mimic2_v2_6'])
engine2 = create_engine(db2_path)
sm2 = sessionmaker(bind=engine2)
s2 = sm2()
conn2 = s2.connection()

meta2 = MetaData(schema="mimic2v26", bind=engine2)
meta2.reflect(bind=engine2)
In [18]:
base2 = automap_base(metadata=meta2)
base2.prepare(engine2, reflect=True)
In [19]:
for cls_name in sorted(base2.classes.keys()):
    print(cls_name)
admissions
censusevents
d_caregivers
d_careunits
d_chartitems
d_codeditems
d_demographicitems
d_ioitems
d_labitems
d_meditems
d_patients
demographicevents
drgevents
icustayevents
poe_order
procedureevents
In [20]:
conn2.execute('select count(*) from mimic2v26.noteevents').fetchall()
Out[20]:
[(1239502,)]

Load labeled notes

In [21]:
categories = ['Advanced.Cancer', 'Advanced.Heart.Disease', 'Advanced.Lung.Disease',
       'Alcohol.Abuse',
       'Chronic.Neurological.Dystrophies', 'Chronic.Pain.Fibromyalgia',
       'Dementia', 'Depression', 'Developmental.Delay.Retardation',
       'Non.Adherence', 'None',
       'Obesity', 'Other.Substance.Abuse', 
       'Schizophrenia.and.other.Psychiatric.Disorders', 'Unsure',]
In [25]:
data_path = pl.Path(path_config['repo_data_dir'])
In [26]:
[p for p in data_path.glob('*csv')]
Out[26]:
[PosixPath('/Users/ecarlson/code/etc/mit_frequent_fliers/mit-team-code/data/AllnursingFinal27Sep16.csv'),
 PosixPath('/Users/ecarlson/code/etc/mit_frequent_fliers/mit-team-code/data/combined_label_data_2016-10-24-16-35.csv'),
 PosixPath('/Users/ecarlson/code/etc/mit_frequent_fliers/mit-team-code/data/dischargeSummariesClean.csv'),
 PosixPath('/Users/ecarlson/code/etc/mit_frequent_fliers/mit-team-code/data/mimic3_note_equivs_2016-10-22-03-39.csv'),
 PosixPath('/Users/ecarlson/code/etc/mit_frequent_fliers/mit-team-code/data/mimic3_note_metadata_2016-10-24-16-35.csv'),
 PosixPath('/Users/ecarlson/code/etc/mit_frequent_fliers/mit-team-code/data/notes_icd9_codes_2016-10-24-16-35.csv'),
 PosixPath('/Users/ecarlson/code/etc/mit_frequent_fliers/mit-team-code/data/nursingNotesClean.csv')]
In [27]:
nursing_notes_path = data_path.joinpath('nursingNotesClean.csv')
discharge_notes_path = data_path.joinpath('dischargeSummariesClean.csv')
In [28]:
nursing_notes = pd.read_csv(nursing_notes_path.as_posix())
disch_notes = pd.read_csv(discharge_notes_path.as_posix()).rename(columns={'subject.id':'subject_id'})
In [29]:
display.display(nursing_notes.head(5))
print(nursing_notes.loc[0,'text'])
Hospital.Admission.ID subject_id icustay_id charttime realtime category title text cohort rand Subject.ID ICU.ID Note.Type Chart.time Category Real.time None Obesity Non.Adherence Developmental.Delay.Retardation Advanced.Heart.Disease Advanced.Lung.Disease Schizophrenia.and.other.Psychiatric.Disorders Alcohol.Abuse Other.Substance.Abuse Chronic.Pain.Fibromyalgia Chronic.Neurological.Dystrophies Advanced.Cancer Depression Dementia Unsure operator
0 2 24807 30800.0 7/9/33 0:56 7/9/33 1:22 Nursing/Other MICU/SICU NURSING PROGRESS NOTE. \nMICU/SICU NURSING PROGRESS NOTE.\n SEE CAREVIEW FOR OBJECTIVE DATA.\n Neuro: Arouses w… 1 0.330879 24807 30800.0 Nursing/Other 7/9/33 0:56 Nursing/Other 7/9/33 1:22 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ETM
1 2 24807 30800.0 7/9/33 0:56 7/9/33 1:22 Nursing/Other MICU/SICU NURSING PROGRESS NOTE. \nMICU/SICU NURSING PROGRESS NOTE.\n SEE CAREVIEW FOR OBJECTIVE DATA.\n Neuro: Arouses w… 1 0.330879 24807 30800.0 Nursing/Other 7/9/33 0:56 Nursing/Other 7/9/33 1:22 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 JTW
2 18 5097 6331.0 6/9/24 18:12 6/9/24 18:29 Nursing/Other NaN \n\nNURSING ACCEPTANCE NOTE.\n\nPT TRANSFERRED FROM [**Hospital 254**] HOSPITAL TRANSFERRED WIT 0 0.315757 5097 6331.0 Nursing/Other 6/9/24 18:12 Nursing/Other 6/9/24 18:29 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 JTW
3 18 5097 6331.0 6/9/24 18:12 6/9/24 18:29 Nursing/Other NaN \n\nNURSING ACCEPTANCE NOTE.\n\nPT TRANSFERRED FROM [**Hospital 254**] HOSPITAL TRANSFERRED WIT 0 0.315757 5097 6331.0 Nursing/Other 6/9/24 18:12 Nursing/Other 6/9/24 18:29 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 ETM
4 45 23014 28601.0 5/1/08 9:51 5/1/08 9:57 Nursing/Other NaN \n\nNEURO; PT ON PROPFOL AT 25 MCG/KG/MIN, GTT OFF PER SICU AND NEURO TEAM, PT UNRESPONSIVE, WIT 1 0.435921 23014 28601.0 Nursing/Other 5/1/08 9:51 Nursing/Other 5/1/08 9:57 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 JF
MICU/SICU NURSING PROGRESS NOTE.
     SEE CAREVIEW FOR OBJECTIVE DATA.
     Neuro: Arouses with verbal stimuli, ms is variable and changes from a&o x 3 to a&o x 1 with moaning episodes, repetitive speach and repetative motions with rue.  Pupils are rt 4mm, lt 3mm and sluggish bilat.  Head ct completed and was negative.  Was given versed 1 mg during ct and pt became much more calm and was able to answer questions appropriatly. Reports able to feel in all 4 extrem., Partial movement of rt ue and lt ue.
     Respiratory:  Lung sounds are coarse throughout, diminished in lt base.  RR 12-24 and non-labored except when having moaning episodes.  O2 saturation is 94-100% on 4l nc.  Cxr in ed showed persistent lll pneumonia and atelectasis in rt upper field.  Expectorating thick tan sputum in abundant amounts.
     CV:  Sinus rythm, rate 70-98 with no ectopy noted.  Nbp 104- 134 systolic.  Good pulses all 4 extrem.  
      GI/GU:  Abdomen si softly distended with + bs.  Pt is able to take regular meals, is full feed, but must be bolt upright.  Able to take meds with custard or jello.  No bm this shift but reports bm yesterday.  Foley catheter is 30fr placed at os facility, occn leaks around catheter.  Patetn and draing clear amber urine.  ? uti, leuks and  blood in ua.
      Integument:  Multiple skin issues.  Stage 3 decubiti below lt glutael, rash on lt and rty gluteals extending down to just above knee joint, raised and pink area in rt lower abdomen, blister type area on rt upper thigh.  Lidocaine patches on mid thorax and lumbar spine.  
       Plan:  Monitor respiratory status and tx pnuemonia with abx, Tx uti, Assess present pain control methods and find alteratives.  Obtain first step bed if to be pt here for any lenght of time.

In [30]:
display.display(disch_notes.head(5))
print(disch_notes.loc[0,'text'][:500])
Hospital.Admission.ID batch.id subject_id category text cohort Subject.ID ICU.ID Note.Type Chart.time Category Real.time None Obesity Non.Adherence Developmental.Delay.Retardation Advanced.Heart.Disease Advanced.Lung.Disease Schizophrenia.and.other.Psychiatric.Disorders Alcohol.Abuse Other.Substance.Abuse Chronic.Pain.Fibromyalgia Chronic.Neurological.Dystrophies Advanced.Cancer Depression Dementia Unsure operator
0 100020 dis08apr16 9973 Discharge \r\r\r\rAdmission Date: [**2512-1-8**] Discharge Date: [**2512-1-18**]\r\rDate o… 1 9973 100020 Discharge 100020 Discharge 12/10/42 0:00 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 ETM
1 100020 dis08apr16 9973 Discharge \r\r\r\rAdmission Date: [**2512-1-8**] Discharge Date: [**2512-1-18**]\r\rDate o… 1 9973 100020 Discharge 100020 Discharge 12/10/42 0:00 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 JTW
2 100103 dis08apr16 3365 Discharge \r\r\rAdmission Date: [**3006-11-24**] Discharge Date: [**3006-12-4**]\r\rDate of Birth:… 1 3365 100103 Discharge 100103 Discharge 12/23/79 0:00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ETM
3 100103 dis08apr16 3365 Discharge \r\r\rAdmission Date: [**3006-11-24**] Discharge Date: [**3006-12-4**]\r\rDate of Birth:… 1 3365 100103 Discharge 100103 Discharge 12/23/79 0:00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 JTW
4 100137 dis08apr16 27290 Discharge \r\r\r\rAdmission Date: [**2751-12-21**] Discharge Date: [**2751-12-29**]\r\rDat… 0 27290 100137 Discharge 100137 Discharge 1/19/15 0:00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 JFF
At OSH

Gather those patient’s data from the database and export

From extracts above, see that nursing notes are from MIMIC II (indicated by dates, also from Slack discussion). Discharge notes seem to be a combination, with chartdate in MIMIC III format, but MIMIC II dates in the notes themselves.

Joy (via Slack):

When we first pulled the notes,  only MIMIC II was available. However, MIMIC II did not have very good notes pulled from the raw clinical data. In particular, lots of discharge notes were missing. Nursing notes were more decent so we started annotating the nursing notes first. Then we got Leo's people to pull discharge notes from MIMIC III for us when it became ready

Approach:

  1. Extract list of all patients (subject_id) from notes files
  2. Extract those patients’ note metadata: note id, text md5sum, dates, type, icustayid, hadm_id
  3. Extract those patients’ icustayid info, including dates
  4. For each note in the notes file, try to match against a note in one of the databases
  5. Find the MIMIC III id data (subject_id, hadm_id, icustay_id, note_id)
  6. Output consistent file with annotations and MIMIC III metadata

Create union of all subject IDs in either set:

In [28]:
subject_ids = set(nursing_notes['subject_id']) | set(disch_notes['subject_id'])
len(subject_ids)
Out[28]:
850

Gather metadata from MIMIC II

Extract existing metadata, as well as information that can be used for matching:

  • md5 hash of original text - only useful if unchanged
  • text length, very rough matching
  • beginning and end of string, stripped of whitespace, template words (e.g. Admission Date), and de-id templates (e.g. [ ])
In [29]:
query = """
select subject_id, hadm_id, icustay_id, realtime, charttime, category, 
    md5(text) as "md5", length(text) as "length", 
    left(strip_text, 50) as "str_start", 
    right(strip_text, 50) as "str_end"
from (
    select *, regexp_replace(text, '\s|\[\*\*[^\*]+\*\*\]|Admission Date|Discharge Date|Date of Birth', '', 'g') as strip_text
    from mimic2v26.noteevents
    where category in ('Nursing/Other', 'DISCHARGE_SUMMARY')
    and subject_id in ({})
    ) as a
"""
In [30]:
m2_notes_meta = pd.read_sql(query.format(','.join([str(sid) for sid in subject_ids])), engine2)
In [31]:
m2_notes_meta.loc[m2_notes_meta['category'] == 'Nursing/Other','category'] = \
  'Nursing/other'
m2_notes_meta.loc[m2_notes_meta['category'] == 'DISCHARGE_SUMMARY','category'] = \
  'Discharge summary'    
In [32]:
m2_notes_meta.head(5)
Out[32]:
subject_id hadm_id icustay_id realtime charttime category md5 length str_start str_end
0 68 26139.0 80.0 2729-01-10 17:27:00 2729-01-10 17:15:00 Nursing/other 96be337b87d1cb0d5b75f4a78d3ea086 743 MICUnpn0700-1900EVENTS:CRRTd/cat11:00,familymeetin N:contsupportivecare,DNR/DNI,followlytes,cr,ph,I/O
1 68 26139.0 80.0 2729-01-10 17:47:00 2729-01-10 17:44:00 Nursing/other c292663acd8a7f0588c0d1d3540e7972 201 RespiratorytherapistPatientawakeandalertfullycoope kygreen,isonaerosolsetup,nebtxgivenviaaerosolmask.
2 68 26139.0 80.0 2729-01-05 18:41:00 2729-01-05 18:38:00 Nursing/other 6a4f1cba58521271a70c5bcab0731f83 308 focus;addendumtoabovenoteACID/BASE-PATIENTGIVEN1AM T.PATIENTUNABLETORAISESPUTUMSPEC.CUPLEFTATBEDSIDE.
3 68 20075.0 NaN None 2728-08-06 00:00:00 Discharge summary c3d30dddf890e3e0df9e55f002282c71 12695 :::Sex:FService:MEDICINEAllergies:Nevirapine/Abaca gnedelectronicallyby:DR.on:FRI12:26PM(EndofReport)
4 68 17608.0 78.0 2728-12-13 17:08:00 2728-12-13 16:51:00 Nursing/other 27572b36bd4c26c322f50cf65d095d16 1666 NSG0700-1900Events:Attemptedeatingbreakfastbutbeca ine..?doubleorsinglelumen.Continuetomonitoroutput.

Gather metadata from MIMIC III

In [33]:
query = """
select row_id, subject_id, hadm_id, chartdate, charttime, storetime, category,
    md5(text) as "md5", length(text) as "length", 
    left(strip_text, 50) as "str_start", 
    right(strip_text, 50) as "str_end"
from (
    select *, regexp_replace(text, '\s|\[\*\*[^\*]+\*\*\]|Admission Date|Discharge Date|Date of Birth', '', 'g') as strip_text
    from mimiciii.noteevents
    where category in ('Nursing/other', 'Discharge summary')
    and subject_id in ({})
    ) as a
"""
In [34]:
m3_notes_meta = pd.read_sql(query.format(','.join([str(sid) for sid in subject_ids])), engine3)
In [35]:
m3_notes_meta.head(5)
Out[35]:
row_id subject_id hadm_id chartdate charttime storetime category md5 length str_start str_end
0 206 5239 129387.0 2189-02-25 None None Discharge summary 5270bc4c100a7596abe47a8a35fb7b2b 11294 ::Service:CCUHISTORYOFPRESENTILLNESS:Thepatientisa DR.,12-462DictatedBy:MEDQUIST36D:01:50T:01:55JOB#:
1 207 5239 129387.0 2189-02-26 None None Discharge summary 3ee0b8238f19cbf2542d030fb832bdd9 6622 ::Service:CCUPleasenotethatthisintervaldictationta able.,M.D.DictatedBy:MEDQUIST36D:02:31T:02:38JOB#:
2 208 5239 125055.0 2189-03-25 None None Discharge summary 84ac577c59d21102c57e30c24bc9ac59 11012 ::Service:CCUHISTORYOFPRESENTILLNESS:Thepatientisa DR.,12-953DictatedBy:MEDQUIST36D:14:52T:15:19JOB#:
3 209 5239 125055.0 2189-03-27 None None Discharge summary 97b65f9c91ec1c9484eae1fb59c9f83a 1533 ::Service:CCUADDENDUM:DISCHARGEMEDICATIONS:1.Enala 12.953DictatedBy:MEDQUIST36D:13:26T:13:45JOB#:111R
4 13 15472 169182.0 2175-07-29 None None Discharge summary 06da02c60a769cd6dbc20cf0456553e7 8715 :::Sex:MService:HISTORYOFPRESENTILLNESS:Thisisa61- o2.5.,M.D.DictatedBy:MEDQUIST36D:11:01T:12:51JOB#:

Try to match to notes files

Add hash column

In [36]:
disch_notes['md5'] = disch_notes['text'].apply(lambda x: hashlib.md5(x.encode('utf-8')).hexdigest())
disch_notes['length'] = disch_notes['text'].apply(len)
disch_notes['str_start'] = disch_notes['text'].apply(lambda x: meu.clean_text(x)[:50])
disch_notes['str_end'] = disch_notes['text'].apply(lambda x: meu.clean_text(x)[-50:])
nursing_notes['md5'] = nursing_notes['text'].apply(lambda x: hashlib.md5(x.encode('utf-8')).hexdigest())
nursing_notes['length'] = nursing_notes['text'].apply(len)
nursing_notes['str_start'] = nursing_notes['text'].apply(lambda x: meu.clean_text(x)[:50])
nursing_notes['str_end'] = nursing_notes['text'].apply(lambda x: meu.clean_text(x)[-50:])

Verify unique notes present in annotation output

In [37]:
len(set(disch_notes['md5'])) == len(disch_notes['md5'])
Out[37]:
False
In [38]:
len(nursing_notes['md5']) == len(set(nursing_notes['md5']))
Out[38]:
False

Notes files Wouldn’t be unique because multiple annotators, that’s ok

Verify database also had unique notes

In [39]:
len(m2_notes_meta['md5']) == len(set(m2_notes_meta['md5']))
Out[39]:
False
In [40]:
len(m3_notes_meta['md5']) == len(set(m3_notes_meta['md5']))
Out[40]:
False
In [41]:
len(m2_notes_meta['md5'])
Out[41]:
44877
In [42]:
len(set(m2_notes_meta['md5']))
Out[42]:
44668

We see that there are duplicate entries, inspect…

In [43]:
g2 = m2_notes_meta.groupby('md5')
res = g2['length'].agg(['count', 'max', 'min']).sort_values('count', ascending=False).head(10)
In [44]:
res.head()
Out[44]:
count max min
md5
e1c06d85ae7b8b032bef47e42e4c08f9 24 2 2
e7ffa42fc2f47fd0e3eb1bc54283375e 11 111 111
911c1ca19932190220e77e42d5195265 10 90 90
a7affdb1bc220e7f415d87957a4982ae 8 88 88
6ebc51b75c7f15aff9040dbee1ea19e8 7 114 114

The most common repeat is only length 2, what is it?

In [45]:
repeat_rows = m2_notes_meta.loc[m2_notes_meta['md5'] == 'e1c06d85ae7b8b032bef47e42e4c08f9']
repeat_rows.head(2)
Out[45]:
subject_id hadm_id icustay_id realtime charttime category md5 length str_start str_end
122 154 18251.0 NaN None 2515-11-20 00:00:00 Discharge summary e1c06d85ae7b8b032bef47e42e4c08f9 2
3089 1883 5991.0 NaN None 3074-02-01 00:00:00 Discharge summary e1c06d85ae7b8b032bef47e42e4c08f9 2
In [46]:
conn2.execute("select text from mimic2v26.noteevents where subject_id=154 and category='DISCHARGE_SUMMARY'").fetchall()
Out[46]:
[('\n\n',)]

Just empty (2 new lines). What about the next most common?

In [47]:
repeat_rows = m2_notes_meta.loc[m2_notes_meta['md5'] == 'e7ffa42fc2f47fd0e3eb1bc54283375e']
repeat_rows.head(2)
Out[47]:
subject_id hadm_id icustay_id realtime charttime category md5 length str_start str_end
4704 2905 25976.0 3599.0 2710-05-16 14:48:00 2710-05-16 14:47:00 Nursing/other e7ffa42fc2f47fd0e3eb1bc54283375e 111 respiratorycareptontheventchangesmadetolwell.seere ell.seerespiratorypageofcarevueformoreinformation.
4713 2905 25976.0 3599.0 2710-05-18 14:54:00 2710-05-18 14:53:00 Nursing/other e7ffa42fc2f47fd0e3eb1bc54283375e 111 respiratorycareptontheventchangesmadetolwell.seere ell.seerespiratorypageofcarevueformoreinformation.
In [48]:
conn2.execute("select text from mimic2v26.noteevents where subject_id=2905 and md5(text)='e7ffa42fc2f47fd0e3eb1bc54283375e'").fetchall()
Out[48]:
[('\nrespiratory care\npt on the vent changes made tol well.  see respiratory page of carevue for more information.\n',),
 ('\nrespiratory care\npt on the vent changes made tol well.  see respiratory page of carevue for more information.\n',)]

Un-informative repeated data.

In [49]:
dat = [(r['subject_id'], r['md5']) for (ind, r) in m2_notes_meta.loc[:, ['subject_id', 'md5']].iterrows()]
In [50]:
len(dat)
Out[50]:
44877
In [51]:
len(set(dat))
Out[51]:
44732

Even including subject_id has repeats. Based on these, we should be able to left join the database data onto the file data, but there could be repeated rows which may confuse analysis. To guarantee no duplicated rows we’ll join including subject id, and also drop duplicates in the database dataframes.

In [52]:
print(m2_notes_meta.shape)
m2_notes_meta.drop_duplicates(['subject_id', 'md5'], inplace=True)
print(m2_notes_meta.shape)
(44877, 10)
(44732, 10)
In [53]:
print(m3_notes_meta.shape)
m3_notes_meta.drop_duplicates(['subject_id', 'md5'], inplace=True)
print(m3_notes_meta.shape)
(47843, 11)
(47693, 11)
In [54]:
nursing_notes[['subject_id', 'icustay_id', 'charttime', 'md5']].head(5)
Out[54]:
subject_id icustay_id charttime md5
0 24807 30800.0 3033-07-09 00:56:00 dd77b97f8b5da793773ceb4c56f32753
1 24807 30800.0 3033-07-09 00:56:00 dd77b97f8b5da793773ceb4c56f32753
2 5097 6331.0 3424-06-09 18:12:00 cd3717e9e2d408a1f75ac3052c677434
3 5097 6331.0 3424-06-09 18:12:00 cd3717e9e2d408a1f75ac3052c677434
4 23014 28601.0 3108-05-01 09:51:00 544d2097e111b56c294d09cc40dbd4b8

Verify that md5 from nursing notes from python will match a md5 from the database from postgres…

In [55]:
(m2_notes_meta['md5'] == 'dd77b97f8b5da793773ceb4c56f32753').sum()
Out[55]:
1

Extract subset of relevant columns, no duplicates

In [56]:
nurs_notes_meta = nursing_notes.loc[:, ['subject_id', 'Hospital.Admission.ID', 'icustay_id', 
                                        'realtime', 'charttime', 
                                        'md5', 'length', 'str_start', 'str_end']]
nurs_notes_meta.drop_duplicates(inplace=True)
nurs_notes_meta.rename(columns={'Hospital.Admission.ID': 'hadm_id'}, inplace=True)
nurs_notes_meta['category'] = 'Nursing/other'
nurs_notes_meta.reset_index(inplace=True)
nurs_notes_meta.head(5)
Out[56]:
index subject_id hadm_id icustay_id realtime charttime md5 length str_start str_end category
0 0 24807 2 30800.0 3033-07-09 01:22:00 3033-07-09 00:56:00 dd77b97f8b5da793773ceb4c56f32753 1758 MICU/SICUNURSINGPROGRESSNOTE.SEECAREVIEWFOROBJECTI .Obtainfirststepbediftobepthereforanylenghtoftime. Nursing/other
1 2 5097 18 6331.0 3424-06-09 18:29:00 3424-06-09 18:12:00 cd3717e9e2d408a1f75ac3052c677434 1057 NURSINGACCEPTANCENOTE.PTTRANSFERREDFROMHOSPITALTRA ONARYFIBROSISIWITHPOSSLUNGBIOSPY.A/PSTABLETRANSFER Nursing/other
2 4 23014 45 28601.0 3108-05-01 09:57:00 3108-05-01 09:51:00 544d2097e111b56c294d09cc40dbd4b8 869 NEURO;PTONPROPFOLAT25MCG/KG/MIN,GTTOFFPERSICUANDNE ATABGWHENPTRETURNS,WILLHAVERPTCHESTX-RAYUPONRETURN Nursing/other
3 5 9882 146 12307.0 3042-06-29 05:26:00 3042-06-29 04:32:00 b68ce445af04e1d28f24c72d5832282f 3070 Pmicunursingadmit/progress7p-7amrisa54yomanwithmul g,wantstobeupdatedwithanychanges-seemsconcernedand Nursing/other
4 6 7670 190 9557.0 2780-04-13 15:19:00 2780-04-13 14:52:00 ff751b526a693998c7dcf8dce1751fdf 1311 MICUnursingadmnotePtisyomaleadmfromwithUGIB.Seecar 700.TxforHct<28.ProtonixBID.NPOfornow.PtisDNR/DNI. Nursing/other
In [57]:
disch_notes_meta = disch_notes.loc[:, ['subject_id', 'Hospital.Admission.ID', 'icustay_id', 'Real.time', 'chartdate', 
                                       'md5', 'length', 'str_start', 'str_end']]
disch_notes_meta.drop_duplicates(inplace=True)
disch_notes_meta.rename(columns={'Hospital.Admission.ID': 'hadm_id'}, inplace=True)
disch_notes_meta['realtime'] = disch_notes['Real.time'].apply(meu.fix_date)
disch_notes_meta.loc[:, 'chartdate'] = disch_notes['chartdate'].apply(meu.fix_date)
disch_notes_meta.drop(['Real.time',], axis=1, inplace=True)
disch_notes_meta['category'] = 'Discharge summary'
disch_notes_meta.reset_index(inplace=True)
disch_notes_meta.head(5)
Out[57]:
index subject_id hadm_id icustay_id chartdate md5 length str_start str_end realtime category
0 0 9973 100020 NaN 2142-12-10 56f2598342cce321539d8975809d487c 16084 :::Sex:MService:MEDICINEAllergies:Percocet/Bactrim Time:4:00.3.GASTROINTESTINALCLINICat:(Completedby: 2142-12-10 Discharge summary
1 2 3365 100103 NaN 2142-12-10 eaea5c3c7577135a83f1f0fb583e0d53 7922 :::Sex:FService:CARDIOTHORACICSURGERYCHIEFCOMPLAIN ward.,M.D.DictatedBy:MEDQUIST36D:18:27T:18:47JOB#: 2142-12-10 Discharge summary
2 4 27290 100137 NaN 2142-12-10 e7433c0b75ea00346390f029bb830774 11585 :::Sex:MService:MEDICINEAllergies:IodineAttending: rin2weeks.PleasecallandmakeanapptwithyourPCPweeks. 2142-12-10 Discharge summary
3 6 5525 100473 NaN 2142-12-10 7d5ada7d4355a1551b9749a50c9a5471 18230 :::Sex:MService:MEDICINEAllergies:NeupogenAttendin within1weekofdischargefromrehabilitation.Phone:.MD 2142-12-10 Discharge summary
4 7 41515 100485 NaN 2142-12-10 6f6b2bf63a8c26780a37495b29086a59 8634 :::Sex:FService:CARDIOTHORACICAllergies:Penicillin undcheck6in2weeksasinstructedbynurse()Completedby: 2142-12-10 Discharge summary
In [58]:
ann_notes_meta = pd.concat([nurs_notes_meta, disch_notes_meta], axis=0)
print(ann_notes_meta.shape)
ann_notes_meta.head(5)
(1894, 12)
Out[58]:
category chartdate charttime hadm_id icustay_id index length md5 realtime str_end str_start subject_id
0 Nursing/other NaT 3033-07-09 00:56:00 2 30800.0 0 1758 dd77b97f8b5da793773ceb4c56f32753 3033-07-09 01:22:00 .Obtainfirststepbediftobepthereforanylenghtoftime. MICU/SICUNURSINGPROGRESSNOTE.SEECAREVIEWFOROBJECTI 24807
1 Nursing/other NaT 3424-06-09 18:12:00 18 6331.0 2 1057 cd3717e9e2d408a1f75ac3052c677434 3424-06-09 18:29:00 ONARYFIBROSISIWITHPOSSLUNGBIOSPY.A/PSTABLETRANSFER NURSINGACCEPTANCENOTE.PTTRANSFERREDFROMHOSPITALTRA 5097
2 Nursing/other NaT 3108-05-01 09:51:00 45 28601.0 4 869 544d2097e111b56c294d09cc40dbd4b8 3108-05-01 09:57:00 ATABGWHENPTRETURNS,WILLHAVERPTCHESTX-RAYUPONRETURN NEURO;PTONPROPFOLAT25MCG/KG/MIN,GTTOFFPERSICUANDNE 23014
3 Nursing/other NaT 3042-06-29 04:32:00 146 12307.0 5 3070 b68ce445af04e1d28f24c72d5832282f 3042-06-29 05:26:00 g,wantstobeupdatedwithanychanges-seemsconcernedand Pmicunursingadmit/progress7p-7amrisa54yomanwithmul 9882
4 Nursing/other NaT 2780-04-13 14:52:00 190 9557.0 6 1311 ff751b526a693998c7dcf8dce1751fdf 2780-04-13 15:19:00 700.TxforHct<28.ProtonixBID.NPOfornow.PtisDNR/DNI. MICUnursingadmnotePtisyomaleadmfromwithUGIB.Seecar 7670
In [59]:
m2_copy = m2_notes_meta.copy()
m2_copy.columns = [c if c in ['subject_id', 'md5'] else c+'_m2' for c in m2_copy.columns]
meta_with_m2 = pd.merge(ann_notes_meta, m2_copy, how='left', on=['subject_id', 'md5'])
In [60]:
meta_with_m2.head(5)
Out[60]:
category chartdate charttime hadm_id icustay_id index length md5 realtime str_end str_start subject_id hadm_id_m2 icustay_id_m2 realtime_m2 charttime_m2 category_m2 length_m2 str_start_m2 str_end_m2
0 Nursing/other NaT 3033-07-09 00:56:00 2 30800.0 0 1758 dd77b97f8b5da793773ceb4c56f32753 3033-07-09 01:22:00 .Obtainfirststepbediftobepthereforanylenghtoftime. MICU/SICUNURSINGPROGRESSNOTE.SEECAREVIEWFOROBJECTI 24807 2.0 30800.0 3033-07-09 01:22:00 3033-07-09 00:56:00 Nursing/other 1758.0 MICU/SICUNURSINGPROGRESSNOTE.SEECAREVIEWFOROBJECTI .Obtainfirststepbediftobepthereforanylenghtoftime.
1 Nursing/other NaT 3424-06-09 18:12:00 18 6331.0 2 1057 cd3717e9e2d408a1f75ac3052c677434 3424-06-09 18:29:00 ONARYFIBROSISIWITHPOSSLUNGBIOSPY.A/PSTABLETRANSFER NURSINGACCEPTANCENOTE.PTTRANSFERREDFROMHOSPITALTRA 5097 18.0 6331.0 3424-06-09 18:29:00 3424-06-09 18:12:00 Nursing/other 1057.0 NURSINGACCEPTANCENOTE.PTTRANSFERREDFROMHOSPITALTRA ONARYFIBROSISIWITHPOSSLUNGBIOSPY.A/PSTABLETRANSFER
2 Nursing/other NaT 3108-05-01 09:51:00 45 28601.0 4 869 544d2097e111b56c294d09cc40dbd4b8 3108-05-01 09:57:00 ATABGWHENPTRETURNS,WILLHAVERPTCHESTX-RAYUPONRETURN NEURO;PTONPROPFOLAT25MCG/KG/MIN,GTTOFFPERSICUANDNE 23014 45.0 28601.0 3108-05-01 09:57:00 3108-05-01 09:51:00 Nursing/other 869.0 NEURO;PTONPROPFOLAT25MCG/KG/MIN,GTTOFFPERSICUANDNE ATABGWHENPTRETURNS,WILLHAVERPTCHESTX-RAYUPONRETURN
3 Nursing/other NaT 3042-06-29 04:32:00 146 12307.0 5 3070 b68ce445af04e1d28f24c72d5832282f 3042-06-29 05:26:00 g,wantstobeupdatedwithanychanges-seemsconcernedand Pmicunursingadmit/progress7p-7amrisa54yomanwithmul 9882 146.0 12307.0 3042-06-29 05:26:00 3042-06-29 04:32:00 Nursing/other 3070.0 Pmicunursingadmit/progress7p-7amrisa54yomanwithmul g,wantstobeupdatedwithanychanges-seemsconcernedand
4 Nursing/other NaT 2780-04-13 14:52:00 190 9557.0 6 1311 ff751b526a693998c7dcf8dce1751fdf 2780-04-13 15:19:00 700.TxforHct<28.ProtonixBID.NPOfornow.PtisDNR/DNI. MICUnursingadmnotePtisyomaleadmfromwithUGIB.Seecar 7670 190.0 9557.0 2780-04-13 15:19:00 2780-04-13 14:52:00 Nursing/other 1311.0 MICUnursingadmnotePtisyomaleadmfromwithUGIB.Seecar 700.TxforHct<28.ProtonixBID.NPOfornow.PtisDNR/DNI.
In [61]:
m3_copy = m3_notes_meta.copy()
m3_copy.columns = [c if c in ['subject_id', 'md5'] else c+'_m3' for c in m3_copy.columns]
meta_m2_m3 = pd.merge(meta_with_m2, m3_copy, how='left', on=['subject_id', 'md5'], suffixes=['', '_m3'])
meta_m2_m3.head(5)
Out[61]:
category chartdate charttime hadm_id icustay_id index length md5 realtime str_end str_start subject_id hadm_id_m2 icustay_id_m2 realtime_m2 charttime_m2 category_m2 length_m2 str_start_m2 str_end_m2 row_id_m3 hadm_id_m3 chartdate_m3 charttime_m3 storetime_m3 category_m3 length_m3 str_start_m3 str_end_m3
0 Nursing/other NaT 3033-07-09 00:56:00 2 30800.0 0 1758 dd77b97f8b5da793773ceb4c56f32753 3033-07-09 01:22:00 .Obtainfirststepbediftobepthereforanylenghtoftime. MICU/SICUNURSINGPROGRESSNOTE.SEECAREVIEWFOROBJECTI 24807 2.0 30800.0 3033-07-09 01:22:00 3033-07-09 00:56:00 Nursing/other 1758.0 MICU/SICUNURSINGPROGRESSNOTE.SEECAREVIEWFOROBJECTI .Obtainfirststepbediftobepthereforanylenghtoftime. NaN NaN NaT NaN NaN NaN NaN NaN NaN
1 Nursing/other NaT 3424-06-09 18:12:00 18 6331.0 2 1057 cd3717e9e2d408a1f75ac3052c677434 3424-06-09 18:29:00 ONARYFIBROSISIWITHPOSSLUNGBIOSPY.A/PSTABLETRANSFER NURSINGACCEPTANCENOTE.PTTRANSFERREDFROMHOSPITALTRA 5097 18.0 6331.0 3424-06-09 18:29:00 3424-06-09 18:12:00 Nursing/other 1057.0 NURSINGACCEPTANCENOTE.PTTRANSFERREDFROMHOSPITALTRA ONARYFIBROSISIWITHPOSSLUNGBIOSPY.A/PSTABLETRANSFER NaN NaN NaT NaN NaN NaN NaN NaN NaN
2 Nursing/other NaT 3108-05-01 09:51:00 45 28601.0 4 869 544d2097e111b56c294d09cc40dbd4b8 3108-05-01 09:57:00 ATABGWHENPTRETURNS,WILLHAVERPTCHESTX-RAYUPONRETURN NEURO;PTONPROPFOLAT25MCG/KG/MIN,GTTOFFPERSICUANDNE 23014 45.0 28601.0 3108-05-01 09:57:00 3108-05-01 09:51:00 Nursing/other 869.0 NEURO;PTONPROPFOLAT25MCG/KG/MIN,GTTOFFPERSICUANDNE ATABGWHENPTRETURNS,WILLHAVERPTCHESTX-RAYUPONRETURN NaN NaN NaT NaN NaN NaN NaN NaN NaN
3 Nursing/other NaT 3042-06-29 04:32:00 146 12307.0 5 3070 b68ce445af04e1d28f24c72d5832282f 3042-06-29 05:26:00 g,wantstobeupdatedwithanychanges-seemsconcernedand Pmicunursingadmit/progress7p-7amrisa54yomanwithmul 9882 146.0 12307.0 3042-06-29 05:26:00 3042-06-29 04:32:00 Nursing/other 3070.0 Pmicunursingadmit/progress7p-7amrisa54yomanwithmul g,wantstobeupdatedwithanychanges-seemsconcernedand NaN NaN NaT NaN NaN NaN NaN NaN NaN
4 Nursing/other NaT 2780-04-13 14:52:00 190 9557.0 6 1311 ff751b526a693998c7dcf8dce1751fdf 2780-04-13 15:19:00 700.TxforHct<28.ProtonixBID.NPOfornow.PtisDNR/DNI. MICUnursingadmnotePtisyomaleadmfromwithUGIB.Seecar 7670 190.0 9557.0 2780-04-13 15:19:00 2780-04-13 14:52:00 Nursing/other 1311.0 MICUnursingadmnotePtisyomaleadmfromwithUGIB.Seecar 700.TxforHct<28.ProtonixBID.NPOfornow.PtisDNR/DNI. NaN NaN NaT NaN NaN NaN NaN NaN NaN
In [62]:
meta_m2_m3.shape
Out[62]:
(1894, 29)
In [63]:
(meta_m2_m3['category'] == 'Nursing/other').sum()
Out[63]:
999
In [64]:
(meta_m2_m3['category'] == 'Discharge summary').sum()
Out[64]:
895
In [65]:
meta_m2_m3['length_m3'].notnull().sum()
Out[65]:
0
In [66]:
meta_m2_m3['length_m2'].notnull().sum()
Out[66]:
998
In [67]:
((meta_m2_m3['category'] == 'Nursing/other') & meta_m2_m3['length_m2'].notnull()).sum()
Out[67]:
998

From this, almost all nursing notes were able to match to MIMIC 2, but nothing was able to match to MIMIC 3, and discharge summaries couldn’t be matched at all. Look into why discharge notes aren’t matching to MIMIC 3…

In [68]:
nurs_notes_meta.loc[0, 'md5'] in list(m2_notes_meta['md5'])
Out[68]:
True
In [69]:
nurs_notes_meta.loc[0, 'md5'] in list(m3_notes_meta['md5'])
Out[69]:
False
In [70]:
disch_notes_meta.loc[0, 'md5'] in list(m2_notes_meta['md5'])
Out[70]:
False
In [71]:
disch_notes_meta.loc[0, 'md5'] in list(m3_notes_meta['md5'])
Out[71]:
False
In [72]:
disch_notes_meta.loc[0, 'subject_id'] in list(m3_notes_meta['subject_id'])
Out[72]:
True
In [73]:
subject_id = disch_notes_meta.loc[0, 'subject_id']
note_length = disch_notes_meta.loc[0, 'length']
display.display(disch_notes_meta.loc[0,:])
m3_notes_meta.loc[(m3_notes_meta['subject_id'] == subject_id) &
                 (np.abs(m3_notes_meta['length']-note_length) < 200)]
index                                                          0
subject_id                                                  9973
hadm_id                                                   100020
icustay_id                                                   NaN
chartdate                                    2142-12-10 00:00:00
md5                             56f2598342cce321539d8975809d487c
length                                                     16084
str_start     :::Sex:MService:MEDICINEAllergies:Percocet/Bactrim
str_end       Time:4:00.3.GASTROINTESTINALCLINICat:(Completedby:
realtime                                     2142-12-10 00:00:00
category                                       Discharge summary
Name: 0, dtype: object
Out[73]:
row_id subject_id hadm_id chartdate charttime storetime category md5 length str_start str_end
834 6876 9973 100020.0 2142-12-10 None None Discharge summary 10577fde1d173468a939ce3cf19f0926 16132 :::Sex:MService:MEDICINEAllergies:Percocet/Bactrim 2.,MDPhone:Date/Time:4:00.3.CLINICat:(Completedby:
In [74]:
db_note = conn3.execute("""
select text 
from mimiciii.noteevents 
where subject_id=9973 
and category='Discharge summary'
and md5(text)='10577fde1d173468a939ce3cf19f0926'
""").fetchone()[0]
print(db_note[:500])
Admission Date:  [**2142-11-30**]              Discharge Date:   [**2142-12-10**]

Date of Birth:  [**2084-5-2**]             Sex:   M

Service: MEDICINE

Allergies:
Percocet / Bactrim Ds / Lisinopril

Attending:[**First Name3 (LF) 898**]
Chief Complaint:
hypotension

Major Surgical or Invasive Procedure:
none

History of Present Illness:
Mr. [**Known lastname 25925**] is a 58 yo m w/ multiple sclerosis and seizure
disorder who presented to an OSH for delusions and AMS x 2 days.
At OSH, he was n
In [75]:
print(disch_notes.loc[0, 'text'][:500])



Admission Date:  [**2512-1-8**]              Discharge Date:   [**2512-1-18**]

Date of Birth:  [**2453-6-10**]             Sex:   M

Service: MEDICINE

Allergies:
Percocet / Bactrim Ds / Lisinopril

Attending:[**First Name3 (LF) 886**]
Chief Complaint:
hypotension

Major Surgical or Invasive Procedure:
none

History of Present Illness:
Mr. [**Known patient lastname 25575**] is a 58 yo m w/ multiple sclerosis and seizure
disorder who presented to an OSH for delusions and AMS x 2 days.
At OSH

From this we see that the notes in our annotation files nearly match the notes in MIMIC III, but de-identification processes have changed.

In [76]:
note_ind = 0
subject_id = disch_notes_meta.loc[note_ind, 'subject_id']
hadm_id = disch_notes_meta.loc[note_ind, 'hadm_id']
note_length = disch_notes_meta.loc[note_ind, 'length']
chartdate = disch_notes_meta.loc[note_ind, 'chartdate']
display.display(disch_notes_meta.loc[note_ind,:])
m3_notes_meta.loc[(m3_notes_meta['subject_id'] == subject_id)]
m3_notes_meta.loc[(m3_notes_meta['subject_id'] == subject_id) &
                 (m3_notes_meta['hadm_id'] == hadm_id)]

# m3_notes_meta.loc[(m3_notes_meta['subject_id'] == subject_id) &
#                  (m3_notes_meta['chartdate'] == chartdate)]
index                                                          0
subject_id                                                  9973
hadm_id                                                   100020
icustay_id                                                   NaN
chartdate                                    2142-12-10 00:00:00
md5                             56f2598342cce321539d8975809d487c
length                                                     16084
str_start     :::Sex:MService:MEDICINEAllergies:Percocet/Bactrim
str_end       Time:4:00.3.GASTROINTESTINALCLINICat:(Completedby:
realtime                                     2142-12-10 00:00:00
category                                       Discharge summary
Name: 0, dtype: object
Out[76]:
row_id subject_id hadm_id chartdate charttime storetime category md5 length str_start str_end
834 6876 9973 100020.0 2142-12-10 None None Discharge summary 10577fde1d173468a939ce3cf19f0926 16132 :::Sex:MService:MEDICINEAllergies:Percocet/Bactrim 2.,MDPhone:Date/Time:4:00.3.CLINICat:(Completedby:
In [77]:
type('test')
Out[77]:
str
In [78]:
m3_copy = m3_notes_meta.copy()
join_cols = ['subject_id', 'category']
m3_copy.columns = [c if c in join_cols else c+'_m3' for c in m3_copy.columns]
meta_m2_m3 = pd.merge(meta_with_m2, m3_copy, how='left', on=join_cols, suffixes=['', '_m3'])
meta_m2_m3['len_diff_pct'] = np.abs(meta_m2_m3['length'] - meta_m2_m3['length_m3'])/meta_m2_m3['length']
meta_m2_m3['str_start_diff'] = meta_m2_m3.apply(lambda r: meu.similar(r['str_start'], r['str_start_m3']), axis=1)
meta_m2_m3['str_end_diff'] = meta_m2_m3.apply(lambda r: meu.similar(r['str_end'], r['str_end_m3']), axis=1)
meta_m2_m3 = meta_m2_m3.loc[meta_m2_m3['len_diff_pct'].isnull() |
                           ((meta_m2_m3['len_diff_pct'] < .1) &
                           (meta_m2_m3['str_start_diff'] < .1) & (meta_m2_m3['str_end_diff'] < .2))]
#meta_m2_m3 = meta_m2_m3.loc[(meta_m2_m3['len_diff_pct'] < .05)].sort_values('len_diff_pct', ascending=False)
meta_m2_m3.head(5)
Out[78]:
category chartdate charttime hadm_id icustay_id index length md5 realtime str_end str_start subject_id hadm_id_m2 icustay_id_m2 realtime_m2 charttime_m2 category_m2 length_m2 str_start_m2 str_end_m2 row_id_m3 hadm_id_m3 chartdate_m3 charttime_m3 storetime_m3 md5_m3 length_m3 str_start_m3 str_end_m3 len_diff_pct str_start_diff str_end_diff
35 Nursing/other NaT 3033-07-09 00:56:00 2 30800.0 0 1758 dd77b97f8b5da793773ceb4c56f32753 3033-07-09 01:22:00 .Obtainfirststepbediftobepthereforanylenghtoftime. MICU/SICUNURSINGPROGRESSNOTE.SEECAREVIEWFOROBJECTI 24807 2.0 30800.0 3033-07-09 01:22:00 3033-07-09 00:56:00 Nursing/other 1758.0 MICU/SICUNURSINGPROGRESSNOTE.SEECAREVIEWFOROBJECTI .Obtainfirststepbediftobepthereforanylenghtoftime. 1584558.0 152277.0 2145-06-29 2145-06-29 01:56:00 2145-06-29 02:22:00 31191f6b369f41481ac3af0aada9cece 1757.0 MICU/SICUNURSINGPROGRESSNOTE.SEECAREVIEWFOROBJECTI .Obtainfirststepbediftobepthereforanylenghtoftime. 0.000569 0.0 0.00
78 Nursing/other NaT 3424-06-09 18:12:00 18 6331.0 2 1057 cd3717e9e2d408a1f75ac3052c677434 3424-06-09 18:29:00 ONARYFIBROSISIWITHPOSSLUNGBIOSPY.A/PSTABLETRANSFER NURSINGACCEPTANCENOTE.PTTRANSFERREDFROMHOSPITALTRA 5097 18.0 6331.0 3424-06-09 18:29:00 3424-06-09 18:12:00 Nursing/other 1057.0 NURSINGACCEPTANCENOTE.PTTRANSFERREDFROMHOSPITALTRA ONARYFIBROSISIWITHPOSSLUNGBIOSPY.A/PSTABLETRANSFER 1329048.0 123560.0 2156-06-23 2156-06-23 19:12:00 2156-06-23 19:29:00 1523a8cad35a8522efcb9204ba422e78 1068.0 NURSINGACCEPTANCENOTE.PTTRANSFERREDFROMHOSPITALTRA ONARYFIBROSISIWITHPOSSLUNGBIOSPY.A/PSTABLETRANSFER 0.010407 0.0 0.00
304 Nursing/other NaT 3042-06-29 04:32:00 146 12307.0 5 3070 b68ce445af04e1d28f24c72d5832282f 3042-06-29 05:26:00 g,wantstobeupdatedwithanychanges-seemsconcernedand Pmicunursingadmit/progress7p-7amrisa54yomanwithmul 9882 146.0 12307.0 3042-06-29 05:26:00 3042-06-29 04:32:00 Nursing/other 3070.0 Pmicunursingadmit/progress7p-7amrisa54yomanwithmul g,wantstobeupdatedwithanychanges-seemsconcernedand 1392593.0 120908.0 2151-05-26 2151-05-26 05:32:00 2151-05-26 06:26:00 6c81cf3fcc35e16ef82e798c84e0d927 3084.0 Pmicunursingadmit/progress7p-7amrisa54yomanwithmul g,wantstobeupdatedwithanychanges-seemsconcernedand 0.004560 0.0 0.00
339 Nursing/other NaT 2780-04-13 14:52:00 190 9557.0 6 1311 ff751b526a693998c7dcf8dce1751fdf 2780-04-13 15:19:00 700.TxforHct<28.ProtonixBID.NPOfornow.PtisDNR/DNI. MICUnursingadmnotePtisyomaleadmfromwithUGIB.Seecar 7670 190.0 9557.0 2780-04-13 15:19:00 2780-04-13 14:52:00 Nursing/other 1311.0 MICUnursingadmnotePtisyomaleadmfromwithUGIB.Seecar 700.TxforHct<28.ProtonixBID.NPOfornow.PtisDNR/DNI. 1365393.0 131440.0 2178-05-10 2178-05-10 15:52:00 2178-05-10 16:19:00 d739b3e262bd69531192583048a28624 1322.0 MICUnursingadmnotePtisyomaleadmfromwithUGIB.Seecar ue1700.TxforHct<28.Protonix.NPOfornow.PtisDNR/DNI. 0.008391 0.0 0.06
392 Nursing/other NaT 2673-04-02 17:40:00 212 30507.0 9 291 3ea27b7ed36b5017c384ec9a443c21fa 2673-04-02 17:42:00 .0/5/5.PT.PROBABLYTOBEWEANEDANDEXTUBATEDLATER.,RRT RESPIRATORYCARE:PT.ISS/PCABGTODAY.FROMORTOCSRU.7.5 24573 212.0 30507.0 2673-04-02 17:42:00 2673-04-02 17:40:00 Nursing/other 291.0 RESPIRATORYCARE:PT.ISS/PCABGTODAY.FROMORTOCSRU.7.5 .0/5/5.PT.PROBABLYTOBEWEANEDANDEXTUBATEDLATER.,RRT 1581483.0 102522.0 2126-03-06 2126-03-06 17:40:00 2126-03-06 17:42:00 7f53514ef2bbb3c01b722391d1807930 289.0 RESPIRATORYCARE:PT.ISS/PCABGTODAY.FROMORTOCSRU.7.5 .0/5/5.PT.PROBABLYTOBEWEANEDANDEXTUBATEDLATER.,RRT 0.006873 0.0 0.00
In [79]:
meta_m2_m3.shape
Out[79]:
(1892, 32)
In [80]:
ann_notes_meta.shape
Out[80]:
(1894, 12)
In [81]:
(meta_m2_m3['category'] == 'Nursing/other').sum()
Out[81]:
962
In [82]:
(meta_m2_m3['category'] == 'Discharge summary').sum()
Out[82]:
930
In [83]:
meta_m2_m3['length_m3'].notnull().sum()
Out[83]:
1885
In [84]:
meta_m2_m3['length_m2'].notnull().sum()
Out[84]:
961
In [85]:
((meta_m2_m3['category'] == 'Nursing/other') & meta_m2_m3['length_m2'].notnull()).sum()
Out[85]:
961
In [86]:
ann_notes_meta.loc[ann_notes_meta['category'] == 'Nursing/other', 'md5'].unique().shape
Out[86]:
(999,)
In [87]:
ann_notes_meta.loc[ann_notes_meta['category'] == 'Discharge summary', 'md5'].unique().shape
Out[87]:
(850,)
In [88]:
ann_notes_meta.loc[ann_notes_meta['category'] == 'Nursing/other', 'md5'].unique().shape
Out[88]:
(999,)
In [89]:
meta_m2_m3.loc[meta_m2_m3['category'] == 'Discharge summary', 'md5'].unique().shape
Out[89]:
(848,)
In [90]:
meta_m2_m3.loc[meta_m2_m3['category'] == 'Nursing/other', 'md5'].unique().shape
Out[90]:
(962,)

Now almost all notes are matched to MIMIC 3, but have multiple potential matches. We need to choose the best match in each group…

In [91]:
meta_m2_m3.groupby(['subject_id', 'category', 'md5'])['length'].count().sort_values(ascending=False).head(10)
Out[91]:
subject_id  category           md5                             
808         Discharge summary  74daffdc6966b4cfefd5715014a80bdc    4
5239        Discharge summary  68d6e187035ae3f48b6ca643e342efee    4
5205        Discharge summary  cc71c046e575c30e49c6aa2a59f7fb2d    3
16112       Discharge summary  71549cf46bfab3260a09492929b37764    3
5205        Discharge summary  85fb9e679596b05e7864d121f3385d6f    3
                               a194fd48b16e3e58a92cac5ee605c946    3
16112       Discharge summary  1687ca2c1a63d73ea81d646dcb74daeb    3
                               e7156262a57f9d54e35a4c393d7db7f4    3
5727        Discharge summary  eef853d03d5d51db83892eb122195ad3    3
4064        Discharge summary  4bd226631baf9aad8d00ff0a11c871ed    3
Name: length, dtype: int64
In [92]:
meta_m2_m3.query('subject_id==808 & md5 == "74daffdc6966b4cfefd5715014a80bdc"')
Out[92]:
category chartdate charttime hadm_id icustay_id index length md5 realtime str_end str_start subject_id hadm_id_m2 icustay_id_m2 realtime_m2 charttime_m2 category_m2 length_m2 str_start_m2 str_end_m2 row_id_m3 hadm_id_m3 chartdate_m3 charttime_m3 storetime_m3 md5_m3 length_m3 str_start_m3 str_end_m3 len_diff_pct str_start_diff str_end_diff
85774 Discharge summary 2142-12-10 NaN 139077 NaN 690 10538 74daffdc6966b4cfefd5715014a80bdc 2142-12-10 00:00:00 :Date/Time:11:30Provider:,M.D.Phone:Date/Time:1:00 :::Sex:FService:MEDICINEAllergies:Patientrecordeda 808 NaN NaN NaN NaN NaN NaN NaN NaN 3776.0 139077.0 2181-05-16 None None 206805bd6a6857e866c557f683554609 10543.0 :::Sex:FService:MEDICINEAllergies:Patientrecordeda :Date/Time:11:30Provider:,M.D.Phone:Date/Time:1:00 0.000474 0.0 0.0
85775 Discharge summary 2142-12-10 NaN 139077 NaN 690 10538 74daffdc6966b4cfefd5715014a80bdc 2142-12-10 00:00:00 :Date/Time:11:30Provider:,M.D.Phone:Date/Time:1:00 :::Sex:FService:MEDICINEAllergies:Patientrecordeda 808 NaN NaN NaN NaN NaN NaN NaN NaN 3777.0 125152.0 2181-06-06 None None 7b21d1732ea01398d359ac49e502d610 11472.0 :::Sex:FService:MEDICINEAllergies:Patientrecordeda ne:Date/Time:1:00Provider:,.D.Phone:Date/Time:9:40 0.088632 0.0 0.1
85781 Discharge summary 2142-12-10 NaN 139077 NaN 692 10538 74daffdc6966b4cfefd5715014a80bdc 2142-12-10 00:00:00 :Date/Time:11:30Provider:,M.D.Phone:Date/Time:1:00 :::Sex:FService:MEDICINEAllergies:Patientrecordeda 808 NaN NaN NaN NaN NaN NaN NaN NaN 3776.0 139077.0 2181-05-16 None None 206805bd6a6857e866c557f683554609 10543.0 :::Sex:FService:MEDICINEAllergies:Patientrecordeda :Date/Time:11:30Provider:,M.D.Phone:Date/Time:1:00 0.000474 0.0 0.0
85782 Discharge summary 2142-12-10 NaN 139077 NaN 692 10538 74daffdc6966b4cfefd5715014a80bdc 2142-12-10 00:00:00 :Date/Time:11:30Provider:,M.D.Phone:Date/Time:1:00 :::Sex:FService:MEDICINEAllergies:Patientrecordeda 808 NaN NaN NaN NaN NaN NaN NaN NaN 3777.0 125152.0 2181-06-06 None None 7b21d1732ea01398d359ac49e502d610 11472.0 :::Sex:FService:MEDICINEAllergies:Patientrecordeda ne:Date/Time:1:00Provider:,.D.Phone:Date/Time:9:40 0.088632 0.0 0.1
In [93]:
meta_m2_m3['total_m3_distance'] = meta_m2_m3['len_diff_pct'] + meta_m2_m3['str_start_diff'] + meta_m2_m3['str_end_diff']
In [94]:
indexes = meta_m2_m3.groupby(['subject_id', 'category', 'md5']).apply(lambda r: r['total_m3_distance'].idxmin())
In [95]:
best_matches = meta_m2_m3.loc[indexes]
In [96]:
best_matches.shape
Out[96]:
(1810, 33)
In [97]:
ann_notes_meta.shape
Out[97]:
(1894, 12)
In [98]:
(best_matches['category'] == 'Nursing/other').sum()
Out[98]:
955
In [99]:
(best_matches['category'] == 'Discharge summary').sum()
Out[99]:
848
In [100]:
best_matches['length_m3'].notnull().sum()
Out[100]:
1803
In [101]:
best_matches['length_m2'].notnull().sum()
Out[101]:
954
In [102]:
((best_matches['category'] == 'Nursing/other') & best_matches['length_m2'].notnull()).sum()
Out[102]:
954
In [103]:
ann_notes_meta.loc[ann_notes_meta['category'] == 'Nursing/other', 'md5'].unique().shape
Out[103]:
(999,)
In [104]:
ann_notes_meta.loc[ann_notes_meta['category'] == 'Discharge summary', 'md5'].unique().shape
Out[104]:
(850,)
In [105]:
best_matches.loc[best_matches['category'] == 'Discharge summary', 'md5'].unique().shape
Out[105]:
(848,)
In [106]:
best_matches.loc[best_matches['category'] == 'Nursing/other', 'md5'].unique().shape
Out[106]:
(955,)
In [107]:
best_matches.head()
Out[107]:
category chartdate charttime hadm_id icustay_id index length md5 realtime str_end str_start subject_id hadm_id_m2 icustay_id_m2 realtime_m2 charttime_m2 category_m2 length_m2 str_start_m2 str_end_m2 row_id_m3 hadm_id_m3 chartdate_m3 charttime_m3 storetime_m3 md5_m3 length_m3 str_start_m3 str_end_m3 len_diff_pct str_start_diff str_end_diff total_m3_distance
52342.0 Nursing/other NaT 2728-12-13 16:51:00 17608.0 78.0 1162.0 1666.0 27572b36bd4c26c322f50cf65d095d16 2728-12-13 17:08:00 ine..?doubleorsinglelumen.Continuetomonitoroutput. NSG0700-1900Events:Attemptedeatingbreakfastbutbeca 68.0 17608.0 78.0 2728-12-13 17:08:00 2728-12-13 16:51:00 Nursing/other 1666.0 NSG0700-1900Events:Attemptedeatingbreakfastbutbeca ine..?doubleorsinglelumen.Continuetomonitoroutput. 1261174.0 170467.0 2173-12-16 2173-12-16 16:51:00 2173-12-16 17:08:00 a71a1009932ab1fce83b8867234ec9e4 1664.0 NSG0700-1900Events:Attemptedeatingbreakfastbutbeca ine..?doubleorsinglelumen.Continuetomonitoroutput. 0.001200 0.00 0.0 0.001200
87204.0 Discharge summary 2142-12-10 NaN 164029.0 NaN 1195.0 7751.0 f5f69772c32f1b0ac05b7cf408f7a6db 2142-12-10 00:00:00 rativeassistant,,atifyouneedtoconfirmorreschedule. :::Sex:FService:MEDICINEAllergies:PenicillinsAtten 109.0 NaN NaN NaN NaN NaN NaN NaN NaN 15322.0 164029.0 2140-01-21 None None 1770e41dbbe1a89a144bcebfad1fc87e 7731.0 :::Sex:FService:MEDICINEAllergies:PenicillinsAtten rativeassistant,,atifyouneedtoconfirmorreschedule. 0.002580 0.00 0.0 0.002580
76242.0 Nursing/other NaT 3353-03-09 02:13:00 26693.0 132.0 1761.0 2107.0 27d1f5907fa14b6702837a845f84c54e 3353-03-09 02:47:00 needtocontinuehemodialysisscheduleqtues/thurs/sat. ficunsgnote:21:30-7:00thisisa22y.o.womanadmwithlup 109.0 26693.0 132.0 3353-03-09 02:47:00 3353-03-09 02:13:00 Nursing/other 2107.0 ficunsgnote:21:30-7:00thisisa22y.o.womanadmwithlup needtocontinuehemodialysisscheduleqtues/thurs/sat. 1261605.0 193281.0 2140-04-08 2140-04-08 03:13:00 2140-04-08 03:47:00 6e88c96f2f3f7dab74d3732eec27dd35 2174.0 nsgnote:21:30-7:00thisisa22y.o.womanadmwithlupusdx needtocontinuehemodialysisscheduleqtues/thurs/sat. 0.031799 0.08 0.0 0.111799
36788.0 Nursing/other NaT 3352-12-19 16:48:00 12757.0 130.0 818.0 1294.0 3e0fff775cfb678fdfa06ece68ebfab5 3352-12-19 17:05:00 wnwhendialysiscompleted,LabetalolgttforSBP>180prn. NPN7A-7PSeecarevueforspecifics:Briefly22yroldfemal 109.0 12757.0 130.0 3352-12-19 17:05:00 3352-12-19 16:48:00 Nursing/other 1294.0 NPN7A-7PSeecarevueforspecifics:Briefly22yroldfemal wnwhendialysiscompleted,LabetalolgttforSBP>180prn. 1261597.0 164029.0 2140-01-19 2140-01-19 16:48:00 2140-01-19 17:05:00 d3350e7e8b6a9073feae2c0f11ea82e3 1293.0 NPN7A-7PSeecarevueforspecifics:Briefly22yroldfemal wnwhendialysiscompleted,LabetalolgttforSBP>180prn. 0.000773 0.00 0.0 0.000773
42257.0 Nursing/other NaT 3351-03-18 02:36:00 14234.0 129.0 932.0 3083.0 8efc0a2ff698b75ce183e3183c1bf204 3351-03-18 03:24:00 byremovingclonodinepatches.tore-evalanti-hypertens MICUA7P-7ASHIFTSUMMARYOFEVENTS:PTADMITTEDAT1930INH 109.0 14234.0 129.0 3351-03-18 03:24:00 3351-03-18 02:36:00 Nursing/other 3083.0 MICUA7P-7ASHIFTSUMMARYOFEVENTS:PTADMITTEDAT1930INH byremovingclonodinepatches.tore-evalanti-hypertens 1261589.0 128755.0 2138-04-17 2138-04-17 03:36:00 2138-04-17 04:24:00 b26423d357dc3009f6e063c405ce742e 3126.0 MICUA7P-7ASHIFTSUMMARYOFEVENTS:PTADMITTEDATINHYPER byremovingclonodinepatches.tore-evalanti-hypertens 0.013947 0.08 0.0 0.093947

Save output

In [108]:
cols_to_keep = [
    'subject_id', 'hadm_id', 'icustay_id',
    'category', 'chartdate', 'charttime', 'realtime',
    'length', 'md5', 'str_start', 'str_end', 
    'row_id_m3', 'chartdate_m3', 'charttime_m3', 'storetime_m3',
    'md5_m3', 'str_start_m3', 'str_end_m3',
    'len_diff_pct', 'str_start_diff', 'str_end_diff', 'total_m3_distance'
]
output_df = best_matches[cols_to_keep]
output_df.reset_index(inplace=True, drop=True)
In [109]:
output_df.head()
Out[109]:
subject_id hadm_id icustay_id category chartdate charttime realtime length md5 str_start str_end row_id_m3 chartdate_m3 charttime_m3 storetime_m3 md5_m3 str_start_m3 str_end_m3 len_diff_pct str_start_diff str_end_diff total_m3_distance
0 68.0 17608.0 78.0 Nursing/other NaT 2728-12-13 16:51:00 2728-12-13 17:08:00 1666.0 27572b36bd4c26c322f50cf65d095d16 NSG0700-1900Events:Attemptedeatingbreakfastbutbeca ine..?doubleorsinglelumen.Continuetomonitoroutput. 1261174.0 2173-12-16 2173-12-16 16:51:00 2173-12-16 17:08:00 a71a1009932ab1fce83b8867234ec9e4 NSG0700-1900Events:Attemptedeatingbreakfastbutbeca ine..?doubleorsinglelumen.Continuetomonitoroutput. 0.001200 0.00 0.0 0.001200
1 109.0 164029.0 NaN Discharge summary 2142-12-10 NaN 2142-12-10 00:00:00 7751.0 f5f69772c32f1b0ac05b7cf408f7a6db :::Sex:FService:MEDICINEAllergies:PenicillinsAtten rativeassistant,,atifyouneedtoconfirmorreschedule. 15322.0 2140-01-21 None None 1770e41dbbe1a89a144bcebfad1fc87e :::Sex:FService:MEDICINEAllergies:PenicillinsAtten rativeassistant,,atifyouneedtoconfirmorreschedule. 0.002580 0.00 0.0 0.002580
2 109.0 26693.0 132.0 Nursing/other NaT 3353-03-09 02:13:00 3353-03-09 02:47:00 2107.0 27d1f5907fa14b6702837a845f84c54e ficunsgnote:21:30-7:00thisisa22y.o.womanadmwithlup needtocontinuehemodialysisscheduleqtues/thurs/sat. 1261605.0 2140-04-08 2140-04-08 03:13:00 2140-04-08 03:47:00 6e88c96f2f3f7dab74d3732eec27dd35 nsgnote:21:30-7:00thisisa22y.o.womanadmwithlupusdx needtocontinuehemodialysisscheduleqtues/thurs/sat. 0.031799 0.08 0.0 0.111799
3 109.0 12757.0 130.0 Nursing/other NaT 3352-12-19 16:48:00 3352-12-19 17:05:00 1294.0 3e0fff775cfb678fdfa06ece68ebfab5 NPN7A-7PSeecarevueforspecifics:Briefly22yroldfemal wnwhendialysiscompleted,LabetalolgttforSBP>180prn. 1261597.0 2140-01-19 2140-01-19 16:48:00 2140-01-19 17:05:00 d3350e7e8b6a9073feae2c0f11ea82e3 NPN7A-7PSeecarevueforspecifics:Briefly22yroldfemal wnwhendialysiscompleted,LabetalolgttforSBP>180prn. 0.000773 0.00 0.0 0.000773
4 109.0 14234.0 129.0 Nursing/other NaT 3351-03-18 02:36:00 3351-03-18 03:24:00 3083.0 8efc0a2ff698b75ce183e3183c1bf204 MICUA7P-7ASHIFTSUMMARYOFEVENTS:PTADMITTEDAT1930INH byremovingclonodinepatches.tore-evalanti-hypertens 1261589.0 2138-04-17 2138-04-17 03:36:00 2138-04-17 04:24:00 b26423d357dc3009f6e063c405ce742e MICUA7P-7ASHIFTSUMMARYOFEVENTS:PTADMITTEDATINHYPER byremovingclonodinepatches.tore-evalanti-hypertens 0.013947 0.08 0.0 0.093947
In [110]:
output_path = pl.Path(path_config['repo_data_dir']).joinpath('mimic3_note_equivs_{}.csv'.format(time_str))
output_path
Out[110]:
PosixPath('../../data/mimic3_note_equivs_2016-10-23-16-35.csv')
In [111]:
output_df.to_csv(output_path.as_posix(), index=False)

Inspect questionable matches

In [112]:
questionable_matches = output_df.sort_values('total_m3_distance', ascending=False).reset_index(drop=True)
questionable_matches.head(10)
Out[112]:
subject_id hadm_id icustay_id category chartdate charttime realtime length md5 str_start str_end row_id_m3 chartdate_m3 charttime_m3 storetime_m3 md5_m3 str_start_m3 str_end_m3 len_diff_pct str_start_diff str_end_diff total_m3_distance
0 23150.0 25167.0 28766.0 Nursing/other NaT 2770-07-11 02:37:00 2770-07-11 05:04:00 145.0 6322d6d5756e7026dca157d84ce31813 CCUNPN2200-0700ADDENDUM:Pt.alertandconversingappro questionsre:POC.+gag,toleratedsipswatersoNGTd/c’d. 1564811.0 2170-07-14 2170-07-14 03:37:00 2170-07-14 06:04:00 c48ed6b8e6dd6d2cba585e1a9f803411 CCUNPN2200-0700ADDENDUM:Pt.alertandconversingappro 30.Askingquestionsre:POC.+gag,sipswatersoNGTd/c’d. 0.034483 0.00 0.18 0.214483
1 15209.0 3851.0 18883.0 Nursing/other NaT 3480-08-22 18:17:00 3480-08-22 18:43:00 1794.0 75230c2e09fca188032535033b290646 S/MICUNursingProgressNote7a-7pSeeCarevueforAdditio withHusbandwhoisherhealthcareproxy.Home:.Daughter: 1459958.0 2128-09-12 2128-09-12 19:17:00 2128-09-12 19:43:00 b26fce2c35098b9cede70247753937ae S/MICUNursingProgressNote7a-7pSeeCarevueforAdditio ivesathomewithHusbandwhoisherhealthcareproxy.:.er: 0.011706 0.00 0.20 0.211706
2 14467.0 15534.0 17978.0 Nursing/other NaT 3184-09-18 03:08:00 3184-09-18 04:45:00 2218.0 655d960f1d1bde683d2442f8464e3356 NURSINGADMISSION/PROGRESSNOTE:SEECAREVUEFOROJECTIV SUPPORTTOPTANDSPOUSE.SOCIALWORKCONSULTTOBEORDERED. 1450790.0 2118-08-09 2118-08-09 04:08:00 2118-08-09 05:45:00 2a0b8dbd546e25668b6a6b7c8fc08de9 NURSINGADMISSION/PROGRESSNOTE:SEECAREVUEFOROJECTIV EEMOTIONALSUPPORTTOPTANDSPOUSE.CONSULTTOBEORDERED. 0.006763 0.00 0.20 0.206763
3 4271.0 26247.0 5316.0 Nursing/other NaT 3110-09-29 04:36:00 3110-09-29 05:28:00 2059.0 52741b5215f9d83fdeed8b9b798abee6 AdmissionNote0000-0700Pt43y/omalewithPMHHepCcirrho elongingslefton10thfloorinhisroompriortoprocedure. 1317603.0 2192-10-11 2192-10-11 04:36:00 2192-10-11 05:28:00 cfc3be2e62869d6157d96e972a70f0a5 AdmissionNote0000-0700Pt43y/omalewithPMHHepCcirrho enthehasbelongingsleftoninhisroompriortoprocedure. 0.009713 0.00 0.18 0.189713
4 3506.0 3911.0 4353.0 Nursing/other NaT 3429-06-17 16:51:00 3429-06-17 16:54:00 148.0 d4f45b44f1c0808eec9b55f9623da36f RespCarePttransferredbacktofromRehabforfever,onmec pportA/C,trached,CVLplacedtoday,hemodialysistoday. 1307799.0 2195-08-12 2195-08-12 17:51:00 2195-08-12 17:54:00 4ba80f4e53c5510e499c06392e980bb1 RespCarePttransferredbacktofromRehabforfever,onmec hventsupportA/C,,CVLplacedtoday,hemodialysistoday. 0.047297 0.00 0.14 0.187297
5 976.0 14833.0 1206.0 Nursing/other NaT 3087-11-10 05:15:00 3087-11-10 05:31:00 2444.0 e75826f4da986bcae0048b2416d8dbcc CCUNSGADMIT/PROGRESSNOTE7P-7A/PERICARDIALEFFS-“MYB NCARE/PULMTOILET-KEEPPTANDFAMILYAWAREOFPLANOFCARE. 1274589.0 2179-09-30 2179-09-30 05:15:00 2179-09-30 05:31:00 ba633ab219b1b7c1cfbdea7177eb772d CCUNSGADMIT/PROGRESSNOTE7P-7A/PERICARDIALEFFS-“MYB ONS.SKINCARE/PULMKEEPPTANDFAMILYAWAREOFPLANOFCARE. 0.036416 0.00 0.14 0.176416
6 28226.0 30107.0 42448.0 Nursing/other NaT 2659-07-25 02:22:00 2659-07-25 02:36:00 435.0 3b8097352e459832483e90474b800fed PTTOROOM8N/SICUAFROMERFORACUTEABDPAINDXNECROTIZING SPLANMEDTREATRESTGISYSTEMANTIBIOTICSSUPPORTIVECARE 1631224.0 2193-04-29 2193-04-29 03:22:00 2193-04-29 03:36:00 3013f152c940ff3c13a694d533c2756d PTTON/SICUAFROMERFORACUTEABDPAINDXNECROTIZINGPANCR SPLANMEDTREATRESTGISYSTEMANTIBIOTICSSUPPORTIVECARE 0.075862 0.10 0.00 0.175862
7 23549.0 20671.0 29257.0 Nursing/other NaT 3088-10-22 17:03:00 3088-10-22 17:06:00 204.0 eb787359974fbdc61fd516f681ba91fb resp.carept.admittedfromrehabtoe.d.toficu.ventdepe 500/12/40%+5.toleratingwell.seertflowsheetformore. 1570383.0 2168-10-31 2168-10-31 18:03:00 2168-10-31 18:06:00 fa14961f4d9da0c9f57a77609ef635cd resp.carept.admittedfromrehabtoe.d.to.ventdependen 500/12/40%+5.toleratingwell.seertflowsheetformore. 0.093137 0.08 0.00 0.173137
8 22098.0 6383.0 27484.0 Nursing/other NaT 2766-04-07 04:10:00 2766-04-07 04:28:00 1056.0 04bd732c66af3b1c7fd65f80813575c9 NursingProgressNote1900-0700ReviewofSystems:Neuro: 3rdunitPRBC’sinfused.Probabletransfertofloortoday. 1552122.0 2162-05-27 2162-05-27 05:10:00 2162-05-27 05:28:00 2df8873036d77a1d0345b0112b19c378 NursingProgressNote1900-0700ReviewofSystems:Neuro: labswhen3rdunitPRBC’sinfused.transfertofloortoday. 0.005682 0.00 0.16 0.165682
9 46116.0 156119.0 NaN Discharge summary 2142-12-10 NaN 2142-12-10 00:00:00 19473.0 3926099a8a27e4eb6286b3d32e766865 :::Sex:FService:MEDICINEAllergies:PenicillinsAtten ding:LMBldg()BasementCampus:WESTBestParking:Garage 54790.0 2153-09-01 None None 3cce112cb2b1dc44f3548c55fde20969 :::Sex:FService:MEDICINEAllergies:PenicillinsAtten :,MDBuilding:LMBldg()Campus:WESTBestParking:Garage 0.001900 0.00 0.16 0.161900
In [113]:
def compare_texts(out_row):
    query = """
    select text 
    from mimiciii.noteevents 
    where subject_id={subject_id} and row_id={row_id_m3}
    """.format(**out_row)
    mimic3_txt = conn3.execute(query).fetchone()[0]

    if out_row['category'] == 'Nursing/other':
        ann_txt = nursing_notes.loc[nursing_notes['md5'] == out_row['md5']].iloc[0]['text']
    else:
        ann_txt = disch_notes.loc[nursing_notes['md5'] == out_row['md5']].iloc[0]['text']
    
    print('MIMIC 3 text:\n{}'.format(mimic3_txt))
    print('Text from annotations:\n{}'.format(ann_txt))
In [114]:
compare_texts(questionable_matches.iloc[0])
MIMIC 3 text:
CCU NPN 2200-0700
ADDENDUM:
Pt. alert and conversing appropriately ~0530. Asking questions re: POC. + gag, [** **] sips water so NGT d/c'd.

Text from annotations:

CCU NPN 2200-0700
ADDENDUM: 
Pt. alert and conversing appropriately ~0530. Asking questions re: POC. + gag, tolerated sips water so NGT d/c'd. 

In [115]:
compare_texts(questionable_matches.iloc[1])
MIMIC 3 text:
S/MICU Nursing Progress Note 7a-7p
See Carevue for Additional Objective Data
ROS:
Resp:4l NC with RR:[**10-9**]/min with SpO2:93-98% BS:crackles at bases with intermittent fine exp wheezes anteriorly. treated with albuterol inhaler x1. No cough or sputum production noted

ID:T max:100. po, unclear source of infection Urine cx, BC:png.
Started on Zosyn and received dose of vanco in EW. Abd U/S: negative

CV:SBP on admission 72-80 responded transiently to IVF. SBP down to 62/ and Levophed started. Titrated up to .057 mcg/kg/min, BP: stabilized to 92-119/ HR:94-103 SR-ST with rare PVC

GU:BUN/Cr:29/1.6 received total of 5 L IVF via boluses in EW and MICU with U/O:>130-250cc/hr clear yellow urine. I/O's +3500cc since presentation in EW. Na:127 (up),K:3.9, Ca:6.8, received 2 amps CaGluconate. BS:62-69, asymptomatic. Received OJ for BS:62

GI:Abd soft, non-tender, +BS, no stool production, NPO. No overt S7Sx of bleeding. hct:23-24. Ordered for transfusion, consent obtained at 18:45

Neuro:Alert with periods of lethary, orientation x1-2. PERRL, speech clear, follows commands. Husband reports that she has had periods of confusion over the past several days. Head CT:negative

Skin:Large sacral decub with pink base with areas of white/yellow. Plastics consulted this morning, performed debridement at bedside followed by W to Dry dressing. Also noted ecchymosis on R heel and .25 cent sized breakdown on R achilles, crusted over, no drainage. placed on first step therapeutic bed

Access:R brachial PICC, L femoral multi lumen, #20 PIV

Dispo:full code

Social: Lives at home with Husband [**Name (NI) **] who is her healthcare proxy. [**Name (NI) **]:[**Telephone/Fax (1) 9306**]. [**Name2 (NI) **]er [**First Name4 (NamePattern1) 9309**] [**Last Name (NamePattern1) 9310**]:[**Telephone/Fax (1) 9307**]

Text from annotations:

S/MICU Nursing Progress Note 7a-7p
See Carevue for Additional Objective Data
ROS:
Resp:4l NC with RR:[**9-18**]/min with SpO2:93-98% BS:crackles at bases with intermittent fine exp wheezes anteriorly. treated with albuterol inhaler x1. No cough or sputum production noted

ID:T max:100. po, unclear source of infection Urine cx, BC:png.
Started on Zosyn and received dose of vanco in EW. Abd U/S: negative

CV:SBP on admission 72-80 responded transiently to IVF. SBP down to 62/ and Levophed started. Titrated up to .057 mcg/kg/min, BP: stabilized to 92-119/ HR:94-103 SR-ST with rare PVC

GU:BUN/Cr:29/1.6 received total of 5 L IVF via boluses in EW and MICU with U/O:>130-250cc/hr clear yellow urine. I/O's +3500cc since presentation in EW. Na:127 (up),K:3.9, Ca:6.8, received 2 amps CaGluconate. BS:62-69, asymptomatic. Received OJ for BS:62

GI:Abd soft, non-tender, +BS, no stool production, NPO. No overt S7Sx of bleeding. hct:23-24. Ordered for transfusion, consent obtained at 18:45

Neuro:Alert with periods of lethary, orientation x1-2. PERRL, speech clear, follows commands. Husband reports that she has had periods of confusion over the past several days. Head CT:negative

Skin:Large sacral decub with pink base with areas of white/yellow. Plastics consulted this morning, performed debridement at bedside followed by W to Dry dressing. Also noted ecchymosis on R heel and .25 cent sized breakdown on R achilles, crusted over, no drainage. placed on first step therapeutic bed

Access:R brachial PICC, L femoral multi lumen, #20 PIV

Dispo:full code

Social: Lives at home with Husband [**Name (NI) **] who is her healthcare proxy. Home:[**Telephone/Fax (1) 6887**]. Daughter [**First Name4 (NamePattern1) 6889**] [**Last Name (NamePattern1) 6890**]:[**Telephone/Fax (1) 6888**] 

Supporting code

mimic_extraction_utils.py download
import re
import pandas as pd
from difflib import SequenceMatcher

def clean_text(in_text):
    e = re.compile(r"\s|\[\*\*[^\*]+\*\*\]|Admission Date|Discharge Date|Date of Birth")
    out_text = e.sub("", in_text)
    return(out_text)

def fix_date(date_str):
    """Dates in the discharge file are a funky format, standardize
    """
    
    months = ['JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 
              'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC']
    month_inds = {m: m_ind for (m, m_ind) in zip(months, range(1, len(months)+1))}
    
    m = re.match('(\d{2})-(\w{3})-(\d{4}) (\d\d:\d\d:\d\d)', '10-DEC-2142 00:00:00')
    if len(m.groups()) > 0:
        (day, month, year, time) = m.groups()
        new_str = pd.Timestamp('{}-{}-{} {}'.format(year, month_inds[month], day, time))
    else:
        new_str = date_str
    return new_str

def similar(a, b):
    if isinstance(a, str) and isinstance(b, str):
        return 1-SequenceMatcher(None, a, b).ratio()
    else:
        return 9999
etc_utils.py download
import pandas as pd

from datetime import timedelta, datetime
import matplotlib.pyplot as plt

import time
import pathlib as pl
import yaml
import os

import logging
logger = logging.getLogger()

def load_config_v2(config_file='config.yaml', 
                   creds_file='credentials.yaml'):
    time_str = time.strftime('%Y-%m-%d-%H-%M')

    with open(config_file, 'r') as conf_file:
        conf = yaml.load(conf_file)

    defaults = {
        'log_dir': './',
        'input_dir': './',
        'results_dir': './',
        'repo_data_dir': './'
    }
    path_config = dict()
    if 'paths' in conf:
        for k in defaults.keys():
            path_config[k] = os.path.expanduser(conf['paths'].get(k, defaults[k]))
    
    creds = None

    with open(creds_file, 'r') as creds_file:
        creds_dat = yaml.load(creds_file)
        creds = creds_dat.copy()
        
    return time_str, path_config, creds

def load_config(config_file='config.yaml', 
                creds_file='credentials.yaml'):
    time_str = time.strftime('%Y-%m-%d-%H-%M')

    with open(config_file, 'r') as conf_file:
        conf = yaml.load(conf_file)

    defaults = {
        'log_dir': './',
        'input_dir': './',
        'results_dir': './',
        'repo_data_dir': './'
    }
    path_config = dict()
    if 'paths' in conf:
        for k in defaults.keys():
            path_config[k] = os.path.expanduser(conf['paths'].get(k, defaults[k]))
    
    pg_creds = None

    with open(creds_file, 'r') as creds_file:
        creds_dat = yaml.load(creds_file)

    if 'postgres' in creds_dat:
        pg_creds = creds_dat['postgres']
        
    return time_str, path_config, pg_creds

def configure_logging(logger, work_desc, log_directory=None, time_str=None):
    if time_str is None:
        time_str = time.strftime('%Y-%m-%d-%H-%M')
    
    if log_directory is None:
        log_directory = pl.Path('.')
    
    
    if isinstance(log_directory, str):
        log_directory = pl.Path(log_directory)
    
    logger.setLevel(logging.DEBUG)

    log_file = log_directory.joinpath(time_str + '_' + work_desc.replace(' ', '_') + '.log').as_posix()
    print('Logging to {}'.format(log_file))
    logger.info('Logging to {}'.format(log_file))

    # create file handler which logs even debug messages
    fh = logging.FileHandler(log_file)
    fh.setLevel(logging.DEBUG)

    # create console handler with a higher log level
    ch = logging.StreamHandler()
    ch.setLevel(logging.INFO)

    # create formatter and add it to the handlers
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    fh.setFormatter(formatter)
    ch.setFormatter(formatter)
    # add the handlers to the logger
    logger.addHandler(fh)
    logger.addHandler(ch)
environment.yml download
# run: conda env create --file environment.yml
name: freq_fliers_notebooks
dependencies:
- anaconda
- jupyter
- jupyter-nbconvert
- matplotlib
- nltk
- numpy>=1.11.1
- pandas>=0.18.1
- pip
- psycopg2
- python==3.5.2
- scipy
- scikit-learn>=0.18
- sqlalchemy>=1.1.3
- pip:
  - pycorenlp
  - imbalanced-learn
Makefile download
all: last_output.txt

last_output.txt: *ipynb
	mkdir -p nbconvert/html
	jupyter-nbconvert *ipynb --to python --output-dir=nbconvert/python
#	mkdir -p nbconvert/python
#	jupyter-nbconvert "$?" --to html --output-dir=nbconvert/html
	date > last_output.txt
	git rev-parse HEAD >> last_output.txt

clean:
	rm -rf nbconvert
	rm -f last_output.txt

Comments

comments powered by Disqus