One tricky thing to deal with in the transition from MIMICII to
MIMICIII has been that
IDs have changed and are sometimes missing, so work that a collaborating team did to annotate
patient notes in MIMICII is not translatable to MIMICIII. This notebook shows one way to discover
note relationships between the two datasets.
I’ve been working with the fantastic MIMIC dataset for several years now, and have been excited
by the new MIMICIII release. It cleans up a lot of the old structure, adds content, and generally
is just a lot nicer to use. (Thanks to all involved!) In the process of developing the new
system the old way of identifying patients was reset, making it difficult to translate work from
MIMICII to MIMICIII.
As an example, in the previous dataset the team had created annotations
such as “Patient 1’s 3rd ICU stay has notes with indications of advanced heart failure” and “Patient 1’s
7th ICU stay has
notes with indication of substance abuse”. In the new dataset the patient numbers are consistent
but the ICU stay indications are removed and times are all shifted differently, so we can’t directly
apply our old annotations to the nte notes. Ideally one could just
take a hash of the notes (e.g. MD5) and create a mapping, but the de-identification process changed
as well, so the same original note text will be different in the two output datasets we have
access to.
The first part of the notebook goes through the following steps
Connect to MIMICII and MIMICIII databases using SQLAlchemy
Load notes labeled by our annotators
Verify that the annotated notes match the MIMICII source (expected)
Check whether they match the MIMICIII data (they did not)
I did find that subject IDs were consistent between datasets - an observation that greatly reduces
the potential search space of matches for each note. Using this observation I was able to take
each annotated note and only try to match it to each of the same subject’s notes in MIMICIII. To
identify matches, I first tried MD5 hashes - this did not work as the de-identification procedure
changed between datasets (the same source note was presented differently in the two datasets,
causing the hash to differ). I finally settled on a distance heuristic combining the overall note
length, the similarity of the beginning of the note (first several hundred characters, excluding
whitespace), and the similarity of the end of the note (last several hundred characters). When there
were multiple possible matches for a note, the note with the aggregate lowest distance was chosen
as the most probable match.
New Approach:
Gather all notes from MIMICII for each subject
Gather all notes from MIMICIII for the same subjects
For each subject’s notes in MIMICII, calculate the following metrics for that subject MIMICIII
notes:
Difference in note length
Similarity of the text beginning (as evaluated by SequenceMatcher)
Similarity of the text ending
Any notes with more than 10%-20% difference in any of these metrics are considered to not match
If there are multiple potential matches remaining, the notes with the minimum overall distance are
considered best matches.
In this case thresholds for the heuristic distance measure and overall matches were chosen by
inspection, and that resulted in reasonable results. A larger problem, or a problem with less
constrained population for potential note matches, would require a different approach - e.g.
pre-clustering (by topic, word frequency, etc), making use of extracted patient demographics (sex,
age, race), or likely diagnoses. One could also apply more rigorous optimization for the thresholds
to balance false positives and false negatives, or could apply automated approaches to developing
distance thresholds (e.g. based on semisupervised learning).
Notes that were evaluated were pulled from both MIMIC 2 and MIMIC 3 due to availablity of data at the time of extraction. This causes difficulty when trying to pair the notes with structured data for the purposes of relating the notes to the overall patient context.
This notebook takes annotated note files as input, collects the necessary information from MIMIC 2 and 3 to find the note in MIMIC 3, and creates a new annotation file with the MIMIC 3 data.
MICU/SICU NURSING PROGRESS NOTE.
SEE CAREVIEW FOR OBJECTIVE DATA.
Neuro: Arouses with verbal stimuli, ms is variable and changes from a&o x 3 to a&o x 1 with moaning episodes, repetitive speach and repetative motions with rue. Pupils are rt 4mm, lt 3mm and sluggish bilat. Head ct completed and was negative. Was given versed 1 mg during ct and pt became much more calm and was able to answer questions appropriatly. Reports able to feel in all 4 extrem., Partial movement of rt ue and lt ue.
Respiratory: Lung sounds are coarse throughout, diminished in lt base. RR 12-24 and non-labored except when having moaning episodes. O2 saturation is 94-100% on 4l nc. Cxr in ed showed persistent lll pneumonia and atelectasis in rt upper field. Expectorating thick tan sputum in abundant amounts.
CV: Sinus rythm, rate 70-98 with no ectopy noted. Nbp 104- 134 systolic. Good pulses all 4 extrem.
GI/GU: Abdomen si softly distended with + bs. Pt is able to take regular meals, is full feed, but must be bolt upright. Able to take meds with custard or jello. No bm this shift but reports bm yesterday. Foley catheter is 30fr placed at os facility, occn leaks around catheter. Patetn and draing clear amber urine. ? uti, leuks and blood in ua.
Integument: Multiple skin issues. Stage 3 decubiti below lt glutael, rash on lt and rty gluteals extending down to just above knee joint, raised and pink area in rt lower abdomen, blister type area on rt upper thigh. Lidocaine patches on mid thorax and lumbar spine.
Plan: Monitor respiratory status and tx pnuemonia with abx, Tx uti, Assess present pain control methods and find alteratives. Obtain first step bed if to be pt here for any lenght of time.
Gather those patient’s data from the database and export¶
From extracts above, see that nursing notes are from MIMICII (indicated by dates, also from Slack discussion). Discharge notes seem to be a combination, with chartdate in MIMICIII format, but MIMICII dates in the notes themselves.
Joy (via Slack):
When we first pulled the notes, only MIMIC II was available. However, MIMIC II did not have very good notes pulled from the raw clinical data. In particular, lots of discharge notes were missing. Nursing notes were more decent so we started annotating the nursing notes first. Then we got Leo's people to pull discharge notes from MIMIC III for us when it became ready
Approach:
Extract list of all patients (subject_id) from notes files
Extract those patients’ note metadata: note id, text md5sum, dates, type, icustayid, hadm_id
Extract those patients’ icustayid info, including dates
For each note in the notes file, try to match against a note in one of the databases
Find the MIMICIII id data (subject_id, hadm_id, icustay_id, note_id)
Output consistent file with annotations and MIMICIII metadata
Extract existing metadata, as well as information that can be used for matching:
md5 hash of original text - only useful if unchanged
text length, very rough matching
beginning and end of string, stripped of whitespace, template words (e.g. Admission Date), and de-id templates (e.g. [])
In [29]:
query="""select subject_id, hadm_id, icustay_id, realtime, charttime, category, md5(text) as "md5", length(text) as "length", left(strip_text, 50) as "str_start", right(strip_text, 50) as "str_end"from ( select *, regexp_replace(text, '\s|\[\*\*[^\*]+\*\*\]|Admission Date|Discharge Date|Date of Birth', '', 'g') as strip_text from mimic2v26.noteevents where category in ('Nursing/Other', 'DISCHARGE_SUMMARY') and subject_id in ({}) ) as a"""
query="""select row_id, subject_id, hadm_id, chartdate, charttime, storetime, category, md5(text) as "md5", length(text) as "length", left(strip_text, 50) as "str_start", right(strip_text, 50) as "str_end"from ( select *, regexp_replace(text, '\s|\[\*\*[^\*]+\*\*\]|Admission Date|Discharge Date|Date of Birth', '', 'g') as strip_text from mimiciii.noteevents where category in ('Nursing/other', 'Discharge summary') and subject_id in ({}) ) as a"""
conn2.execute("select text from mimic2v26.noteevents where subject_id=2905 and md5(text)='e7ffa42fc2f47fd0e3eb1bc54283375e'").fetchall()
Out[48]:
[('\nrespiratory care\npt on the vent changes made tol well. see respiratory page of carevue for more information.\n',),
('\nrespiratory care\npt on the vent changes made tol well. see respiratory page of carevue for more information.\n',)]
Even including subject_id has repeats. Based on these, we should be able to left join the database data onto the file data, but there could be repeated rows which may confuse analysis. To guarantee no duplicated rows we’ll join including subject id, and also drop duplicates in the database dataframes.
From this, almost all nursing notes were able to match to MIMIC 2, but nothing was able to match to MIMIC 3, and discharge summaries couldn’t be matched at all. Look into why discharge notes aren’t matching to MIMIC 3…
db_note=conn3.execute("""select text from mimiciii.noteevents where subject_id=9973 and category='Discharge summary'and md5(text)='10577fde1d173468a939ce3cf19f0926'""").fetchone()[0]print(db_note[:500])
Admission Date: [**2142-11-30**] Discharge Date: [**2142-12-10**]
Date of Birth: [**2084-5-2**] Sex: M
Service: MEDICINE
Allergies:
Percocet / Bactrim Ds / Lisinopril
Attending:[**First Name3 (LF) 898**]
Chief Complaint:
hypotension
Major Surgical or Invasive Procedure:
none
History of Present Illness:
Mr. [**Known lastname 25925**] is a 58 yo m w/ multiple sclerosis and seizure
disorder who presented to an OSH for delusions and AMS x 2 days.
At OSH, he was n
In [75]:
print(disch_notes.loc[0,'text'][:500])
Admission Date: [**2512-1-8**] Discharge Date: [**2512-1-18**]
Date of Birth: [**2453-6-10**] Sex: M
Service: MEDICINE
Allergies:
Percocet / Bactrim Ds / Lisinopril
Attending:[**First Name3 (LF) 886**]
Chief Complaint:
hypotension
Major Surgical or Invasive Procedure:
none
History of Present Illness:
Mr. [**Known patient lastname 25575**] is a 58 yo m w/ multiple sclerosis and seizure
disorder who presented to an OSH for delusions and AMS x 2 days.
At OSH
From this we see that the notes in our annotation files nearly match the notes in MIMICIII, but de-identification processes have changed.
defcompare_texts(out_row):query=""" select text from mimiciii.noteevents where subject_id={subject_id} and row_id={row_id_m3} """.format(**out_row)mimic3_txt=conn3.execute(query).fetchone()[0]ifout_row['category']=='Nursing/other':ann_txt=nursing_notes.loc[nursing_notes['md5']==out_row['md5']].iloc[0]['text']else:ann_txt=disch_notes.loc[nursing_notes['md5']==out_row['md5']].iloc[0]['text']print('MIMIC 3 text:\n{}'.format(mimic3_txt))print('Text from annotations:\n{}'.format(ann_txt))
In [114]:
compare_texts(questionable_matches.iloc[0])
MIMIC 3 text:
CCU NPN 2200-0700
ADDENDUM:
Pt. alert and conversing appropriately ~0530. Asking questions re: POC. + gag, [** **] sips water so NGT d/c'd.
Text from annotations:
CCU NPN 2200-0700
ADDENDUM:
Pt. alert and conversing appropriately ~0530. Asking questions re: POC. + gag, tolerated sips water so NGT d/c'd.
In [115]:
compare_texts(questionable_matches.iloc[1])
MIMIC 3 text:
S/MICU Nursing Progress Note 7a-7p
See Carevue for Additional Objective Data
ROS:
Resp:4l NC with RR:[**10-9**]/min with SpO2:93-98% BS:crackles at bases with intermittent fine exp wheezes anteriorly. treated with albuterol inhaler x1. No cough or sputum production noted
ID:T max:100. po, unclear source of infection Urine cx, BC:png.
Started on Zosyn and received dose of vanco in EW. Abd U/S: negative
CV:SBP on admission 72-80 responded transiently to IVF. SBP down to 62/ and Levophed started. Titrated up to .057 mcg/kg/min, BP: stabilized to 92-119/ HR:94-103 SR-ST with rare PVC
GU:BUN/Cr:29/1.6 received total of 5 L IVF via boluses in EW and MICU with U/O:>130-250cc/hr clear yellow urine. I/O's +3500cc since presentation in EW. Na:127 (up),K:3.9, Ca:6.8, received 2 amps CaGluconate. BS:62-69, asymptomatic. Received OJ for BS:62
GI:Abd soft, non-tender, +BS, no stool production, NPO. No overt S7Sx of bleeding. hct:23-24. Ordered for transfusion, consent obtained at 18:45
Neuro:Alert with periods of lethary, orientation x1-2. PERRL, speech clear, follows commands. Husband reports that she has had periods of confusion over the past several days. Head CT:negative
Skin:Large sacral decub with pink base with areas of white/yellow. Plastics consulted this morning, performed debridement at bedside followed by W to Dry dressing. Also noted ecchymosis on R heel and .25 cent sized breakdown on R achilles, crusted over, no drainage. placed on first step therapeutic bed
Access:R brachial PICC, L femoral multi lumen, #20 PIV
Dispo:full code
Social: Lives at home with Husband [**Name (NI) **] who is her healthcare proxy. [**Name (NI) **]:[**Telephone/Fax (1) 9306**]. [**Name2 (NI) **]er [**First Name4 (NamePattern1) 9309**] [**Last Name (NamePattern1) 9310**]:[**Telephone/Fax (1) 9307**]
Text from annotations:
S/MICU Nursing Progress Note 7a-7p
See Carevue for Additional Objective Data
ROS:
Resp:4l NC with RR:[**9-18**]/min with SpO2:93-98% BS:crackles at bases with intermittent fine exp wheezes anteriorly. treated with albuterol inhaler x1. No cough or sputum production noted
ID:T max:100. po, unclear source of infection Urine cx, BC:png.
Started on Zosyn and received dose of vanco in EW. Abd U/S: negative
CV:SBP on admission 72-80 responded transiently to IVF. SBP down to 62/ and Levophed started. Titrated up to .057 mcg/kg/min, BP: stabilized to 92-119/ HR:94-103 SR-ST with rare PVC
GU:BUN/Cr:29/1.6 received total of 5 L IVF via boluses in EW and MICU with U/O:>130-250cc/hr clear yellow urine. I/O's +3500cc since presentation in EW. Na:127 (up),K:3.9, Ca:6.8, received 2 amps CaGluconate. BS:62-69, asymptomatic. Received OJ for BS:62
GI:Abd soft, non-tender, +BS, no stool production, NPO. No overt S7Sx of bleeding. hct:23-24. Ordered for transfusion, consent obtained at 18:45
Neuro:Alert with periods of lethary, orientation x1-2. PERRL, speech clear, follows commands. Husband reports that she has had periods of confusion over the past several days. Head CT:negative
Skin:Large sacral decub with pink base with areas of white/yellow. Plastics consulted this morning, performed debridement at bedside followed by W to Dry dressing. Also noted ecchymosis on R heel and .25 cent sized breakdown on R achilles, crusted over, no drainage. placed on first step therapeutic bed
Access:R brachial PICC, L femoral multi lumen, #20 PIV
Dispo:full code
Social: Lives at home with Husband [**Name (NI) **] who is her healthcare proxy. Home:[**Telephone/Fax (1) 6887**]. Daughter [**First Name4 (NamePattern1) 6889**] [**Last Name (NamePattern1) 6890**]:[**Telephone/Fax (1) 6888**]
Comments
comments powered by Disqus