Background
The Data Science and Advanced Analytics (DSAA) team at Unity Health Toronto has developed and evaluated advanced patient monitoring and decision support systems to improve the efficiency, accuracy, and timeliness of clinical decision-making on the General Internal Medicine (GIM) inpatient ward at St. Michael’s Hospital. The GIM dataset was created through this work, and is comprised of de-identified health related data associated with patients who were admitted under the GIM service at St. Michael’s Hospital
Funding for the creation and de-identification of the dataset was provided by Unity Health Toronto. The dataset was originally created internally, and was provided to T-CAIREM under a data transfer agreement to make the dataset available. It is currently the largest Canadian healthcare dataset made openly available to researchers.
Data
The General Internal Medicine (GIM) dataset is comprised of deidentified health related data associated with over 22,000 patient encounters for 14,000 unique patients who were admitted under the GIM service at St. Michael’s Hospital between 2011 and 2019. All patients admitted under a GIM service with an admission of at least 30 hours were included. The dataset is provided in both a preprocessed format and as raw data tables, all available as CSV files.
Data tables are all divided into three sets, for training (data collected prior to December 1, 2017), validation (data collected between December 1, 2017 and December 1, 2018), and testing (data collected after December 1, 2018). The dataset includes both static and time-varying tables. Please note that division into sets occurs on the level of encounters and not individual patients. As such, the same patient may be represented in multiple of the training, validation, and testing datasets.
Variables
The following variables were selected for inclusion based on consultation with a staff physician, including:
- 136 Numeric Values including:
- 9 vital signs
- 100 labs
- 7 shift assessment variables
- 7 intake-outtake variables
- 1 ulcer variable, 1 alcohol scale, 1 diabetes variable
- 165 Clinical Orders including:
- Imaging
- Telemetry
- Consults
- Cardio
- Diet
- Respiration
- Activities
- Codes
- Protocols
- Transfusions
- Wound Care
- Neuro
- Medication Administrations (grouped by AHFS Class)
Collection and Pre-Processing
Data was extracted directly from the following source systems:
- Admit-Discharge-Transfer (ADT) System: Identify patient encounters under the GIM service.
- Electronic Medical Records (EMR): Demographics, laboratory results, clinical orders, vitals and ICD-10 codes.
- Medication Administration Check (MAK): Documentation for all inpatient medication administrations, including the type of medication, dose, timing, administration route, and administration timestamp.
The dataset is provided in its original, raw form as well as in a pre-processed form which aggregates data into fixed time windows. Pre-processing is done as follows:
- Time-varying data is binned into 8 hour windows
- Numeric data is averaged within each window, trimmed, and normalized. Two variables are added: an indicator for measurement, and a time since last measurement
- Missing numeric data is carried forward with mean imputation
- Orders are given as indicator variable
- Missing orders are imputed as zero
- Medications are grouped into classes and then classes are given as indicator variables
For more details, please review the explanations for each individual data table. Please note that the use of mean imputation may pose challenges with using the binned data.
De-identification
The following steps were taken by individuals at Unity Health to de-identify the data:
- Patient IDss and encounter numbers were removed from the data. Encounter numbers were replaced with a unique random 6 digit number.
- Addresses, postal codes, and names were stripped from the data.
- Any variable containing the year or month have been removed from the data. Pre-processed data includes a time window indicating the number of 8 hour blocks since admission, while raw data includes a “time since admission” variable for each measurement.
In addition, T-CAIREM staff have also further de-identified the data by grouping individuals' ages into five-year categories, capping these categories at 20 on the lower end and 100 on the upper end.
Use
The data included in this dataset has been used internally at St. Michael’s Hospital to build systems for improving patient monitoring and decision making. Some of this work has been referenced in publications[1][2].
Governance and Consent
Research Ethics Board (REB) approval has been obtaind for both the creation of the dataset and the de-identified version of the dataset. Individual consent has been provided for the collection and analysis of data. Consent has not been given for the secondary use of the de-identified dataset, and is not required under PHIPA or TCPS2.
This dataset is governed by Unity Health REB, protocol #21-206. Transfer of the data is governed by a Data Transfer Agreement between the University of Toronto and Unity Health. Access by authorized users is governed by a Data Sharing Agreement and Code of Conduct, as well as the Health Data Nexus Contributor Review Health Data License 1.0.
Data is owned by Unity Health Toronto, with access provided by T-CAIREM at the University of Toronto. The dataset will be updated and maintained jointly by Unity Health and T-CAIREM, until either party chooses to remove support for the dataset. For more information or any questions about the information in this dataset,
Version History
- 1.0.0: Original hosting of the dataset.
- 1.0.1: Update to the sharing policy to conform to T-CAIREM guidelines.
Citation
DOI: https://doi.org/10.57764/1w7f-kb56
Version: 1.0.1
BibTeX Citation:
References
- [1] Verma AA, Murray J, Greiner R, Cohen JP, Shojania KG, Ghassemi M, Straus SE, Pou-Prom C, Mamdani M. Implementing machine learning in medicine. CMAJ. 2021 Aug 30;193(34):E1351-7.
- [2] Nestor B, McCoy LG, Verma A, Pou-Prom C, Murray J, Kuzulugil S, Dai D, Mamdani M, Goldenberg A, Ghassemi M. Preparing a clinical support model for silent mode in general internal medicine. In Machine Learning for Healthcare Conference 2020 Sep 18 (pp. 950-972). PMLR.