This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Explanation and Background

Exploration of the contents of the dataset

Background

The Data Science and Advanced Analytics (DSAA) team at Unity Health Toronto has developed and evaluated advanced patient monitoring and decision support systems to improve the efficiency, accuracy, and timeliness of clinical decision-making on the General Internal Medicine (GIM) inpatient ward at St. Michael’s Hospital. The GIM dataset was created through this work, and is comprised of de-identified health related data associated with patients who were admitted under the GIM service at St. Michael’s Hospital

Funding for the creation and de-identification of the dataset was provided by Unity Health Toronto. The dataset was originally created internally, and was provided to T-CAIREM under a data transfer agreement to make the dataset available. It is currently the largest Canadian healthcare dataset made openly available to researchers.

Data

The General Internal Medicine (GIM) dataset is comprised of deidentified health related data associated with over 22,000 patient encounters for 14,000 unique patients who were admitted under the GIM service at St. Michael’s Hospital between 2011 and 2019. All patients admitted under a GIM service with an admission of at least 30 hours were included. The dataset is provided in both a preprocessed format and as raw data tables, all available as CSV files.

Data tables are all divided into three sets, for training (data collected prior to December 1, 2017), validation (data collected between December 1, 2017 and December 1, 2018), and testing (data collected after December 1, 2018). The dataset includes both static and time-varying tables. Please note that division into sets occurs on the level of encounters and not individual patients. As such, the same patient may be represented in multiple of the training, validation, and testing datasets.

Variables

The following variables were selected for inclusion based on consultation with a staff physician, including:

  • 136 Numeric Values including:
    • 9 vital signs
    • 100 labs
    • 7 shift assessment variables
    • 7 intake-outtake variables
    • 1 ulcer variable, 1 alcohol scale, 1 diabetes variable
  • 165 Clinical Orders including:
    • Imaging
    • Telemetry
    • Consults
    • Cardio
    • Diet
    • Respiration
    • Activities
    • Codes
    • Protocols
    • Transfusions
    • Wound Care
    • Neuro
  • Medication Administrations (grouped by AHFS Class)

Collection and Pre-Processing

Data was extracted directly from the following source systems:

  • Admit-Discharge-Transfer (ADT) System: Identify patient encounters under the GIM service.
  • Electronic Medical Records (EMR): Demographics, laboratory results, clinical orders, vitals and ICD-10 codes.
  • Medication Administration Check (MAK): Documentation for all inpatient medication administrations, including the type of medication, dose, timing, administration route, and administration timestamp.

The dataset is provided in its original, raw form as well as in a pre-processed form which aggregates data into fixed time windows. Pre-processing is done as follows:

  • Time-varying data is binned into 8 hour windows
  • Numeric data is averaged within each window, trimmed, and normalized. Two variables are added: an indicator for measurement, and a time since last measurement
    • Missing numeric data is carried forward with mean imputation
  • Orders are given as indicator variable
    • Missing orders are imputed as zero
  • Medications are grouped into classes and then classes are given as indicator variables

For more details, please review the explanations for each individual data table. Please note that the use of mean imputation may pose challenges with using the binned data.

De-identification

The following steps were taken by individuals at Unity Health to de-identify the data:

  1. Patient IDss and encounter numbers were removed from the data. Encounter numbers were replaced with a unique random 6 digit number.
  2. Addresses, postal codes, and names were stripped from the data.
  3. Any variable containing the year or month have been removed from the data. Pre-processed data includes a time window indicating the number of 8 hour blocks since admission, while raw data includes a “time since admission” variable for each measurement.

In addition, T-CAIREM staff have also further de-identified the data by grouping individuals' ages into five-year categories, capping these categories at 20 on the lower end and 100 on the upper end.

Use

The data included in this dataset has been used internally at St. Michael’s Hospital to build systems for improving patient monitoring and decision making. Some of this work has been referenced in publications[1][2].

Research Ethics Board (REB) approval has been obtaind for both the creation of the dataset and the de-identified version of the dataset. Individual consent has been provided for the collection and analysis of data. Consent has not been given for the secondary use of the de-identified dataset, and is not required under PHIPA or TCPS2.

This dataset is governed by Unity Health REB, protocol #21-206. Transfer of the data is governed by a Data Transfer Agreement between the University of Toronto and Unity Health. Access by authorized users is governed by a Data Sharing Agreement and Code of Conduct, as well as the Health Data Nexus Contributor Review Health Data License 1.0.

Data is owned by Unity Health Toronto, with access provided by T-CAIREM at the University of Toronto. The dataset will be updated and maintained jointly by Unity Health and T-CAIREM, until either party chooses to remove support for the dataset. For more information or any questions about the information in this dataset,

Version History

  • 1.0.0: Original hosting of the dataset.
  • 1.0.1: Update to the sharing policy to conform to T-CAIREM guidelines.

Citation

DOI: https://doi.org/10.57764/1w7f-kb56

Version: 1.0.1

BibTeX Citation:

References

  • [1] Verma AA, Murray J, Greiner R, Cohen JP, Shojania KG, Ghassemi M, Straus SE, Pou-Prom C, Mamdani M. Implementing machine learning in medicine. CMAJ. 2021 Aug 30;193(34):E1351-7.
  • [2] Nestor B, McCoy LG, Verma A, Pou-Prom C, Murray J, Kuzulugil S, Dai D, Mamdani M, Goldenberg A, Ghassemi M. Preparing a clinical support model for silent mode in general internal medicine. In Machine Learning for Healthcare Conference 2020 Sep 18 (pp. 950-972). PMLR.

1 - Encounters (Static)

Description

This is a master table containing all the essential information associated with a patient encounter (visit). Each row represents a distinct encounter in the GIM ward. The Encounters table links to all other tables on the ENCOUNTER_NUM column.

2 - Demographic Variables (Static)

Description

This table includes patient demographics for the encounters listed in the Encounters dataset. Each row represents a distinct encounter in the GIM ward. Demographics are consistent across encounters for each patient, with the exception of age (which naturally changes over time.)

The lack of demographic information related to race and ethnicity presents a potential difficulty with data fairness and equality when drawing conclusions based on the data in this datasets. Please be conscious of how the lack of this data affects any analysis. Please also note that the term “Sex” is provided as a binary field consisting of “Male” and “Female” entries, based on how patients are entered in electronic medical records. The term “Sex” should be approached with caution in this data, as it may not reflect the lived experiences of individuals included in the dataset and may provide misleading information on individuals' medical information.

Figures

Figure 1: Patients by Sex

Figure 2: Patients by Age

Figure 3: Patients by Province

Figure 4: Patients by Language

Figure 5: Patients by Marital Status

Figure 6: Patients by Housing Status

Figure 7: Patients by Religion

3 - Baseline Values (Static)

Descriptions

This table includes the mean values for several of the variables listed in the Numeric time-varying tables collected prior to admission in the GIM ward. Each row represents a distinct encounter in the GIM ward.

4 - Numeric Variables (Time-Varying)

Raw Data

Description

This table includes numeric results for laboratory measurements and vitals and the time the measurement was taken. Each row represents a distinct encounter in the GIM ward and a distinct test at a distinct time. The tests listed in this table are equivalent to the columns in the pre-processed data tables. Result times less than 0 are measurements that would have taken place before the patient was in the GIM ward (e.g. while in the emergency department).

Pre-Processed Data

Description

This table contains average values of numeric variables measured over 8 hour windows. Each row represents a distinct encounter in the GIM ward and an 8-hour time window from admission. The table also includes an indicator valuable indicdating whether the measurement was taken and a counter describing the number of windows since the last measurement.

Figures

Figure 1: Numeric Values

Figure 2: Number of Tested Encounters for Top Eight Tests (as a proportion of total encounters in the data table)

Figure 3: Average Number of Tests (on encounters including the tested variable)

5 - Clinical Orders (Time-Varying)

Raw Data

Description

This table includes the start and stop times for all clinical orders in the dataset. Each row represents a distinct encounter in the GIM ward and a distinct time. The clinical orders listed in this table are equivalent to the columns in the pre-processed data table. Start and end times less than 0 are measurements that would have taken place before the patient was in the GIM ward (e.g. while in the emergency department).

Pre-Processed Data

Description

This table contains indicators for all clinical orders indicating whether they were included within 8 hour time windows. Each row represents a distinct encounter in the GIM ward and an 8-hour time window from admission.

Figures

Figure 1: Clinical Orders

6 - Medication Administration (Time-Varying)

Raw Data

Description

This table includes administered medications and the time of administration. Each row represents a distinct encounter in the GIM ward and a distinct time. The medications listed in this table (by AHFS code) are equivalent to the columns in the pre-processed data tables. Start and end times less than 0 are measurements that would have taken place before the patient was in the GIM ward (e.g. while in the emergency department).

Pre-Processed Data

Description

This table contains indicators for all medication classes administered (by AHFS code) indicating whether or not the medication was administered within the 8 hour time window. Each row represents a distinct encounter in the GIM ward and an 8-hour time window from admission.

7 - Outcomes & Alternate Outcomes (Time-Varying)

Outcomes

Description

This table contains the type of outcome for the encounter (over 8 hour windows) as well as a variety of indicator variables indicating whether individual outcomes happen over the following 24, 48, or 72 hours. Each row represents a distinct encounter in the GIM ward and an 8-hour time window from admission.

Attention should be paid when attempting to use this data to predict patient outcomes. There is a significant disparity in outcomes between release and other outcomes, which must be taken into account with any analysis.

Figures

Figure 1: Patient Outcomes

Alternate Outcomes

Description

This table contains indicator variables indicating whether alternate outcomes (sepsis or respiratory failure) happen over the following 24, 48, or 72 hours, for each 8 hour window.

Figures

Figure 2: Patient Alternate Outcomes