Show Python Imports
import numpy as np
import pandas as pd
How do we link health registry data to environmental exposures?
I will use this blog post as a way to showcase the basic usage of registry data and linkage to environmental data typically done as part of my work as a member of HELICAL’S Work Package 1, whose main objective is to help understand the relationship between environmental exposures and vasculitis onset.
I will try to display a simplified example of my usage of healthcare registries data. I make use of individual data just as a basis to aggregate and obtain incidence counts per spatial unit (zip-code, province, electoral district) and time-unit (daily, weekly, monthly) based on each patients’ residence and date of onset/diagnosis information.
To illustrate the linkage process I will generate an environmental and healthcare record toy dataset and perform the linkage as I usually would.
In general, I fetch different datasets of publicly available or self-generated daily observations of several environmental variables:
A toy example would be the following table, spanning only 5 days for two different regions, A and B:
environment_df = pd.DataFrame({
'Date': np.repeat(pd.date_range('2021-01-01', '2021-01-05'), 2),
'Region': np.tile(['A', 'B'], 5),
'Temperature': np.random.normal(20, 5, 10).round(2),
'NO₂': np.random.normal(5, 1, 10).round(2),
'Fungal Sp. 1': np.random.normal(1000, 100, 10).astype(int),
'Bacterial Sp. 2': np.random.normal(750, 75, 10).astype(int)
})
# this is just for styling the table on the blog
(environment_df.style
# .hide(axis='index')
.format({'Temperature': '{:.2f}',
'NO₂': '{:.2f}',
'Date': '{:%Y-%m-%d}'})
.set_table_attributes("class='dataframe table-hover td-left-th-center'")
)
Date | Region | Temperature | NO₂ | Fungal Sp. 1 | Bacterial Sp. 2 | |
---|---|---|---|---|---|---|
0 | 2021-01-01 | A | 17.58 | 3.85 | 927 | 783 |
1 | 2021-01-01 | B | 32.16 | 5.01 | 951 | 630 |
2 | 2021-01-02 | A | 24.76 | 5.15 | 982 | 797 |
3 | 2021-01-02 | B | 24.16 | 3.47 | 835 | 854 |
4 | 2021-01-03 | A | 17.96 | 5.84 | 1005 | 781 |
5 | 2021-01-03 | B | 18.68 | 4.40 | 811 | 723 |
6 | 2021-01-04 | A | 22.51 | 6.73 | 1126 | 632 |
7 | 2021-01-04 | B | 14.68 | 5.24 | 981 | 821 |
8 | 2021-01-05 | A | 26.34 | 5.69 | 1028 | 674 |
9 | 2021-01-05 | B | 17.50 | 4.66 | 999 | 644 |
The minimal example of a healthcare records dataset that I use would contain, at the individual level, the patient’s residence region, and the (vasculitis) onset date recorded.
healthcare_records = pd.DataFrame({
'Patient ID': range(1, 16),
'Region': np.random.choice(['A', 'B'], 15),
'Onset Date': np.random.choice(pd.date_range('2021-01-01', '2021-01-05'), 15)
})
# this is just for styling the table on the blog
(healthcare_records
.style
.hide(axis='index')
.format({'Onset Date': '{:%Y-%m-%d}'})
.set_table_attributes("class='dataframe table-hover'")
)
Patient ID | Region | Onset Date |
---|---|---|
1 | B | 2021-01-05 |
2 | B | 2021-01-05 |
3 | B | 2021-01-01 |
4 | B | 2021-01-02 |
5 | A | 2021-01-05 |
6 | A | 2021-01-05 |
7 | B | 2021-01-01 |
8 | B | 2021-01-05 |
9 | A | 2021-01-03 |
10 | B | 2021-01-04 |
11 | B | 2021-01-03 |
12 | A | 2021-01-02 |
13 | B | 2021-01-02 |
14 | B | 2021-01-02 |
15 | B | 2021-01-03 |
I then go from individual level record to population level records aggregating by date and region, such that the data table I use looks like the following:
daily_cases = (healthcare_records
.groupby(['Onset Date', 'Region'])
.size()
.rename('Cases')
.astype(int)
.reset_index()
.rename(columns={'Onset Date': 'Date'})
)
# this is just for styling the table on the blog
(daily_cases
.style
.hide(axis='index')
.format({'Date': '{:%Y-%m-%d}'})
.set_table_attributes("class='dataframe table-hover'")
)
Date | Region | Cases |
---|---|---|
2021-01-01 | B | 2 |
2021-01-02 | A | 1 |
2021-01-02 | B | 3 |
2021-01-03 | A | 1 |
2021-01-03 | B | 2 |
2021-01-04 | B | 1 |
2021-01-05 | A | 2 |
2021-01-05 | B | 3 |
The final linkage, which leads us to the table on which most of the analyses will be made, is based on merging both the environmental and epidemiological daily incidence counts in a single table based on the Date
and Region
columns, such that:
Temperature | NO₂ | Fungal Sp. 1 | Bacterial Sp. 2 | Cases | ||
---|---|---|---|---|---|---|
Region | Date | |||||
A | 2021-01-01 | 15.31 | 5.12 | 981 | 758 | 1 |
2021-01-02 | 20.83 | 3.09 | 1086 | 782 | 4 | |
2021-01-03 | 22.09 | 4.66 | 903 | 801 | 2 | |
2021-01-04 | 25.52 | 4.47 | 823 | 716 | 0 | |
2021-01-05 | 27.61 | 4.41 | 1059 | 763 | 0 | |
B | 2021-01-01 | 20.98 | 5.58 | 1008 | 692 | 2 |
2021-01-02 | 18.13 | 4.02 | 1206 | 779 | 1 | |
2021-01-03 | 17.26 | 5.11 | 879 | 748 | 4 | |
2021-01-04 | 21.32 | 5.40 | 1124 | 882 | 0 | |
2021-01-05 | 18.66 | 5.42 | 897 | 660 | 1 |
Made with , and Quarto
View the source at GitHub
:::