Registry Data usage (ESR6)

How do we link health registry data to environmental exposures?

WP1

python

data

registry

health

environment

exposure

Author

ESR6: Alejandro Fontal

Introduction

I will use this blog post as a way to showcase the basic usage of registry data and linkage to environmental data typically done as part of my work as a member of HELICAL’S Work Package 1, whose main objective is to help understand the relationship between environmental exposures and vasculitis onset.

I will try to display a simplified example of my usage of healthcare registries data. I make use of individual data just as a basis to aggregate and obtain incidence counts per spatial unit (zip-code, province, electoral district) and time-unit (daily, weekly, monthly) based on each patients’ residence and date of onset/diagnosis information.

To illustrate the linkage process I will generate an environmental and healthcare record toy dataset and perform the linkage as I usually would.

Show Python Imports

import numpy as np
import pandas as pd

Environmental dataset

In general, I fetch different datasets of publicly available or self-generated daily observations of several environmental variables:

Weather
Pollution
Biological air diversity
Chemical composition (via LIDAR or inplace sampling).

A toy example would be the following table, spanning only 5 days for two different regions, A and B:

Show Code

environment_df = pd.DataFrame({
    'Date': np.repeat(pd.date_range('2021-01-01', '2021-01-05'), 2),
    'Region': np.tile(['A', 'B'], 5),
    'Temperature': np.random.normal(20, 5, 10).round(2),
    'NO₂': np.random.normal(5, 1, 10).round(2),
    'Fungal Sp. 1': np.random.normal(1000, 100, 10).astype(int),
    'Bacterial Sp. 2': np.random.normal(750, 75, 10).astype(int)
})


# this is just for styling the table on the blog

(environment_df.style
#  .hide(axis='index')
 .format({'Temperature': '{:.2f}',
          'NO₂': '{:.2f}',
          'Date': '{:%Y-%m-%d}'})
 .set_table_attributes("class='dataframe table-hover td-left-th-center'")
)

	Date	Region	Temperature	NO₂	Fungal Sp. 1	Bacterial Sp. 2
0	2021-01-01	A	17.58	3.85	927	783
1	2021-01-01	B	32.16	5.01	951	630
2	2021-01-02	A	24.76	5.15	982	797
3	2021-01-02	B	24.16	3.47	835	854
4	2021-01-03	A	17.96	5.84	1005	781
5	2021-01-03	B	18.68	4.40	811	723
6	2021-01-04	A	22.51	6.73	1126	632
7	2021-01-04	B	14.68	5.24	981	821
8	2021-01-05	A	26.34	5.69	1028	674
9	2021-01-05	B	17.50	4.66	999	644

Healthcare records dataset

The minimal example of a healthcare records dataset that I use would contain, at the individual level, the patient’s residence region, and the (vasculitis) onset date recorded.

Show Code

healthcare_records = pd.DataFrame({
    'Patient ID': range(1, 16),
    'Region': np.random.choice(['A', 'B'], 15),
    'Onset Date': np.random.choice(pd.date_range('2021-01-01', '2021-01-05'), 15)
})

# this is just for styling the table on the blog

(healthcare_records
 .style
 .hide(axis='index')
 .format({'Onset Date': '{:%Y-%m-%d}'})
 .set_table_attributes("class='dataframe table-hover'")
)

Patient ID	Region	Onset Date
1	B	2021-01-05
2	B	2021-01-05
3	B	2021-01-01
4	B	2021-01-02
5	A	2021-01-05
6	A	2021-01-05
7	B	2021-01-01
8	B	2021-01-05
9	A	2021-01-03
10	B	2021-01-04
11	B	2021-01-03
12	A	2021-01-02
13	B	2021-01-02
14	B	2021-01-02
15	B	2021-01-03

I then go from individual level record to population level records aggregating by date and region, such that the data table I use looks like the following:

Show Code

daily_cases = (healthcare_records
               .groupby(['Onset Date', 'Region'])
               .size()
               .rename('Cases')
               .astype(int)
               .reset_index()
               .rename(columns={'Onset Date': 'Date'})
)

# this is just for styling the table on the blog

(daily_cases
 .style
 .hide(axis='index')
 .format({'Date': '{:%Y-%m-%d}'})
 .set_table_attributes("class='dataframe table-hover'")
)

Date	Region	Cases
2021-01-01	B	2
2021-01-02	A	1
2021-01-02	B	3
2021-01-03	A	1
2021-01-03	B	2
2021-01-04	B	1
2021-01-05	A	2
2021-01-05	B	3

Linkage

The final linkage, which leads us to the table on which most of the analyses will be made, is based on merging both the environmental and epidemiological daily incidence counts in a single table based on the Date and Region columns, such that:

Show Code

(environment_df
 .merge(daily_cases, on=['Date', 'Region'], how='left')
 .fillna(0)
 .assign(Cases=lambda df: df.Cases.astype(int))
 .sort_values(['Region', 'Date'])
 .set_index(['Region', 'Date'])
)

		Temperature	NO₂	Fungal Sp. 1	Bacterial Sp. 2	Cases
Region	Date
A	2021-01-01	15.31	5.12	981	758	1
	2021-01-02	20.83	3.09	1086	782	4
	2021-01-03	22.09	4.66	903	801	2
	2021-01-04	25.52	4.47	823	716	0
	2021-01-05	27.61	4.41	1059	763	0
B	2021-01-01	20.98	5.58	1008	692	2
	2021-01-02	18.13	4.02	1206	779	1
	2021-01-03	17.26	5.11	879	748	4
	2021-01-04	21.32	5.40	1124	882	0
	2021-01-05	18.66	5.42	897	660	1

:::