Linking Demographic and Socioeconomic Data to the Electronic Health Record

Resource Chapters

Linking Demographic and Socioeconomic Data to the Electronic Health Record

Contributors

Monique L. Anderson, MD, MHS

Sohayla Pruitt, MA

Contributing Editors

Karen Staman, MS

Liz Wing, MA

Introduction

Tools for ResearchPragmatic clinical trials (PCTs) focus on how treatments work in the real world, and because PCTs draw from healthcare systems consisting of large populations, they offer unique opportunities to study treatment differences by demographic characteristics such as race/ethnicity, sex, age, or socioeconomic status (SES).

Traditional randomized controlled trials (RCTs) generally report a single measure of treatment effect, but the same treatment can have variable responses in different populations [1-5]. PCTs have the potential to yield information on subgroup differences to determine whether there is heterogeneity of treatment effect (HTE), defined as nonrandom variability in the direction of magnitude of a treatment effect. Yet efforts to examine treatment effect by race/ethnicity in PCTs may be hampered by large amounts of missing data. Additionally, associations by race or ethnicity are often confounded by the complex interplay of socio-environmental factors [6], including SES. Various geospatial strategies can be used to estimate an individual’s race or ethnicity from information in the electronic health record (EHR) using the individual's last name and/or address. The most accurate estimation method is Bayesian Improved Surname Geocoding (BISG) [7,8]. BISG uses both patient addresses and surname analysis to assign probabilities for membership in a racial or ethnic group. U.S. Census Bureau and American Community Survey files also can be used to associate demographic and socioeconomic information in the EHR.

This tool introduces the methodology used at Duke Medicine for linking and enriching demographic and socioeconomic data within its enterprise-wide EHR system.


Geocoding Methodology Deployed Within the EHR

[Back to top]

In 2012 Duke Medicine deployed an automated address standardization and geocoding process that was initially applied to all addresses stored in Duke Medicine’s Enterprise Data Warehouse (EDW). This same process was subsequently scheduled to run nightly on any new addresses added to the EDW. The following section describes the software, hardware, and other specifics related to the automation methodology.

Software

SAS Dataflux Data Management Studio (SAS Institute Inc., Cary, NC) is used to verify, standardize, and geocode address information in the EDW. Both the U.S. Postal Service (USPS) address database and the TomTom street-level geocoding data pack are licensed through the SAS DMS software and are refreshed on a quarterly basis.

Hardware

The EDW geocoding server is a Windows 2008 Server virtual machine located in the secure Duke Medicine data center. The server hardware is managed by the Duke Health Technology Solutions data management group.

Automation Methodology

New addresses are identified nightly by extract-transform-load procedures that pull data from EHRs and place them in the EDW. The IBM Tivoli utility (IBM Corporation, Armonk, NY) is used to schedule these regular nightly processes. Whenever possible, addresses are matched to verified addresses contained in the USPS database, using a combination of elements available in the EDW, such as full street address, city, state, and/or 5-digit ZIP code. When an address is verified, the address elements—now parsed, cleaned, filled in, and standardized—are returned to the EDW from the USPS database. These now include address data elements that may not have been originally obtained in the EDW. Unverified addresses are flagged and reevaluated when a new USPS quarterly data pack is released.

For all USPS-verified addresses, the TomTom data pack, as loaded into DMS, is used to assign latitude, longitude, and other geospatial identifying information to the addresses (e.g., U.S. Census Block Group Federal Information Processing Standards [FIPS] Code). The TomTom data pack uses a cascade approach to apply geocoding depending on available data, which means the program attempts to find a match using the highest degree of accuracy first and moves to lower accuracy methods until a match is found. The geocoding methods in descending order of accuracy are:

  1. Highest: Delivery point (also known as “rooftop geocoding”) is essentially the precise latitude and longitude of the doorstop or mailbox.
  2. Middle: The 9-digt ZIP code that includes a 4-digit extension (AABB) representing the ZIP Code Sector and the ZIP Code Segment.
    • ZIP Code Sector (AABB): geographic area of a ZIP code area such as city blocks or a large building.
    • ZIP Code Segment (AABB): a very specific location such as a floor in a large building or a specific city block.  The accuracy of an address at the “Plus 4” level falls within a radius of 10-20 households on the same side of the street.
  3. Lowest: TomTom provides coordinates at the ZIP code centroid level.

The EDW geocoding process attempts to retrieve all rooftop coordinates at the first pass, and then uses the 9-digit ZIP code coordinates when rooftop coordinates cannot be identified. The EDW, however, does not retain geocoding at the ZIP code centroid level.

Results

Table 1 lists the data elements that resulted from the automation process.

Field Name Field Type (Length) Definition
ADDRESS_KEY NUMBER Table primary key
FULL_STREET_ADDRESS VARCHAR2(128) Full string address information (delimited with $); Business key for table
Elements from the clinical data record match to the USPS database for verification and standardization:
STREET_ADDRESS VARCHAR2(128) Parsed from FULL_STREET_ADDRESS to include all text to the left of the $ delimiter
CITY_STATE_ZIP VARCHAR2(128) Parsed from FULL_STREET_ADDRESS to include all text to the right of the $ delimiter
Elements that are returned from the USPS database as part of the verification and standardization process, and are prefixed with ‘STD_’:
VERIFIED_FLAG VARCHAR2(1) Y/N address verified by USPS knowledge base
STD_STREET_LINE1 VARCHAR2(128) USPS Standardized street  line 1 address (Value present only if address was verified and standardized by USPS).
STD_STREET_LINE2 VARCHAR2(128) USPS Standardized street line 2 address (Value present only if address was verified and standardized by USPS).
STD_CITY VARCHAR2(60) USPS Standardized city (Value present only if address was verified and standardized by USPS).
STD_STATE VARCHAR2(2) USPS Standardized state (Value present only if address was verified and standardized by USPS).
STD_ZIP VARCHAR2(5) USPS Standardized 5-digit zip code (Value present only if address was verified and standardized by USPS).
STD_ZIP4 VARCHAR2(4) USPS Standardized +4 zip code (Value present only if address was verified and standardized by USPS).
STD_POSTAL_CODE VARCHAR2(48) USPS Standardized postal code. This can contain the 5 digit zip code or the full 9-digit zip code (Value present only if address was verified and standardized by USPS).
STD_COUNTRY_CODE VARCHAR2(3) USPS Standardized 3-character country code (Value present only if address was verified and standardized by USPS).
STD_STREET_NUMBER VARCHAR2(10) USPS Standardized and parsed street number (Value present only if address was verified and standardized by USPS).
STD_STREET_PRE_DIRECTION VARCHAR2(6) USPS Standardized and parsed pre-direction of street (Value present only if address was verified and standardized by USPS).
STD_STREET_NAME VARCHAR2(60) USPS Standardized and parsed name of street (Value present only if address was verified and standardized by USPS).
STD_STREET_SUFFIX VARCHAR2(10) USPS Standardized and parsed street suffix (Value present only if address was verified and standardized by USPS).
STD_STREET_POST_DIRECTION VARCHAR2(10) USPS Standardized and parsed post direction of street (Value present only if address was verified and standardized by USPS).
STD_STREET_2ND_ABBREV VARCHAR2(60) USPS Standardized and parsed second abbreviation--i.e. Apt., Unit, etc. (Value present only if address was verified and standardized by USPS).
STD_STREET_2ND_NUMBER VARCHAR2(20) USPS Standardized and parsed second number--i.e. Apt Number (Value present only if address was verified and standardized by USPS).
STD_COUNTY_NAME VARCHAR2(60) USPS Standardized county name (Value present only if address was verified and standardized by USPS).
STD_STATE_FIPS VARCHAR2(2) U.S. Census State FIPS code (Value present only if address was verified and standardized by USPS).
Data elements resulting from geocoding where the ‘STD_’ elements have been fed to the TomTom cascade methodology:
GEO_RESULT_CODE VARCHAR2(7) Level of geocoding achieved: 'DP' = highest accuracy based on delivery point, and 'PLUS4' = zip+4 match, (null) = geocoding not successful
GEO_LATITUDE NUMBER(10,6) Y COORDINATE using WGS84 DATUM (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).
GEO_LONGITUDE NUMBER(10,6) X COORDINATE using WGS84 DATUM (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).
GEO_COUNTY_FIPS_CODE VARCHAR2(5) U.S. Census County FIPS Code: first two digits = state FIPS code; last three = county code (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).
GEO_TRACT_FIPS_CODE VARCHAR2(11) U.S. Census Tract FIPS Code: first two digits = state FIPS, next three digits = county code, next six digits = tract code (Value present only if DP or PLUS4 geocoding accuracy was achieved).
GEO_BLOCKGRP_FIPS_CODE VARCHAR2(12) U.S. Census Tract FIPS Code: first two digits = state FIPS, next three digits = county code, next six digits = tract code, last digit = block group code (Value present only if DP or PLUS4 geocoding accuracy was achieved).
GEO_BLOCK_FIPS_CODE VARCHAR2(15) U.S. Census Block FIPS Code: first two digits = state FIPS, next three digits = county code, next six digits = tract code, last four digits = block code (Value present only if DP or PLUS4 geocoding accuracy was achieved).
GEO_MCD_FIPS_CODE VARCHAR2(5) U.S. Census Minor Civil Divisions Code (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).
GEO_MSA_CMSA_CODE VARCHAR2(4) U.S. Census Metro Statistical Area Code (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).
GEO_PMSA_FIPS_CODE VARCHAR2(4) U.S. Census Primary Metro Statistical Area Code (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).
GEO_CBSA_FIPS_CODE VARCHAR2(5) U.S. Census Consolidated Metropolitan Statistical Area Code (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).

Table 1. Data elements

Geocoding Performance Statistics
  • Approximately 95% of patients with a physical address (i.e., a street address rather than a post office box) have a verified and standardized current address.
  • Approximately 92% of patients with a physical address have both a standardized and geocoded address at the delivery point or ZIP+4 level of accuracy.
  • A full verification-standardization-geocoding refresh cycle involves approximately 5 million addresses and takes approximately 10 hours.
  • Nightly verification-standardization-geocoding of iteratively added “new” addresses takes less than 1 hour. Approximately 1,000 to 3,000 new addresses are added each day.
Geocoding Methodology Rationale

Each geocoding method has its own relationship with “ground truth” accuracy and its own particular limitations and assumptions. A slide set is available that describes the conceptual differences among the three best-known high-accuracy geocoding methodologies—parcel, interpolated street, and delivery point/rooftop [9]. Regardless of the method used, the typical variance is approximately 12 meters and is sufficient for neighborhood-level analyses.


On-Demand Geospatial Visualization of Patient Health and Healthcare Within the EHR

[Back to top]

In 2012 Duke Medicine deployed an on-demand, geospatial visualization tool within its EDW. The following section describes the software and capabilities related to the geospatial visualization technology.

Software

The geospatial mapping tool, DEDUCE-Geo, was integrated within the Duke Enterprise Data Unified Content Explorer (DEDUCE) [10]. DEDUCE is a .NET application built at Duke that expedites access to EHR data stored in the EDW without the user having to understand how to execute complicated SQL queries or be familiar with the underlying data model of the EDW. DEDUCE-Geo uses both ArcGIS Server (Esri, Redlands, CA) and JavaScript to execute the geospatial visualization of a cohort defined within DEDUCE.

Capabilities

After DEDUCE users create a patient cohort, they are able to geospatially visualize the cohort on a map with dot distribution (i.e., point maps) or thematic maps that summarize patient counts per geographic unit (Figure 1). Thematic maps are available for a number of geographic boundaries, including county, 3-digit ZIP codes, 5-digit ZIP codes, census tracts, and census block groups. The counts of patients can be normalized by an all-Duke-patient count, 2010 population count, or square miles.

Visualization of DEDUCE Cohorts in DEDUCE-Geo
Figure 1. Visualization of DEDUCE cohorts in DEDUCE-Geo


Demographic and Socioeconomic Data Linkage Within the EHR

[Back to top]

In 2013 the U.S. Census Bureau’s 2010 Summary File 1 and the American Community Survey 2007-2011 5-year estimate data for North Carolina at the block group level was extracted, transformed, and loaded into Duke Medicine’s EDW. Automated processes were built to retain the raw counts of each statistical measure but also to provide computed meaningful derived values representing the percentage of population or households with a given statistical attribute (e.g., the percentage of population with public assistance income versus the number of people with public assistance income in a given block group).

As part of the geocoding process, every address record in the EDW that was successfully geocoded to the delivery point of ZIP+4 unit of geography was also assigned a block group FIPS code that can easily be joined in order to attribute each geocoded patient with the demographic and socioeconomic variables that exist in the two datasets.

Results

Approximately 10,000 new demographic and socioeconomic variables are now available to describe the characteristics present within each patient’s block group neighborhood. In 2014, these variables were made available within the DEDUCE Cohort Manager application [10] so that patient cohorts could be (a) further refined/filtered by the demographic characteristics present within their Census block group (e.g., those with median household incomes less than $49,999) (Figure 2) or (b) extracted from the EDW alongside any number of selected variables for more advanced statistical analysis or visualization outside of DEDUCE (Figure 3).

Demographic and Socioeconomic Data Filtering in DEDUCE
Figure 2. Demographic and socioeconomic data filtering in DEDUCE

Exporting Geospatially Related Data
Figure 3. Exporting geospatially related data

Integration of SES Measures

[Back to top]

The SES measures listed in Table 2 were created within the EDW, based on a report from the Agency for Healthcare Research and Quality [11].

Construct Measure Definition
Occupation
  Unemployment Percentage of persons aged ≥16 years in the labor force who are unemployed and actively seeking work
Income
  Below US poverty line Percentage of persons below the federally defined poverty line
  Median income Median household income
Wealth
  Property values Median value of owner-occupied homes
Education
  Low education Percentage of persons aged >25 years with less than a 12th-grade education
  High education Percentage of persons aged >25 years with at least 4 years of college
Housing
  Crowded households Percentage of households containing ≥ 1 person per room

Table 2. SES measures integrated into the EDW

The following statistical data elements and mathematical transformations from the American Community Survey 2007-2011 were applied in order to create the measures shown in Table 2.

Unemployment Measure
B23025_005 Population 16 years and over: In labor force--Civilian labor force, Unemployed
B23025_001 Population 16 years and over
Transformation Unemployment = (B23025_005/B23025_001)*100
Below US Poverty Line Measure
B17021_002 Population for whom poverty status is determined: Income in the past 12 months below poverty level
B17021_001 Population for whom poverty status is determined
Transformation Below US Poverty Line = (B17021_002/B17021_001)*100
Median Income Measure
B19013_001 Median household income in the past 12 months (in 2011 inflation-adjusted dollars): Total
Transformation 0-100 Normalized Median Income = (B19013_001/(MAX(B19013_001)))*100
Property Values Measure
B25077_001 Median value quartile (dollars) of owner-occupied housing units
Transformation 0-100 Normalized Median Income = (B25077_001 /(MAX(B25077_001)))*100
Low Education Measure
B15002_003 Population 25 years and over: Male--No schooling completed
B15002_004 Population 25 years and over: Male--Nursery to 4th grade
B15002_005 Population 25 years and over: Male--5th and 6th grade
B15002_006 Population 25 years and over: Male--7th and 8th grade
B15002_007 Population 25 years and over: Male--9th grade
B15002_008 Population 25 years and over: Male--10th grade
B15002_009 Population 25 years and over: Male--11th grade
B15002_010 Population 25 years and over: Male--12th grade, no diploma
B15002_020 Population 25 years and over: Female--No schooling completed
B15002_021 Population 25 years and over: Female--Nursery to 4th grade
B15002_022 Population 25 years and over: Female--5th and 6th grade
B15002_023 Population 25 years and over: Female--7th and 8th grade
B15002_024 Population 25 years and over: Female--9th grade
B15002_025 Population 25 years and over: Female--10th grade
B15002_026 Population 25 years and over: Female--11th grade
B15002_027 Population 25 years and over: Female--12th grade, no diploma
B15002_001 Population 25 years and over
Transformation Low Education = ((B15002_003 + B15002_004 + B15002_005+ B15002_006+ B15002_007+ B15002_008+ B15002_009+ B15002_010+ B15002_020+ B15002_021+ B15002_022+ B15002_023+ B15002_024+ B15002_025+ B15002_026+ B15002_027)/B15002_001)*100
High Education Measure
B15002_015 Population 25 years and over: Male--Bachelor's degree
B15002_016 Population 25 years and over: Male--Master's degree
B15002_017 Population 25 years and over: Male--Professional school degree
B15002_018 Population 25 years and over: Male--Doctorate degree
B15002_032 Population 25 years and over: Female--Bachelor's degree
B15002_033 Population 25 years and over: Female--Master's degree
B15002_034 Population 25 years and over: Female--Professional school degree
B15002_035 Population 25 years and over: Female--Doctorate degree
B15002_001 Population 25 years and over
Transformation High Education = ((B15002_015+ B15002_016+ B15002_017+ B15002_018+ B15002_032+ B15002_033+ B15002_034+ B15002_035)/B15002_001)*100
Crowded Households Measure
B25014_006 Occupied housing units: Owner occupied--1.51 to 2.00 occupants per room
B25014_007 Occupied housing units: Owner occupied--2.01 or more occupants per room
B25014_012 Occupied housing units: Renter occupied--1.51 to 2.00 occupants per room
B25014_013 Occupied housing units: Renter occupied--2.01 or more occupants per room
B25014_001 Occupied housing units
Transformation Crowded Households = ((B25014_006+B25014_007+B25014_012+B25014_013)/B25014_001)*100

Given that most of the above SES measures represent percentages and range in values from 0-100, the Median Income and Median Property values were subjected to mathematical transformation in order to standardize their values to a range of 0-100.

The following mathematical equation was then applied to each NC block group using the measures described above:

SES Index Score = 50+(-0.07*Crowded Households)+(0.08*Property Values 0-100)+(-0.10*Below US Poverty Line)+ (0.11*Median Household Income 0-100)+(0.10*High Education)+(-0.11*Low Education)+(-0.08*Unemployment)

Note: 160 block groups contained null values in one or more of the above measures. For these, the SES index could not be calculated.

The SES index ranged in values from 35 to 78 and was further broken into quartiles where:

  • Quartile 1 / Low SES = 35-48
  • Quartile 2 / Medium-Low SES = 49-51
  • Quartile 3 / Medium-High SES = 52-55
  • Quartile 4 / High SES = 56-78

Visualization of Patient Cohort SES

[Back to top]

The availablility of the SES measure, index, and quartile variables for every geocoded patient in Duke's EDW has enabled on-demand filtering, export, and visualization of patient data according to these measures. The SES measures were served through the DEDUCE cohort charting capability, so that for any cohort of patients, a histogram distribution could be visualized on-demand. For example, within a cohort of high utilizers of emergency department services over a given month at Duke hospitals, a number of demographic and socioeconomic variables can be visualized graphically (Figure 4).

Figure 4. Demographic and socioeconomic histograms in DEDUCE
Figure 4. Demographic and socioeconomic histograms in DEDUCE

In addition to chart visualizations, several SES measures were selected to be made available as base maps within the DEDUCE-Geo mapping application. For example, the same cohort data presented as bar graphs above could be visualized on top of a layer that represents the SES index. (Note that this capability, while not presently available, is planned for release in upcoming months.)

EHR data linkage fig 5
Figure 5. Visualization of DEDUCE cohorts SES in DEDUCE-Geo


Bibliography

[Back to top]


1. Kimmel SE, French B, Kasner SE, et al. A pharmacogenetic versus a clinical algorithm for warfarin dosing. N Engl J Med 2013;369:2283–2293. PMID: 24251361 PMCID: PMC3942158. doi: 10.1056/NEJMoa1310669.

2. Naggie S, Cooper C, Saag M, et al. Ledipasvir and Sofosbuvir for HCV in Patients Coinfected with HIV-1. N Engl J Med 2015;373:705–713. PMID: 26196665. doi: 10.1056/NEJMoa1501315.

3. Exner DV, Dries DL, Domanski MJ, et al. Lesser response to angiotensin-converting-enzyme inhibitor therapy in black as compared with white patients with left ventricular dysfunction. N Engl J Med 2001;344:1351–1357. PMID: 11333991. doi: 10.1056/NEJM200105033441802.

4. Muir AJ, Bornstein JD, Killenberg PG, et al. Peginterferon alfa-2b and ribavirin for the treatment of chronic hepatitis C in blacks and non-Hispanic whites. N Engl J Med 2004;350:2265–2271. PMID: 15163776. doi: 10.1056/NEJMoa032502.

5. Ou S-HI, Ziogas A, Zell JA. Prognostic factors for survival in extensive stage small cell lung cancer (ED-SCLC): the importance of smoking history, socioeconomic and marital statuses, and ethnicity. J Thorac Oncol 2009;4:37–43. PMID: 19096304. doi: 10.1097/JTO.0b013e31819140fb.

6. Braveman PA, Cubbin C, Egerter S, et al. Socioeconomic status in health research: one size does not fit all. JAMA 2005;294:2879–2888. PMID: 16352796. doi: 10.1001/jama.294.22.2879.

7. Elliott MN, Fremont A, Morrison PA, et al. A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. 2008;43:1722–1736. PMID:18479410. doi: 10.1111/j.1475-6773.2008.00854.x.

8. Fiscella K, Fremont AM. Use of geocoding and surname analysis to estimate race and ethnicity. 2006;41:1482–1500. PMID:16899020. doi: 10.1111/j.1475-6773.2006.00551.x.

9. Esri. ArcGIS [computer program]. Redlands, CA. Esri.Available at: http://www.esri.com/. Accessed October 26, 2015.

10. Horvath MM, Rusincovitch SA, Brinson S, et al. Modular design, application architecture, and usage of a self-service model for enterprise data delivery: The Duke Enterprise Data Unified Content Explorer (DEDUCE). 2014;52:231–242.PMID:25051403. doi: 10.1016/j.jbi.2014.07.006.

11. Agency for Healthcare Research and Quality. Creation of New Race-Ethnicity Codes and Socioeconomic Status (SES) Indicators for Medicare Beneficiaries, 2009 (archived). Available at: http://archive.ahrq.gov/research/findings/final-reports/medicareindicators/. Accessed September 23.

Citation:

Anderson M, Pruitt S. Resource Chapters: Linking Demographic and Socioeconomic Data to the Electronic Health Record. In: Rethinking Clinical Trials: A Living Textbook of Pragmatic Clinical Trials. Bethesda, MD: NIH Health Care Systems Research Collaboratory. Available at: http://rethinkingclinicaltrials.org/resources/linking-electronic-health-record-data-to-socioeconomic-status-methods-and-documentation/. Updated December 4, 2017.