Linking Demographic and Socioeconomic Data to the Electronic Health Record
Sohayla Pruitt, MA
Karen Staman, MS
Liz Wing, MA
Pragmatic clinical trials (PCTs) focus on how treatments work in the real world, and because PCTs draw from healthcare systems consisting of large populations, they offer unique opportunities to study treatment differences by demographic characteristics such as race/ethnicity, sex, age, or socioeconomic status (SES).
Traditional randomized controlled trials (RCTs) generally report a single measure of treatment effect, but the same treatment can have variable responses in different populations [1-5]. PCTs have the potential to yield information on subgroup differences to determine whether there is heterogeneity of treatment effect (HTE), defined as nonrandom variability in the direction of magnitude of a treatment effect. Yet efforts to examine treatment effect by race/ethnicity in PCTs may be hampered by large amounts of missing data. Additionally, associations by race or ethnicity are often confounded by the complex interplay of socio-environmental factors , including SES. Various geospatial strategies can be used to estimate an individual’s race or ethnicity from information in the electronic health record (EHR) using the individual's last name and/or address. The most accurate estimation method is Bayesian Improved Surname Geocoding (BISG) [7,8]. BISG uses both patient addresses and surname analysis to assign probabilities for membership in a racial or ethnic group. U.S. Census Bureau and American Community Survey files also can be used to associate demographic and socioeconomic information in the EHR.
This tool introduces the methodology used at Duke Medicine for linking and enriching demographic and socioeconomic data within its enterprise-wide EHR system.
Also see Monique Anderson’s Grand Rounds Presentations on this topic:
Geocoding Methodology Deployed Within the EHR
In 2012 Duke Medicine deployed an automated address standardization and geocoding process that was initially applied to all addresses stored in Duke Medicine’s Enterprise Data Warehouse (EDW). This same process was subsequently scheduled to run nightly on any new addresses added to the EDW. The following section describes the software, hardware, and other specifics related to the automation methodology.
SAS Dataflux Data Management Studio (SAS Institute Inc., Cary, NC) is used to verify, standardize, and geocode address information in the EDW. Both the U.S. Postal Service (USPS) address database and the TomTom street-level geocoding data pack are licensed through the SAS DMS software and are refreshed on a quarterly basis.
The EDW geocoding server is a Windows 2008 Server virtual machine located in the secure Duke Medicine data center. The server hardware is managed by the Duke Health Technology Solutions data management group.
New addresses are identified nightly by extract-transform-load procedures that pull data from EHRs and place them in the EDW. The IBM Tivoli utility (IBM Corporation, Armonk, NY) is used to schedule these regular nightly processes. Whenever possible, addresses are matched to verified addresses contained in the USPS database, using a combination of elements available in the EDW, such as full street address, city, state, and/or 5-digit ZIP code. When an address is verified, the address elements—now parsed, cleaned, filled in, and standardized—are returned to the EDW from the USPS database. These now include address data elements that may not have been originally obtained in the EDW. Unverified addresses are flagged and reevaluated when a new USPS quarterly data pack is released.
For all USPS-verified addresses, the TomTom data pack, as loaded into DMS, is used to assign latitude, longitude, and other geospatial identifying information to the addresses (e.g., U.S. Census Block Group Federal Information Processing Standards [FIPS] Code). The TomTom data pack uses a cascade approach to apply geocoding depending on available data, which means the program attempts to find a match using the highest degree of accuracy first and moves to lower accuracy methods until a match is found. The geocoding methods in descending order of accuracy are:
- Highest: Delivery point (also known as “rooftop geocoding”) is essentially the precise latitude and longitude of the doorstop or mailbox.
- Middle: The 9-digt ZIP code that includes a 4-digit extension (AABB) representing the ZIP Code Sector and the ZIP Code Segment.
- ZIP Code Sector (AABB): geographic area of a ZIP code area such as city blocks or a large building.
- ZIP Code Segment (AABB): a very specific location such as a floor in a large building or a specific city block. The accuracy of an address at the “Plus 4” level falls within a radius of 10-20 households on the same side of the street.
- Lowest: TomTom provides coordinates at the ZIP code centroid level.
The EDW geocoding process attempts to retrieve all rooftop coordinates at the first pass, and then uses the 9-digit ZIP code coordinates when rooftop coordinates cannot be identified. The EDW, however, does not retain geocoding at the ZIP code centroid level.
Table 1 lists the data elements that resulted from the automation process.
|Field Name||Field Type (Length)||Definition|
|ADDRESS_KEY||NUMBER||Table primary key|
|FULL_STREET_ADDRESS||VARCHAR2(128)||Full string address information (delimited with $); Business key for table|
|Elements from the clinical data record match to the USPS database for verification and standardization:|
|STREET_ADDRESS||VARCHAR2(128)||Parsed from FULL_STREET_ADDRESS to include all text to the left of the $ delimiter|
|CITY_STATE_ZIP||VARCHAR2(128)||Parsed from FULL_STREET_ADDRESS to include all text to the right of the $ delimiter|
|Elements that are returned from the USPS database as part of the verification and standardization process, and are prefixed with ‘STD_’:|
|VERIFIED_FLAG||VARCHAR2(1)||Y/N address verified by USPS knowledge base|
|STD_STREET_LINE1||VARCHAR2(128)||USPS Standardized street line 1 address (Value present only if address was verified and standardized by USPS).|
|STD_STREET_LINE2||VARCHAR2(128)||USPS Standardized street line 2 address (Value present only if address was verified and standardized by USPS).|
|STD_CITY||VARCHAR2(60)||USPS Standardized city (Value present only if address was verified and standardized by USPS).|
|STD_STATE||VARCHAR2(2)||USPS Standardized state (Value present only if address was verified and standardized by USPS).|
|STD_ZIP||VARCHAR2(5)||USPS Standardized 5-digit zip code (Value present only if address was verified and standardized by USPS).|
|STD_ZIP4||VARCHAR2(4)||USPS Standardized +4 zip code (Value present only if address was verified and standardized by USPS).|
|STD_POSTAL_CODE||VARCHAR2(48)||USPS Standardized postal code. This can contain the 5 digit zip code or the full 9-digit zip code (Value present only if address was verified and standardized by USPS).|
|STD_COUNTRY_CODE||VARCHAR2(3)||USPS Standardized 3-character country code (Value present only if address was verified and standardized by USPS).|
|STD_STREET_NUMBER||VARCHAR2(10)||USPS Standardized and parsed street number (Value present only if address was verified and standardized by USPS).|
|STD_STREET_PRE_DIRECTION||VARCHAR2(6)||USPS Standardized and parsed pre-direction of street (Value present only if address was verified and standardized by USPS).|
|STD_STREET_NAME||VARCHAR2(60)||USPS Standardized and parsed name of street (Value present only if address was verified and standardized by USPS).|
|STD_STREET_SUFFIX||VARCHAR2(10)||USPS Standardized and parsed street suffix (Value present only if address was verified and standardized by USPS).|
|STD_STREET_POST_DIRECTION||VARCHAR2(10)||USPS Standardized and parsed post direction of street (Value present only if address was verified and standardized by USPS).|
|STD_STREET_2ND_ABBREV||VARCHAR2(60)||USPS Standardized and parsed second abbreviation--i.e. Apt., Unit, etc. (Value present only if address was verified and standardized by USPS).|
|STD_STREET_2ND_NUMBER||VARCHAR2(20)||USPS Standardized and parsed second number--i.e. Apt Number (Value present only if address was verified and standardized by USPS).|
|STD_COUNTY_NAME||VARCHAR2(60)||USPS Standardized county name (Value present only if address was verified and standardized by USPS).|
|STD_STATE_FIPS||VARCHAR2(2)||U.S. Census State FIPS code (Value present only if address was verified and standardized by USPS).|
|Data elements resulting from geocoding where the ‘STD_’ elements have been fed to the TomTom cascade methodology:|
|GEO_RESULT_CODE||VARCHAR2(7)||Level of geocoding achieved: 'DP' = highest accuracy based on delivery point, and 'PLUS4' = zip+4 match, (null) = geocoding not successful|
|GEO_LATITUDE||NUMBER(10,6)||Y COORDINATE using WGS84 DATUM (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).|
|GEO_LONGITUDE||NUMBER(10,6)||X COORDINATE using WGS84 DATUM (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).|
|GEO_COUNTY_FIPS_CODE||VARCHAR2(5)||U.S. Census County FIPS Code: first two digits = state FIPS code; last three = county code (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).|
|GEO_TRACT_FIPS_CODE||VARCHAR2(11)||U.S. Census Tract FIPS Code: first two digits = state FIPS, next three digits = county code, next six digits = tract code (Value present only if DP or PLUS4 geocoding accuracy was achieved).|
|GEO_BLOCKGRP_FIPS_CODE||VARCHAR2(12)||U.S. Census Tract FIPS Code: first two digits = state FIPS, next three digits = county code, next six digits = tract code, last digit = block group code (Value present only if DP or PLUS4 geocoding accuracy was achieved).|
|GEO_BLOCK_FIPS_CODE||VARCHAR2(15)||U.S. Census Block FIPS Code: first two digits = state FIPS, next three digits = county code, next six digits = tract code, last four digits = block code (Value present only if DP or PLUS4 geocoding accuracy was achieved).|
|GEO_MCD_FIPS_CODE||VARCHAR2(5)||U.S. Census Minor Civil Divisions Code (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).|
|GEO_MSA_CMSA_CODE||VARCHAR2(4)||U.S. Census Metro Statistical Area Code (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).|
|GEO_PMSA_FIPS_CODE||VARCHAR2(4)||U.S. Census Primary Metro Statistical Area Code (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).|
|GEO_CBSA_FIPS_CODE||VARCHAR2(5)||U.S. Census Consolidated Metropolitan Statistical Area Code (Value present only if DP or PLUS4 rooftop geocoding accuracy was achieved).|
Table 1. Data elements
Geocoding Performance Statistics
- Approximately 95% of patients with a physical address (i.e., a street address rather than a post office box) have a verified and standardized current address.
- Approximately 92% of patients with a physical address have both a standardized and geocoded address at the delivery point or ZIP+4 level of accuracy.
- A full verification-standardization-geocoding refresh cycle involves approximately 5 million addresses and takes approximately 10 hours.
- Nightly verification-standardization-geocoding of iteratively added “new” addresses takes less than 1 hour. Approximately 1,000 to 3,000 new addresses are added each day.
Geocoding Methodology Rationale
Each geocoding method has its own relationship with “ground truth” accuracy and its own particular limitations and assumptions. Regardless of the method used, the typical variance is approximately 12 meters and is sufficient for neighborhood-level analyses.
On-Demand Geospatial Visualization of Patient Health and Healthcare Within the EHR
In 2012 Duke Medicine deployed an on-demand, geospatial visualization tool within its EDW. The following section describes the software and capabilities related to the geospatial visualization technology.
After DEDUCE users create a patient cohort, they are able to geospatially visualize the cohort on a map with dot distribution (i.e., point maps) or thematic maps that summarize patient counts per geographic unit (Figure 1). Thematic maps are available for a number of geographic boundaries, including county, 3-digit ZIP codes, 5-digit ZIP codes, census tracts, and census block groups. The counts of patients can be normalized by an all-Duke-patient count, 2010 population count, or square miles.
Demographic and Socioeconomic Data Linkage Within the EHR
In 2013 the U.S. Census Bureau’s 2010 Summary File 1 and the American Community Survey 2007-2011 5-year estimate data for North Carolina at the block group level was extracted, transformed, and loaded into Duke Medicine’s EDW. Automated processes were built to retain the raw counts of each statistical measure but also to provide computed meaningful derived values representing the percentage of population or households with a given statistical attribute (e.g., the percentage of population with public assistance income versus the number of people with public assistance income in a given block group).
As part of the geocoding process, every address record in the EDW that was successfully geocoded to the delivery point of ZIP+4 unit of geography was also assigned a block group FIPS code that can easily be joined in order to attribute each geocoded patient with the demographic and socioeconomic variables that exist in the two datasets.
Approximately 10,000 new demographic and socioeconomic variables are now available to describe the characteristics present within each patient’s block group neighborhood. In 2014, these variables were made available within the DEDUCE Cohort Manager application  so that patient cohorts could be (a) further refined/filtered by the demographic characteristics present within their Census block group (e.g., those with median household incomes less than $49,999) (Figure 2) or (b) extracted from the EDW alongside any number of selected variables for more advanced statistical analysis or visualization outside of DEDUCE (Figure 3).
Integration of SES Measures
The SES measures listed in Table 2 were created within the EDW, based on a report from the Agency for Healthcare Research and Quality .
|UnemploymentPercentage of persons aged ≥16 years in the labor force who are unemployed and actively seeking work|
|Below US poverty linePercentage of persons below the federally defined poverty line|
|Median incomeMedian household income|
|Property valuesMedian value of owner-occupied homes|
|Low educationPercentage of persons aged >25 years with less than a 12th-grade education|
|High educationPercentage of persons aged >25 years with at least 4 years of college|
|Crowded householdsPercentage of households containing ≥ 1 person per room|
Table 2. SES measures integrated into the EDW
The following statistical data elements and mathematical transformations from the American Community Survey 2007-2011 were applied in order to create the measures shown in Table 2.
|B23025_005Population 16 years and over: In labor force--Civilian labor force, Unemployed|
|B23025_001Population 16 years and over|
|TransformationUnemployment = (B23025_005/B23025_001)*100|
Below US Poverty Line Measure
|B17021_002Population for whom poverty status is determined: Income in the past 12 months below poverty level|
|B17021_001Population for whom poverty status is determined|
|TransformationBelow US Poverty Line = (B17021_002/B17021_001)*100|
Median Income Measure
|B19013_001Median household income in the past 12 months (in 2011 inflation-adjusted dollars): Total|
|Transformation0-100 Normalized Median Income = (B19013_001/(MAX(B19013_001)))*100|
Property Values Measure
|B25077_001Median value quartile (dollars) of owner-occupied housing units|
|Transformation0-100 Normalized Median Income = (B25077_001 /(MAX(B25077_001)))*100|
Low Education Measure
|B15002_003Population 25 years and over: Male--No schooling completed|
|B15002_004Population 25 years and over: Male--Nursery to 4th grade|
|B15002_005Population 25 years and over: Male--5th and 6th grade|
|B15002_006Population 25 years and over: Male--7th and 8th grade|
|B15002_007Population 25 years and over: Male--9th grade|
|B15002_008Population 25 years and over: Male--10th grade|
|B15002_009Population 25 years and over: Male--11th grade|
|B15002_010Population 25 years and over: Male--12th grade, no diploma|
|B15002_020Population 25 years and over: Female--No schooling completed|
|B15002_021Population 25 years and over: Female--Nursery to 4th grade|
|B15002_022Population 25 years and over: Female--5th and 6th grade|
|B15002_023Population 25 years and over: Female--7th and 8th grade|
|B15002_024Population 25 years and over: Female--9th grade|
|B15002_025Population 25 years and over: Female--10th grade|
|B15002_026Population 25 years and over: Female--11th grade|
|B15002_027Population 25 years and over: Female--12th grade, no diploma|
|B15002_001Population 25 years and over|
|TransformationLow Education = ((B15002_003 + B15002_004 + B15002_005+ B15002_006+ B15002_007+ B15002_008+ B15002_009+ B15002_010+ B15002_020+ B15002_021+ B15002_022+ B15002_023+ B15002_024+ B15002_025+ B15002_026+ B15002_027)/B15002_001)*100|
High Education Measure
|B15002_015Population 25 years and over: Male--Bachelor's degree|
|B15002_016Population 25 years and over: Male--Master's degree|
|B15002_017Population 25 years and over: Male--Professional school degree|
|B15002_018Population 25 years and over: Male--Doctorate degree|
|B15002_032Population 25 years and over: Female--Bachelor's degree|
|B15002_033Population 25 years and over: Female--Master's degree|
|B15002_034Population 25 years and over: Female--Professional school degree|
|B15002_035Population 25 years and over: Female--Doctorate degree|
|B15002_001Population 25 years and over|
|TransformationHigh Education = ((B15002_015+ B15002_016+ B15002_017+ B15002_018+ B15002_032+ B15002_033+ B15002_034+ B15002_035)/B15002_001)*100|
Crowded Households Measure
|B25014_006Occupied housing units: Owner occupied--1.51 to 2.00 occupants per room|
|B25014_007Occupied housing units: Owner occupied--2.01 or more occupants per room|
|B25014_012Occupied housing units: Renter occupied--1.51 to 2.00 occupants per room|
|B25014_013Occupied housing units: Renter occupied--2.01 or more occupants per room|
|B25014_001Occupied housing units|
|TransformationCrowded Households = ((B25014_006+B25014_007+B25014_012+B25014_013)/B25014_001)*100|
Given that most of the above SES measures represent percentages and range in values from 0-100, the Median Income and Median Property values were subjected to mathematical transformation in order to standardize their values to a range of 0-100.
The following mathematical equation was then applied to each NC block group using the measures described above:
SES Index Score = 50+(-0.07*Crowded Households)+(0.08*Property Values 0-100)+(-0.10*Below US Poverty Line)+ (0.11*Median Household Income 0-100)+(0.10*High Education)+(-0.11*Low Education)+(-0.08*Unemployment)
Note: 160 block groups contained null values in one or more of the above measures. For these, the SES index could not be calculated.
The SES index ranged in values from 35 to 78 and was further broken into quartiles where:
- Quartile 1 / Low SES = 35-48
- Quartile 2 / Medium-Low SES = 49-51
- Quartile 3 / Medium-High SES = 52-55
- Quartile 4 / High SES = 56-78
Visualization of Patient Cohort SES
The availablility of the SES measure, index, and quartile variables for every geocoded patient in Duke's EDW has enabled on-demand filtering, export, and visualization of patient data according to these measures. The SES measures were served through the DEDUCE cohort charting capability, so that for any cohort of patients, a histogram distribution could be visualized on-demand. For example, within a cohort of high utilizers of emergency department services over a given month at Duke hospitals, a number of demographic and socioeconomic variables can be visualized graphically (Figure 4).
In addition to chart visualizations, several SES measures were selected to be made available as base maps within the DEDUCE-Geo mapping application. For example, the same cohort data presented as bar graphs above could be visualized on top of a layer that represents the SES index. (Note that this capability, while not presently available, is planned for release in upcoming months.)
1. Kimmel SE, French B, Kasner SE, et al. A pharmacogenetic versus a clinical algorithm for warfarin dosing. N Engl J Med 2013;369:2283–2293. PMID: 24251361 PMCID: PMC3942158. doi: 10.1056/NEJMoa1310669.↑
3. Exner DV, Dries DL, Domanski MJ, et al. Lesser response to angiotensin-converting-enzyme inhibitor therapy in black as compared with white patients with left ventricular dysfunction. N Engl J Med 2001;344:1351–1357. PMID: 11333991. doi: 10.1056/NEJM200105033441802.↑
4. Muir AJ, Bornstein JD, Killenberg PG, et al. Peginterferon alfa-2b and ribavirin for the treatment of chronic hepatitis C in blacks and non-Hispanic whites. N Engl J Med 2004;350:2265–2271. PMID: 15163776. doi: 10.1056/NEJMoa032502.↑
5. Ou S-HI, Ziogas A, Zell JA. Prognostic factors for survival in extensive stage small cell lung cancer (ED-SCLC): the importance of smoking history, socioeconomic and marital statuses, and ethnicity. J Thorac Oncol 2009;4:37–43. PMID: 19096304. doi: 10.1097/JTO.0b013e31819140fb.↑
7. Elliott MN, Fremont A, Morrison PA, et al. A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. 2008;43:1722–1736. PMID:18479410. doi: 10.1111/j.1475-6773.2008.00854.x.↑
10. Horvath MM, Rusincovitch SA, Brinson S, et al. Modular design, application architecture, and usage of a self-service model for enterprise data delivery: The Duke Enterprise Data Unified Content Explorer (DEDUCE). 2014;52:231–242.PMID:25051403. doi: 10.1016/j.jbi.2014.07.006.↑
11. Agency for Healthcare Research and Quality. Creation of New Race-Ethnicity Codes and Socioeconomic Status (SES) Indicators for Medicare Beneficiaries, 2009 (archived). Available at: http://archive.ahrq.gov/research/findings/final-reports/medicareindicators/. Accessed September 23.↑