Programming Resources
This section includes two types of tools useful for beginning work with ICD-10-CM coded data – Perl Regular Expressions and Standardized Validation Datasets.
Perl Regular Expressions for ICD-10-CM Injury and Drug Overdose Indicators
Notes About Regular Expressions:
Most injury codes in ICD-10-CM have 7 characters. As outlined in the indicator definitions, regular expressions include codes that are missing a 7th character for encounter type, and codes truncated any further than this are excluded. Some codes only have 3-6 characters by design (T30-T32, Y07, Y09). This is accounted for and noted where necessary.
Capture Groups – parentheses () – are utilized to improve readability of the regular expressions. Each regular expression is formatted as follows:
Capture Group 1 – search pattern for the first 6 characters of the ICD-10-CM code
• Where applicable, the pipe symbol – alternation | – is used within Capture Group 1 to gather multiple 6-character sub-expressions representing the various codes included in the indicator.
• BLUE and YELLOW colors are used to show the alternation pattern within Capture Group 1.
Capture Group 2 – search pattern for the 7th character of the ICD-10-CM code
• The same 7th character inclusion criteria are generally shared by all codes included in the indicator.
• PURPLE color is used to show Capture Group 2.
The regular expressions should be updated annually to include any CMS modifications to the code set that affect the indicator definitions.
If you are using SAS, PRX functions can be used in the DATA step to harness Perl pattern matching features.
If you are using R with these regular expressions replace ‘\’ with ‘\\’ and use the option ‘perl = TRUE’ and ‘ignore.case = TRUE’ when applicable. In addition, R users can simplify the aforementioned Capture Group 2 by replacing “$|\b” with “$”.
Introduction
This toolkit includes several ICD-10-CM validation datasets:
- General Injury hospitalization dataset with dedicated external cause fields
- General Injury hospitalization dataset with no dedicated external cause fields
- General injury ED visit dataset with dedicated external cause fields
- General Injury ED visit dataset with no dedicated external cause fields
- Supplemental Drug Overdose validation dataset
Purpose
Developing statistical programs that are 100% accurate can be challenging. Often programmers can identify errors within their programs by examining the output. However, if the output seems reasonable, errors in the program may go undetected. A validation dataset can be used to ensure that your statistical program obtains consistent and accurate results. A validation dataset is a “fake” dataset for which answers to specific questions are either known a priori or agreed upon by a “key of three”. This toolkit contains a “key of three” for the general injury validation dataset only. See the Glossary of Terms/Abbreviations and Concepts for further explanation of this concept.
Analysts working with ICD-10-CM coded injury and overdose data can run their statistical analysis programs on the validation datasets and compare their results to an “answer key”, to make sure their program is functioning in the expected manner. Using a validation dataset can help analysts reconcile any programming errors and can also ensure that people using different statistical packages or different approaches to programming are obtaining accurate results. The validation datasets included in this toolkit contain fictitious data; therefore, they should not be used for any purposes other than validation of statistical programming.
Background
Jurisdictions are frequently asked to collaborate and share data with organizations such as CSTE and CDC for a wide range of projects. With the advent of ICD-10-CM, statistical programs must be updated to reflect the new coding schema with all its added complexities. Not all jurisdictions have the capacity or time to internally validate statistical programs. Currently, there are very few resources publicly available to help practitioners validate statistical programs for analyzing ICD-10-CM coded data.
The intent behind the ICD-10-CM Standardized Validation Dataset project is to provide a simple way for analysts to check the accuracy of their statistical analysis programs, providing the programmer confidence in the results. The tools give jurisdictions an opportunity to work with other jurisdictions on a common dataset and to share programming ideas and techniques. These datasets could also serve as training tools for students and new staff.
Since its inception, the utility of this approach has been demonstrated through various projects. If you would like additional information on the how the datasets were developed, please contact CSTE staff lead.
Standardized General Injury Validation Dataset Materials:
- Nonfatal ED Visits Combined (with no dedicated external cause fields)
- Nonfatal ED Visits Separate (with dedicated external cause fields)
- Nonfatal Hospitalizations Combined (with no dedicated external cause fields)
- Nonfatal Hospitalizations Separate (with dedicated external cause fields)
- Data Dictionary (to be used with above datasets)
- Answer Key
