- Injury Surveillance Toolkit

Programming Resources

This section includes two types of tools useful for beginning work with ICD-10-CM coded data – Perl Regular Expressions and Standardized Validation Datasets.

Perl Regular Expressions for ICD-10-CM Injury and Drug Overdose Indicators

A regular expression is a sequence of characters that define a search pattern. There are different syntaxes, or “flavors, ” for writing regular expressions – Perl is the most common. The injury and drug overdose indicator definitions operationalized as Perl regular expressions can be used in statistical programs to identify the presence of included ICD-10-CM codes in ED and hospitalization datasets.

Notes About Regular Expressions:

Most injury codes in ICD-10-CM have 7 characters. As outlined in the indicator definitions, regular expressions include codes that are missing a 7th character for encounter type, and codes truncated any further than this are excluded. Some codes only have 3-6 characters by design (T30-T32, Y07, Y09). This is accounted for and noted where necessary.

Capture Groups – parentheses () – are utilized to improve readability of the regular expressions. Each regular expression is formatted as follows:

Capture Group 1 – search pattern for the first 6 characters of the ICD-10-CM code

• Where applicable, the pipe symbol – alternation | – is used within Capture Group 1 to gather multiple 6-character sub-expressions representing the various codes included in the indicator.

• BLUE and YELLOW colors are used to show the alternation pattern within Capture Group 1.

Capture Group 2 – search pattern for the 7th character of the ICD-10-CM code

• The same 7th character inclusion criteria are generally shared by all codes included in the indicator.

• PURPLE color is used to show Capture Group 2.

The regular expressions should be updated annually to include any CMS modifications to the code set that affect the indicator definitions.

If you are using SAS, PRX functions can be used in the DATA step to harness Perl pattern matching features.

If you are using R with these regular expressions replace ‘\’ with ‘\\’ and use the option ‘perl = TRUE’ and ‘ignore.case = TRUE’ when applicable. In addition, R users can simplify the aforementioned Capture Group 2 by replacing “$|\b” with “$”.

Introduction

This toolkit includes several ICD-10-CM validation datasets:

General Injury hospitalization dataset with dedicated external cause fields
General Injury hospitalization dataset with no dedicated external cause fields
General injury ED visit dataset with dedicated external cause fields
General Injury ED visit dataset with no dedicated external cause fields
Supplemental Drug Overdose validation dataset

Purpose

Developing statistical programs that are 100% accurate can be challenging. Often programmers can identify errors within their programs by examining the output. However, if the output seems reasonable, errors in the program may go undetected.  A validation dataset can be used to ensure that your statistical program obtains consistent and accurate results. A validation dataset is a “fake” dataset for which answers to specific questions are either known a priori or agreed upon by a “key of three”.   This toolkit contains a “key of three” for the general injury validation dataset only. See the Glossary of Terms/Abbreviations and Concepts for further explanation of this concept.

  Analysts working with ICD-10-CM coded injury and overdose data can run their statistical analysis programs on the validation datasets and compare their results to an “answer key”, to make sure their program is functioning in the expected manner. Using a validation dataset can help analysts reconcile any programming errors and can also ensure that people using different statistical packages or different approaches to programming are obtaining accurate results. The validation datasets included in this toolkit contain fictitious data; therefore, they should not be used for any purposes other than validation of statistical programming.

Background

Jurisdictions are frequently asked to collaborate and share data with organizations such as CSTE and CDC for a wide range of projects. With the advent of ICD-10-CM, statistical programs must be updated to reflect the new coding schema with all its added complexities. Not all jurisdictions have the capacity or time to internally validate statistical programs. Currently, there are very few resources publicly available to help practitioners validate statistical programs for analyzing ICD-10-CM coded data.  

The intent behind the ICD-10-CM Standardized Validation Dataset project is to provide a simple way for analysts to check the accuracy of their statistical analysis programs, providing the programmer confidence in the results. The tools give jurisdictions an opportunity to work with other jurisdictions on a common dataset and to share programming ideas and techniques. These datasets could also serve as training tools for students and new staff.

 Since its inception, the utility of this approach has been demonstrated through various projects. If you would like additional information on the how the datasets were developed, please contact CSTE staff lead.

Standardized General Injury Validation Dataset Materials:

Nonfatal ED Visits Combined (with no dedicated external cause fields)
Nonfatal ED Visits Separate (with dedicated external cause fields)
Nonfatal Hospitalizations Combined (with no dedicated external cause fields)
Nonfatal Hospitalizations Separate (with dedicated external cause fields)
Data Dictionary (to be used with above datasets)
Answer Key

Programming Resources

Introduction

Purpose

Background

Standardized General Injury Validation Dataset Materials:

Supplemental Drug Overdose Validation Dataset Materials: