Data cleaning 101

This paper from the Robert Wood Johnson Medical School outlines a step-by-step process for verifying that data values are correct or, at the very least, conform to a set of rules through the use of a data cleaning process.

Contents

  • A Sample Data Set
  • Description Of The File Patients.Txt
  • Checking For Invalid Character Values
  • Using A Data Step To Identify Invalid Character Values
  • Using Proc Print With A Where Statement To List Invalid Data Values
  • Using A Where Statement With Proc Print To List Out-Of-Range Data
  • Using User Defined Formats To Detect Invalid Values
  • Checking For Invalid Numeric Values
  • Using Proc Means, Proc Tabulate, And Proc Univariate To Look For Outliers
  • Using A Data Step To Check For Invalid Values
  • Using Formats For Range Checking
  • Extending Proc Univariate To Look For Lowest And Highest Values By Percentage
  • Creating Another Way To Find Lowest And Highest Values
  • Checking A Range Using An Algorithm Based On Standard Deviation

Sources

Cody, R. (n.d.). Data cleaning 101. Robert Wood Johnson Medical School, Dept of Environmental and Community Medicine. Retrieved from website: http://www.ats.ucla.edu/stat/sas/library/nesug99/ss123.pdf

'Data cleaning 101' is referenced in: