The focus on analytics, Business Intelligence (BI), and big data tools has grown over the last 10 years. But new tools don’t negate the adage ‘garbage in, garbage out,’ so, while these tools can make it easier to obtain insights, a data quality program is key to any successful analytics or BI program. What is data quality? What causes poor data quality? How much should I spend on data quality? These are the questions to consider when setting up a data management or data analytics program.
At its simplest, data quality can be broken into two categories: completeness and accuracy. Completeness refers to ensuring that all expected data are received. Accuracy evaluates the validity of the data. Completeness and accuracy can be subjective and should be guided by the organization’s needs such as maximizing profit or legal and safety requirements. That is, having data that are “fit for purpose.”
How the data are obtained influences the data quality program. Data collected directly from the owner (primary collection), such as through a survey, tend to need a more robust data quality program because no one else has cleaned the data. Data obtained from a vendor or organization (secondary collection), such as the government, tend to need less cleaning to be usable because the data have already gone through a data quality process. Primary data collections have several common causes of data quality problems including:
Vague instruction: Providing clear instructions on what and how to report helps ensure that the data can be used for the intended purpose(s). For example, if you instruct 100 children to write an essay titled ‘my summer vacation,’ you’ll get a lot of information, but it will be inconsistent, and it’s unclear if you’d be able to divine any valuable insights. Alternatively, if you ask the student to include the location, dates, and activities, you’ll gather more consistent information and be able to draw intended conclusions.
"At its simplest, data quality can be broken into two categories: completeness and accuracy"
Imprecise or unclear definitions: Clear definitions remove ambiguity and the need for interpretation. For example, ‘loan amount’ can be interpreted as amount authorized, amount of the original loan, or the current balance of the loan. Each of these interpretations may provide a significantly different number and could result in errors in the subsequent analysis.
Lack of incentives: Having incentives for the data provider to avoid or catch errors can have a significant impact on data quality. Unfortunately, incentives are often esoteric (such as helping society) or nonexistent. Possible strategies include: 1) make it easy to provide accurate data including instructions and a user friendly interface, 2) explain how the esoteric benefits help the provider, and 3) provide data or resulting analysis that is useful to the provider.
Human error: Possibly the most common problem is typos, recording data in the wrong column or row, truncation, transposing values, invalid values, or incorrect formats. Human error includes not reading instructions or definitions.
Quality assurance edits used to check for both completeness and accuracy can be broken into three categories: validity, reasonableness, and warning.
Validity edits identify definite errors and often result in the submission’s being rejected. They are frequently used to validate formats, ensure completeness, and highlight obvious errors. Format edits are used to reject data that do not conform to the specified format, such as text in a date or numerical field or an email without an @ symbol. It can also check for the validity of content (e.g., ‘FP’ isn’t invalid in U.S. ).
Reasonableness edits look for information that is highly unlikely or an extreme outlier but there are rare instances in which it might be possible. Reasonableness edits don’t generally cause a data submission to be rejected but may require an explanation. The reasonableness can be based on the statistical probability of the value, business rules, or acceptable tolerances. Reasonableness edits can be lenient or strict depending on purpose. Stricter edits generally result in more edit failures which generally lead to higher operational costs.
Warning edits can be used to highlight content that is in the tail of a bell curve. In other words it is reasonable but not common. These edits generally require the data submitter only to acknowledge that the data are correct. Examples could include stock price increase of greater than 25 percent in one day or a height over 6’5”. Both of these examples occur, but it is worth double-checking to ensure you didn’t have a typo.
Curating quality data requires time and money for both setup and operations. Time spent developing clear guidance and edit checks can save time and money by avoiding excessive data cleanup or as a worst case scenario unusable data. Data governance, data standards, and quality assurance edits all help to minimize data quality problems. Using industry-defined terms and formats can reduce errors because it reduces the need to define and transform data.
Regardless of how you are obtaining data (secondary or primary data collection) a data quality program should ensure that you don’t introduce errors during transmission or manipulation of the data. This can include edits that check volume or row counts and hash totals to ensure you aren’t dropping data, truncating data values, or incorrectly converting formats. After ensuring that you aren’t introducing errors, the question of how much to devote to data quality depends on what is “fit for purpose” in your organization. Stricter edit checks generally result in higher quality data but carry higher setup and operational costs. There needs to be a balance between the ability to make informed decisions based on clean and accurate data and on the cost of curating the data asset—don’t spend more on improving an asset than the asset will save or generate in value.