12.07.2015 Views

data consistency, completeness and cleaning - The INCLEN Trust

data consistency, completeness and cleaning - The INCLEN Trust

data consistency, completeness and cleaning - The INCLEN Trust

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

DATA CONSISTENCY, COMPLETENESSAND CLEANINGByB.K. Tyagi <strong>and</strong> P.Philip SamuelCRME, Madurai


DATA QUALITY(DATA CONSISTENCY, COMPLETENESS )High-quality <strong>data</strong> needs to pass a set of quality criteria. Those include:Accuracy:An aggregated value over the criteria of integrity, <strong>consistency</strong>, <strong>and</strong> densityIntegrity:An aggregated value over the criteria of <strong>completeness</strong> <strong>and</strong> validityCompleteness:Achieved by correcting <strong>data</strong> containing anomaliesValidity:Approximated by the amount of <strong>data</strong> satisfying integrity constraintsConsistency:Concerns contradictions <strong>and</strong> syntactical anomaliesUniformity:Directly related to irregularities <strong>and</strong> in compliance with the set 'unit of measure'Density:<strong>The</strong> quotient of missing values in the <strong>data</strong> <strong>and</strong> the number of total values oughtto be known


DATA CLEANSINGData auditing: <strong>The</strong> <strong>data</strong> is audited with the use of statistical methods to detectanomalies <strong>and</strong> contradictions. This eventually gives an indication of thecharacteristics of the anomalies <strong>and</strong> their locations.Workflow specification: <strong>The</strong> detection <strong>and</strong> removal of anomalies is performed by asequence of operations on the <strong>data</strong> known as the workflow. It is specified after theprocess of auditing the <strong>data</strong> <strong>and</strong> is crucial in achieving the end product of highquality<strong>data</strong>. In order to achieve a proper workflow, the causes of the anomalies <strong>and</strong>errors in the <strong>data</strong> have to be closely considered.Workflow execution: In this stage, the workflow is executed after its specification iscomplete <strong>and</strong> its correctness is verified. <strong>The</strong> implementation of the workflow shouldbe efficient, even on large sets of <strong>data</strong>, which inevitably poses a trade-off becausethe execution of a <strong>data</strong>-cleansing operation can be computationally expensive.Post-processing <strong>and</strong> controlling: After executing the cleansing workflow, the resultsare inspected to verify correctness. Data that could not be corrected duringexecution of the workflow is manually corrected, if possible. <strong>The</strong> result is a new cyclein the <strong>data</strong>-cleansing process where the <strong>data</strong> is audited again


DATA QUALITYData quality is not linear <strong>and</strong> hasmany dimensions like Accuracy,Completeness, Consistency, Timeliness <strong>and</strong>Auditability. Having <strong>data</strong> quality on onedimension is as good as 'no quality‘.None of the Data Quality dimensionsis complete by itself, <strong>and</strong> many a timesdimensions are overlapping.


DATA ACCURACY<strong>The</strong> address of customer in the customer<strong>data</strong>base is the real address.<strong>The</strong> temperature recorded in thethermometer is the real temperature.<strong>The</strong> bank balance in the customer'saccount is the real value customer deservesfrom the Bank.


DATA COMPLETENESSData Completeness definition is the 'expected<strong>completeness</strong>'. It is possible that <strong>data</strong> is notavailable, but it is still considered completed, as itmeets the expectations of the user. Every <strong>data</strong>requirement has 'm<strong>and</strong>atory' <strong>and</strong> 'optional'aspects.For exampleCustomer's mailing address is m<strong>and</strong>atory<strong>and</strong> it is available <strong>and</strong> because customer’soffice address is optional, it is OK if it is notavailable.


DATA CONSISTENCYConsistency of Data means that <strong>data</strong> across the enterprise should be in synch with eachother.Examples of <strong>data</strong> in-<strong>consistency</strong> are:An agent is inactive, but he still has his disbursement account active.A credit card is cancelled, <strong>and</strong> inactive, but the card billing status shows 'due'.Data can be accurate (i.e., it will represent what happened in real world), but stillinconsistent.An Airline promotion campaign closure date is Jan 31, <strong>and</strong> there is a passengerticket booked under the campaign on Feb. 2.Data is inconsistent, when it is in synch in the narrow domain of an organization, but notin synch across the organization.For example:Collection management system has the Cheque status as 'cleared', but in theaccounting system, the money is not shown being credited to the bank account.Reason for this kind of in<strong>consistency</strong> is that system interfaces are synchronizedduring the end-of-day batch runs.Data can be complete, but inconsistentData for all the packets dispatched from NEW DELHI to CHENNAI are available.,but some of the packages are also shown as 'under bar-coding' status.


DATA TIMELINESS'Data delayed' is 'Data Denied'<strong>The</strong> timeliness of <strong>data</strong> is extremely important. This isreflected in:Companies are required to publish their quarterlyresults with in a given frame of time.Customers service providing up-to date information to thecustomers.Credit system checking on the credit card accountactivity.<strong>The</strong> timeliness depends on user expectation. An onlineavailability of <strong>data</strong> could be required for room allocationsystem in Hospitality, but an overnight <strong>data</strong> is fine for abilling system.


DATA AUDITABILITYData Auditability means that any transaction,report, accounting entry, bank statement etc.can be tracked to its originating transaction.This would need a common identifier, whichshould stay with a transaction as it undergoesTransformation, aggregation <strong>and</strong> reporting.


DATA CLEANSING• Data cleansing, <strong>data</strong> <strong>cleaning</strong>, or <strong>data</strong> scrubbing is the process of detecting <strong>and</strong>correcting (or removing) corrupt or inaccurate records from a record set, table,or <strong>data</strong>base. Used mainly in <strong>data</strong>bases, the term refers to identifyingincomplete, incorrect, inaccurate, irrelevant, etc. parts of the <strong>data</strong> <strong>and</strong> thenreplacing, modifying, or deleting this dirty <strong>data</strong>.• After cleansing, a <strong>data</strong> set will be consistent with other similar <strong>data</strong> sets in thesystem. <strong>The</strong> inconsistencies detected or removed may have been originallycaused by user entry errors, by corruption in transmission or storage, or bydifferent <strong>data</strong> dictionary definitions of similar entities in different stores.• Data cleansing differs from <strong>data</strong> validation in that validation almost invariablymeans <strong>data</strong> is rejected from the system at entry <strong>and</strong> is performed at entry time,rather than on batches of <strong>data</strong>.• <strong>The</strong> actual process of <strong>data</strong> cleansing may involve removing typographical errorsor validating <strong>and</strong> correcting values against a known list of entities. <strong>The</strong>validation may be strict (such as rejecting any address that does not have a validpostal code) or fuzzy (such as correcting records that partially match existing,known records).


Data Cleaning is the First Step in DataProcessing• Data <strong>cleaning</strong> is the process of detecting <strong>and</strong>correcting (or removing) incomplete,incorrect, inaccurate <strong>and</strong> irrelevant parts of a<strong>data</strong>set by replacing, modifying or deletingthe bad <strong>data</strong>• It is the first <strong>and</strong> most important step in any<strong>data</strong> processing• It aims to have access to reliable <strong>data</strong> to avoidfalse <strong>and</strong> misdirected conclusions


Data Descriptive Document• A document should be developed alongside the raw <strong>data</strong> containingthe following information:– Variable name - Variable type - Missing values– Variable description - Variable value


Using Excel for Character Data• Select the variable of interest, for example gender• From the main tool bar go to <strong>data</strong>, from there selectFilter <strong>and</strong> then “autofilter”• Click on the auto-filter arrows <strong>and</strong> a box will show allthe available values of our variable• Check the variable values in the <strong>data</strong> descriptiondocument to determine the valid values• Use auto-filter to select the questionable values• Excel can give you the case ID of each questionablevalue.• Refer to the case ID, check <strong>and</strong> correct thequestionable value by going back to the medical record


Another Approach:Using Frequencies


Checking for Invalid CharacterValues….(1)• Run frequencies on all character variables that represent alimited number of categories such as gender, residence,hospital’s department, occupation, etc.GENDERFrequency2 1F 300M 440X 1f 3Missing values 5


Checking for Invalid Character Values….(2)• Three categories do not fit with our <strong>data</strong> valueGENDERFrequency2 1F 300M 440X 1f 3Missing values 5


Checking for Invalid Character Values….(3)• <strong>The</strong> 2 <strong>and</strong> the X are inappropriate values.• f depending on the situation, it could be consideredan error or notGENDER2 1Occur onceF 300M 440X 1Occur oncef 3Missing values 5Frequency


Correcting Invalid Character Values• If the lower case values were entered into thefile by mistake but the value, aside from thecase, was correct, we consider this valuecorrect <strong>and</strong> change each of these lower casevalues to upper case• For the 2 <strong>and</strong> X values, we need to identifythe location of these errors <strong>and</strong> correct it afterchecking the medical records


Checking Missing Data• Check each of the cases with missing <strong>data</strong>(here on gender)• See whether there is information in the casethat allows that variable to be entered (e.g.the patient’s name will generally indicategender)


Checking for Invalid Numeric Values• <strong>The</strong> techniques for checking invalid numeric <strong>data</strong> are quite differentfrom the techniques used with character <strong>data</strong>– Examine minimum <strong>and</strong> maximum values for each numeric variable– Internal <strong>consistency</strong> methods; if we see that most of the <strong>data</strong> values fallwithin a certain range of values, then any values that fall far enough outsidethe range may be <strong>data</strong> errors– Run a univariate analysis, focusing especially on• Number of non-missing observations, number of observation not equal to zero <strong>and</strong>the number of observation greater than zero are of most interest at this stage• Extremes shows the five lowest <strong>and</strong> five highest values for numeric variables• Quantiles• Mean• St<strong>and</strong>ard deviation to decide on constitute reasonable cutoffs for low <strong>and</strong> high <strong>data</strong>value• Range• Graphic displays: a stem-<strong>and</strong> leaf plot, a box plot <strong>and</strong> a normal probability plot• Check the medical records for the extreme values <strong>and</strong> write a note to the<strong>data</strong> center about the findings to help in further <strong>cleaning</strong> of these <strong>data</strong>


Dates: Hospitalization…..(1)• We can create a variablefrom subtracting thedate of discharge fromdate of admission, <strong>and</strong>call it totalhospitalization 1• This variable will detectany wrong <strong>data</strong> entryfor dates such as casenumber 6014


Dates: Hospitalization…..(2)• We can create a variablefrom adding the dayspatient spent in ICU,ward <strong>and</strong> private room<strong>and</strong> call it totalhospitalization 2


Dates: Hospitalization…..(3)• To check in<strong>consistency</strong> wecan create a variable, lets callit difference by subtractingthe total hospitalization 1(created from subtractingdates of admission <strong>and</strong>discharge) <strong>and</strong> the totalhospitalization 2 (created bysumming the days spent inICU, ward <strong>and</strong> private room)• We need to check any valueother than zero by using theauto-filter comm<strong>and</strong> <strong>and</strong>recheck the medical records

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!