27.03.2014 Views

Fraud Detection in Selection Exams Using Knowledge Engineering ...

Fraud Detection in Selection Exams Using Knowledge Engineering ...

Fraud Detection in Selection Exams Using Knowledge Engineering ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Fraud</strong> <strong>Detection</strong> <strong>in</strong> <strong>Selection</strong> <strong>Exams</strong> Us<strong>in</strong>g<br />

<strong>Knowledge</strong> Eng<strong>in</strong>eer<strong>in</strong>g Tools<br />

Marcus de Melo Braga<br />

Post-Graduate Program <strong>in</strong> <strong>Knowledge</strong> Eng<strong>in</strong>eer<strong>in</strong>g<br />

Universidade Federal de Santa Catar<strong>in</strong>a (UFSC)<br />

Florianópolis, Brazil<br />

marcus@egc.ufsc.br<br />

Mario Antonio Ribeiro Dantas<br />

Department of Informatics and Statistics<br />

Universidade Federal de Santa Catar<strong>in</strong>a (UFSC)<br />

Florianópolis, Brazil<br />

mario@<strong>in</strong>f.ufsc.br<br />

Abstract—This paper proposes a method for fraud detection <strong>in</strong><br />

automated selection exams, us<strong>in</strong>g knowledge eng<strong>in</strong>eer<strong>in</strong>g tools for<br />

identify<strong>in</strong>g groups of answers with a strong <strong>in</strong>dication of fraud,<br />

based on probabilistic evidence. Founded on an analysis of the<br />

wrong answers of the various candidates, the proposed method<br />

enables identification of suspicions, and evidence of fraud<br />

attempts through f<strong>in</strong>d<strong>in</strong>g candidates with a significant number of<br />

identical wrong answers. This method of us<strong>in</strong>g knowledge<br />

eng<strong>in</strong>eer<strong>in</strong>g tools can be employed <strong>in</strong> various types of selection<br />

exam<strong>in</strong>ations, adapt<strong>in</strong>g to the characteristics and features of each<br />

one and contribut<strong>in</strong>g to the achievement of fair exams and free of<br />

fraud attempts.<br />

Keywords: knowledge eng<strong>in</strong>eer<strong>in</strong>g tools; fraud detection;<br />

selection exams; multiple-choice questions.<br />

can state that we did not f<strong>in</strong>d any study of application of KE<br />

tools to detect fraud <strong>in</strong> selection exam<strong>in</strong>ations. Therefore, we<br />

propose a new possibility for application of these tools.<br />

Some related studies apply others’ software techniques for<br />

fraud detection. In [3], artificial immune systems are used to<br />

identify fraud attempts <strong>in</strong> onl<strong>in</strong>e credit cards operations. Datam<strong>in</strong><strong>in</strong>g<br />

techniques can also be applied for fraud detection <strong>in</strong><br />

e-Commerce [4], mobile communication networks [5], and <strong>in</strong><br />

conventional telephony [6]. Other studies specifically propose<br />

statistical methods for fraud detection <strong>in</strong> several areas [7], [8].<br />

The model presented <strong>in</strong> this paper for the application of KE<br />

tools <strong>in</strong> fraud detection <strong>in</strong> admission exams is structured <strong>in</strong>to<br />

eight sections: Section II makes a brief <strong>in</strong>troduction to selection<br />

exam<strong>in</strong>ations processes, present<strong>in</strong>g their ma<strong>in</strong> characteristics.<br />

Section III presents the ma<strong>in</strong> types of fraud that can occur <strong>in</strong> a<br />

contest aimed at giv<strong>in</strong>g an overview of the method to be<br />

proposed <strong>in</strong> this paper. Section IV exam<strong>in</strong>es the various types<br />

of questions that can be adopted <strong>in</strong> an exam<strong>in</strong>ation. Section V<br />

presents a statistical study of the probability analysis of<br />

identical wrong answers (IWA) to a specific type of question:<br />

multiple-choice. In Section VI, we present a proposed method<br />

for fraud detection <strong>in</strong> automated selection processes.<br />

Section VII presents and discusses the results, and f<strong>in</strong>ally <strong>in</strong><br />

Section VIII, we give conclusions and suggestions for future<br />

work.<br />

INTRODUCTION<br />

The need for detection of various types of fraud <strong>in</strong> several<br />

fields of application has grown <strong>in</strong> recent years, possibly due to<br />

the great developments <strong>in</strong> <strong>in</strong>formation and communication<br />

technologies (ICT) that provide citizens with diverse resources<br />

and technological facilities for access to databases of public<br />

and private organizations.<br />

The grow<strong>in</strong>g complexity of equipment and services<br />

provided by new technologies h<strong>in</strong>ders the identification of<br />

possible failure po<strong>in</strong>ts or some susceptible to fraud attempts. In<br />

several areas of activities, there is a lack of study and research<br />

that enables the creation of methods and techniques to prevent<br />

II - SELECTION EXAMS<br />

fraud and to reduce f<strong>in</strong>ancial losses caused by this type of<br />

crime. <strong>Knowledge</strong> Eng<strong>in</strong>eer<strong>in</strong>g (KE) provides us with tools<br />

that can assist <strong>in</strong> detect<strong>in</strong>g and prevent<strong>in</strong>g fraud, help<strong>in</strong>g us to<br />

fight this type of crime.<br />

The application of knowledge eng<strong>in</strong>eer<strong>in</strong>g tools <strong>in</strong> fraud<br />

detection is already common <strong>in</strong> many fields of study. It occurs<br />

ma<strong>in</strong>ly <strong>in</strong> the follow<strong>in</strong>g areas: mobile communications<br />

(cellular), f<strong>in</strong>ance, electronic commerce, healthcare, credit<br />

cards and government organizations [1], [2]. A systematic<br />

review on the subject of this research has not identified a study<br />

published <strong>in</strong> the databases Scopus (www.scopus.com) or ISI<br />

Web of <strong>Knowledge</strong> (apps.isiknowledge.com) associat<strong>in</strong>g the<br />

keywords "detection," "fraud," "exam," and "exam<strong>in</strong>ation."<br />

Based on the review conducted <strong>in</strong> the two largest databases, we<br />

This study was supported by CAPES, the Brazilian Government entity for the tra<strong>in</strong><strong>in</strong>g of human resources.<br />

A. Def<strong>in</strong>ition<br />

A selection exam<strong>in</strong>ation process can be def<strong>in</strong>ed as an<br />

activity consist<strong>in</strong>g of several sub-processes or steps, with the<br />

aim of select<strong>in</strong>g candidates through the application of<br />

questionnaires, tests, and other selection tools to fill a limited<br />

number of vacancies.<br />

B. Phases of a selection exam<br />

The phases, steps or sub-processes of a selection<br />

exam<strong>in</strong>ation are necessary to meet statutory deadl<strong>in</strong>es for<br />

publication, call for entries, preparation of tests, allocation of<br />

material and human resources, realization of exams, test


ead<strong>in</strong>g, test process<strong>in</strong>g, and disclosure of results for the<br />

subsequent fill<strong>in</strong>g of vacancies by admission or hir<strong>in</strong>g. Figure 1<br />

shows the complete cycle of the phases of a selection process.<br />

Figure 1. Phases of a selection exam.<br />

The limitation of the vacancies is precisely the ma<strong>in</strong><br />

motivation of the contest; if there were no limits to the<br />

vacancies or the number of qualified candidates for completion<br />

was not significant, selection would have been unnecessary.<br />

The process <strong>in</strong>itiates with the disclosure activities. The<br />

purpose of the announcement phase is to enroll the largest<br />

number of candidates, giv<strong>in</strong>g chances to all <strong>in</strong>terested <strong>in</strong><br />

attend<strong>in</strong>g the event. After comply<strong>in</strong>g with legal requirements<br />

for the announcement, the next phase opens up the <strong>in</strong>scription<br />

or registration for all applicants who may be qualified for the<br />

selection exam. Once the <strong>in</strong>scriptions are f<strong>in</strong>ished, <strong>in</strong> the next<br />

phase, we make the arrangements for the preparation of the<br />

tests that will be applied to candidates; this is done through<br />

enjo<strong>in</strong><strong>in</strong>g experts or specialized companies for the author<strong>in</strong>g of<br />

tests and exam<strong>in</strong>ations. At the phase of resource allocation, the<br />

candidates are distributed by each exam place; all the materials<br />

and human resources are allocated for the next phase, such as<br />

rooms, equipment, proctors, and coord<strong>in</strong>ators. In the test<br />

application stage, there are major activities related to the<br />

security aspects of a selection process, <strong>in</strong>clud<strong>in</strong>g review of all<br />

the candidates with metal detectors dur<strong>in</strong>g their access to the<br />

exams places, plann<strong>in</strong>g and proctor<strong>in</strong>g the application of the<br />

tests, and the electronic monitor<strong>in</strong>g of all the exam<strong>in</strong>ation<br />

locations.<br />

Once the tests are completed, all material is collected <strong>in</strong> a<br />

safe place − usually the <strong>in</strong>stitution's <strong>in</strong>formation technology<br />

department − where the test read<strong>in</strong>g phase is done with<br />

automatic readers, scann<strong>in</strong>g responses of all candidates’ tests<br />

for later assessment. In the test process<strong>in</strong>g phase, all test<br />

answers are processed us<strong>in</strong>g the templates and automated<br />

process<strong>in</strong>g systems for gett<strong>in</strong>g the results of the tests, creat<strong>in</strong>g<br />

the scores for the candidates, and generat<strong>in</strong>g their respective<br />

places (classifications). F<strong>in</strong>ally, after verify<strong>in</strong>g and audit<strong>in</strong>g the<br />

whole process of assessment, f<strong>in</strong>al reports are issued for<br />

disclosure of the results toward the registry or hir<strong>in</strong>g of selected<br />

candidates.<br />

III. FRAUD IN SELECTION EXAMS<br />

Currently, security is one of the most relevant po<strong>in</strong>ts <strong>in</strong><br />

manag<strong>in</strong>g selection processes, given the considerable number<br />

of technological tools that tempt people to perform activities of<br />

fraud, especially <strong>in</strong> contests where there is a great competition<br />

of candidates per position.<br />

Security issues <strong>in</strong> exams can now be handled from the time<br />

of enrollment by consult<strong>in</strong>g special database lists ma<strong>in</strong>ta<strong>in</strong>ed<br />

by various Brazilian governmental <strong>in</strong>stitutions that conta<strong>in</strong> the<br />

names of all candidates already identified <strong>in</strong> prosecution of<br />

fraud attempts, throughout the national territory. Despite<br />

hav<strong>in</strong>g their entries approved, such candidates can be identified<br />

from the time of enrollment and allocated <strong>in</strong> special rooms to<br />

be monitored <strong>in</strong> their activities dur<strong>in</strong>g application of the tests.<br />

At the stage of resource allocation, some security measures<br />

can also be taken, try<strong>in</strong>g to f<strong>in</strong>d candidates with relatives <strong>in</strong> the<br />

same contest and separat<strong>in</strong>g them <strong>in</strong> different rooms or places<br />

for application of the tests, avoid<strong>in</strong>g their proximity.<br />

However, it is at the phase of test application <strong>in</strong> which<br />

security issues should have greater weight because it is<br />

precisely at that po<strong>in</strong>t that the largest number of fraud attempts<br />

occurs. All measures already mentioned − such as the use of<br />

metal detectors and electronic monitor<strong>in</strong>g of electronic<br />

transmissions − are key <strong>in</strong> this phase.<br />

At the stage of test read<strong>in</strong>g, it is also necessary to adopt<br />

preventive measures, with rigid control of access to the<br />

process<strong>in</strong>g data center and precautions aga<strong>in</strong>st possible read<strong>in</strong>g<br />

process<strong>in</strong>g errors.<br />

F<strong>in</strong>ally, <strong>in</strong> the phase of disclosure of results, safety<br />

measures must be adopted with respect to the risks of<br />

<strong>in</strong>formation leakage or premature disclosure of the results,<br />

avoid<strong>in</strong>g harm<strong>in</strong>g the reliability of the whole process.<br />

Our practical experience <strong>in</strong> conduct<strong>in</strong>g selection exams<br />

over two decades shows that occurrences of fraud attempts are<br />

fairly common on any type of competition or selection process.<br />

The attempts of fraud range from traditional and frequent<br />

acts of try<strong>in</strong>g to cheat or plagiarize dur<strong>in</strong>g the test, copy<strong>in</strong>g the<br />

responses of a competitor <strong>in</strong> the same room, and even the most<br />

dar<strong>in</strong>g attempts, carried out by means of electronic<br />

transmission. All these attempts must be considered and<br />

scrut<strong>in</strong>ized by the authorities responsible for conduct<strong>in</strong>g the<br />

selection process, and all the omissions <strong>in</strong> these security cases<br />

are <strong>in</strong>admissible. Absolutely noth<strong>in</strong>g can justify the absence of<br />

mechanisms for its prevention or precaution.<br />

IV. TYPES OF QUESTIONS<br />

For a better understand<strong>in</strong>g of the proposal for fraud<br />

detection presented <strong>in</strong> this article, it is necessary to describe the<br />

types of questions usually adopted <strong>in</strong> exam<strong>in</strong>ation tests.<br />

At first, the ma<strong>in</strong> types of questions used <strong>in</strong> tests can be<br />

grouped <strong>in</strong>to three categories:


Multiple-choice questions.<br />

True/false questions.<br />

Open questions.<br />

Open questions significantly h<strong>in</strong>der the attempts of fraud<br />

s<strong>in</strong>ce they require a response not coded and difficult to be<br />

transmitted − although not impossible. Open questions are the<br />

most difficult to process when adopted <strong>in</strong> contests with a large<br />

number of participants. Another disadvantage that this type of<br />

question presents is how to standardize criteria for evaluation,<br />

aim<strong>in</strong>g to avoid disparities <strong>in</strong> the assessment by reduc<strong>in</strong>g the<br />

subjectivity and <strong>in</strong>consistency among the different evaluators.<br />

True/false questions <strong>in</strong>hibit the adoption of the fraud<br />

detection strategy proposed <strong>in</strong> this paper. With this type of test<br />

question, it becomes difficult to identify the "signature" or the<br />

statistical evidence of fraud, as will be shown below. For this<br />

type of question <strong>in</strong> particular, other strategies should be<br />

established for the automatic detection of fraud, tak<strong>in</strong>g<br />

advantage of their peculiarities.<br />

Multiple-choice questions are still frequently adopted <strong>in</strong><br />

selection processes for their tradition and simplicity [9]. One of<br />

their ma<strong>in</strong> advantages is the ease of representation on pr<strong>in</strong>ted<br />

answer sheets that can be read by automatic readers. This<br />

procedure enables the process<strong>in</strong>g of the large volumes of data<br />

collected <strong>in</strong> medium and large exam<strong>in</strong>ations.<br />

The method for fraud detection proposed <strong>in</strong> this study<br />

assumes that the type of questions adopted <strong>in</strong> the contest is<br />

multiple-choice. This type of question presents an <strong>in</strong>terest<strong>in</strong>g<br />

feature that can be exploited for the automatic detection of<br />

fraud: We observe that when two or more candidates try to<br />

cheat on a test, the probability that their errors are equal is high;<br />

that is, the wrong answers are the same and the alternative<br />

letters answered <strong>in</strong>correctly are also equal. Tak<strong>in</strong>g this<br />

observation as a basis, the method proposed <strong>in</strong> this paper<br />

makes a study of the statistical probability of simultaneous<br />

occurrence of errors on the same questions and the same<br />

alternatives to substantiate the mathematical evidence of a<br />

fraud attempt, further propos<strong>in</strong>g KE tools to identify the<br />

occurrence of this problem, detect<strong>in</strong>g and fight<strong>in</strong>g some fraud<br />

activities <strong>in</strong> selection exams.<br />

In the follow<strong>in</strong>g section, we create a probability analysis of<br />

the simultaneity of the same wrong answers with the same<br />

wrong alternative letters <strong>in</strong> a test of multiple-choice questions<br />

us<strong>in</strong>g a total of 40 questions for a case study.<br />

V. PROBABILITY ANALYSIS<br />

To illustrate the method proposed <strong>in</strong> this paper, we assume<br />

that the test be<strong>in</strong>g exam<strong>in</strong>ed has 40 multiple-choice questions<br />

with 5 alternatives (ABCDE). A sample of the possible<br />

answers can be seen <strong>in</strong> Figure 2.<br />

Figure 2. An example of answers of multiple-choice questions.<br />

The first l<strong>in</strong>e of the figure shows the position of each of the<br />

40 questions <strong>in</strong> the test paper. The bottom l<strong>in</strong>e, displayed with<br />

highlighted background, corresponds to the template of these<br />

multiple-choice questions. The 10 l<strong>in</strong>es between are examples<br />

of candidates' responses.<br />

To further facilitate visualization, we can elim<strong>in</strong>ate all the<br />

correct answers of the candidates, s<strong>in</strong>ce they are of no <strong>in</strong>terest<br />

to our study. Figure 3 shows only the wrong answers, already<br />

highlighted.<br />

Figure 3. Only the wrong answers are displayed.<br />

The research question for the analysis of probabilities is:<br />

What is the probability that two or more candidates err <strong>in</strong> a<br />

multiple-choice test with 40 questions (such as ABCDE) <strong>in</strong> the<br />

same question numbers and answer<strong>in</strong>g the same wrong<br />

alternative letters? Our hypothesis is that after a certa<strong>in</strong> number<br />

of wrong answers exactly alike, the probability of occurrence is<br />

close to zero, provid<strong>in</strong>g formal evidence that there is a strong<br />

suspicion of a fraud attempt.<br />

This probabilistic evidence substantiates the fraud detection<br />

method proposed <strong>in</strong> this study, apply<strong>in</strong>g KE tools for<br />

identify<strong>in</strong>g high similarities <strong>in</strong> the wrong answers of the<br />

candidates <strong>in</strong> a selection exam that meets these characteristics<br />

and type of question.<br />

C. Calculat<strong>in</strong>g the probability<br />

For didactic purposes, it is necessary to def<strong>in</strong>e the concept<br />

of IWA. In this study, IWA are those special occurrences when<br />

two or more candidates miss the same questions, answer<strong>in</strong>g the<br />

same wrong alternative letters, result<strong>in</strong>g <strong>in</strong> a remarkable<br />

co<strong>in</strong>cidence. One can <strong>in</strong>tuitively predict that this co<strong>in</strong>cidence<br />

has a significantly low probability of occurrence (i.e., it is<br />

almost impossible), and this fact sets up a strong <strong>in</strong>dication of<br />

anomaly. We can <strong>in</strong>terpret this occurrence as a "signature" of


the fraud attempt, a feature that makes it unique, s<strong>in</strong>gular, and<br />

− rightfully so − it should be <strong>in</strong>vestigated.<br />

The probability of the simultaneous occurrence of IWA <strong>in</strong><br />

an exam<strong>in</strong>ation test can be expressed as follows:<br />

Let:<br />

a - the number of alternatives <strong>in</strong> each question and<br />

k - the number of IWA,<br />

then the probability (P) can be expressed as:<br />

P =<br />

1 <br />

<br />

2<br />

<br />

a <br />

This equation represents a simplification of the problem<br />

and is based on the follow<strong>in</strong>g assumptions:<br />

The questions are <strong>in</strong>dependent; that is, their responses<br />

do not <strong>in</strong>terfere with the responses to other questions.<br />

The probability of each alternative (ABCDE) is equal.<br />

Without this simplification, the solution would be far more<br />

complex, requir<strong>in</strong>g a more detailed study of all proposed<br />

questions of a specific test to determ<strong>in</strong>e the probability of each<br />

question and each alternative <strong>in</strong>dividually. The proposed<br />

equation also implies that the probability of IWA does not<br />

depend on the size of the universe (i.e., number of candidates<br />

who have taken the test).<br />

This equation demonstrates clearly that as the number of<br />

IWA <strong>in</strong>creases, the value of its probability decreases<br />

significantly. We can perform the calculation <strong>in</strong> a small <strong>in</strong>terval<br />

to analyze the behavior of the calculated probability. Table 1<br />

shows the calculation of probabilities for the range of 1 to 10<br />

IWA, consider<strong>in</strong>g a test with multiple-choice questions with<br />

five alternatives ("ABCDE," with a = 5).<br />

k<br />

TABLE I. PROBABILITIES OF IWA IN A MULTIPLE-<br />

CHOICE TEST WITH 5 ALTERNATIVES<br />

IWA P (%)<br />

1 4.000000000000<br />

2 0.160000000000<br />

3 0.006400000000<br />

4 0.000256000000<br />

5 0.000010240000<br />

6 0.000000409600<br />

7 0.000000016384<br />

8 0.000000000655<br />

9 0.000000000026<br />

10 0.000000000001<br />

From Table 1, we conclude that, if the number of IWA is<br />

equal to or greater than 3, the probability is already m<strong>in</strong>imal<br />

(0.0064%).<br />

VI. PROPOSED METHOD<br />

We have applied a SQL (Structured Query Language) script<br />

and relational databases as a traditional Software Eng<strong>in</strong>eer<strong>in</strong>g<br />

tool for help<strong>in</strong>g to detect some IWA. Nevertheless, the results<br />

us<strong>in</strong>g this technique were not always accurate due to different<br />

comb<strong>in</strong>ations of IWA and the absence of some IWA <strong>in</strong> some<br />

candidate’ responses.<br />

There are many <strong>Knowledge</strong> Eng<strong>in</strong>eer<strong>in</strong>g tools that can be<br />

applied to detect this k<strong>in</strong>d of fraud attempt with better<br />

accuracy. Among others, the ma<strong>in</strong> KE tools that can be used <strong>in</strong><br />

this case are:<br />

Artificial neural networks (ANN).<br />

Artificial immune systems (AIS).<br />

Data m<strong>in</strong><strong>in</strong>g.<br />

Case-based reason<strong>in</strong>g (CBR).<br />

The proposed method is founded on apply<strong>in</strong>g case-based<br />

reason<strong>in</strong>g (CBR) as a data-m<strong>in</strong><strong>in</strong>g tool. CBR is an artificial<br />

<strong>in</strong>telligence technique applied <strong>in</strong> solv<strong>in</strong>g problems and<br />

achiev<strong>in</strong>g learn<strong>in</strong>g, based on past experience such as the use of<br />

known cases to solve new cases [10].<br />

CBR methodology fits the purpose of fraud detection<br />

proposed <strong>in</strong> this paper by present<strong>in</strong>g some features relevant to<br />

the problem presented. Among these characteristics, we can<br />

highlight the follow<strong>in</strong>g [11]:<br />

A CBR system should be able to handle <strong>in</strong>complete<br />

and highly specific queries.<br />

It can suggest appropriate cases, even when not all<br />

attributes are provided.<br />

It presents retrieved cases <strong>in</strong> a rational way; that is,<br />

avoid<strong>in</strong>g excessive responses and respect<strong>in</strong>g a limit of<br />

retrieved cases specified by the user.<br />

These characteristics favor the application of CBR<br />

methodology <strong>in</strong> this type of fraud detection, s<strong>in</strong>ce it allows<br />

retrieval of similar cases differently from traditional<br />

<strong>in</strong>formation retrieval, done through use of queries <strong>in</strong> relational<br />

databases.<br />

The proposed method consists <strong>in</strong> design<strong>in</strong>g a system of<br />

CBR for the identification of candidates with IWA above the<br />

m<strong>in</strong>imum value of statistical probability (IWA = 3) for the<br />

retrieval of similar cases <strong>in</strong> an IWA file. The system starts out<br />

with a file conta<strong>in</strong><strong>in</strong>g all the wrong answers of several<br />

candidates, mak<strong>in</strong>g an automatic sequential search of all<br />

similar cases until there is a recovery of more than one case<br />

where this similarity occurs. All cases recovered with high<br />

similarity will be listed for further <strong>in</strong>vestigation of attempted<br />

fraud.<br />

As the volume of responses <strong>in</strong> a test can be quite high, it is<br />

necessary to adopt some criteria to restrict the sample studied<br />

<strong>in</strong> order to optimize queries. The first criterion to be adopted is<br />

to restrict the sample studied by elim<strong>in</strong>at<strong>in</strong>g the responses of<br />

candidates whose number of hits <strong>in</strong> the test is less than a<br />

m<strong>in</strong>imum pass<strong>in</strong>g grade. Thus, all responses that have a


number of errors exceed<strong>in</strong>g a certa<strong>in</strong> percentage will be<br />

automatically discarded s<strong>in</strong>ce these candidates failed to pass<br />

the competition. This measure considerably reduces the<br />

number of cases to be searched.<br />

Another measure of restra<strong>in</strong>t refers to the retrieval of cases<br />

<strong>in</strong> the database. It makes no sense to retrieve cases <strong>in</strong> which the<br />

number of IWA is less than a certa<strong>in</strong> percentage; for example,<br />

60%. The idea is to elim<strong>in</strong>ate <strong>in</strong>stances <strong>in</strong> which the number of<br />

IWA does not reach a certa<strong>in</strong> percentage, which underm<strong>in</strong>es<br />

the suspicion of fraud. Thus, we consider only the responses of<br />

the candidates whose percentage of IWA is equal to or exceeds<br />

60%.<br />

Look<strong>in</strong>g closely at Figure 4, we note that there are some<br />

<strong>in</strong>stances of IWA between the l<strong>in</strong>es, as can be seen <strong>in</strong> Table 2.<br />

the sequential search. Some CBR tools allow a case to be used<br />

as a query, facilitat<strong>in</strong>g the research process.<br />

In a real application, we need the candidate identification<br />

number <strong>in</strong> addition to the answers of the multiple choice<br />

questions. This will enable us to identify those candidates who<br />

have a certa<strong>in</strong> number of IWA.<br />

To automate the fraud detection <strong>in</strong> an IWA file of multiplechoice<br />

questions, it is necessary to make a data-m<strong>in</strong><strong>in</strong>g tool<br />

us<strong>in</strong>g an automatic scann<strong>in</strong>g procedure on the IWA case base.<br />

As some CBR systems do not allow the creation of scripts to<br />

program this automatic scann<strong>in</strong>g, one way to implement such a<br />

data-m<strong>in</strong><strong>in</strong>g procedure would be through the creation of a<br />

similarity function, "Sim ()" <strong>in</strong> a relational database software,<br />

enabl<strong>in</strong>g the creation of an SQL script to perform the automatic<br />

scann<strong>in</strong>g of all the IWA stored <strong>in</strong> the database, calculat<strong>in</strong>g their<br />

similarities, identify<strong>in</strong>g which answers show a high similarity<br />

and, f<strong>in</strong>ally, retriev<strong>in</strong>g them for <strong>in</strong>spection.<br />

The similarity metric that best fits the proposed model is<br />

the average of similarities implemented <strong>in</strong> the CBR tool.<br />

Several simulations were conducted to f<strong>in</strong>d the best option for<br />

the case studied, <strong>in</strong>clud<strong>in</strong>g the codification of the alternatives<br />

as numeric values (ABCDE as 12345). None of them has better<br />

results than the model adopted.<br />

Figure 4. Relevant similar cases <strong>in</strong> the case base.<br />

TABLE II. OCCURRENCES OF IWA IN THE FIGURE 4<br />

L<strong>in</strong>es with IWA Nº of IWA % of IWA<br />

2 and 8 5 100%<br />

6 and 10 4 29%<br />

7 and 9 1 10%<br />

Although we have identified three <strong>in</strong>stances (2-8, 6-10, 7-9)<br />

<strong>in</strong> which there is similarity among the cases of wrong answers,<br />

not all are relevant. If we consider the percentage of IWA <strong>in</strong><br />

each of the events, we see that only one of them is of <strong>in</strong>terest:<br />

l<strong>in</strong>e 2 with l<strong>in</strong>e 8 (Fig. 4).<br />

The other events are not depicted with evidence of fraud<br />

s<strong>in</strong>ce the percentage of IWA is less than a significant<br />

percentage. We can establish a m<strong>in</strong>imum threshold value for<br />

this percentage (e.g., 60%), whereas <strong>in</strong> some cases the<br />

candidate who received a given template (crib note) can<br />

recognize that one of the answers is wrong and <strong>in</strong>creased his or<br />

her number of correct answers or may even have tried to<br />

answer some questions on his or her own (chance), and it still<br />

is wrong, lead<strong>in</strong>g to wrong answers on the same issues but with<br />

different alternatives from the orig<strong>in</strong>al source of cheat or crib.<br />

The proposed method provides an algorithm to scan the<br />

responses meet<strong>in</strong>g the restriction criteria predeterm<strong>in</strong>ed. The<br />

process is sequential and beg<strong>in</strong>s with the first l<strong>in</strong>e, retriev<strong>in</strong>g all<br />

its similar cases, then mov<strong>in</strong>g on to the next l<strong>in</strong>e (2) and do<strong>in</strong>g<br />

new queries only from this start<strong>in</strong>g po<strong>in</strong>t, reduc<strong>in</strong>g <strong>in</strong> every<br />

turn the universe to be searched and optimiz<strong>in</strong>g the process of<br />

VII. RESULTS AND DISCUSSION<br />

The model simulated <strong>in</strong> the CBR tool has produced some<br />

excellent results <strong>in</strong> recover<strong>in</strong>g similar cases of IWA. For test<strong>in</strong>g<br />

the model, a query was created that had one more wrong<br />

answer than the similar cases <strong>in</strong> the case base, just <strong>in</strong> the <strong>in</strong>itial<br />

position of the str<strong>in</strong>g (Fig. 5).<br />

Figure 5. Graphical representation of the query.<br />

This <strong>in</strong>sertion of one more wrong answer <strong>in</strong> the <strong>in</strong>itial<br />

position would make impossible the retrieval of similar cases<br />

by means of a SQL query <strong>in</strong> a relational database, s<strong>in</strong>ce the<br />

blank character (or space) has a smaller value for its <strong>in</strong>ternal<br />

representation than the letter "A"; this fact <strong>in</strong>creases the<br />

distance between the cases registered <strong>in</strong> the database and the<br />

query made. However, the model created with the CBR system<br />

proved excellent performance by retriev<strong>in</strong>g all similar cases.<br />

Table 3 shows the similarity values calculated by the CBR tool<br />

<strong>in</strong> all the retrieved cases of the case base.<br />

TABLE III.<br />

VALUES OF THE SIMILARITIES OF THE<br />

RETRIEVED CASES<br />

Case 2 8 4 6 10 5 1 3 7 9<br />

Sim() 1,00 1,00 0,50 0,25 0,25 0,20 0,00 0,00 0,00 0,00<br />

The query on the case base done with the data shown <strong>in</strong><br />

Figure 5 gives us the results returned by the CBR system used<br />

<strong>in</strong> this study, as presented <strong>in</strong> Figure 6.


mak<strong>in</strong>g the analysis of the behavior of other CBR software or<br />

others KE tools for a greater volume of data.<br />

ACKNOWLEDGMENT<br />

We want to thank Dr. Marcelo Sobottka of the Department<br />

of Mathematics at the Federal University of Santa Catar<strong>in</strong>a<br />

(Brazil) for his valuable help <strong>in</strong> determ<strong>in</strong><strong>in</strong>g the probability of<br />

IWA <strong>in</strong> a test of multiple-choice questions.<br />

REFERENCES<br />

Figure 6. Cases retrieved bythe CBR tool.<br />

Even when the query was submitted with one more wrong<br />

answer purposely <strong>in</strong>troduced <strong>in</strong> the first column, it is clear that<br />

case numbers 2 and 8 were recovered with maximum<br />

similarity. The rema<strong>in</strong><strong>in</strong>g cases that did not show a significant<br />

number of IWA had low similarity, prov<strong>in</strong>g the efficiency and<br />

effectiveness of the proposed model.<br />

VIII. CONCLUSIONS AND FUTURE WORK<br />

The methodology of CBR showed excellent performance <strong>in</strong><br />

the detection of IWA, identify<strong>in</strong>g all the cases <strong>in</strong> the case base<br />

used as a test. While there are some differences between the<br />

values of the query with the values found <strong>in</strong> the case base, the<br />

CBR software used <strong>in</strong> this study correctly retrieved all cases<br />

that showed the highest similarity.<br />

We conclude that the method proposed <strong>in</strong> this study can be<br />

applied <strong>in</strong> detect<strong>in</strong>g and prevent<strong>in</strong>g fraud <strong>in</strong> selection<br />

exam<strong>in</strong>ations that use multiple-choice questions, allow<strong>in</strong>g the<br />

adoption of countermeasures <strong>in</strong> a timely manner, thanks to its<br />

ease of model<strong>in</strong>g and application.<br />

One limitation of this study is the lack of a more detailed<br />

analysis of the behavior of the CBR software with respect to its<br />

performance with a case base with thousands of answers from a<br />

large group of applicants. However, based on other authors’<br />

similar experiences with us<strong>in</strong>g bases of more than 200,000<br />

cases [12], the CBR systems have a satisfactory performance.<br />

Thus, it is expected that the CBR solution here proposed<br />

presents a good performance even <strong>in</strong> the case of larger bases.<br />

Future research could explore the application of the model<br />

proposed <strong>in</strong> this study for other k<strong>in</strong>ds of questions and tests,<br />

adapt<strong>in</strong>g the methodology of CBR to the specificity of these<br />

applications. Others <strong>in</strong>vestigators can also expand this study,<br />

[1] N. Laleh and M. A. Azgomi, "A taxonomy of frauds and fraud detection<br />

techniques," <strong>in</strong> Communications <strong>in</strong> Computer and Information Science,<br />

vol. 31, pt. 9, S. K. Prasad, S. Routray, R. Khurana, S. Sahni, Eds.<br />

ICISTM 2009, Berl<strong>in</strong>: Spr<strong>in</strong>ger-Verlag, 2009, pp. 256–267.<br />

[2] R. J. Bolton and D. J. Hand, "Statistical fraud detection: A review,"<br />

Statistical Science, vol. 17, no. 3, pp. 235–255.<br />

[3] A. Brabazon, J. Cahill, P. Keenan and D. Walsh, “Identify<strong>in</strong>g onl<strong>in</strong>e<br />

credit card fraud us<strong>in</strong>g artificial immune systems,” Proceed<strong>in</strong>gs of the<br />

IEEE Congress on Evolutionary Computation (CEC 2010).<br />

[4] M. Lek , B. Anandarajah, N. Cerpa and R. Jamieson, “Data m<strong>in</strong><strong>in</strong>g<br />

prototype for detect<strong>in</strong>g e-commerce fraud,” 9 th European Conference on<br />

Information Systems, Bled, Slovenia 2001, pp. 160–165.<br />

[5] B. Kusaksizoglu, “<strong>Fraud</strong> detection <strong>in</strong> mobile communications networks<br />

us<strong>in</strong>g data m<strong>in</strong><strong>in</strong>g,” Ph.D. thesis, Department of Computer Eng<strong>in</strong>eer<strong>in</strong>g,<br />

Bahcesehir University Istanbul, 2006.<br />

[6] C. Schommer. (2009). Discover<strong>in</strong>g fraud behavior <strong>in</strong> call detailed<br />

records [Onl<strong>in</strong>e]. Available: http://wiki.uni.lu/m<strong>in</strong>e/docs/<br />

Schommer.DataM<strong>in</strong><strong>in</strong>gAnd<strong>Fraud</strong>.pdf.<br />

[7] J. Li, K. Y. Huang, J. J<strong>in</strong> and J. Shi, “A survey on statistical methods for<br />

health care fraud detection,” Health Care Management Science, vol. 11,<br />

No. 3, 2008, pp. 275–287.<br />

[8] C. C. Albrecht. (2008). <strong>Fraud</strong> and forensic account<strong>in</strong>g <strong>in</strong> a digital<br />

environment [Onl<strong>in</strong>e]. Available: http://www.theifp.org/research-grants/<br />

IFP-Whitepaper-4.pdf.<br />

[9] D. McSherry, "A case-based reason<strong>in</strong>g approach to automat<strong>in</strong>g the<br />

construction of multiple choice questions," Lecture Notes on Computer<br />

Science, vol. 6176, I. Bich<strong>in</strong>daritz and S. Montana, Eds. ICCBR 2010,<br />

Spr<strong>in</strong>ger-Verlag, 2010, pp. 406–420.<br />

[10] M. M. Richter, “Case-Based Reason<strong>in</strong>g Technology, From Foundations<br />

to Applications”, Lecture Notes <strong>in</strong> Computer Science, Spr<strong>in</strong>ger-Verlag,<br />

Berl<strong>in</strong>, 1998, vol. 1400, pp. 1-15.<br />

[11] W. Wilke, M. Lenz, and S. Wess, "Intelligent sales support with CBR,"<br />

<strong>in</strong> Case-Based Reason<strong>in</strong>g Technology from Foundations to Applications,<br />

M. Lenz, B. Bartsch-Spörl, H. D. Burkhard, S. Wess, Eds. Berl<strong>in</strong>:<br />

Spr<strong>in</strong>ger-Verlag, 1998, pp. 91-113.<br />

[12] M. Lenz and H. D. Burkhard, "Case retrieval nets: Basic ideas and<br />

extensions," <strong>in</strong> Proceed<strong>in</strong>gs of the 20 th Annual German Conference on<br />

Artificial Intelligence: Advances <strong>in</strong> Artificial Intelligence, London,<br />

Spr<strong>in</strong>ger-Verlag press, 1996, pp. 227-239.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!