12.07.2015 Views

The Reuters Corpus Volume 1 - from Yesterday's News to ...

The Reuters Corpus Volume 1 - from Yesterday's News to ...

The Reuters Corpus Volume 1 - from Yesterday's News to ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

egion code and at least one <strong>to</strong>pic code. <strong>The</strong>refore, thenext stage after the application of TIS was <strong>to</strong> check eachs<strong>to</strong>ry <strong>to</strong> see whether it satisfied this requirement. If so, thes<strong>to</strong>ry was sent directly <strong>to</strong> a holding queue (see Section4.2.3). If not, the s<strong>to</strong>ry was sent <strong>to</strong> a human edi<strong>to</strong>r. Thisedi<strong>to</strong>r would then assign <strong>to</strong> the s<strong>to</strong>ry all codes they feltwere appropriate, ensuring that the s<strong>to</strong>ry received at leas<strong>to</strong>ne <strong>to</strong>pic code and one region code. <strong>The</strong>y were also free<strong>to</strong> delete or modify some of the au<strong>to</strong>matically assignedcodes. Once this manual editing was complete, the s<strong>to</strong>rywas sent <strong>to</strong> the holding queue for final review.4.2.3. Manual Correction in the Holding QueueEvery 6 hours the contents of holding queue werereviewed by a further edi<strong>to</strong>r, whose responsibility was <strong>to</strong>correct any outstanding coding errors. Finally, once thiswas complete and the s<strong>to</strong>ries had passed through theholding queue, they were batched up and loaded on<strong>to</strong> theRBB database in blocks.4.3. Coding StatisticsSince all s<strong>to</strong>ries passed through the holding queue, itcan be argued that every s<strong>to</strong>ry in the collection wasmanually coded, in the sense of having the au<strong>to</strong>matedcoding checked by at least one edi<strong>to</strong>r. Moreover, s<strong>to</strong>riesthat violated any coding principle (e.g. those lacking atleast one <strong>to</strong>pic code and region code) were reviewed by atleast two edi<strong>to</strong>rs, the first of whom always added orchanged at least one of the original TIS-assigned codes.This process of manual review represents a significantinvestment in the maintenance of data quality standards.In addition, further quality control procedures wereapplied, whereby each month a senior edi<strong>to</strong>r would take asample of s<strong>to</strong>ries and assesses them for quality of coding,as well as language, punctuation, spelling etc. <strong>The</strong>outcome of this process was fed back in<strong>to</strong> the system, andthe edi<strong>to</strong>rs notified of any errors.Table 1 summarizes, for the year 1997, how manys<strong>to</strong>ries were manually edited and how many weremanually corrected in the holding queue 6 . <strong>The</strong> middle lineshows that a <strong>to</strong>tal of 505,720 s<strong>to</strong>ries went straight <strong>from</strong>TIS <strong>to</strong> the holding queue (bypassing the manual editingstage), and 334,975 of these (66.2%) were subsequentlymanually corrected. By contrast, the lower line shows thata <strong>to</strong>tal of 312,140 s<strong>to</strong>ries went <strong>from</strong> TIS via manualediting <strong>to</strong> the holding queue, but only 23,289 (13.4%) ofthese were subsequently corrected. It is possible that someof this difference could be attributed <strong>to</strong> the fact that theedi<strong>to</strong>rs making manual corrections could see which s<strong>to</strong>rieshad been au<strong>to</strong>-coded and which had been manually edited,but it nonetheless provides a significant degree ofconfidence in the degree of consistency between humanedi<strong>to</strong>rs.Uncorrected CorrectedUnedited 170,745 334,975Edited 288,851 23,289Table 1: Numbers of s<strong>to</strong>ries edited and/or corrected6 Note that RCV1 contains s<strong>to</strong>ries spanning parts of 1996and 1997, so the number of s<strong>to</strong>ries in the corpus is not thesame as the number of s<strong>to</strong>ries in Table 15. Measuring inter-coder consistencyA fundamental feature of many real-worldcategorization schemes is that the definition of codes canbe inherently quite imprecise, and as such open <strong>to</strong>interpretation by the various individuals that apply them.Various studies have shown that there can be considerablevariation in inter-indexer agreement for different data sets(Bruce and Weibe, 1998; Brants, 2000; Veronis, 1998). In<strong>Reuters</strong> case, each edi<strong>to</strong>r may have a slightly differentunderstanding of the concepts <strong>to</strong> which each code refers,and this can lead <strong>to</strong> inconsistencies in their application.Clearly, it would be of great benefit if some quantitativemeasure of inter-coder consistency could be applied <strong>to</strong> theRCV1 data. Evidently, the ideal approach would be <strong>to</strong>compare each s<strong>to</strong>ry against some benchmark standard,such as that discussed in Section 2. However, even in theabsence of such a resource, it was still possible <strong>to</strong> measuretwo aspects of coding consistency, using metadata presentin the original RBB source files (i.e. a superset of the datathat eventually became RCV1).When a human edi<strong>to</strong>r opens a s<strong>to</strong>ry, the action isrecorded by adding a flag <strong>to</strong> the ‘COMPRO’ field of thefile. <strong>The</strong> first letter after the colon is used <strong>to</strong> indicate acorrection (C) or an edit (E). Thus a COMPRO fieldcontaining the data ‘ED:ETA’ indicates a s<strong>to</strong>ry that hadfirst been through TIS (by default) and was then edited byan edi<strong>to</strong>r identified by the letters TA. Likewise, an articlewith the COMPRO field ‘ED:ETA ED:CBY’ indicates as<strong>to</strong>ry initially coded by TIS, then edited by edi<strong>to</strong>r TA andsubsequently corrected in the holding queue by edi<strong>to</strong>r BY.It is assumed that the last edi<strong>to</strong>r shown (i.e. the last <strong>to</strong>make any changes) is responsible for the final coding ofany given s<strong>to</strong>ry.One approximate measure of coding consistency is <strong>to</strong>calculate how frequently an individual edi<strong>to</strong>r’s coding iscorrected. Note that in this context any change <strong>to</strong> thecoding of a s<strong>to</strong>ry counts as a correction (rather thancounting corrections on the basis of individual codes).Since edi<strong>to</strong>rs were equally likely <strong>to</strong> be first or second <strong>to</strong>see a given s<strong>to</strong>ry, the correction rate for a given edi<strong>to</strong>r canbe calculated thus:NE = Number of s<strong>to</strong>ries coded by a given edi<strong>to</strong>rNF = Number of s<strong>to</strong>ries for which a given edi<strong>to</strong>rapplies the final codingNC = Number of times an edi<strong>to</strong>r is corrected by asecond edi<strong>to</strong>r, i.e. NE – NFCorrection Rate C = (NC/NE) * 100<strong>The</strong> results are shown in Table 2, sorted by C. Notethat this data refers <strong>to</strong> the original RBB source files, i.e. asuperset of the data that eventually became RCV1. Edi<strong>to</strong>rE101 is TIS, which generally gets corrected around 77%of the time. Since TIS was never solely responsible for anarticle (i.e. every s<strong>to</strong>ry was subsequently reviewed by atleast one human edi<strong>to</strong>r), this was not considered undulyproblematic. It should also be noted that edi<strong>to</strong>r E3 and wasnot an active BIP coder, and that edi<strong>to</strong>rs E71, E73 andE91 were still undergoing training at the time and hencewere expected <strong>to</strong> have higher correction rates.Based on this data, the average correction rate formanual edi<strong>to</strong>rs is 5.16%, i.e. slightly more than 1 in 20s<strong>to</strong>ries. However, this figure is likely <strong>to</strong> be an upper bound

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!