13.07.2015 Views

Detecting Inconsistencies in Treebanks - Department of Linguistics

Detecting Inconsistencies in Treebanks - Department of Linguistics

Detecting Inconsistencies in Treebanks - Department of Linguistics

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

label, or with the special label NIL if that list <strong>of</strong> corpus positions is not annotatedto be a constituent. Second, elim<strong>in</strong>ate the non-vary<strong>in</strong>g ones.However, such a generate-and-test methodology which for every corpus positionstores the list <strong>of</strong> words <strong>of</strong> length i for all i up to the length <strong>of</strong> the longestconstituent <strong>in</strong> the corpus is clearly not feasible for deal<strong>in</strong>g with larger corpora. Instead,we can make use <strong>of</strong> the observation from the last section that a variationnecessarily <strong>in</strong>volves at least one constituent occurrence <strong>of</strong> a nucleus. We thus arriveat the follow<strong>in</strong>g algorithm to calculate the set <strong>of</strong> nuclei for a w<strong>in</strong>dow <strong>of</strong> lengthi(1 ≤ i ≤ length-<strong>of</strong>-longest-constituent-<strong>in</strong>-corpus):1. Compute the set <strong>of</strong> nuclei:a) F<strong>in</strong>d all constituents <strong>of</strong> length i, store them with their category label.b) For each dist<strong>in</strong>ct type <strong>of</strong> str<strong>in</strong>g stored as a constituent <strong>of</strong> length i, addthe label NIL for each non-constituent occurrence <strong>of</strong> that str<strong>in</strong>g.2. Compute the set <strong>of</strong> variation nuclei by determ<strong>in</strong><strong>in</strong>g which <strong>of</strong> the nuclei werestored <strong>in</strong> step 1 with more than one label.In addition to calculat<strong>in</strong>g the variation nuclei <strong>in</strong> this way, we generate the variationn-grams for these variation nuclei as def<strong>in</strong>ed <strong>in</strong> Dick<strong>in</strong>son and Meurers (2003),i.e., we search for <strong>in</strong>stances <strong>of</strong> the variation nuclei which occur with<strong>in</strong> a similarcontext. The motivation for the step from variation nuclei to variation n-grams isthat a variation nucleus occurr<strong>in</strong>g with<strong>in</strong> a similar context is more likely to be anerror. Regard<strong>in</strong>g what constitutes a similar context, for the purpose <strong>of</strong> this paperwe simply def<strong>in</strong>e it as a context <strong>of</strong> identical words surround<strong>in</strong>g the variation nucleus,just as <strong>in</strong> Dick<strong>in</strong>son and Meurers (2003). To <strong>in</strong>crease the recall <strong>of</strong> the errordetection method one could experiment with us<strong>in</strong>g a less strict notion <strong>of</strong> similarity,such as requir<strong>in</strong>g the context surround<strong>in</strong>g the nucleus to consist <strong>of</strong> identical orrelated pos-tags <strong>in</strong>stead <strong>of</strong> identical words.4 A case study: Apply<strong>in</strong>g the method to the WSJTo test the variation n-gram method as applied to syntactic annotation, we did acase study with the already mentioned WSJ corpus as part <strong>of</strong> the Penn Treebank 3(Marcus et al., 1999). Before present<strong>in</strong>g the results <strong>of</strong> the case study, there are acouple <strong>of</strong> po<strong>in</strong>ts to note about the nature <strong>of</strong> the corpus and the format we used it <strong>in</strong>.Syntactic categories and syntactic functions The syntactic annotation <strong>in</strong> theWSJ <strong>in</strong>cludes syntactic category and syntactic function <strong>in</strong>formation. The annotation<strong>of</strong> a constituent consists <strong>of</strong> both pieces <strong>of</strong> <strong>in</strong>formation jo<strong>in</strong>ed by a hyphen, e.g.,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!