01.04.2015 Views

1FfUrl0

1FfUrl0

1FfUrl0

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Preselection and processing of attributes<br />

We should also only keep those attributes that we think could help the classifier<br />

in determining the good from the not-so-good answers. Certainly, we need the<br />

identification-related attributes to assign the correct answers to the questions.<br />

Read the following attributes:<br />

Chapter 5<br />

• The PostType attribute, for example, is only necessary to distinguish<br />

between questions and answers. Furthermore, we can distinguish between<br />

them later by checking for the ParentId attribute. So, we keep it for<br />

questions too, and set it to 1.<br />

• The CreationDate attribute could be interesting to determine the time span<br />

between posting the question and posting the individual answers, so we<br />

keep it.<br />

• The Score attribute is, of course, important as an indicator of the<br />

community's evaluation.<br />

• The ViewCount attribute, in contrast, is most likely of no use for our task.<br />

Even if it is able to help the classifier distinguish between good and bad, we<br />

will not have this information at the time when an answer is being submitted.<br />

We will ignore it.<br />

• The Body attribute obviously contains the most important information. As it<br />

is encoded in HTML, we will have to decode it to plain text.<br />

• The OwnerUserId attribute is useful only if we will take the user-dependent<br />

features into account, which we won't. Although we drop it here, we<br />

encourage you to use it (maybe in connection with users.xml) to build a<br />

better classifier.<br />

• The Title attribute is also ignored here, although it could add some more<br />

information about the question.<br />

• The CommentCount attribute is also ignored. Similar to ViewCount, it could<br />

help the classifier with posts that were posted a while ago (more comments<br />

are equal to more ambiguous posts). It will, however, not help the classifier<br />

at the time that an answer is posted.<br />

• The AcceptedAnswerId attribute is similar to the Score attribute, that is, it is<br />

an indicator of a post's quality. As we will access this per answer, instead of<br />

keeping this attribute, we will create a new attribute, IsAccepted, which will<br />

be 0 or 1 for answers and ignored for questions (ParentId = 1).<br />

[ 93 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!