08.04.2013 Views

Extraction and Integration of MovieLens and IMDb Data - APMD

Extraction and Integration of MovieLens and IMDb Data - APMD

Extraction and Integration of MovieLens and IMDb Data - APMD

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Verónika Peralta<br />

− Movie type calculation task (cleaning for movies.txt) calculates movie types as derived attributes by<br />

looking for some special characters in the movie title (see Table 9). The task is implemented as a<br />

sequence <strong>of</strong> SQL operations (one for each movie type) that update the Type attribute according to Table<br />

9, for example:<br />

UPDATE movies-cleaning<br />

SET Type=M<br />

WHERE Mid(MovieTitle, 1, 2) = ’”’<br />

AND Mid(MovieTitle, Len(MovieTitle)-5, 6) = ’(mini)’<br />

Movie type Title is quoted Special ending word Generated attribute<br />

Cinema film No none C<br />

Series episode Yes (mini) S<br />

Mini-series Yes none M<br />

TV series No (TV) TV<br />

Video No (V) V<br />

Video game No (VG) VG<br />

Table 9 – Movie type identification<br />

− Null values elimination tasks (initial cleaning for biographies.txt <strong>and</strong> business.txt) eliminate tuples having<br />

null values for all attributes (excepting keys). Tasks are implemented as SQL operations like:<br />

INSERT INTO business-cleaning (MovieTitle, Budget, Revenue)<br />

SELECT *<br />

FROM business-source<br />

WHERE Bugnet IS NOT NULL OR Revenue IS NOT NULL;<br />

The implementation for biographies.txt is analogous.<br />

− Splitting tasks (cleaning for several tasks) update several attributes from one concatenating several<br />

concepts. For example, the ProductionCompany column <strong>of</strong> the productioncompanies.txt file embeds<br />

company names <strong>and</strong> country codes. The tasks search for special characters (e.g. brackets, square brackets,<br />

braces) <strong>and</strong> split attribute values according to such characters. Tasks are implemented as a sequence <strong>of</strong><br />

SQL operations, for example:<br />

INSERT INTO productioncompanies-cleaning<br />

(MovieTitle, ProductionCompany, CompanyInfo, Aux1)<br />

SELECT MovieTitle, ProductionCompany, Comments,<br />

InStr(1,ProductionCompany,’[‘)<br />

FROM productioncompanies-source;<br />

UPDATE productioncompanies-cleaning<br />

SET CompanyName = Mid(ProductionCompany, 1, Aux1-2),<br />

CountryCode = Mid(ProductionCompany,Aux1+1,Len(ProductionCompany)-Aux1-1)<br />

WHERE Aux1 > 0;<br />

UPDATE productioncompanies-cleaning<br />

SET CompanyName = ProductionCompany<br />

WHERE Aux1 = 0;<br />

The first operation calculates the position <strong>of</strong> the special character <strong>and</strong> stores it in the Aux1 attribute. The<br />

second operation performs the split. The third operation is necessary when some values are optional (in<br />

the example, if country code can be omitted). It sets the first attribute <strong>and</strong> less the second as Null.<br />

The implementation <strong>of</strong> splitting tasks for business.txt, distributors.txt, releasedates.txt <strong>and</strong><br />

runningtimes.txt is similar. Table 10 lists attributes to split <strong>and</strong> special characters for those files. Columns<br />

<strong>of</strong> biographies.txt were not split because <strong>of</strong> their format diversity.<br />

As a special case, the splitting <strong>of</strong> movielinks.txt is performed by table look-up, i.e. the first attribute<br />

(movie type) is looked up in a table; the remaining <strong>of</strong> the text corresponds to the second attribute. The<br />

look-up table contains the fifteen link types, namely: “alternate language version <strong>of</strong>”, “edited from”,<br />

“edited into”, “featured in”, “features”, “followed by”, “follows”, “referenced in”, “references”, “remade<br />

as”, “remake <strong>of</strong>”, “spin <strong>of</strong>f”, “spo<strong>of</strong>ed in”, “spo<strong>of</strong>s”, “version <strong>of</strong>”. The tasks is implemented as follows:<br />

23

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!