Extraction and Integration of MovieLens and IMDb Data - APMD
Extraction and Integration of MovieLens and IMDb Data - APMD
Extraction and Integration of MovieLens and IMDb Data - APMD
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Verónika Peralta<br />
− Movie type calculation task (cleaning for movies.txt) calculates movie types as derived attributes by<br />
looking for some special characters in the movie title (see Table 9). The task is implemented as a<br />
sequence <strong>of</strong> SQL operations (one for each movie type) that update the Type attribute according to Table<br />
9, for example:<br />
UPDATE movies-cleaning<br />
SET Type=M<br />
WHERE Mid(MovieTitle, 1, 2) = ’”’<br />
AND Mid(MovieTitle, Len(MovieTitle)-5, 6) = ’(mini)’<br />
Movie type Title is quoted Special ending word Generated attribute<br />
Cinema film No none C<br />
Series episode Yes (mini) S<br />
Mini-series Yes none M<br />
TV series No (TV) TV<br />
Video No (V) V<br />
Video game No (VG) VG<br />
Table 9 – Movie type identification<br />
− Null values elimination tasks (initial cleaning for biographies.txt <strong>and</strong> business.txt) eliminate tuples having<br />
null values for all attributes (excepting keys). Tasks are implemented as SQL operations like:<br />
INSERT INTO business-cleaning (MovieTitle, Budget, Revenue)<br />
SELECT *<br />
FROM business-source<br />
WHERE Bugnet IS NOT NULL OR Revenue IS NOT NULL;<br />
The implementation for biographies.txt is analogous.<br />
− Splitting tasks (cleaning for several tasks) update several attributes from one concatenating several<br />
concepts. For example, the ProductionCompany column <strong>of</strong> the productioncompanies.txt file embeds<br />
company names <strong>and</strong> country codes. The tasks search for special characters (e.g. brackets, square brackets,<br />
braces) <strong>and</strong> split attribute values according to such characters. Tasks are implemented as a sequence <strong>of</strong><br />
SQL operations, for example:<br />
INSERT INTO productioncompanies-cleaning<br />
(MovieTitle, ProductionCompany, CompanyInfo, Aux1)<br />
SELECT MovieTitle, ProductionCompany, Comments,<br />
InStr(1,ProductionCompany,’[‘)<br />
FROM productioncompanies-source;<br />
UPDATE productioncompanies-cleaning<br />
SET CompanyName = Mid(ProductionCompany, 1, Aux1-2),<br />
CountryCode = Mid(ProductionCompany,Aux1+1,Len(ProductionCompany)-Aux1-1)<br />
WHERE Aux1 > 0;<br />
UPDATE productioncompanies-cleaning<br />
SET CompanyName = ProductionCompany<br />
WHERE Aux1 = 0;<br />
The first operation calculates the position <strong>of</strong> the special character <strong>and</strong> stores it in the Aux1 attribute. The<br />
second operation performs the split. The third operation is necessary when some values are optional (in<br />
the example, if country code can be omitted). It sets the first attribute <strong>and</strong> less the second as Null.<br />
The implementation <strong>of</strong> splitting tasks for business.txt, distributors.txt, releasedates.txt <strong>and</strong><br />
runningtimes.txt is similar. Table 10 lists attributes to split <strong>and</strong> special characters for those files. Columns<br />
<strong>of</strong> biographies.txt were not split because <strong>of</strong> their format diversity.<br />
As a special case, the splitting <strong>of</strong> movielinks.txt is performed by table look-up, i.e. the first attribute<br />
(movie type) is looked up in a table; the remaining <strong>of</strong> the text corresponds to the second attribute. The<br />
look-up table contains the fifteen link types, namely: “alternate language version <strong>of</strong>”, “edited from”,<br />
“edited into”, “featured in”, “features”, “followed by”, “follows”, “referenced in”, “references”, “remade<br />
as”, “remake <strong>of</strong>”, “spin <strong>of</strong>f”, “spo<strong>of</strong>ed in”, “spo<strong>of</strong>s”, “version <strong>of</strong>”. The tasks is implemented as follows:<br />
23