08.04.2013 Views

Extraction and Integration of MovieLens and IMDb Data - APMD

Extraction and Integration of MovieLens and IMDb Data - APMD

Extraction and Integration of MovieLens and IMDb Data - APMD

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

ETL workflows consisted in the same tasks presented for the <strong>MovieLens</strong> data set but providing different<br />

implementations for specific cleaning <strong>and</strong> reconciliation tasks. The general workflow is illustrated in Figure 21<br />

(P.txt represents either one <strong>of</strong> the text files obtained after pre-processing source files or ratings.list, which does<br />

not need pre-processing). However, for certain files some tasks were not necessary, specifically, if there were no<br />

transformations to do, or no duplications nor orphans were detected. Table 8 shows the tasks that are omitted for<br />

each file; bulk copy <strong>and</strong> duplicate check tasks are executed for all files.<br />

In addition, as indexes by movie titles consumed much space <strong>and</strong> were less efficient, we auto generated new<br />

integer identifiers for movies. We included a new column (MovieNum), <strong>of</strong> auto-numeric type to the moviesgroup<br />

table (which stores the results <strong>of</strong> the grouping task). Each tuple inserted by the grouping task is set with a<br />

new sequential identifier generated automatically. We also maintained movie titles in tables referencing movies,<br />

letting users the possibility <strong>of</strong> choosing the identifier that is best adapted to their applications.<br />

22<br />

P.txt<br />

<strong>IMDb</strong>_Movies<br />

bulk copy<br />

cleaning<br />

duplicate<br />

check<br />

Figure 21 – ETL workflow for <strong>IMDb</strong> files<br />

duplicate<br />

reconciliation grouping<br />

cleaning<br />

referential<br />

check<br />

orphan<br />

elimination<br />

aggregation<br />

<strong>IMDb</strong>_P<br />

<strong>IMDb</strong>_P’<br />

Source file Cleaning<br />

Reconc., dupl. cleaning<br />

<strong>and</strong> grouping<br />

Ref. check<br />

Orphan<br />

elimination Aggregation<br />

movies.txt X X X<br />

actors.txt X X<br />

actresses.txt X X<br />

biographies.txt X X X<br />

business.txt X<br />

color-info.txt X X<br />

costume-designers.txt X X<br />

countries.txt X X<br />

directors.txt X X<br />

distributors.txt X<br />

genres.txt X X<br />

keywords.txt X<br />

language.txt X X<br />

locations.txt X<br />

movie-links.txt X X<br />

producers.txt X X<br />

production-companies.txt X<br />

production-designers.txt X X<br />

ratings.list X X X<br />

release-dates.txt X<br />

running-times.txt X<br />

sound-mix.txt X X<br />

writers.txt X X<br />

Table 8 – ETL tasks that where not necessary to execute<br />

In the following we describe such specific cleaning <strong>and</strong> reconciliation tasks, as well as special remarks:

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!