08.04.2013 Views

Extraction and Integration of MovieLens and IMDb Data - APMD

Extraction and Integration of MovieLens and IMDb Data - APMD

Extraction and Integration of MovieLens and IMDb Data - APMD

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong><br />

Verónika Peralta<br />

Laboratoire PRiSM, Université de Versailles<br />

45, avenue des Etats-Unis<br />

78035 Versailles Cedex<br />

FRANCE<br />

veronika.peralta@prism.uvsq.fr<br />

Technical Report 1<br />

July 2007<br />

Abstract. This report describes the procedure followed for extracting <strong>and</strong> integrating <strong>MovieLens</strong><br />

<strong>and</strong> <strong>IMDb</strong> data. We firstly focus on data extraction, describing source schemas, target schemas<br />

<strong>and</strong> the algorithms that perform data extraction, data cleaning <strong>and</strong> data transformation. We then<br />

describe data integration, describing the integrated schema, the algorithm that match movie titles<br />

<strong>and</strong> the construction <strong>of</strong> the integrated database. Finally, we present some statistics <strong>of</strong> the extracted<br />

<strong>and</strong> integrated data.<br />

1. Introduction<br />

This report describes the procedure followed for extracting <strong>and</strong> integrating <strong>IMDb</strong> <strong>and</strong> <strong>MovieLens</strong> data.<br />

<strong>IMDb</strong>, the Internet Movie <strong>Data</strong>base, is a huge collection <strong>of</strong> movie information (auto-claimed to be the earth’s<br />

biggest movie database). It started as a hobby project by an international group <strong>of</strong> movie fans <strong>and</strong> currently<br />

belongs to the Amazon.com Company. <strong>IMDb</strong> tries to catalog every pertinent detail about a movie, from who was<br />

in it, to who made it, to trivia about it, to filming locations, <strong>and</strong> even where you can find reviews <strong>and</strong> fan sites on<br />

the web. The <strong>IMDb</strong> web site (http://www.imdb.com/ [4]) provides 49 text files in ad-hoc format (called lists)<br />

containing different characteristics about movies (e.g. actors.list or running-times.list). Lists content is<br />

continually updated <strong>and</strong> enlarged; at October 5 th 2006, the movie list included more than 858.000 movies.<br />

<strong>MovieLens</strong> is a movie recommender project, developed by the Department <strong>of</strong> Computer Science <strong>and</strong><br />

Engineering at the University <strong>of</strong> Minnesota. <strong>MovieLens</strong> is a typical collaborative filtering system that collects<br />

movie preferences from users <strong>and</strong> then groups users with similar tastes. Based on the movie ratings expressed by<br />

all the users in a group it attempts to predict for each individual their opinion on movies they have not yet seen.<br />

Two data sets are available at the <strong>MovieLens</strong> web site (http://movielens.umn.edu/ [2]). The first one consists <strong>of</strong><br />

100,000 ratings for 1682 movies by 943 users. The second one consists <strong>of</strong> approximately 1 million ratings for<br />

3883 movies by 6040 users.<br />

Our goal is to build a relational database about movies, which includes movie descriptions <strong>and</strong> user ratings. To<br />

this end, we extract, transform <strong>and</strong> integrate data provided by <strong>IMDb</strong> <strong>and</strong> <strong>MovieLens</strong> sites. This database will be<br />

used, in future works, to evaluate the performance <strong>and</strong> pertinence <strong>of</strong> different techniques <strong>and</strong> algorithms for<br />

query large data sets taking into account user preferences <strong>and</strong> data quality.<br />

The extraction <strong>and</strong> integration procedure has 4 main steps: (i) extraction <strong>of</strong> <strong>MovieLens</strong> data, (ii) extraction <strong>and</strong><br />

transformation <strong>of</strong> <strong>IMDb</strong> data, (iii) matching <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> movie titles, <strong>and</strong> (iv) construction <strong>of</strong> the<br />

integrated database. Figure 1 shows an overview <strong>of</strong> these steps.<br />

1 This research was partially supported by the French Ministry <strong>of</strong> Research <strong>and</strong> New Technologies under the ACI program<br />

devoted to <strong>Data</strong> Masses (ACI-MD), project #MD-33.


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

2<br />

<strong>MovieLens</strong><br />

data<br />

<strong>IMDb</strong> data<br />

<strong>Extraction</strong> <strong>of</strong><br />

<strong>MovieLens</strong> data<br />

<strong>Extraction</strong> <strong>of</strong><br />

<strong>IMDb</strong> data<br />

Matching <strong>of</strong><br />

movie titles<br />

Figure 1 – <strong>Extraction</strong> <strong>and</strong> integration steps<br />

Construction <strong>of</strong><br />

the integrated<br />

database<br />

<strong>Data</strong> extraction <strong>and</strong> integration processes were executed in a Micros<strong>of</strong>t Access database <strong>and</strong> then migrated to an<br />

Oracle database. In the remaining <strong>of</strong> the report we describe each step: Sections 2 <strong>and</strong> 3 present the extraction <strong>of</strong><br />

<strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> data respectively, describing source schemas, target schemas <strong>and</strong> the algorithms that<br />

perform data extraction, data cleaning <strong>and</strong> data transformation. Section 4 presents data integration, describing the<br />

integrated schema, the algorithm that matches movie titles <strong>and</strong> the construction <strong>of</strong> the integrated database.<br />

Matching details are subject <strong>of</strong> a more detailed report [Peralta 2007a]. Finally, Section 7 concludes.<br />

2. <strong>MovieLens</strong> data extraction<br />

<strong>MovieLens</strong> data sets are provided in the form <strong>of</strong> tabular text files (comma separated text). <strong>Data</strong> extraction<br />

consisted in the loading <strong>of</strong> text data to a relational database, normalization <strong>of</strong> data <strong>and</strong> duplicates elimination.<br />

The following sub-sections describe source files, target schemas, extraction processes <strong>and</strong> cleaning processes.<br />

2.1. <strong>MovieLens</strong> source schemas<br />

<strong>MovieLens</strong> data sets were collected by the GroupLens Research Project at the University <strong>of</strong> Minnesota. They<br />

provide two data sets (with 100.000 <strong>and</strong> 1.000.209 evaluations respectively). <strong>Data</strong> was collected through the<br />

<strong>MovieLens</strong> web site [3] <strong>and</strong> was cleaned up, i.e. users who had less than 20 ratings or did not have complete<br />

demographic information were removed from data sets.<br />

We concentrated in the largest one, however, the smallest one is helpful in for the matching <strong>of</strong> movie titles<br />

(because it provides the <strong>IMDb</strong> URL <strong>of</strong> each movie) so it was extracted too. The following sub-sections describe<br />

both schemas.<br />

2.1.1. Source files <strong>of</strong> the large data set<br />

The large data set consists in 3 text files, with tabular format, describing 1000209 anonymous ratings <strong>of</strong> 3883<br />

movies made by 6040 <strong>MovieLens</strong> users who joined <strong>MovieLens</strong> in 2000. In the following, we describe the<br />

contents <strong>of</strong> each text file.<br />

Ratings.dat<br />

This file contains data about 1000209 ratings (1 evaluation, corresponding to 1 user <strong>and</strong> 1 movie, in each line) in<br />

the format: UserID::MovieID::Rating::Timestamp where:<br />

− UserID is an integer, ranging from 1 to 6040, that identifies a user. Each user has rated at least 20 movies.<br />

− MovieID is an integer, ranging from 1 to 3952, that identifies a movie.<br />

− Rating is an integer, ranging from 1 to 5, made on a 5-star scale (whole-star ratings only).<br />

− Timestamp is represented in seconds since 1/1/1970 UTC<br />

Figure 2 shows an extract <strong>of</strong> the file corresponding to 5 ratings.


Movies.dat<br />

1::1193::5::978300760<br />

1::661::3::978302109<br />

1::914::3::978301968<br />

1::3408::4::978300275<br />

1::2355::5::978824291<br />

Figure 2 – Extract <strong>of</strong> the Ratings.dat file<br />

Verónika Peralta<br />

This file contains data about 3883 movies (1 movie in each line) in the format: MovieID::Title::Genres where:<br />

− MovieID is an integer, ranging from 1 to 3952, that identifies a movie.<br />

− Title is a String that concatenates movie title <strong>and</strong> year <strong>of</strong> release (between brackets).<br />

− Genres is a pipe-separated list <strong>of</strong> genres. Provided genres are: Action, Adventure, Animation, Children's,<br />

Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi,<br />

Thriller, War, Western.<br />

Figure 3 shows an extract <strong>of</strong> the file corresponding to 5 movies.<br />

Users.dat<br />

1::Toy Story (1995)::Animation|Children's|Comedy<br />

2::Jumanji (1995)::Adventure|Children's|Fantasy<br />

3::Grumpier Old Men (1995)::Comedy|Romance<br />

4::Waiting to Exhale (1995)::Comedy|Drama<br />

5::Father <strong>of</strong> the Bride Part II (1995)::Comedy<br />

Figure 3 – Extract <strong>of</strong> the Movies.dat file<br />

This file contains data about 6040 users (1 user in each line) in the format:<br />

UserID::Gender::Age::Occupation::Zip-code where:<br />

− UserID is an integer, ranging from 1 to 6040, that identifies a user.<br />

− Gender is denoted by a "M" for male <strong>and</strong> "F" for female<br />

− Age is an integer identifying a range (the minimum age in the range). Provided ranges are: under 18, 18-<br />

24, 25-34, 35-44, 45-49, 50-55, over 56.<br />

− Occupation is an integer, ranging from 0 to 20, indicating the following choices: 0: other or not<br />

specified, 1: academic/educator, 2: artist, 3: clerical/admin, 4: college/grad student, 5: customer service,<br />

6: doctor/health care, 7: executive/managerial, 8: farmer, 9: homemaker, 10: K-12 student, 11: lawyer, 12:<br />

programmer, 13: retired, 14: sales/marketing, 15: scientist, 16: self-employed, 17: technician/engineer,<br />

18: tradesman/craftsman, 19: unemployed, 20: writer<br />

− Zip-code is a five-digits integer indicating user ZIP-code.<br />

All demographic information was provided voluntarily by the users <strong>and</strong> was not checked for accuracy. Only<br />

users who have provided some demographic information are included in this data set. Figure 4 shows an extract<br />

<strong>of</strong> the file corresponding to 5 users.<br />

1::F::1::10::48067<br />

2::M::56::16::70072<br />

3::M::25::15::55117<br />

4::M::45::7::02460<br />

5::M::25::20::55455<br />

Figure 4 – Extract <strong>of</strong> the Users.dat file<br />

3


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

2.1.2. Source files <strong>of</strong> the small data set<br />

The small data set consists in 5 text files, with tabular format, describing 100000 anonymous ratings <strong>of</strong> 1682<br />

movies made by 943 users during the seven-month period from September 19 th , 1997 through April 22 nd , 1998.<br />

In the following, we describe the contents <strong>of</strong> each text file.<br />

u.data<br />

This file contains data about 100000 ratings (1 evaluation, corresponding to 1 user <strong>and</strong> 1 movie, in each line). It<br />

is a tab-separated list <strong>of</strong> UserID, MovieID, Rating <strong>and</strong> Timestamp, where:<br />

4<br />

− UserID is an integer, ranging from 1 to 943, that identifies a user. Each user has rated at least 20 movies.<br />

− MovieID is an integer, ranging from 1 to 1682, that identifies a movie.<br />

− Rating is an integer, ranging from 1 to 5, made on a 5-star scale (whole-star ratings only).<br />

− Timestamp is represented in seconds since 1/1/1970 UTC<br />

<strong>Data</strong> is r<strong>and</strong>omly ordered. Figure 5 shows an extract <strong>of</strong> the file corresponding to 5 ratings.<br />

u.item<br />

196 242 3 881250949<br />

186 302 3 891717742<br />

22 377 1 878887116<br />

244 51 2 880606923<br />

166 346 1 886397596<br />

Figure 5 – Extract <strong>of</strong> the u.data file<br />

This file contains data about 1682 movies (1 movie in each line). This is a pipe-separated list <strong>of</strong> MovieID,<br />

MovieTitle, ReleaseDate, VideoReleaseDate, <strong>IMDb</strong>URL, unknown, Action, Adventure, Animation, Children’s,<br />

Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi,<br />

Thriller, War <strong>and</strong> Western, where<br />

− MovieID is an integer, ranging from 1 to 1682, that identifies a movie.<br />

− MovieTitle is a String that concatenates movie title <strong>and</strong> year <strong>of</strong> release (between brackets).<br />

− ReleaseDate is a DD-Mon-YYYY date indicating movie release date<br />

− VideoReleaseDate was destinated to video release date but it is always NULL<br />

− <strong>IMDb</strong>URL indicates the URL <strong>of</strong> the movie in the <strong>IMDb</strong> site.<br />

− The last 19 fields correspond to genres; a 1 indicates the movie is <strong>of</strong> that genre, a 0 indicates it is not;<br />

movies can be in several genres at once.<br />

Figure 6 shows an extract <strong>of</strong> the file corresponding to 5 movies.<br />

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|<br />

0|0|0|0|0|0|0|0|0|0|0<br />

2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|<br />

0|0|0|0|0|0|0|1|0|0<br />

3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|<br />

0|0|0|0|0|0|0|0|0|1|0|0<br />

4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|<br />

0|1|0|0|0|0|0|0|0|0|0|0<br />

5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|<br />

0|0|0|0|0|1|0|0<br />

Figure 6 – Extract <strong>of</strong> the u.item file


u.genre<br />

Verónika Peralta<br />

This file contains data about 19 movie genres (1 genre in each line). This is a pipe-separated list <strong>of</strong> GenreID <strong>and</strong><br />

GenreName, where<br />

− GenreID is an integer, ranging from 0 to 18, that identifies a genre.<br />

− GenreName is a String describing the genre. Genre names corresponds to genre columns <strong>of</strong> the u.item<br />

file.<br />

Figure 7 shows an extract <strong>of</strong> the file corresponding to 5 genres.<br />

u.user<br />

unknown|0<br />

Action|1<br />

Adventure|2<br />

Animation|3<br />

Children's|4<br />

Figure 7 – Extract <strong>of</strong> the u.genre file<br />

This file contains data about 943 users (1 user in each line). This is a pipe-separated list <strong>of</strong> UserID, Age, Gender,<br />

Occupation <strong>and</strong> Zip-code, where:<br />

− UserID is an integer, ranging from 1 to 943, that identifies a user<br />

− Age is an integer indicating user’s age<br />

− Gender is denoted by a "M" for male <strong>and</strong> "F" for female<br />

− Occupation is an String indicating user occupation.<br />

− Zip-code is a five-digits integer indicating user ZIP-code.<br />

All demographic information was provided voluntarily by the users <strong>and</strong> was not checked for accuracy. Only<br />

users who have provided some demographic information are included in this data set. Figure 8 shows an extract<br />

<strong>of</strong> the file corresponding to 5 users.<br />

u.occupation<br />

1|24|M|technician|85711<br />

2|53|F|other|94043<br />

3|23|M|writer|32067<br />

4|24|M|technician|43537<br />

5|33|F|other|15213<br />

Figure 8 – Extract <strong>of</strong> the u.user file<br />

This file lists 21 user occupations (1 occupation in each line). Figure 9 shows an extract <strong>of</strong> the file corresponding<br />

to 5 occupations.<br />

2.2. <strong>MovieLens</strong> target schemas<br />

administrator<br />

artist<br />

doctor<br />

educator<br />

engineer<br />

Figure 9 – Extract <strong>of</strong> the u.occupations file<br />

Both <strong>MovieLens</strong> data set were extracted to a Micros<strong>of</strong>t Access® database. The following sub-sections describe<br />

the target schemas for both data sets.<br />

5


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

2.2.1. Target tables <strong>of</strong> the large data set<br />

The target schema for the large data set consists in 6 tables describing movies, users <strong>and</strong> ratings. Figure 10<br />

shows the tables <strong>of</strong> the target schema, their primary keys (attributes in bold) <strong>and</strong> foreign keys (lines between<br />

tables). Table 1 describes each target table, its attributes <strong>and</strong> constraints.<br />

6<br />

Figure 10 – <strong>MovieLens</strong> schema (large data set)<br />

Table Attributes Constraints<br />

ML_Users<br />

Demographic information<br />

about users<br />

ML_Movies<br />

Titles <strong>of</strong> movies<br />

ML_Ratings<br />

User ratings on movies<br />

ML_MovieGenres<br />

Genres <strong>of</strong> movies<br />

ML_Ages<br />

Age intervals<br />

ML_Occupations<br />

Occupations<br />

− UserId: Numeric(4)<br />

− Gender: Char(1); value in {M,F}<br />

− AgeId: Numeric(2)<br />

− OccupationId: Numeric(2)<br />

− ZipCode: String(10)<br />

− MovieId: Numeric(4)<br />

− MovieTitle: String(100)<br />

− UserId: Numeric(4)<br />

− MovieId: Numeric(4)<br />

− Rating: Number(1), value in {1,2,3,4,5}<br />

− Timestamp: Numeric(9); represents<br />

milliseconds from an initial time<br />

− MovieId: Numeric(4)<br />

− Genre: String(12)<br />

− AgeId: Numeric(2)<br />

− MinAge: Numeric(2)<br />

− MaxAge: Numeric(2)<br />

− OccupationId: Numeric(2)<br />

− Occupation: String (25)<br />

2.2.2. Target tables <strong>of</strong> the small data set<br />

Table 1 – <strong>MovieLens</strong> schema (large data set)<br />

Primary key: UserId<br />

Foreign keys:<br />

− Age: ref. ML_Ages<br />

− Occupation: ref. ML_Occupation<br />

Primary key: MovieId<br />

Unique: MovieTitle<br />

Primary key: UserId, MovieId<br />

Foreign keys:<br />

− UserId: ref. ML_Users<br />

− MovieId: ref. ML_Movies<br />

Primary key: MovieId, Genre<br />

Foreign keys:<br />

− MovieId: ref. SML_Movies<br />

Primary key: AgeId<br />

Primary key: OccupationId<br />

The target schema for the small data set consists in 5 tables describing movies, movie genres, users <strong>and</strong> ratings.<br />

Figure 11 shows the tables <strong>of</strong> the target schema, their primary keys (attributes in bold) <strong>and</strong> foreign keys (lines<br />

between tables). Table 2 describes each target table, its attributes <strong>and</strong> constraints.<br />

Figure 11 – <strong>MovieLens</strong> schema (small data set)


Table Attributes Constraints<br />

SML_Users<br />

Demographic information<br />

about users<br />

SML_Movies<br />

Information about movies<br />

SML_Ratings<br />

User ratings on movies<br />

SML_Genres<br />

Description <strong>of</strong> genres<br />

SML_MovieGenres<br />

Genres <strong>of</strong> movies<br />

− UserId: Numeric(4)<br />

− Age: Numeric(2)<br />

− Gender: Char(1); value in {M,F}<br />

− Occupation: String(15); finite set<br />

− ZipCode: String(5)<br />

− MovieId: Numeric(4)<br />

− MovieTitle: String(100)<br />

− ReleaseYear: Numeric(4)<br />

− Imdb: String(150); represents the URL <strong>of</strong> an<br />

<strong>IMDb</strong> page<br />

− UserId: Numeric(4)<br />

− MovieId: Numeric(4)<br />

− Rating: Number(1), value in {1,2,3,4,5}<br />

− Timestamp: Numeric(9); represents<br />

milliseconds from an initial time<br />

− Genre: String(12)<br />

− GenreId: Numeric(2)<br />

− MovieId: Numeric(4)<br />

− GenreId: Numeric(2)<br />

Table 2 – <strong>MovieLens</strong> schema (small data set)<br />

2.3. <strong>MovieLens</strong> extraction, cleaning <strong>and</strong> transformation processes<br />

Primary key: UserId<br />

Primary key: MovieId<br />

Unique: MovieTitle<br />

Verónika Peralta<br />

Primary key: UserId, MovieId<br />

Foreign keys:<br />

− UserId: ref. SML_Users<br />

− MovieId: ref. SML_Movies<br />

Primary key: GenreId<br />

Primary key: MovieId, GenreId<br />

Foreign keys:<br />

− MovieId: ref. SML_Movies<br />

− GenreId: ref. SML_Genres<br />

Both data sets were quite consistent; however, we detected <strong>and</strong> solved some kinds <strong>of</strong> anomalies:<br />

− Multi-valued attributes: Some source attributes are provided as comma-separated (or pipe-separated) lists<br />

<strong>of</strong> values. As a solution, we normalise tables.<br />

− Null or dummy values in m<strong>and</strong>atory attributes: Some m<strong>and</strong>atory attributes (especially key attributes)<br />

contains Null or dummy values (e.g. MovieTitle = 'Unknown'). As a solution, such tuples are eliminated.<br />

− Duplicates: Some tuples (or tuple keys) are duplicated. As a solution, we keep a unique tuple, reconciling<br />

values <strong>of</strong> further attributes if necessary.<br />

− Orphans: Some tuples do not satisfy referential constraints, i.e. their values are not included in a foreign<br />

table. As a solution, we eliminate orphan tuples.<br />

In order to solve such anomalies we implemented an ETL process for extracting, cleaning <strong>and</strong> transforming data<br />

<strong>of</strong> each source file. ETL workflows consist <strong>of</strong> five major tasks as illustrated in Figure 12 (rectangles represent<br />

P.txt<br />

extraction<br />

P-source<br />

cleaning<br />

P-cleaning<br />

unduplication<br />

P-group<br />

referenciation<br />

P-duplicates P-orphans<br />

Figure 12 – High-level ETL process<br />

referential<br />

P-ref<br />

exportation<br />

P<br />

P’<br />

7


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

tasks; continuous arrows represent control flow; dotted arrows represent data flow, indicating the tables that are<br />

input or output <strong>of</strong> tasks. ETL inputs are a text file containing the data to extract (P.txt in Figure 12) <strong>and</strong><br />

(possibly) some referential tables (referential in Figure 12). ETL output is a relational table containing the<br />

extracted data (P in Figure 12) <strong>and</strong> (possibly) some aggregation tables (P’ in Figure 12). Two additional tables<br />

keep trace <strong>of</strong> the anomalies found (P-duplicates <strong>and</strong> P-orphans in Figure 12). We stored the result <strong>of</strong> each task in<br />

a temporal table (white tables in Figure 12) in order to ease traceability <strong>and</strong> maintainability. For certain specific<br />

files, ETL workflows can be simpler car not all the tasks are necessary for loading <strong>and</strong> cleaning all tables. In the<br />

following, we describe the semantics <strong>of</strong> each task (details are illustrated in Figure 13).<br />

Bulk copy tasks load data from text files to relational tables. We used Micros<strong>of</strong>t Access® loading interface,<br />

indicating file format <strong>and</strong> column format.<br />

Cleaning tasks perform different data cleaning transformations, for example, elimination <strong>of</strong> Null values,<br />

denormalisation, calculation <strong>of</strong> derived attributes <strong>and</strong> so on. Specific cleaning tasks are described later for each<br />

table.<br />

Unduplication tasks eliminate duplicate data. Several sub-tasks are performed, distinguishing normal flow<br />

(when there are no duplicates) <strong>and</strong> error flow (when there are duplicates):<br />

8<br />

− Duplicate check tasks verify the existence <strong>of</strong> duplicate values <strong>of</strong> key (or unique) attributes. Note that key<br />

(or unique) constraints are not defined yet; unduplication is needed in order to define the appropriate<br />

constraints. Duplicate tuples are inserted into temporal tables (P-duplicates in Figure 13), registering the<br />

key (or unique) value, the number <strong>of</strong> times the key (or unique) value is repeated, <strong>and</strong> the conflicting<br />

values <strong>of</strong> other attributes (minimum <strong>and</strong> maximum values). Tasks are implemented as SQL operations <strong>of</strong><br />

the form:<br />

INSERT INTO P-duplicates (key_attribute1,... key_attributeM, quantity,<br />

min_column1, max_column1,... min_columnN, max_columnN)<br />

SELECT key_attribute1,... key_attributeM, COUNT(*), MIN(column1),<br />

MAX(column1),... MIN(columnN), MAX(columnN)<br />

FROM P-cleaning<br />

GROUP BY key_attribute1,... key_attributeM<br />

HAVING COUNT(*)>1;<br />

− Reconciliation tasks are executed only when some duplicates where found (in the corresponding duplicate<br />

check task). Inconsistencies are generally solved by arbitrarily selecting one <strong>of</strong> the values <strong>of</strong> conflictive<br />

attributes (e.g. the minimum one) or filling with Null values. Additional columns are added to the<br />

extraction<br />

P.txt<br />

cleaning<br />

referentiation<br />

referential<br />

bulk copy<br />

cleaning<br />

referential<br />

check<br />

P-group<br />

P-source<br />

P-source P-cleaning<br />

error<br />

ok<br />

P-orphans<br />

copy<br />

unduplication<br />

duplicate<br />

check<br />

P-cleaning<br />

error<br />

ok<br />

orphan<br />

elimination<br />

P-ref<br />

P-duplicates<br />

reconciliation<br />

copy<br />

Figure 13 – Details <strong>of</strong> ETL tasks<br />

exportation<br />

validation<br />

P-ref<br />

duplicate<br />

cleaning<br />

copy<br />

aggregation<br />

grouping<br />

P-group<br />

P<br />

P’


Verónika Peralta<br />

temporal table storing duplicates (P-duplicates in Figure 13) in order to store reconciled values. Tasks are<br />

implemented as a sequence <strong>of</strong> SQL operations, each one solving one inconsistency. The following<br />

operations are examples <strong>of</strong> reconciliation operations:<br />

UPDATE P-duplicates<br />

SET column1 = min_column1;<br />

UPDATE P-duplicates<br />

SET column2 = NULL<br />

WHERE min_column2 max_column2;<br />

− Duplicate cleaning tasks set a unique value for each non-key attribute (those indicated during<br />

reconciliation task) in duplicated tuples. Tasks are implemented as SQL operations <strong>of</strong> the form:<br />

UPDATE P-cleaning<br />

SET column1 = P-duplicates.column1,...<br />

columnN = P-duplicates.columnN<br />

WHERE P-cleaning.key_attribute1 = P-duplicates.key_attribute1 ...<br />

AND P-cleaning.key_attributeM = P-duplicates.key_attributeM<br />

− Grouping tasks keep a unique tuple for each key value. Note that there are conflicting values anymore, so<br />

grouping by all attributes is analogous to grouping by key attributes. Tasks are implemented as SQL<br />

operations <strong>of</strong> the form:<br />

INSERT INTO grouped_table (key_attribute1,... key_attributeM, column1,...<br />

columnN)<br />

SELECT key_attribute1,... key_attributeM, column1,... columnN<br />

FROM cleaning_table<br />

GROUP BY key_attribute1,... key_attributeM, column1,... column;<br />

− Copy tasks duplicate tables in order to provide traceability in the case no duplicates were found. Tasks are<br />

implemented as SQL operations <strong>of</strong> the form:<br />

INSERT INTO P-group (<br />

SELECT *<br />

FROM P-cleaning;<br />

Referentiation tasks eliminate data not satisfying referential constraints. Several sub-tasks are performed,<br />

distinguishing normal <strong>and</strong> error flow:<br />

− Referential check tasks verify the satisfaction <strong>of</strong> referential constraints. Orphan tuples (which do not<br />

match tuples <strong>of</strong> the referenced table) are stored in a temporal table (P-orphans in Figure 13). Tasks are<br />

implemented as SQL operations <strong>of</strong> the form:<br />

INSERT INTO P-orphans (<br />

SELECT *<br />

FROM P-group<br />

WHERE NOT EXISTS (<br />

SELECCT * FROM referential<br />

WHERE reference_attribute1 = foreign_attribute1 AND...<br />

reference_attributeN = foreign_attributeN));<br />

− Orphan elimination tasks deletes tuples not satisfying referential constraints (those found by the<br />

corresponding referential check task). In other words, tasks keep only tuples that join referential tables.<br />

Tasks are implemented as SQL operations <strong>of</strong> the form:<br />

INSERT INTO P-ref (<br />

SELECT P-group.*<br />

FROM P-group, referential<br />

WHERE reference_attribute1 = foreign_attribute1 ...<br />

AND reference_attributeN = foreign_attributeN));<br />

− Copy tasks are analogous to those described for unduplication.<br />

Exportation tasks create result tables. Several sub-tasks are performed:<br />

− Validate tasks perform manual user validations (e.g. control <strong>of</strong> some sampled tuples) or modifications<br />

(e.g. the insertion <strong>of</strong> new tuples). Specific validation tasks are described later for each table.<br />

− Copy tasks are analogous to those described for unduplication.<br />

9


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

10<br />

− Aggregate tasks build aggregations (P’ in Figure 13) for storing existing values <strong>of</strong> a specific attribute, i.e.<br />

master tables (e.g. movie genres). Tasks are implemented as SQL operations <strong>of</strong> the form:<br />

INSERT INTO P’ (<br />

SELECT column1,... columnK<br />

FROM P-ref<br />

GROUP BY column1,... columnK;<br />

The following sub-sections describe the specific tasks <strong>of</strong> each data set.<br />

2.3.1. ETL process for the large data set<br />

The large data set was quite consistent: no duplicates were found, no violations to referential constraints were<br />

detected. However, we detected <strong>and</strong> solved some minor anomalies:<br />

− Genres were represented as multi-valued attributes (pipe-separated lists) in source files. Normalization<br />

routines were performed.<br />

− ZIP codes contained letters <strong>and</strong> ZIP intervals. We ignored such errors <strong>and</strong> stored values as Strings.<br />

− Master tables for ages <strong>and</strong> occupations were created from scratch in order to associate descriptions to age<br />

ranges <strong>and</strong> occupation ids.<br />

Figure 14 shows the extraction workflow. Note that not all the tasks presented in previous section were<br />

necessary for extracting data from all text files. For ease <strong>of</strong> underst<strong>and</strong>ing, we omitted representing some data<br />

flows <strong>and</strong> temporary tables; copy tasks were also omitted.<br />

movies.dat<br />

ratings.dat<br />

users.dat<br />

bulk copy<br />

bulk copy<br />

bulk copy<br />

duplicate<br />

chech<br />

duplicate<br />

check<br />

duplicate<br />

check<br />

referential<br />

check<br />

genre<br />

denormalisation<br />

Figure 14 – ETL workflow for the large data set<br />

referential<br />

check<br />

manual<br />

insert<br />

manual<br />

insert<br />

ML_MovieGenres<br />

ML_Movies<br />

ML_Ratings<br />

ML_Users<br />

ML_Ages<br />

ML_Occupations<br />

The semantics <strong>of</strong> the bulk copy, duplicate check <strong>and</strong> reference check tasks was described previously; the<br />

semantics <strong>of</strong> specific validation <strong>and</strong> aggregation tasks is the following:<br />

− Genre denormalisation task builds a normalized table expressing the relationship among movies <strong>and</strong><br />

genres. Each tuple, which has a list <strong>of</strong> genres, is decomposed in several tuples, one per genre. For<br />

example, from the tuple (see Figure 3) we create<br />

3 tuples , <strong>and</strong> . The task is implemented as a sequence <strong>of</strong><br />

SQL operations that split the genre list according to the token ‘|’.


Verónika Peralta<br />

− Manual insert tasks create new tables by executing a list <strong>of</strong> manually-entered insert operations, for<br />

example:<br />

INSERT INTO ML_Ages (AgeId, MinAge, MaxAge)<br />

VALUES (18, 18, 24);<br />

2.3.2. ETL process for the small data set<br />

We detected <strong>and</strong> solved several anomalies in the small data set:<br />

− We found 18 duplicate titles (with different MovieId). These tuples would cause duplicates when<br />

matching (by title) <strong>IMDb</strong> movies. We kept, arbitrarily, the smaller MovieId <strong>of</strong> each duplicate title.<br />

− There was a movie with title='Unknown' <strong>and</strong> null values for ReleaseYear <strong>and</strong> <strong>IMDb</strong>URL. The movie was<br />

eliminated.<br />

− Ratings <strong>of</strong> eliminated movies were also eliminated. Substituting their MovieId for those <strong>of</strong> the kept<br />

movies would carry to duplicates with contradictory ratings.<br />

− Genres were denormalized in source files (one Boolean attribute for each genre). Normalization routines<br />

were performed.<br />

Figure 15 shows the extraction workflow.<br />

u.genre<br />

u.item<br />

u.data<br />

u.user<br />

u.occupation<br />

bulk copy<br />

bulk copy<br />

bulk copy<br />

bulk copy<br />

bulk copy<br />

null-values<br />

elimination<br />

duplicate<br />

check<br />

duplicate<br />

check<br />

duplicate<br />

check<br />

year<br />

calculation<br />

movie id<br />

reconciliation<br />

referential<br />

check<br />

referential<br />

check<br />

duplicate<br />

cleaning<br />

referential<br />

check<br />

Figure 15 – ETL workflow for the small data set<br />

genre<br />

denormalization<br />

grouping<br />

orphan<br />

elimination<br />

SML_Genres<br />

SML_MovieGenres<br />

SML_Movies<br />

SML_Ratings<br />

SML_Users<br />

The semantics <strong>of</strong> the bulk copy, duplicate check, duplicate cleaning, grouping, reference check <strong>and</strong> orphan<br />

elimination tasks was described previously; the semantics <strong>of</strong> specific cleaning, reconciliation <strong>and</strong> aggregation<br />

tasks is the following:<br />

− Null-values elimination task deletes tuples having null or dummy values. Specifically, we deleted tuples<br />

having ‘Unknown’ as title. Task is implemented as follows:<br />

INSERT INTO movies-cleaning (MovieId, MovieTitle, ReleaseDate, <strong>IMDb</strong>)<br />

SELECT * FROM movies-source<br />

WHERE MovieTitle ‘Unknown’;<br />

11


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

12<br />

− Year calculation task calculates movie release year from movie release date. A typical truncate function is<br />

used. The task is implemented as follows:<br />

UPDATE movies-cleaning<br />

SET ReleaseYear = Mid(ReleaseDate,8,4);<br />

− MovieId reconciliation task keeps the smaller movie id when there are duplicated movid titles (supposed<br />

to be unique). The task is implemented as follows:<br />

UPDATE movie-duplicates<br />

SET MovieId = MinMovieId;<br />

− Genre denormalisation task builds a normalized table expressing the relationship among movies <strong>and</strong><br />

genres. Each tuple, which has several Boolean values each one corresponding to a genre, is decomposed<br />

in several tuples, one for each True value. For example, from the tuple (see Figure 6) we create 3 tuples , <strong>and</strong> <br />

indicating that genres 3, 4 <strong>and</strong> 5 correspond to the movie. The task is implemented as a sequence <strong>of</strong> SQL<br />

operations, one for each genre; for example, the query for Adventure movies (genre 2) is:<br />

INSERT INTO SML_MovieGenres (MovieId, GenreId)<br />

SELECT MovieId, 2<br />

FROM Movies-ref<br />

WHERE Adventure=1;<br />

2.3.3. Summary <strong>and</strong> statistics<br />

As summary <strong>of</strong> ETL tasks, the following tables show the number <strong>of</strong> extracted tuples, the number <strong>of</strong> anomalies,<br />

duplicates <strong>and</strong> orphans found, <strong>and</strong> the number <strong>of</strong> exported tuples for both data sets.<br />

Source file<br />

Extracted<br />

Movies.dat<br />

tuples<br />

3883<br />

Cleaning<br />

anomalies<br />

0<br />

Duplicates<br />

0<br />

Orphans<br />

0<br />

Exported<br />

Resulting table<br />

tuples<br />

3883 ML_Movies<br />

6407 ML_MovieGenres<br />

Ratings.dat 1000209 0 0 0 1000209 ML_Ratings<br />

6040 ML_Users<br />

Users.dat 6040 0 0 0<br />

7 ML_Ages<br />

21 ML_Occupations<br />

Table 3 – Statistics <strong>of</strong> data extraction, cleaning <strong>and</strong> transformation <strong>of</strong> the large data set<br />

Source file<br />

Extracted<br />

u.genre<br />

tuples<br />

19<br />

Cleaning<br />

anomalies<br />

0<br />

Duplicates<br />

0<br />

Orphans<br />

0<br />

Exported<br />

Resulting table<br />

tuples<br />

19 SML_Genres<br />

u.item 1682 1 18 0<br />

1663 SML_Movies<br />

2906 SML_MovieGenres<br />

u.data 100000 0 0 617 99383 SML_Ratings<br />

u.user 943 0 0 0 943 SML_Users<br />

u.occupation 21<br />

3. <strong>IMDb</strong> data extraction<br />

Table 4 – Statistics <strong>of</strong> data extraction, cleaning <strong>and</strong> transformation <strong>of</strong> the small data set<br />

<strong>IMDb</strong> data set consists in ad-hoc text files, having different formats, including tabular lists, tagged text <strong>and</strong><br />

hierarchical-organized text. <strong>Data</strong> extraction consisted in the loading <strong>of</strong> text data to a relational database,<br />

normalization <strong>of</strong> data <strong>and</strong> duplicates elimination. The following sub-sections describe source files, target<br />

schemas, extraction processes <strong>and</strong> cleaning processes.<br />

3.1. <strong>IMDb</strong> source schemas<br />

<strong>IMDb</strong> data set [5] consists <strong>of</strong> 49 lists that catalog different details about movies. At October 5 th 2006, list sizes<br />

varied from 25000 to 5000000 tuples. Each list has a list manager, who is responsible for updating, maintaining<br />

<strong>and</strong> publishing it (http://www.imdb.com/helpdesk/contact). List managers rely on users <strong>of</strong> <strong>IMDb</strong> to submit


Verónika Peralta<br />

corrections <strong>and</strong> additions to keep lists accurate <strong>and</strong> complete as possible. List updates are mostly submitted via<br />

the movie mail-server's central collection service, following specific submission protocols<br />

(http://www.imdb.com/mailing_lists).<br />

We extracted data from 23 lists, representing the most relevant tabular features about movies. We discarded lists<br />

providing textual data (10 lists), such as movie plots or trivia, because it was useless for our database-oriented<br />

goals. The remaining 16 lists will be possibly extracted in near future.<br />

We treated four file formats:<br />

Fixed-length columned files present data in several columns, each one starting at a fixed position. They are very<br />

simple to treat with a database loading interface like the one used for extracting <strong>MovieLens</strong> data sets. Figure 16<br />

shows an extract <strong>of</strong> the ratings.list file. It is the unique file having this format:<br />

− ratings.list has 4 columns (rating distribution, number <strong>of</strong> votes, average rating, movie title).<br />

0000000015 178183 9.1 Godfather, The (1972)<br />

0000000115 214539 9.1 Shawshank Redemption, The (1994)<br />

0000000124 101253 9.0 Godfather: Part II, The (1974)<br />

0000000015 161827 8.9 Lord <strong>of</strong> the Rings: The Return <strong>of</strong> the King, The (2003)<br />

0000000124 87270 8.8 Casablanca (1942)<br />

0000000124 129696 8.8 Schindler's List (1993)<br />

Figure 16 – Extract <strong>of</strong> the ratings.list file<br />

Tab-separated columned files also present data in several columns, but columns are separated by tabulations.<br />

The number <strong>of</strong> tabulations between two consecutive columns is variable (responding to visual presentation)<br />

which unables the direct use <strong>of</strong> database loading interfaces. Files need to be pre-processed in order to substitute<br />

several tabulations by a unique one. In addition, as some values can be omitted (null values), in some cases, a<br />

double tabulation is expected. In particular, most lists contain optional comments. As anomalies, we found<br />

occurrences <strong>of</strong> double spaces instead <strong>of</strong> a tabulation. Figure 17 shows an extract <strong>of</strong> the genres.list file, which<br />

presents movie titles <strong>and</strong> their genres separated by several tabulations. The files having this format are:<br />

− color-info.list has 3 columns (movie title, color, comments).<br />

− countries.list has 2 columns (movie title, country).<br />

− genres.list has 2 columns (movie title, genre).<br />

− keywords.list has 2 parts. The former contains a list <strong>of</strong> keywords (with the number <strong>of</strong> films including such<br />

keywords between brackets), organized in several tab-separated columns. The latter contains two columns<br />

(movie title, keyword).<br />

− language.list has 3 columns (movie title, language, comments).<br />

− locations.list has 2 columns (movie title, location, comments), where location is a comma-separated list<br />

<strong>of</strong> geographical names, for example: “Sevilla, Andalucía, Spain”.<br />

− movies.list has 3 columns (movie title, release year, comments).<br />

− production-companies.list has 3 columns (movie title, production company, comments), where<br />

production company may contain the code <strong>of</strong> one or several countries (between square brackets), for<br />

example: “Taxi Films [uy]”<br />

− release-dates.list has 3 columns (movie title, release date, comments), where release date embeds release<br />

country.<br />

− running-times.list has 2 columns (movie title, running time, comments), where running time may embed<br />

the running country.<br />

− sound-mix.list has 2 columns (movie title, sound mix).<br />

Walking with the Dead (1998) Short<br />

Walking with Walken (2001) Short<br />

Walkin' on Sunshine: The Movie (1997) Comedy<br />

Walkin' on Sunshine: The Movie (1997) Sci-Fi<br />

Walk in the Clouds, A (1995) Drama<br />

Walk in the Clouds, A (1995) Romance<br />

Figure 17 – Extract <strong>of</strong> the genres.list file<br />

13


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

Tagged files precedes data with tags, which indicates the nature <strong>of</strong> data. Figure 18 shows an extract <strong>of</strong> the<br />

biographies.list file; tags are placed at the start <strong>of</strong> each line, indicating for example, an artist name (NM), his real<br />

name (RN), his date <strong>and</strong> place <strong>of</strong> birth (DB), etc. Tags are prefixed for each list but some tags are not explained.<br />

This type <strong>of</strong> file needs ad-hoc parsing. The files having this format are:<br />

14<br />

− biographies.list includes as tags: artist name, real name, birth (date <strong>and</strong> place), decease (date, place <strong>and</strong><br />

cause) <strong>and</strong> height. Other tags are ignored.<br />

− business.list includes as tags: movie title, budget (currency <strong>and</strong> amount), <strong>and</strong> revenue (currency, amount<br />

<strong>and</strong> country). Other tags are ignored.<br />

NM: Noth, Chris<br />

RN: Noth, Christopher David<br />

DB: 13 November 1954, Madison, Wisconsin, USA<br />

BG: Chris Noth was born in Madison, Wisconsin on 13 November 1954.<br />

In his<br />

BG: youth he travelled <strong>and</strong> lived in Engl<strong>and</strong>, Yugoslavia <strong>and</strong> Spain…<br />

Figure 18 – Extract <strong>of</strong> the biographies.list file<br />

Hierarchical-structured files organize data into two hierarchical levels: the outside level describes a movie<br />

feature (e.g. an actor or a director) <strong>and</strong> the inside level lists all movies corresponding to such feature. Figure 19<br />

shows an extract <strong>of</strong> the actors.list file, presenting the list <strong>of</strong> movies played by an actor. Some files contains other<br />

attributes <strong>and</strong> comments, which follow movie titles (in the same line) separated by one or more tabulations. This<br />

type <strong>of</strong> file is the most difficult to extract; it needs ad-hoc extractors to iterate between hierarchical levels, <strong>and</strong><br />

pre-processing for solving tabulations problems (see tab-separated columned files). The files having this format<br />

are:<br />

− actors.list lists actors <strong>and</strong> 4 columns (movie title, role, casting position, comments). Columns (excepting<br />

movie title) are optional.<br />

− actresses.list lists actresses <strong>and</strong> 4 columns (movie title, role, casting position, comments). Columns<br />

(excepting movie title) are optional.<br />

− costume-designers.list lists costume designers <strong>and</strong> 2 columns (movie title, comments).<br />

− directors.list lists directors <strong>and</strong> 2 columns (movie title, comments).<br />

− distributors.list lists movie titles <strong>and</strong> 2 columns (distribution company, comments), where distribution<br />

company may contain acronyms (between brackets) <strong>and</strong> the code <strong>of</strong> one or several countries (between<br />

square brackets), for example: “France 2 (FR 2) [fr]”<br />

− movie-links.list lists movie titles <strong>and</strong> 1 columns (link to another movie title), which embeds the link type<br />

(e.g. “featured in”, “alternate language version <strong>of</strong>”, “referenced in”, etc).<br />

− producers.list lists producers <strong>and</strong> 2 columns (movie title, comments).<br />

− production-designers.list lists producion designers <strong>and</strong> 2 columns (movie title, comments).<br />

− writers.list lists custome designers <strong>and</strong> 2 columns (movie title, comments).<br />

Outst<strong>and</strong>ing Supporting Actor in a Comedy Series]<br />

Noth, Chris Abducted: A Father's Love (1996) (TV) [Larry Coster] <br />

Acting Class, The (2000) [Martin Ballsac] <br />

Apology (1986) (TV) [Roy Burnette] <br />

At Mother's Request (1987) (TV) [Steve Klein]<br />

Baby Boom (1987) (as Christopher Noth) [Yuppie Husb<strong>and</strong>] <br />

Bad Apple (2004) (TV) [Tozzi] <br />

Figure 19 – Extract <strong>of</strong> the actors.list file<br />

Table 5 summarizes file formats, optional attributes <strong>and</strong> other particularities.


Verónika Peralta<br />

File Format Optional att. Particularities<br />

movies.list Tab-separated columned<br />

actors.list Hierarchical-structured<br />

comments,<br />

role, casting<br />

actresses.list Hierarchical-structured<br />

comments,<br />

role, casting<br />

biographies.list Tagged<br />

Columns birth <strong>and</strong> decease are commaseparated<br />

lists <strong>of</strong> dates <strong>and</strong> places.<br />

business.list Tagged<br />

Columns budget <strong>and</strong> revenue embed<br />

currencies, amounts <strong>and</strong> countries.<br />

color-info.list Tab-separated columned comments<br />

costume-designers.list Hierarchical-structured comments<br />

countries.list Tab-separated columned<br />

directors.list Hierarchical-structured comments<br />

distributors.list Hierarchical-structured comments<br />

Column distribution company embeds<br />

company acronyms <strong>and</strong> country code<br />

genres.list Tab-separated columned<br />

keywords.list Tab-separated columned<br />

Contains two lists: (i) list <strong>of</strong> keywords,<br />

(ii) lists <strong>of</strong> movies <strong>and</strong> their keywords.<br />

language.list Tab-separated columned comments<br />

locations.list Tab-separated columned comments<br />

Column location is a comma-separated<br />

list <strong>of</strong> geographical places.<br />

movie-links.list Hierarchical-structured Column link embeds link type.<br />

producers.list Hierarchical-structured comments<br />

production-companies.list Tab-separated columned comments<br />

Column production company embeds<br />

country code<br />

production-designers.list Hierarchical-structured comments<br />

ratings.list Fixed-length columned<br />

release-dates.list Tab-separated columned comments Column release date embeds country.<br />

running-times.list Tab-separated columned comments Column running time embeds country.<br />

sound-mix.list Tab-separated columned<br />

writers.list Hierarchical-structured comments<br />

Table 5 – List formats <strong>and</strong> particularities<br />

Movies correspond to one <strong>of</strong> 6 types: cinema film, series episode, mini-series, TV series, video <strong>and</strong> video game.<br />

Movie identifiers (MovieTitle columns) are Strings that concatenates different data depending on movie types:<br />

− Cinema film identifiers concatenate movie title <strong>and</strong> release year (between brackets). E.g.: Walk in the<br />

Clouds, A (1995)<br />

− Series episode identifiers concatenate movie title (quoted), release year (between brackets) <strong>and</strong> episode<br />

identification (between braces). The latter may include episode name, episode number (between<br />

brackets), or both. E.g.: "Sex <strong>and</strong> the City" (1998) {All or Nothing (#3.10)}<br />

− Mini-series identifiers concatenate movie title (quoted), release year (between brackets) <strong>and</strong> the word<br />

mini (between brackets). E.g.: "Bon Voyage" (2006) (mini)<br />

− TV series identifiers concatenate movie title, release year (between brackets) <strong>and</strong> the word TV (between<br />

brackets). E.g.: Aladdin on Ice (1995) (TV)<br />

− Video identifiers concatenate movie title, release year (between brackets) <strong>and</strong> the letter V (between<br />

brackets). E.g.: Elton John: Live in Barcelona (1992) (V)<br />

− Video game identifiers concatenate movie title, release year (between brackets) <strong>and</strong> the word VG<br />

(between brackets). E.g.: Harry Potter: Quidditch World Cup (2003) (VG)<br />

3.2. <strong>IMDb</strong> target schema<br />

The target schema for the <strong>IMDb</strong> data set consists in 24 tables describing movies <strong>and</strong> companies or persons<br />

related to movies. Figure 20 shows the tables <strong>of</strong> the target schema, their primary keys (underlined attributes) <strong>and</strong><br />

foreign keys (arrows between tables). Additional dotted lines indicate is-a relationships among some tables <strong>and</strong> a<br />

(not implemented) relation describing persons. Table 6 describes each target table, its attributes <strong>and</strong> constraints.<br />

15


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

16<br />

Figure 20 – <strong>IMDb</strong> schema<br />

<strong>IMDb</strong>_Movies<br />

MovieNum<br />

MovieTitle<br />

Type<br />

Year<br />

FurtherInfo<br />

<strong>IMDb</strong>_Ratings<br />

MovieNum<br />

MovieTitle<br />

Distribution<br />

Votes<br />

Rating<br />

<strong>IMDb</strong>_Genres<br />

MovieNum<br />

MovieTitle<br />

Genre<br />

MovieNum<br />

MovieNum<br />

<strong>IMDb</strong>_Countries<br />

MovieNum<br />

MovieTitle<br />

Country<br />

MovieNum<br />

<strong>IMDb</strong>_ReleaseDates<br />

MovieNum<br />

MovieTitle<br />

Country<br />

ReleaseDate<br />

ReleaseYear<br />

ReleaseInfo<br />

MovieNum<br />

<strong>IMDb</strong>_Keywords<br />

Keyword<br />

MovieQuantity<br />

<strong>IMDb</strong>_MovieKeywords<br />

MovieNum<br />

MovieTitle<br />

Keyword<br />

Keyword<br />

MovieNum<br />

MovieNum<br />

<strong>IMDb</strong>_Languages<br />

MovieNum<br />

MovieTitle<br />

Language<br />

LanguageInfo<br />

<strong>IMDb</strong>_ProductionCompanies<br />

MovieNum<br />

MovieTitle<br />

CompanyName<br />

CountryCode<br />

CompanyInfo<br />

MovieNum<br />

MovieNum<br />

<strong>IMDb</strong>_Colors<br />

MovieNum<br />

MovieTitle<br />

Color<br />

ColorInfo<br />

MovieNum<br />

LinkMovieNum=MovieNum<br />

<strong>IMDb</strong>_MovieLinks<br />

MovieNum<br />

LinkMovieNum<br />

MovieTitle<br />

LinkMovieTitle<br />

LinkType<br />

2<br />

MovieNum<br />

<strong>IMDb</strong>_Business<br />

MovieNum<br />

MovieTitle<br />

BudgetCurrency<br />

Budget<br />

RevenueCurrency<br />

Revenue<br />

<strong>IMDb</strong>_Distributors<br />

MovieNum<br />

MovieTitle<br />

DistributionCompany<br />

CountryCode<br />

DistributionInfo<br />

MovieNum<br />

<strong>IMDb</strong>_Locations<br />

MovieNum<br />

MovieTitle<br />

Location<br />

Zone<br />

LocationInfo<br />

MovieNum<br />

<strong>IMDb</strong>_RunningTimes<br />

MovieNum<br />

MovieTitle<br />

Country<br />

Duration<br />

RunningInfo<br />

MovieNum<br />

MovieNum<br />

<strong>IMDb</strong>_Sounds<br />

MovieNum<br />

MovieTitle<br />

SoundMix<br />

<strong>IMDb</strong>_Directors<br />

MovieNum<br />

MovieTitle<br />

Director<br />

DirectorAlias<br />

DirectorInfo<br />

MovieNum<br />

<strong>IMDb</strong>_Writers<br />

MovieNum<br />

MovieTitle<br />

Writer<br />

WriterInfo<br />

MovieNum<br />

<strong>IMDb</strong>_Producers<br />

MovieNum<br />

MovieTitle<br />

Producer<br />

ProducerInfo<br />

MovieNum<br />

<strong>IMDb</strong>_CostumeDesigners<br />

MovieNum<br />

MovieTitle<br />

CostumeDesigner<br />

CostumeInfo<br />

MovieNum<br />

<strong>IMDb</strong>_ProductionDesigners<br />

MovieNum<br />

MovieTitle<br />

ProductionDesigner<br />

DesignerInfo<br />

MovieNum<br />

<strong>IMDb</strong>_Actresses<br />

MovieNum<br />

MovieTitle<br />

Actress<br />

PlayInfo<br />

Role<br />

Casting<br />

MovieNum<br />

<strong>IMDb</strong>_Actors<br />

MovieNum<br />

MovieTitle<br />

Actor<br />

PlayInfo<br />

Role<br />

Casting<br />

MovieNum<br />

<strong>IMDb</strong>_Biographies<br />

Name<br />

RealName<br />

Birth<br />

Decease<br />

Height<br />

<strong>IMDb</strong>_Person<br />

Name<br />

Director<br />

Writer<br />

Producer<br />

Actress<br />

Actor<br />

CostumeDesigner<br />

ProductionDesigner<br />

Name<br />

<strong>IMDb</strong>_Movies<br />

MovieNum<br />

MovieTitle<br />

Type<br />

Year<br />

FurtherInfo<br />

<strong>IMDb</strong>_Ratings<br />

MovieNum<br />

MovieTitle<br />

Distribution<br />

Votes<br />

Rating<br />

<strong>IMDb</strong>_Genres<br />

MovieNum<br />

MovieTitle<br />

Genre<br />

MovieNum<br />

MovieNum<br />

<strong>IMDb</strong>_Countries<br />

MovieNum<br />

MovieTitle<br />

Country<br />

MovieNum<br />

<strong>IMDb</strong>_ReleaseDates<br />

MovieNum<br />

MovieTitle<br />

Country<br />

ReleaseDate<br />

ReleaseYear<br />

ReleaseInfo<br />

MovieNum<br />

<strong>IMDb</strong>_Keywords<br />

Keyword<br />

MovieQuantity<br />

<strong>IMDb</strong>_MovieKeywords<br />

MovieNum<br />

MovieTitle<br />

Keyword<br />

Keyword<br />

MovieNum<br />

MovieNum<br />

<strong>IMDb</strong>_Languages<br />

MovieNum<br />

MovieTitle<br />

Language<br />

LanguageInfo<br />

<strong>IMDb</strong>_ProductionCompanies<br />

MovieNum<br />

MovieTitle<br />

CompanyName<br />

CountryCode<br />

CompanyInfo<br />

MovieNum<br />

MovieNum<br />

<strong>IMDb</strong>_Colors<br />

MovieNum<br />

MovieTitle<br />

Color<br />

ColorInfo<br />

MovieNum<br />

LinkMovieNum=MovieNum<br />

<strong>IMDb</strong>_MovieLinks<br />

MovieNum<br />

LinkMovieNum<br />

MovieTitle<br />

LinkMovieTitle<br />

LinkType<br />

2<br />

MovieNum<br />

<strong>IMDb</strong>_Business<br />

MovieNum<br />

MovieTitle<br />

BudgetCurrency<br />

Budget<br />

RevenueCurrency<br />

Revenue<br />

<strong>IMDb</strong>_Distributors<br />

MovieNum<br />

MovieTitle<br />

DistributionCompany<br />

CountryCode<br />

DistributionInfo<br />

MovieNum<br />

<strong>IMDb</strong>_Locations<br />

MovieNum<br />

MovieTitle<br />

Location<br />

Zone<br />

LocationInfo<br />

MovieNum<br />

<strong>IMDb</strong>_RunningTimes<br />

MovieNum<br />

MovieTitle<br />

Country<br />

Duration<br />

RunningInfo<br />

MovieNum<br />

MovieNum<br />

<strong>IMDb</strong>_Sounds<br />

MovieNum<br />

MovieTitle<br />

SoundMix<br />

<strong>IMDb</strong>_Directors<br />

MovieNum<br />

MovieTitle<br />

Director<br />

DirectorAlias<br />

DirectorInfo<br />

MovieNum<br />

<strong>IMDb</strong>_Writers<br />

MovieNum<br />

MovieTitle<br />

Writer<br />

WriterInfo<br />

MovieNum<br />

<strong>IMDb</strong>_Producers<br />

MovieNum<br />

MovieTitle<br />

Producer<br />

ProducerInfo<br />

MovieNum<br />

<strong>IMDb</strong>_CostumeDesigners<br />

MovieNum<br />

MovieTitle<br />

CostumeDesigner<br />

CostumeInfo<br />

MovieNum<br />

<strong>IMDb</strong>_ProductionDesigners<br />

MovieNum<br />

MovieTitle<br />

ProductionDesigner<br />

DesignerInfo<br />

MovieNum<br />

<strong>IMDb</strong>_Actresses<br />

MovieNum<br />

MovieTitle<br />

Actress<br />

PlayInfo<br />

Role<br />

Casting<br />

MovieNum<br />

<strong>IMDb</strong>_Actors<br />

MovieNum<br />

MovieTitle<br />

Actor<br />

PlayInfo<br />

Role<br />

Casting<br />

MovieNum<br />

<strong>IMDb</strong>_Biographies<br />

Name<br />

RealName<br />

Birth<br />

Decease<br />

Height<br />

<strong>IMDb</strong>_Person<br />

Name<br />

Director<br />

Writer<br />

Producer<br />

Actress<br />

Actor<br />

CostumeDesigner<br />

ProductionDesigner<br />

Name


Table Attributes Constraints<br />

<strong>IMDb</strong>_Movies<br />

Information about<br />

movies<br />

<strong>IMDb</strong>_Actors<br />

Actors <strong>of</strong> movies<br />

<strong>IMDb</strong>_Actresses<br />

Actresses <strong>of</strong> movies<br />

<strong>IMDb</strong>_Biographies<br />

Personal data about<br />

actors, actresses,<br />

directors, producers<br />

<strong>and</strong> other people<br />

involved in movies<br />

<strong>IMDb</strong>_Business<br />

Budget <strong>and</strong> revenue <strong>of</strong><br />

movies<br />

<strong>IMDb</strong>_Colors<br />

Colors <strong>of</strong> movies<br />

<strong>IMDb</strong>_Costume-<br />

Designers<br />

Costume designers <strong>of</strong><br />

movies<br />

<strong>IMDb</strong>_Countries<br />

Countries <strong>of</strong> movies<br />

<strong>IMDb</strong>_Directors<br />

Directors <strong>of</strong> movies<br />

− MovieNum: Numeric(7); autogenerated id<br />

− MovieTitle: String(250)<br />

− Type: String(2); {C=cinema, TV=television,<br />

V=video, VG=video game, S=serie,<br />

M=miniserie}<br />

− Year: String(45); an year, a range or a list <strong>of</strong><br />

years<br />

− FurtherInfo: String(30)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Actor: String(75)<br />

− PlayInfo: String(110); further info<br />

(uncredited, alias, special roles…)<br />

− Role: String(256); role name <strong>and</strong>/or<br />

description<br />

− Casting: Numeric(3); casting position<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Actress: String(75)<br />

− PlayInfo: String(110); further info<br />

(uncredited, alias, special roles…)<br />

− Role: String(256); role name <strong>and</strong>/or<br />

description<br />

− Casting: Numeric(3); casting position<br />

− Name: String(70)<br />

− RealName: String(220)<br />

− Birth: String(130); date <strong>and</strong> place <strong>of</strong> birth<br />

− Decease: String(160); date, place <strong>and</strong> cause<br />

<strong>of</strong> decease<br />

− Height: String(15)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− BudgetCurrency: String(3)<br />

− Budget: Numeric(15,2)<br />

− RevenueCurrency: String(3)<br />

− Revenue: Numeric(15,2)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Color: String(20)<br />

− ColorInfo: String(50); further info (versions)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− CostumeDesigner: String(50)<br />

− CostumeInfo: String(100); further info<br />

(roles, alias, …)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Country: String(30)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Director: String(40)<br />

− DirectorAlias: String(80); director’s<br />

appearing name in the movie<br />

− DirectorInfo: String(100); further info e.g.<br />

uncredited movies, co-direction, direction <strong>of</strong><br />

episodes, segments or videos…<br />

Verónika Peralta<br />

Primary key: MovieNum<br />

Alternative key: MovieTitle<br />

Primary key: MovieNum, Actor<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum, Actress<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: Name<br />

Primary key: MovieNum<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum, Color<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum,<br />

CostumeDesigner<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum, Country<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum, Director<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

17


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

<strong>IMDb</strong>_Distributors<br />

Companies that<br />

distributed movies<br />

<strong>IMDb</strong>_Genres<br />

Genres <strong>of</strong> movies<br />

<strong>IMDb</strong>_Keywords<br />

Master <strong>of</strong> keywords<br />

<strong>IMDb</strong>_Languages<br />

Languages <strong>of</strong> movies<br />

<strong>IMDb</strong>_Locations<br />

Running locations <strong>of</strong><br />

movies<br />

<strong>IMDb</strong>_MovieKeywords<br />

Keywords <strong>of</strong> movies<br />

<strong>IMDb</strong>_MovieLinks<br />

Links between movies<br />

<strong>IMDb</strong>_Producers<br />

Producers <strong>of</strong> movies<br />

<strong>IMDb</strong>_Production-<br />

Companies<br />

Companies that<br />

produced movies<br />

<strong>IMDb</strong>_Production-<br />

Designers<br />

Production designers <strong>of</strong><br />

movies<br />

<strong>IMDb</strong>_Ratings<br />

Global ratings on<br />

movies (not<br />

differentiated par user)<br />

18<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− DistributionCompany: String(110)<br />

− CountryCode: String(15)<br />

− DistributorInfo: String(140); further info<br />

(years, countries, media, …)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Genre: String(12)<br />

− Keyword: String(60)<br />

− MovieQuantity: Numeric(5); the number <strong>of</strong><br />

movies having the keyword<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Language: String(30)<br />

− LanguageInfo: String(70); further info<br />

(versions)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Location: String(170)<br />

− Zone: String(50)<br />

− LocationInfo: String(256); further info<br />

(years, countries, media, …)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Keyword: String(60)<br />

− MovieNum: Numeric(7)<br />

− LinkMovieNum: Numeric(7); linked movie<br />

− MovieTitle: String(250)<br />

− LinkMovieTitle: String(250); linked movie<br />

− LinkType: String(30); references, remakes,<br />

etc.<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Producer: String(60)<br />

− ProducerInfo: String(90); further info (role,<br />

awards, …)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− CompanyName: String(170)<br />

− CountryCode: String(15)<br />

− CompanyInfo: String(140); further info<br />

(years, places, participations, …)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− ProductionDesginer: String(35)<br />

− DesignerInfo: String(140); further info (co-<br />

productions, roles, …)<br />

− Distribution: String(10); unknown meaning<br />

− Votes: Numeric(8,2); quantity <strong>of</strong> votes<br />

(decimals are always zero)<br />

− Rating: Number(4,2), value between 0 <strong>and</strong><br />

10<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

Primary key: MovieNum,<br />

DistributionCompany,<br />

CountryCode<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum, Genre<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: Keyword<br />

Primary key: MovieNum,<br />

Language<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum, Location<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum, Keyword<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

− Keyword: ref. <strong>IMDb</strong>_Keywords<br />

Primary key: MovieNum,<br />

LinkMovieNum, LinkType<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

− LinkMovieNum: ref.<br />

<strong>IMDb</strong>_Movies<br />

Primary key: MovieNum, Producer<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum,<br />

CompanyName, CountryCode<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum,<br />

ProductionDesginer<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies


<strong>IMDb</strong>_ReleaseDates<br />

Dates <strong>of</strong> movie releases<br />

<strong>IMDb</strong>_RunningTimes<br />

Running times <strong>of</strong><br />

movies<br />

<strong>IMDb</strong>_Sounds<br />

Sound mix <strong>of</strong> movies<br />

<strong>IMDb</strong>_Writers<br />

Writers <strong>of</strong> movies<br />

3.3. <strong>IMDb</strong> extraction processes<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Country: String(40)<br />

− ReleaseDate: String(20)<br />

− ReleaseYear: Numeric(4)<br />

− ReleaseInfo: String(90); further info (release<br />

city/festival)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Country: String(40)<br />

− Duration: Numeric(5)<br />

− RunningInfo: String(75); further info<br />

(version, episodes, media, …)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− SoundMix: String(40)<br />

− MovieNum: Numeric(7)<br />

− MovieTitle: String(250)<br />

− Writer: String(45)<br />

− WriterInfo: String(110); further info (role,<br />

awards, …)<br />

Table 6 – <strong>IMDb</strong> schema<br />

Verónika Peralta<br />

Primary key: MovieNum, Country,<br />

ReleaseDate<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum, Country<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum,<br />

SoundMix<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Primary key: MovieNum, Writer<br />

Foreign keys:<br />

− MovieNum: ref. <strong>IMDb</strong>_Movies<br />

Contrarily to <strong>MovieLens</strong> data sets, the <strong>IMDb</strong> data set was very hard to extract <strong>and</strong> load in a relational database.<br />

The reason is that the data set is not conceived to be manipulated by DBMSs but by research engines. As a<br />

consequence, data is not presented in a tabular way <strong>and</strong> it is necessary to develop ad-hoc data extractors for<br />

loading tagged <strong>and</strong> hierarchical data <strong>and</strong> pre-processors for formatting data when it is presented in a pseudotabular<br />

format. This sub-section describes data extraction processes <strong>and</strong> next sub-section presents data cleaning<br />

<strong>and</strong> data transformation processes.<br />

<strong>Data</strong> extraction includes pre-processing tasks for: (i) normalizing hierarchical structures, (ii) substituting<br />

multiple tabulations <strong>and</strong> blanks, <strong>and</strong> (iii) normalizing tagged structures. Then, pre-processed files are loaded into<br />

a relational database. We describe each task as follows:<br />

3.3.1. Normalization <strong>of</strong> hierarchical structures<br />

Many files organize data into two hierarchical levels. The outside level describes a movie feature (e.g. an actor)<br />

<strong>and</strong> the inside level lists all movies corresponding to such feature (see Figure 19). Consequently, only the first<br />

line corresponding to a movie feature contains it explicitly; the following lines only contain movie titles (<strong>and</strong><br />

possible optional attributes). In order to store data in a tabular way, we need to iterate among the list <strong>of</strong> movies<br />

corresponding to each feature <strong>and</strong> insert it at each line. We do so in a pre-processing stage.<br />

Pre-processing program is written in Java (jdk 1.4). It takes as input a list file, processes the file line per line <strong>and</strong><br />

stores resulting lines in an output text file. We recognize lines containing a new feature from those containing<br />

movies <strong>of</strong> a previous feature because the formers start by a String (the feature value) <strong>and</strong> the latter start by a<br />

tabulation mark. A variable, storing the current feature value, is updated for each new feature found. White lines<br />

are ignored.<br />

Table 7 (column hierarchies) indicates which source files need pre-processing for normalizing hierarchical<br />

structures.<br />

3.3.2. Substitution <strong>of</strong> multiple tabulations <strong>and</strong> blanks<br />

An important problem, present in most source files, is the use <strong>of</strong> a variable number <strong>of</strong> tabulations for separating<br />

columns. This disconcerts database loading programs <strong>and</strong> causes the storage <strong>of</strong> data at inappropriate attributes.<br />

19


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

In order to prepare source files for being read by database loading programs, we pre-processed files, substituting<br />

multiple tabulations <strong>and</strong> blanks by a unique tabulation.<br />

Pre-processing programs are written in Java (jdk 1.4). They take as input a list file, process the file line per line<br />

<strong>and</strong> store resulting lines in an output text file. The substitution function is quite simple; it searches for multiple<br />

occurrences <strong>of</strong> tabulations <strong>and</strong>/or blanks <strong>and</strong> substitutes them by a single tabulation.<br />

Optional attributes need more attention. When there is a unique optional attribute <strong>and</strong> it is placed at the end <strong>of</strong><br />

the line, the database loading interface fills null values automatically. However, when there are several optional<br />

attributes, we need to analyze to which attribute corresponds each value <strong>and</strong> (possibly) insert additional<br />

tabulations in order to signal the presence <strong>of</strong> a null value. Most files only contained comments as optional<br />

attributes, which were placed at the end <strong>of</strong> the line, so no further processing was needed. Conversely, actors.list<br />

<strong>and</strong> actresses.list files present three optional attributes: comments, role <strong>and</strong> casting position. We recognized their<br />

values because <strong>of</strong> special characters enclosing them. Specifically, comments, roles <strong>and</strong> casting positions are<br />

enclosed by brackets, square brackets <strong>and</strong> angle brackets respectively (see Figure 19). In all cases (including<br />

files having only comments), we did not eliminate enclosing characters; we deferred it to data cleaning steps.<br />

Table 7 indicates which source files needed pre-processing for substituting multiple tabulations or blanks<br />

(column tabulations) <strong>and</strong> which ones needed the treatment <strong>of</strong> enclosing characters (we indicate attribute names<br />

enclosed by the corresponding special characters).<br />

3.3.3. Normalization <strong>of</strong> tagged structures<br />

Two files utilize tags for organizing data: biographies.list <strong>and</strong> business.list. Both files start each line with a twoletters<br />

tag that indicates the meaning <strong>of</strong> the line (the attribute to which it corresponds).<br />

Biographies.list starts describing a new person by presenting a line with person name (tag NM). Subsequent lines<br />

describe different characteristics <strong>of</strong> a person. We extract real names (tag RN), date <strong>and</strong> place <strong>of</strong> birth (tag DB),<br />

date, place <strong>and</strong> cause <strong>of</strong> decease (tag DD) <strong>and</strong> height (tag HT). We ignore other tags <strong>and</strong> white lines.<br />

Analogously, business.list starts describing a new movie by presenting a line with its title (tag MV). Subsequent<br />

lines describe different characteristics <strong>of</strong> a movie. We extract budget currency <strong>and</strong> amount (tag BT), revenue<br />

currency <strong>and</strong> amount (tag RT). We ignore other tags <strong>and</strong> white lines.<br />

20<br />

File Hierarchies Tabulations Enclosing characters Tags<br />

movies.list no yes no<br />

actors.list yes yes (comments), [role], no<br />

actresses.list yes yes (comments), [role], no<br />

biographies.list no no yes<br />

business.list no no yes<br />

color-info.list no yes (comments) no<br />

costume-designers.list yes yes (comments) no<br />

countries.list no yes no<br />

directors.list yes yes (comments) no<br />

distributors.list yes yes (comments) no<br />

genres.list no yes no<br />

keywords.list no yes no<br />

language.list no yes (comments) no<br />

locations.list no yes (comments) no<br />

movie-links.list yes yes no<br />

producers.list yes yes (comments) no<br />

Production-companies.list no yes (comments) no<br />

Production-designers.list yes yes (comments) no<br />

ratings.list no no no<br />

release-dates.list no yes (comments) no<br />

running-times.list no yes (comments) no<br />

sound-mix.list No yes no<br />

writers.list Yes yes (comments) no<br />

Table 7 – Pre-processing <strong>of</strong> source files


Verónika Peralta<br />

In order to store data in a tabular way, we need to iterate among lines corresponding to each person/movie <strong>and</strong><br />

write a unique line with all available characteristics. We do so in a pre-processing stage.<br />

Pre-processing program was written in Java (jdk 1.4). It takes as input a list file, processes the file line per line<br />

<strong>and</strong> stores resulting lines in an output text file. We store extracted data (corresponding to desired tags) in local<br />

variables. In both files, the desired tags are contained in a single line (other ignored tags are split in several<br />

lines). When we recognize a new person/movie (because <strong>of</strong> reading the corresponding tags), we store the line<br />

corresponding to the previous one.<br />

Table 7 (column tags) indicates which source files needed pre-processing for normalizing tagged structures.<br />

3.3.4. Loading into a relational database<br />

After pre-processing tasks, files have a tabular form <strong>and</strong> can be treated by a database loading interface. We used<br />

Micros<strong>of</strong>t Access ® loading interface, indicating file format <strong>and</strong> column format.<br />

Note that the only file that did not need pre-processing is ratings.list, which is structured as fixed-length<br />

columns.<br />

3.4. <strong>IMDb</strong> cleaning <strong>and</strong> transformation processes<br />

Contrarily to <strong>MovieLens</strong> data sets, the <strong>IMDb</strong> data set presented a great number <strong>of</strong> anomalies <strong>and</strong> consequently<br />

needed the execution <strong>of</strong> several data cleaning <strong>and</strong> data transformation processes. The types <strong>of</strong> anomalies that we<br />

detected <strong>and</strong> solved are:<br />

− We found 6 duplicate titles in movies.list. We keep only one occurrence.<br />

− There were a great number <strong>of</strong> duplicate pairs <strong>of</strong> the form for most features (specific<br />

quantities for each source file are shown in Table 11). In most cases, duplicates tuples had inconsistent<br />

values for remaining attributes. Our police was to substitute inconsistent values by null values (especially<br />

for comments), i.e. we considered inconsistent data as unknown. We did special treatment to some<br />

feature: in the case <strong>of</strong> budget <strong>and</strong> revenues (business.list) we summed inconsistent values (they were<br />

consider as partial data) <strong>and</strong> in the case <strong>of</strong> personal data (biographies.list) we kept the maximum value (in<br />

all inconsistencies, one <strong>of</strong> the values was null).<br />

− There were a great number <strong>of</strong> violations to referential constraints (specific quantities for each source file<br />

are shown in Table 11). We eliminated such tuples.<br />

− Some columns embedded multiple attributes. For example, the value “Taxi Films [uy]” <strong>of</strong> the productioncompanies.list<br />

embeds a production company (“Taxi Films”) <strong>and</strong> the code <strong>of</strong> its country (“Uruguay”).<br />

Eight files contained this type <strong>of</strong> anomalies (see Table 5). We solved them by splitting columns.<br />

− Optional columns are delimited by special characters, e.g. square brackets or braces (specific enclosing<br />

characters are listed in Table 7). We eliminated enclosing characters in most cases. Exceptions were the<br />

occurrences <strong>of</strong> multiple comments, for example “(1963-1964) (USA) (TV) (original airing)”; we kept<br />

brackets because they delimitate each comment.<br />

− Country names <strong>and</strong> codes, appearing in different source files (distributors.list, production-companies.list,<br />

running-times.list, locations.list, countries.list <strong>and</strong> release-dates.list) were not consistent (for example, we<br />

found the names ‘U.K.’ <strong>and</strong> ‘United Kindom’ for referring to the same country). We deferred the problem<br />

to data integration stage.<br />

− As we did not extract all tags in business.list <strong>and</strong> biographies.list, we found an important number <strong>of</strong> tuples<br />

having null values for all columns excepting key (specific quantities for each source file are shown in<br />

Table 11). We filtered such tuples.<br />

− The type <strong>of</strong> a movie (film, series episode, etc.) was not explicitly provided, however, <strong>IMDb</strong> site explains<br />

how titles are build (see Section 3.1). We calculated movie type following such explanations.<br />

− Release year was calculated from release date (release-dates.list) by using typical date manipulation<br />

functions.<br />

− The list <strong>of</strong> keyword included at the start <strong>of</strong> the keywords.list file, which is split in several columns, was<br />

hard to extract. We calculated the same information by grouping keywords <strong>of</strong> all movies <strong>and</strong> counting the<br />

number <strong>of</strong> movies that references each one.<br />

21


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

ETL workflows consisted in the same tasks presented for the <strong>MovieLens</strong> data set but providing different<br />

implementations for specific cleaning <strong>and</strong> reconciliation tasks. The general workflow is illustrated in Figure 21<br />

(P.txt represents either one <strong>of</strong> the text files obtained after pre-processing source files or ratings.list, which does<br />

not need pre-processing). However, for certain files some tasks were not necessary, specifically, if there were no<br />

transformations to do, or no duplications nor orphans were detected. Table 8 shows the tasks that are omitted for<br />

each file; bulk copy <strong>and</strong> duplicate check tasks are executed for all files.<br />

In addition, as indexes by movie titles consumed much space <strong>and</strong> were less efficient, we auto generated new<br />

integer identifiers for movies. We included a new column (MovieNum), <strong>of</strong> auto-numeric type to the moviesgroup<br />

table (which stores the results <strong>of</strong> the grouping task). Each tuple inserted by the grouping task is set with a<br />

new sequential identifier generated automatically. We also maintained movie titles in tables referencing movies,<br />

letting users the possibility <strong>of</strong> choosing the identifier that is best adapted to their applications.<br />

22<br />

P.txt<br />

<strong>IMDb</strong>_Movies<br />

bulk copy<br />

cleaning<br />

duplicate<br />

check<br />

Figure 21 – ETL workflow for <strong>IMDb</strong> files<br />

duplicate<br />

reconciliation grouping<br />

cleaning<br />

referential<br />

check<br />

orphan<br />

elimination<br />

aggregation<br />

<strong>IMDb</strong>_P<br />

<strong>IMDb</strong>_P’<br />

Source file Cleaning<br />

Reconc., dupl. cleaning<br />

<strong>and</strong> grouping<br />

Ref. check<br />

Orphan<br />

elimination Aggregation<br />

movies.txt X X X<br />

actors.txt X X<br />

actresses.txt X X<br />

biographies.txt X X X<br />

business.txt X<br />

color-info.txt X X<br />

costume-designers.txt X X<br />

countries.txt X X<br />

directors.txt X X<br />

distributors.txt X<br />

genres.txt X X<br />

keywords.txt X<br />

language.txt X X<br />

locations.txt X<br />

movie-links.txt X X<br />

producers.txt X X<br />

production-companies.txt X<br />

production-designers.txt X X<br />

ratings.list X X X<br />

release-dates.txt X<br />

running-times.txt X<br />

sound-mix.txt X X<br />

writers.txt X X<br />

Table 8 – ETL tasks that where not necessary to execute<br />

In the following we describe such specific cleaning <strong>and</strong> reconciliation tasks, as well as special remarks:


Verónika Peralta<br />

− Movie type calculation task (cleaning for movies.txt) calculates movie types as derived attributes by<br />

looking for some special characters in the movie title (see Table 9). The task is implemented as a<br />

sequence <strong>of</strong> SQL operations (one for each movie type) that update the Type attribute according to Table<br />

9, for example:<br />

UPDATE movies-cleaning<br />

SET Type=M<br />

WHERE Mid(MovieTitle, 1, 2) = ’”’<br />

AND Mid(MovieTitle, Len(MovieTitle)-5, 6) = ’(mini)’<br />

Movie type Title is quoted Special ending word Generated attribute<br />

Cinema film No none C<br />

Series episode Yes (mini) S<br />

Mini-series Yes none M<br />

TV series No (TV) TV<br />

Video No (V) V<br />

Video game No (VG) VG<br />

Table 9 – Movie type identification<br />

− Null values elimination tasks (initial cleaning for biographies.txt <strong>and</strong> business.txt) eliminate tuples having<br />

null values for all attributes (excepting keys). Tasks are implemented as SQL operations like:<br />

INSERT INTO business-cleaning (MovieTitle, Budget, Revenue)<br />

SELECT *<br />

FROM business-source<br />

WHERE Bugnet IS NOT NULL OR Revenue IS NOT NULL;<br />

The implementation for biographies.txt is analogous.<br />

− Splitting tasks (cleaning for several tasks) update several attributes from one concatenating several<br />

concepts. For example, the ProductionCompany column <strong>of</strong> the productioncompanies.txt file embeds<br />

company names <strong>and</strong> country codes. The tasks search for special characters (e.g. brackets, square brackets,<br />

braces) <strong>and</strong> split attribute values according to such characters. Tasks are implemented as a sequence <strong>of</strong><br />

SQL operations, for example:<br />

INSERT INTO productioncompanies-cleaning<br />

(MovieTitle, ProductionCompany, CompanyInfo, Aux1)<br />

SELECT MovieTitle, ProductionCompany, Comments,<br />

InStr(1,ProductionCompany,’[‘)<br />

FROM productioncompanies-source;<br />

UPDATE productioncompanies-cleaning<br />

SET CompanyName = Mid(ProductionCompany, 1, Aux1-2),<br />

CountryCode = Mid(ProductionCompany,Aux1+1,Len(ProductionCompany)-Aux1-1)<br />

WHERE Aux1 > 0;<br />

UPDATE productioncompanies-cleaning<br />

SET CompanyName = ProductionCompany<br />

WHERE Aux1 = 0;<br />

The first operation calculates the position <strong>of</strong> the special character <strong>and</strong> stores it in the Aux1 attribute. The<br />

second operation performs the split. The third operation is necessary when some values are optional (in<br />

the example, if country code can be omitted). It sets the first attribute <strong>and</strong> less the second as Null.<br />

The implementation <strong>of</strong> splitting tasks for business.txt, distributors.txt, releasedates.txt <strong>and</strong><br />

runningtimes.txt is similar. Table 10 lists attributes to split <strong>and</strong> special characters for those files. Columns<br />

<strong>of</strong> biographies.txt were not split because <strong>of</strong> their format diversity.<br />

As a special case, the splitting <strong>of</strong> movielinks.txt is performed by table look-up, i.e. the first attribute<br />

(movie type) is looked up in a table; the remaining <strong>of</strong> the text corresponds to the second attribute. The<br />

look-up table contains the fifteen link types, namely: “alternate language version <strong>of</strong>”, “edited from”,<br />

“edited into”, “featured in”, “features”, “followed by”, “follows”, “referenced in”, “references”, “remade<br />

as”, “remake <strong>of</strong>”, “spin <strong>of</strong>f”, “spo<strong>of</strong>ed in”, “spo<strong>of</strong>s”, “version <strong>of</strong>”. The tasks is implemented as follows:<br />

23


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

24<br />

INSERT INTO movielinks-cleaning<br />

(MovieTitle, LinkMovieTitle, LinkType)<br />

SELECT MovieTitle, Mid(Link, Len(LinkType), Len(Link)), LinkType<br />

FROM movielinks-source X, MovieTypes Y<br />

WHERE Mid(Link, 1, Len(LinkType)) = LinkType;<br />

File<br />

Multi-purpose<br />

column<br />

First attribute Second attribute Separator<br />

business.txt Budget BudgetCurrency Budget first blank<br />

Revenue RevenueCurrency Revenue first blank<br />

distributors.txt DistributionCompany DistributionCompany CountryCode [<br />

movielinks.txt Link LinkType LinkMovieTitle look up<br />

productioncompanies.txt ProductionCompany CompanyName CountryCode(opt.) [<br />

releasedates.txt ReleaseDate Country ReleaseDate colon<br />

runningtimes.txt RunningTime Country (optional) Duration colon<br />

Table 10 – Splitting attributes<br />

− Enclosing character elimination tasks (cleaning for several tables, listed in Table 7) deletes initial <strong>and</strong><br />

ending character <strong>of</strong> a column, which enclose the value. The task is implemented as follows:<br />

INSERT INTO languages-cleaning<br />

(MovieTitle, Language, LanguageInfo)<br />

SELECT MovieTitle, Language, Mid(LanguageInfo,2,Len(LanguageInfo)-2)<br />

FROM languages-source;<br />

The implementation for other files is analogous.<br />

− Zone calculation task (cleaning for location.txt) extracts zone from location attribute. Specifically,<br />

location values are comma-separated lists <strong>of</strong> geographic places, from most precise to largest (e.g.<br />

“Toronto, Ontario, Canada” <strong>and</strong> “Biltmore Hotel - 506 S. Gr<strong>and</strong> Ave., Downtown, Los Angeles,<br />

California, USA”). The number <strong>of</strong> items in each location value is variable. The Zone attribute will store<br />

the last item <strong>of</strong> a location, which is generally a country name. In order to calculate it, we need to iterate in<br />

the location value, searching the position <strong>of</strong> a comma <strong>and</strong> splitting the location value as described for<br />

previous task.<br />

The task is implemented as a sequence <strong>of</strong> SQL operations, as follows:<br />

INSERT INTO locations-cleaning<br />

(MovieTitle, Location, Zone, LocationInfo)<br />

SELECT MovieTitle, Location, Location, Comments<br />

FROM locations-source;<br />

UPDATE locations-cleaning<br />

SET Zone = Mid(Zone, InStr(1,Zone,’,’)+2, Len(Zone))<br />

WHERE InStr(1,Zone,’,’) > 0;<br />

The first operation sets zone as the complete location. The second operation splits zone according to the<br />

position <strong>of</strong> the first comma, storing the second part. Note that only tuples containing a comma are<br />

updated. The second operation is repeated until no tuple is updated.<br />

− Release year calculation task (further cleaning for releasedates.txt) calculates release year from release<br />

date. A typical truncate function is used. The implementation <strong>of</strong> the task was presented in Section 2.3.2.<br />

− Reconciliation tasks set conflictive values to Null. The task is implementation was discussed in Section<br />

2.3.<br />

− Orphan elimination tasks keep tuples joining the <strong>IMDb</strong>_Movies table, as described in Section 2.3. At this<br />

moment, we copy the auto-generated movie identifier (MovieNum) obtained from the <strong>IMDb</strong>_Movies<br />

table. The implementation for the genres.txt file is:<br />

INSERT INTO genres-ref (MovieNum, MovieTitle, Genre)<br />

SELECT Y.MovieNum, X.MovieTitle, X.Genre<br />

FROM genres-group X, <strong>IMDb</strong>_Movies Y<br />

WHERE X.MovieTitle = Y.MovieTitle;


Verónika Peralta<br />

The implementation for other files is analogous. Note that we execute the task even if no orphans were<br />

found, in order to obtain movie identifiers.<br />

− Referential check task (for movielinks.txt) is executed twice in order to check that both, movie titles <strong>and</strong><br />

link movie titles, are included in the <strong>IMDb</strong>_Movies table, generating two orphan tables. Implementation<br />

was described in Section 2.3.<br />

− Orphan elimination task (for movielinks.txt) also considers that both, movie titles <strong>and</strong> link movie titles,<br />

are included in the <strong>IMDb</strong>_Movies table, i.e. it joins twice the <strong>IMDb</strong>_Movies table. The task is<br />

implemented as follows:<br />

INSERT INTO movielinks-ref (MovieNum, LinkMovieNum, MovieTitle,<br />

LinkMovieTitle, LinkType)<br />

SELECT X.MovieNum, Y.MovieNum, M.MovieTitle, M.LinkMovieTitle, M.LinkType<br />

FROM movielinks-group M, <strong>IMDb</strong>_Movies X, <strong>IMDb</strong>_Movies Y<br />

WHERE M.MovieTitle = X.MovieTitle<br />

AND M.LinkMovieTitle = Y.MovieTitle;<br />

− Aggregation task (for keywords.txt) stores the list <strong>of</strong> existing keywords <strong>and</strong> the number <strong>of</strong> movies related<br />

to each keyword. It is implemented as follows:<br />

INSERT INTO <strong>IMDb</strong>_Keywords (Keyword, MovieQuantity)<br />

SELECT Keyword, count(*)<br />

FROM <strong>IMDb</strong>_MovieKeywords<br />

GROUP BY Keyword;<br />

As a summary <strong>of</strong> ETL tasks, Table 11 shows the number <strong>of</strong> extracted tuples, the number <strong>of</strong> anomalies, duplicates<br />

<strong>and</strong> orphans found, <strong>and</strong> the number <strong>of</strong> exported tuples.<br />

Source file<br />

movies.list<br />

Extracted Cleaning<br />

Exported<br />

Duplicates Orphans Resulting table<br />

tuples anomalies tuples<br />

858.967 0 6 858.961 <strong>IMDb</strong>_Movies<br />

actors.list 5.014.897 0 40 486.875 4.527.982 <strong>IMDb</strong>_Actors<br />

actresses.list 2.743.802 0 35 327.935 2415832 <strong>IMDb</strong>_Actresses<br />

biographies.list 312.837 54.011 8 258818 <strong>IMDb</strong>_Biographies<br />

business.list 57.364 31.847 1 0 25516 <strong>IMDb</strong>_Business<br />

color-info.list 433.117 0 846 44.541 387730 <strong>IMDb</strong>_Colors<br />

costume-designers.list 99.192 0 35 7.644 91513 <strong>IMDb</strong>_CostumeDesigners<br />

countries.list 530.295 0 681 0 529614 <strong>IMDb</strong>_Countries<br />

directors.list 513.344 0 295 34 513015 <strong>IMDb</strong>_Directors<br />

distributors.list 403.271 0 15.023 0 388248 <strong>IMDb</strong>_Distributors<br />

genres.list 637.976 0 606 57.519 579851 <strong>IMDb</strong>_Genres<br />

keywords.list 1.124.778 0 507 330<br />

32704 <strong>IMDb</strong>_Keywords<br />

1123941 <strong>IMDb</strong>_MovieKeywords<br />

language.list 445.840 0 1.355 0 444485 <strong>IMDb</strong>_Languages<br />

locations.list 216.951 0 45 0 216906 <strong>IMDb</strong>_Locations<br />

movie-links.list 533.428 0 0 36.572 496856 <strong>IMDb</strong>_MovieLinks<br />

producers.list 1.038.128 0 422.088 172.775 443265 <strong>IMDb</strong>_Producers<br />

production-companies.list 478.346 0 831 0 477515 <strong>IMDb</strong>_ProductionCompanies<br />

production-designers.list 103.347 0 12 7509 95826 <strong>IMDb</strong>_ProductionDesigners<br />

ratings.list 159.957 0 0 66 159891 <strong>IMDb</strong>_Ratings<br />

release-dates.list 830.416 0 2.007 53.556 774853 <strong>IMDb</strong>_ReleaseDates<br />

running-times.list 327.312 0 4.405 49.901 273006 <strong>IMDb</strong>_RunningTimes<br />

sound-mix.list 231.318 0 496 8.512 222310 <strong>IMDb</strong>_Sounds<br />

writers.list 775.660 0 25.260 130.812 619588 <strong>IMDb</strong>_Writers<br />

Table 11 – Statistics <strong>of</strong> data extraction, cleaning <strong>and</strong> transformation<br />

25


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

4. <strong>Data</strong> integration<br />

This section describes the construction <strong>of</strong> an integrated database from the extracted data. We start describing the<br />

difficulty <strong>of</strong> matching movie identifiers <strong>and</strong> describing the matching process. We then describe the integrated<br />

schema <strong>and</strong> its construction.<br />

4.1. Matching processes<br />

We extracted 858.961 movies from <strong>IMDb</strong> <strong>and</strong> 3.883 movies from <strong>MovieLens</strong>. After integration, we constated<br />

that all <strong>MovieLens</strong> movies were included in the <strong>IMDb</strong> data set. However, matching titles was not trivial.<br />

Furthermore, only 79% <strong>of</strong> matches had identical movie titles in both data sets. Other matching strategies were<br />

implemented, including manual look-up, in order to match the remaining titles. This section presents an<br />

overview <strong>of</strong> the matching process; details can be found in [Peralta 2007a].<br />

Both, <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> identify movies by ad-hoc identifiers (called movie titles) that concatenate title <strong>and</strong><br />

running year (see Sections 2.1.1<strong>and</strong> 3.1). Note that we do not consider series episodes, mini-series, TV series,<br />

videos or video games (which have different identifiers) because <strong>MovieLens</strong> data set only includes cinema<br />

movies. But even identifiers are built in the same manner, we found discrepancies in titles. Most typical causes<br />

are inclusion <strong>of</strong> special characters, typing errors, transposition or omission <strong>of</strong> articles, use <strong>of</strong> different running<br />

years, translation <strong>of</strong> foreign titles <strong>and</strong> use <strong>of</strong> alternative titles. Table 12 illustrates differences in movie titles.<br />

<strong>IMDb</strong> titles <strong>MovieLens</strong> titles Causes<br />

¡Three Amigos! (1986) Three Amigos! (1986) Special characters<br />

Shall We Dance (1937) Shall We Dance? (1937)<br />

8 ½ Women (1999) 8 1/2 Women (1999)<br />

One Hundred <strong>and</strong> One Dalmatians (1961) 101 Dalmatians (1961)<br />

Two Moon Junction (1988) Two Moon Juction (1988) Typing errors<br />

La Bamba (1987) Bamba, La (1987) Transposed articles<br />

El Dorado (1966) Dorado, El (1967)<br />

Three Ages (1923) Three Ages, The (1923) Omitted articles<br />

Story <strong>of</strong> G.I. Joe (1945) Story <strong>of</strong> G.I. Joe, The (1945)<br />

Tarantella (1996) Tarantella (1995) Different years<br />

Supernova (2000/I) Supernova (2000)<br />

Abre los ojos (1997) Open Your Eyes (Abre los ojos) (1997) English titles<br />

Caro diario (1994) Dear Diary (Caro Diario) (1994)<br />

Huitième jour, Le (1996) Eighth Day, The (Le Huitième jour ) (1996)<br />

Historia <strong>of</strong>icial, La (1985) Official Story, The (La Historia Oficial) (1985)<br />

Cité des enfants perdus, La (1995) City <strong>of</strong> Lost Children, The (1995)<br />

Star Wars (1977) Star Wars: Episode IV - A New Hope (1977) Alternative titles<br />

Santa Claus (1985) Santa Claus: The Movie (1985)<br />

Sunset Blvd. (1950) Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)<br />

Sugar Hill (1994) Harlem (1993)<br />

26<br />

Table 12 – Examples <strong>of</strong> movies having different titles<br />

In order to match <strong>MovieLens</strong> titles with <strong>IMDb</strong> titles (included in ML_Movies <strong>and</strong> <strong>IMDb</strong>_Movies tables), we<br />

followed several matching strategies. Each strategy tries to match <strong>MovieLens</strong> titles not matched by previous<br />

strategies as shown in Figure 22. To this end, each strategy considers a different heuristic for building c<strong>and</strong>idate<br />

(alternative) titles for unmatched movies <strong>and</strong> compares them with <strong>IMDb</strong> titles. We implemented the following<br />

strategies:<br />

− Join by movie title: This strategy computes the exact match. We joined <strong>IMDb</strong>_Movies <strong>and</strong> ML_Movies<br />

tables by the MovieTitle attribute. We obtained 3086 matches (79%).<br />

− Match using the <strong>MovieLens</strong> small data set: This strategy uses the SML_Movies table (small data set) as a<br />

mapping table between unmatched <strong>MovieLens</strong> titles <strong>and</strong> <strong>IMDb</strong> titles. Specifically, the SML_Movies table<br />

contains the URL <strong>of</strong> the <strong>IMDb</strong> web page corresponding to each movie. We formatted the URL <strong>and</strong><br />

replaced special characters (e.g. %20 represents a blank) obtaining a c<strong>and</strong>idate title. We joined this table<br />

with unmatched <strong>MovieLens</strong> movies (by the MovieTitle attribute) <strong>and</strong> with <strong>IMDb</strong>_Movies (by the just


ML_Movies<br />

SML_Movies<br />

<strong>IMDb</strong>_Movies<br />

match1<br />

ok (insert)<br />

unmatched<br />

match2<br />

unmatched<br />

match3<br />

unmatched<br />

match4<br />

unmatched<br />

match5<br />

unmatched<br />

match6<br />

unmatched<br />

match7<br />

unmatched<br />

match8<br />

ok<br />

ok<br />

ok<br />

ok<br />

ok<br />

ok<br />

ok<br />

Figure 22 – Matching process<br />

I_Movies<br />

Verónika Peralta<br />

built c<strong>and</strong>idate title). We had several difficulties: First, the intersection <strong>of</strong> both <strong>MovieLens</strong> collections is<br />

not very large. Second, we cannot replace all special characters. As a result, we obtained 83 new matches,<br />

totalizing 3169 matches (82%).<br />

− Match extracting foreign title: Some <strong>MovieLens</strong> titles are translated to English but include, between<br />

brackets, the original title (see examples in Table 12). We extracted original titles (text between brackets)<br />

<strong>of</strong> unmatched <strong>MovieLens</strong> movies obtaining a c<strong>and</strong>idate title. Then, we joined them with <strong>IMDb</strong>_Movies.<br />

We obtained 92 new matches, totalizing 3261 matches (84%).<br />

− Match ignoring running year: Sometimes, movie titles embeds running years, sometimes they embed<br />

diffusion years <strong>and</strong> sometimes they include additional characters (e.g. ‘1999/I’). This strategy consists in<br />

ignoring years for joining movie titles <strong>and</strong> manually verifying the obtained matches. We removed years<br />

from movie titles <strong>of</strong> unmatched <strong>MovieLens</strong> movies <strong>and</strong> <strong>IMDb</strong>_Movies. Then, we joined these auxiliary<br />

titles storing matches in a temporal table. A human validated matches (e.g. those differentiating in only<br />

one year) <strong>and</strong> eliminated erroneous ones. As a result, we obtained 295 new matches, totalizing 3556<br />

matches (92%).<br />

− Matching <strong>of</strong> 20 first characters: This strategy consists in truncating titles to 20 characters, joining them<br />

<strong>and</strong> manually verifying the obtained matches. We tried to solve some kinds <strong>of</strong> alternative titles (e.g.<br />

“Friday the 13th Part III (1982)” <strong>and</strong> “Friday the 13th Part 3: 3D (1982)”) or special characters or<br />

truncations (e.g. “Why Do Fools Fall In Love? (1998)” <strong>and</strong> “Why Do Fools Fall In Love (1998)”). We<br />

truncated titles <strong>of</strong> unmatched <strong>MovieLens</strong> movies <strong>and</strong> <strong>IMDb</strong>_Movies. Then, we joined these auxiliary<br />

titles storing matches in a temporal table. A human validated matches by comparing whole titles <strong>and</strong><br />

eliminated erroneous matches. As a result, we obtained 34 new matches, totalizing 3590 matches (92%).<br />

− Matching <strong>of</strong> 10 first characters: This strategy repeats the previous one, but truncating titles to 10<br />

characters. We obtained 46 new matches, totalizing 3636 matches (94%).<br />

− Manual look-up: This strategy consists in manually examining unmatched <strong>MovieLens</strong> titles looking for<br />

possible matches in the <strong>IMDb</strong>_Movies table. We used intuition for finding titles in the huge <strong>IMDb</strong><br />

collection (e.g. transposing words for “La Bamba (1987)” <strong>and</strong> “Bamba, La (1987)”) <strong>and</strong> knowledge <strong>of</strong><br />

foreign languages (e.g. “Cité des enfants perdus, La (1995)” <strong>and</strong> “City <strong>of</strong> Lost Children, The (1995)”).<br />

We obtained 116 new matches, totalizing 3752 matches (97%).<br />

− Web look-up: The last strategy uses the search engine <strong>of</strong> <strong>MovieLens</strong> web site to find unmatched movies.<br />

Then, we used the link provided by <strong>MovieLens</strong> site to access the movie page at <strong>IMDb</strong> <strong>and</strong> we copied its<br />

title. We matched the remaining 131 movie titles.<br />

After matching titles, we found that 2 movies had the same corresponding movie at <strong>IMDb</strong>, i.e. we detected 2<br />

additional duplicates. Tuples with original title were kept; those with foreign title were removed.<br />

27


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

28<br />

Figure 23 – Integrated schema<br />

I_Movies<br />

MovieId<br />

Title<strong>MovieLens</strong><br />

TitleIMDB<br />

I_MovieRatings<br />

MovieId<br />

Distribution<br />

Votes<br />

Rating<br />

I_Genres<br />

Genre<br />

I_MovieGenres<br />

MovieId<br />

Genre<br />

MovieId<br />

MovieId<br />

Genre<br />

I_Countries<br />

Country<br />

LongName<br />

DomainCode<br />

ISO2Code<br />

ISO3Code<br />

UNnumericalCode<br />

IsCurrent<br />

IsSovereign<br />

Sovereign<br />

Continent<br />

SecondaryContinent<br />

Area<br />

Inhabitants<br />

I_MovieCountries<br />

MovieId<br />

Country<br />

MovieId<br />

Country<br />

I_Years<br />

Year<br />

Decade<br />

I_MovieYears<br />

MovieId<br />

Year<br />

YearInfo<br />

Year<br />

MovieId<br />

I_MovieReleaseDates<br />

MovieId<br />

Country<br />

ReleaseDate<br />

ReleaseYear<br />

ReleaseInfo<br />

Year=<br />

ReleaseYear<br />

Country<br />

MovieId<br />

I_Keywords<br />

Keyword<br />

MovieQuantities<br />

I_MovieKeywords<br />

MovieId<br />

Keyword<br />

Keyword<br />

MovieId<br />

Language<br />

MovieId<br />

I_Languages<br />

Language<br />

I_MovieLanguages<br />

MovieId<br />

Language<br />

LanguageInfo<br />

I_CountryCodes<br />

CountryCode<br />

Country<br />

Description<br />

I_MovieProductionCompanies<br />

MovieId<br />

CompanyName<br />

CountryCode<br />

CompanyInfo<br />

CountryCode<br />

MovieId<br />

Type<br />

MovieId<br />

I_Types<br />

Type<br />

TypeDescription<br />

I_MovieTypes<br />

MovieId<br />

Type<br />

Color<br />

MovieId<br />

I_Colors<br />

Color<br />

I_MovieColors<br />

MovieId<br />

Color<br />

ColorInfo<br />

LinkType<br />

MovieId<br />

LinkMovieId=MovieId<br />

I_LinkTypes<br />

LinkType<br />

I_MovieLinks<br />

MovieId<br />

LinkMovieId<br />

LinkType<br />

2<br />

BudgetCurrency=Currency /<br />

RevenueCurrency=Currency<br />

MovieId<br />

I_Currencies<br />

Currency<br />

ConversionUSD<br />

I_MovieBusiness<br />

MovieId<br />

BudgetCurrency<br />

Budget<br />

BudgetUSD<br />

RevenueCurrency<br />

Revenue<br />

RevenueUSD<br />

2<br />

I_MovieDistributors<br />

MovieId<br />

DistributionCompany<br />

CountryCode<br />

DistributionInfo<br />

CountryCode<br />

MovieId<br />

I_Distributors<br />

DistributionCompany<br />

DistributionCompany<br />

I_Zones<br />

Zone<br />

Country<br />

Country<br />

I_MovieLocations<br />

MovieId<br />

Location<br />

Zone<br />

LocationInfo<br />

Zone<br />

MovieId<br />

I_MovieRunningTimes<br />

MovieId<br />

Country<br />

Duration<br />

RunningInfo<br />

Country<br />

MovieId<br />

SoundMix<br />

MovieId<br />

I_Sounds<br />

SoundMix<br />

I_MovieSounds<br />

MovieId<br />

SoundMix<br />

I_Directors<br />

Director<br />

MovieQuantity<br />

I_MovieDirectors<br />

MovieId<br />

Director<br />

DirectorAlias<br />

DirectorInfo<br />

MovieId<br />

Director<br />

I_Writers<br />

Writer<br />

MovieQuantity<br />

I_MovieWriters<br />

MovieId<br />

Writer<br />

WriterInfo<br />

MovieId<br />

Writer<br />

I_Producers<br />

Producer<br />

MovieQuantity<br />

I_MovieProducers<br />

MovieId<br />

Producer<br />

ProducerInfo<br />

MovieId<br />

Producer<br />

I_MovieCostumeDesigners<br />

MovieId<br />

CostumeDesigner<br />

CostumeInfo<br />

MovieId<br />

I_MovieProductionDesigners<br />

MovieId<br />

ProductionDesigner<br />

DesignerInfo<br />

MovieId<br />

I_Actresses<br />

Actress<br />

MovieQuantity<br />

I_MovieActresses<br />

MovieId<br />

Actress<br />

PlayInfo<br />

Role<br />

Casting<br />

MovieId<br />

Actress<br />

I_Actors<br />

Actor<br />

MovieQuantity<br />

I_MovieActors<br />

MovieId<br />

Actor<br />

PlayInfo<br />

Role<br />

Casting<br />

MovieId<br />

Actor<br />

I_Biographies<br />

Name<br />

RealName<br />

Birth<br />

Decease<br />

Height<br />

I_Ages<br />

AgeId<br />

MinAge<br />

MaxAge<br />

I_Durations<br />

DurationMin<br />

DurationMax<br />

DurationInterval<br />

DurationMin ≤ Duration<br />

≤ DurationMax<br />

Country<br />

I_Occupations<br />

OccupationId<br />

Occupation<br />

I_UserRatings<br />

UserId<br />

MovieId<br />

Rating<br />

Timestamp<br />

I_Users<br />

UserId<br />

Gender<br />

AgeId<br />

OccupationId<br />

ZipCode<br />

UserId<br />

MovieId<br />

AgeId OccupationId<br />

I_Person<br />

Name<br />

Director<br />

Writer<br />

Producer<br />

Actress<br />

Actor<br />

CostumeDesigner<br />

ProductionDesigner<br />

Name<br />

I_Votes<br />

Votes<br />

I_Ratings<br />

Ratings<br />

Votes<br />

Rating<br />

I_Revenues<br />

RevenueUSD<br />

I_Budgets<br />

BudgetUSD<br />

RevenueUSD<br />

BudgetUSD<br />

I_Movies<br />

MovieId<br />

Title<strong>MovieLens</strong><br />

TitleIMDB<br />

I_MovieRatings<br />

MovieId<br />

Distribution<br />

Votes<br />

Rating<br />

I_Genres<br />

Genre<br />

I_MovieGenres<br />

MovieId<br />

Genre<br />

MovieId<br />

MovieId<br />

Genre<br />

I_Countries<br />

Country<br />

LongName<br />

DomainCode<br />

ISO2Code<br />

ISO3Code<br />

UNnumericalCode<br />

IsCurrent<br />

IsSovereign<br />

Sovereign<br />

Continent<br />

SecondaryContinent<br />

Area<br />

Inhabitants<br />

I_MovieCountries<br />

MovieId<br />

Country<br />

MovieId<br />

Country<br />

I_Years<br />

Year<br />

Decade<br />

I_MovieYears<br />

MovieId<br />

Year<br />

YearInfo<br />

Year<br />

MovieId<br />

I_MovieReleaseDates<br />

MovieId<br />

Country<br />

ReleaseDate<br />

ReleaseYear<br />

ReleaseInfo<br />

Year=<br />

ReleaseYear<br />

Country<br />

MovieId<br />

I_Keywords<br />

Keyword<br />

MovieQuantities<br />

I_MovieKeywords<br />

MovieId<br />

Keyword<br />

Keyword<br />

MovieId<br />

Language<br />

MovieId<br />

I_Languages<br />

Language<br />

I_MovieLanguages<br />

MovieId<br />

Language<br />

LanguageInfo<br />

I_CountryCodes<br />

CountryCode<br />

Country<br />

Description<br />

I_MovieProductionCompanies<br />

MovieId<br />

CompanyName<br />

CountryCode<br />

CompanyInfo<br />

CountryCode<br />

MovieId<br />

Type<br />

MovieId<br />

I_Types<br />

Type<br />

TypeDescription<br />

I_MovieTypes<br />

MovieId<br />

Type<br />

Color<br />

MovieId<br />

I_Colors<br />

Color<br />

I_MovieColors<br />

MovieId<br />

Color<br />

ColorInfo<br />

LinkType<br />

MovieId<br />

LinkMovieId=MovieId<br />

I_LinkTypes<br />

LinkType<br />

I_MovieLinks<br />

MovieId<br />

LinkMovieId<br />

LinkType<br />

2<br />

BudgetCurrency=Currency /<br />

RevenueCurrency=Currency<br />

MovieId<br />

I_Currencies<br />

Currency<br />

ConversionUSD<br />

I_MovieBusiness<br />

MovieId<br />

BudgetCurrency<br />

Budget<br />

BudgetUSD<br />

RevenueCurrency<br />

Revenue<br />

RevenueUSD<br />

2<br />

I_MovieDistributors<br />

MovieId<br />

DistributionCompany<br />

CountryCode<br />

DistributionInfo<br />

CountryCode<br />

MovieId<br />

I_Distributors<br />

DistributionCompany<br />

DistributionCompany<br />

I_Zones<br />

Zone<br />

Country<br />

Country<br />

I_MovieLocations<br />

MovieId<br />

Location<br />

Zone<br />

LocationInfo<br />

Zone<br />

MovieId<br />

I_MovieRunningTimes<br />

MovieId<br />

Country<br />

Duration<br />

RunningInfo<br />

Country<br />

MovieId<br />

SoundMix<br />

MovieId<br />

I_Sounds<br />

SoundMix<br />

I_MovieSounds<br />

MovieId<br />

SoundMix<br />

I_Directors<br />

Director<br />

MovieQuantity<br />

I_MovieDirectors<br />

MovieId<br />

Director<br />

DirectorAlias<br />

DirectorInfo<br />

MovieId<br />

Director<br />

I_Writers<br />

Writer<br />

MovieQuantity<br />

I_MovieWriters<br />

MovieId<br />

Writer<br />

WriterInfo<br />

MovieId<br />

Writer<br />

I_Producers<br />

Producer<br />

MovieQuantity<br />

I_MovieProducers<br />

MovieId<br />

Producer<br />

ProducerInfo<br />

MovieId<br />

Producer<br />

I_MovieCostumeDesigners<br />

MovieId<br />

CostumeDesigner<br />

CostumeInfo<br />

MovieId<br />

I_MovieProductionDesigners<br />

MovieId<br />

ProductionDesigner<br />

DesignerInfo<br />

MovieId<br />

I_Actresses<br />

Actress<br />

MovieQuantity<br />

I_MovieActresses<br />

MovieId<br />

Actress<br />

PlayInfo<br />

Role<br />

Casting<br />

MovieId<br />

Actress<br />

I_Actors<br />

Actor<br />

MovieQuantity<br />

I_MovieActors<br />

MovieId<br />

Actor<br />

PlayInfo<br />

Role<br />

Casting<br />

MovieId<br />

Actor<br />

I_Biographies<br />

Name<br />

RealName<br />

Birth<br />

Decease<br />

Height<br />

I_Ages<br />

AgeId<br />

MinAge<br />

MaxAge<br />

I_Durations<br />

DurationMin<br />

DurationMax<br />

DurationInterval<br />

DurationMin ≤ Duration<br />

≤ DurationMax<br />

Country<br />

I_Occupations<br />

OccupationId<br />

Occupation<br />

I_UserRatings<br />

UserId<br />

MovieId<br />

Rating<br />

Timestamp<br />

I_Users<br />

UserId<br />

Gender<br />

AgeId<br />

OccupationId<br />

ZipCode<br />

UserId<br />

MovieId<br />

AgeId OccupationId<br />

I_Person<br />

Name<br />

Director<br />

Writer<br />

Producer<br />

Actress<br />

Actor<br />

CostumeDesigner<br />

ProductionDesigner<br />

Name<br />

I_Votes<br />

Votes<br />

I_Ratings<br />

Ratings<br />

Votes<br />

Rating<br />

I_Revenues<br />

RevenueUSD<br />

I_Budgets<br />

BudgetUSD<br />

RevenueUSD<br />

BudgetUSD


4.2. Integrated schema<br />

Verónika Peralta<br />

The integrated schema consists in 52 tables describing movies, companies <strong>and</strong> persons related to movies <strong>and</strong> the<br />

users that evaluated movies. Figure 23 shows the tables <strong>of</strong> the integrated schema, their primary keys (underlined<br />

attributes) <strong>and</strong> foreign keys (arrows between tables). Additional dotted lines relate some tables to a fictitious (not<br />

implemented) relation, describing persons. Shadow tables were used for the construction <strong>of</strong> the referential but<br />

are not visible for making queries. Table 13 describes each table, its attributes <strong>and</strong> constraints.<br />

Table Attributes Constraints<br />

I_Movies<br />

Join between <strong>IMDb</strong> <strong>and</strong><br />

<strong>MovieLens</strong><br />

I_Actors<br />

Master <strong>of</strong> actors<br />

I_Actresses<br />

Master <strong>of</strong> actresses<br />

I_Ages<br />

Age intervals<br />

I_Biographies<br />

Biographies <strong>of</strong> actors,<br />

actresses, directors,<br />

producers <strong>and</strong> other<br />

people involved in movies<br />

I_Budgets<br />

Master <strong>of</strong> budget<br />

intervals<br />

I_Colors<br />

Master <strong>of</strong> colors<br />

I_Countries<br />

Master <strong>of</strong> countries,<br />

providing several<br />

aggregation criteria <strong>and</strong><br />

international country<br />

codes<br />

− MovieId: Numeric(4); <strong>MovieLens</strong>’ id<br />

− Title<strong>MovieLens</strong>: String(100)<br />

− TitleImdb: String(250)<br />

− Actor: String(75)<br />

− MovieQuantity: Numeric(3); the number<br />

<strong>of</strong> played movies<br />

− Actress: String(75)<br />

− MovieQuantity: Numeric(3); the number<br />

<strong>of</strong> played movies<br />

− AgeId: Numeric(2)<br />

− MinAge: Numeric(2)<br />

− MaxAge: Numeric(2)<br />

− Name: String(70)<br />

− RealName: String(220)<br />

− Birth: String(130); date <strong>and</strong> place <strong>of</strong> birth<br />

− Decease: String(160); date, place <strong>and</strong><br />

cause <strong>of</strong> decease<br />

− Height: String(15)<br />

− BudgetUSD: Numeric(15,2); start <strong>of</strong> an<br />

interval (internal use)<br />

Primary key: MovieId<br />

Unique: TitleImdb<br />

Primary key: Actor<br />

Primary key: Actress<br />

Primary key: AgeId<br />

Primary key: Name<br />

Primary key: BudgetUSD<br />

− Color: String(20) Primary key: Color<br />

− Country: String(40)<br />

− LongName: String(110)<br />

− DomainCode: String(2); internet domain<br />

− ISO2Code: String(2); ISO3166-1-alpha2<br />

code<br />

− ISO3Code: String(3); ISO3166-1-alpha2<br />

code<br />

− UNnumericalCode: Numeric(3); united<br />

nations country code<br />

− IsCurrent: Numeric(1); 1 for current<br />

countries, 0 for old ones<br />

− IsSovereign: Numeric(1); 1 for sovereign<br />

UN nations, 2 for sovereign non-UN nations,<br />

3 for sovereign non-recognized nations, 4 for<br />

dependent territories <strong>and</strong> 5 for areas <strong>of</strong><br />

special sovereignty<br />

− Sovereign: String(40); name <strong>of</strong> sovereign<br />

nation (current country in the case <strong>of</strong> old<br />

countries)<br />

− Continent: String(20)<br />

− SecondaryContinent: String(20); for<br />

countries having territories in two continents<br />

(the one with the highest are is taken as main<br />

continent)<br />

− Area: Numeric(8)<br />

− Inhabitants: Numeric(10)<br />

Primary key: Country<br />

29


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

I_CountryCodes<br />

Codes <strong>of</strong> countries<br />

I_Currencies<br />

Master <strong>of</strong> currencies<br />

I_Directors<br />

Master <strong>of</strong> directors<br />

I_Distributors<br />

Master <strong>of</strong> distribution<br />

companies<br />

I_Durations<br />

Master <strong>of</strong> duration<br />

intervales<br />

I_Genres<br />

Master <strong>of</strong> genres<br />

I_Keywords<br />

Master <strong>of</strong> keywords<br />

I_Languages<br />

Master <strong>of</strong> languages<br />

I_LinkTypes<br />

Types <strong>of</strong> links between<br />

movies<br />

I_MovieActors<br />

Actors <strong>of</strong> movies<br />

I_MovieActresses<br />

Actresses <strong>of</strong> movies<br />

I_MovieBusiness<br />

Budget <strong>and</strong> revenue <strong>of</strong><br />

movies<br />

I_MovieColors<br />

Colors <strong>of</strong> movies<br />

I_MovieCostume-<br />

Designers<br />

Costume designers <strong>of</strong><br />

movies<br />

30<br />

− CountryCode: String(15)<br />

− Country: String(40)<br />

− Description: String(50); further info<br />

(regions, deprecations...)<br />

− Currency: String(3)<br />

− ConversionUSD: Numeric(12,10); at<br />

October 2006<br />

− Director: String(40)<br />

− MovieQuantity: Numeric(3); the number<br />

<strong>of</strong> directed movies<br />

− DistributionCompany: String(70)<br />

− DurationMin: Numeric(5)<br />

− DurationMin: Numeric(5)<br />

− DurationInterval: Numeric(5)<br />

Primary key: CountryCode<br />

Foreign keys:<br />

− Country: ref. I_Countries<br />

Primary key: Currency<br />

Primary key: Director<br />

Primary key:<br />

DistributionCompany<br />

Primary key: DurationMin<br />

− Genre: String(12) Primary key: Genre<br />

− Keyword: String(60)<br />

− MovieQuantity: Numeric(5); the number<br />

<strong>of</strong> movies having the keyword<br />

Primary key: Keyword<br />

− Language: String(30) Primary key: Language<br />

− LinkType: String(30); references,<br />

remakes, etc.<br />

− MovieId: Numeric(4)<br />

− Actor: String(75)<br />

− PlayInfo: String(110); further info<br />

(uncredited, alias, special roles…)<br />

− Role: String(256); role name <strong>and</strong>/or<br />

description<br />

− Casting: Numeric(3); casting position<br />

− MovieId: Numeric(4)<br />

− Actress: String(75)<br />

− PlayInfo: String(110); further info<br />

(uncredited, alias, special roles…)<br />

− Role: String(256); role name <strong>and</strong>/or<br />

description<br />

− Casting: Numeric(3); casting position<br />

− MovieId: Numeric(4)<br />

− BudgetCurrency: String(3)<br />

− Budget: Numeric(15,2)<br />

− BudgetUSD: Numeric(15,2)<br />

− RevenueCurrency: String(3)<br />

− Revenue: Numeric(15,2)<br />

− RevenueUSD: Numeric(15,2)<br />

− MovieId: Numeric(4)<br />

− Color: String(20)<br />

− ColorInfo: String(50); further info<br />

(versions)<br />

− MovieId: Numeric(4)<br />

− CostumeDesigner: String(50)<br />

− CostumeInfo: String(100); further info<br />

(roles, alias, …)<br />

Primary key: LinkType<br />

Primary key: MovieId, Actor<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Actor: ref. I_Actors<br />

Primary key: MovieId, Actress<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Actress: ref. I_Actresses<br />

Primary key: MovieId<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− BudgetCurrency: ref.<br />

I_Currencies<br />

− RevenueCurrency: ref.<br />

I_Currencies<br />

Primary key: MovieId, Color<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Color: ref. I_Colors<br />

Primary key: MovieId,<br />

CostumeDesigner<br />

Foreign keys:<br />

− MovieId: ref. I_Movies


I_MovieCountries<br />

Countries <strong>of</strong> movies<br />

I_MovieDirectors<br />

Countries <strong>of</strong> movies<br />

I_MovieDistributors<br />

Companies that<br />

distributed movies<br />

I_MovieGenres<br />

Genres <strong>of</strong> movies<br />

I_MovieKeywords<br />

Keywords <strong>of</strong> movies<br />

I_MovieLanguages<br />

Languages <strong>of</strong> movies<br />

I_MovieLinks<br />

Links between movies<br />

I_MovieLocations<br />

Running locations <strong>of</strong><br />

movies<br />

I_MovieProducers<br />

Producers <strong>of</strong> movies<br />

I_MovieProduction-<br />

Companies<br />

Companies that produced<br />

movies<br />

I_MovieProduction-<br />

Designers<br />

Production designers <strong>of</strong><br />

movies<br />

− MovieId: Numeric(4)<br />

− Country: String(40)<br />

− MovieId: Numeric(4)<br />

− Director: String(40)<br />

− DirectorAlias: String(40); director’s<br />

appearing name in the movie<br />

− DirectorInfo: String(70); further info e.g.<br />

uncredited movies, co-direction, direction <strong>of</strong><br />

episodes, segments or videos…<br />

− MovieId: Numeric(4)<br />

− DistributionCompany: String(70)<br />

− CountryCode: String(15)<br />

− DistributorInfo: String(100); further info<br />

(years, countries, media, …)<br />

− MovieId: Numeric(4)<br />

− Genre: String(12)<br />

− MovieId: Numeric(4)<br />

− Keyword: String(60)<br />

− MovieId: Numeric(4)<br />

− Language: String(30)<br />

− LanguageInfo: String(70); further info<br />

(versions)<br />

− MovieId: Numeric(4)<br />

− LinkMovieId: Numeric(4)<br />

− LinkType: String(30); references,<br />

remakes, etc.<br />

− MovieId: Numeric(4)<br />

− Location: String(140)<br />

− Zone: String(30)<br />

− LocationInfo: String(256); further info<br />

(years, countries, media, …)<br />

− MovieId: Numeric(4)<br />

− Producer: String(60)<br />

− ProducerInfo: String(90); further info<br />

(role, awards, …)<br />

− MovieId: Numeric(4)<br />

− CompanyName: String(120)<br />

− CountryCode: String(15)<br />

− CompanyInfo: String(140); further info<br />

(years, places, participations, …)<br />

− MovieId: Numeric(4)<br />

− ProductionDesginer: String(35)<br />

− DesignerInfo: String(140); further info<br />

(co-productions, roles, …)<br />

Verónika Peralta<br />

Primary key: MovieId, Country<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Country: ref. I_Countries<br />

Primary key: MovieId, Director<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Director: ref. I_Directors<br />

Primary key: MovieId,<br />

DistributionCompany,<br />

CountryCode<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− DistributionCompany: ref.<br />

I_Distributors<br />

− CountryCode: ref.<br />

I_CountryCodes<br />

Primary key: MovieId, Genre<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Genre: ref. I_Genres<br />

Primary key: MovieId, Keyword<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Keyword: ref. I_Keywords<br />

Primary key: MovieId, Language<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Language: ref. I_Languages<br />

Primary key: MovieId,<br />

LinkMovieId, LinkType<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Link MovieId: ref. I_Movies<br />

− LinkType: ref. I_LinkTypes<br />

Primary key: MovieId, Location<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Zone: ref. I_Zones<br />

Primary key: MovieId, Producer<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Producer: ref. I_Producers<br />

Primary key: MovieId,<br />

CompanyName, CountryCode<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− CountryCode: ref.<br />

I_CountryCodes<br />

Primary key: MovieId,<br />

ProductionDesginer<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

31


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

I_MovieRatings<br />

Global ratings on movies<br />

(not differentiated par<br />

user)<br />

I_MovieReleaseDates<br />

Dates <strong>of</strong> different releases<br />

I_MovieRunningTimes<br />

Running times <strong>of</strong> movies<br />

I_MovieSounds<br />

Sound mix <strong>of</strong> movies<br />

I_MovieTypes<br />

Types <strong>of</strong> movies<br />

I_MovieWriters<br />

Writers <strong>of</strong> movies<br />

I_MovieYears<br />

Years <strong>of</strong> movies<br />

I_Occupations<br />

User occupations<br />

I_Producers<br />

Master <strong>of</strong> producers<br />

I_Ratings<br />

Master <strong>of</strong> ratings<br />

I_Revenues<br />

Master <strong>of</strong> revenue<br />

intervals<br />

I_Sounds<br />

Master <strong>of</strong> sounds<br />

I_Types<br />

Types <strong>of</strong> movies<br />

I_UserRatings<br />

User ratings on movies<br />

32<br />

− MovieId: Numeric(4)<br />

− Distribution: String(10); unknown<br />

meaning<br />

− Votes: Numeric(8,2); quantity <strong>of</strong> votes<br />

(decimals are always zero)<br />

− Rating: Number(4,2), value between 0<br />

<strong>and</strong> 10<br />

− MovieId: Numeric(4)<br />

− Country: String(40)<br />

− ReleaseDate: String(20)<br />

− ReleaseYear: Numeric(4)<br />

− ReleaseInfo: String(90); further info<br />

(release city/festival)<br />

− MovieId: Numeric(4)<br />

− Country: String(40)<br />

− Duration: Numeric(5)<br />

− RunningInfo: String(75); further info<br />

(version, episodes, media, …)<br />

− MovieId: Numeric(4)<br />

− SoundMix: String(40)<br />

− MovieId: Numeric(4)<br />

− Type: String(2); {C=cinema,<br />

TV=television, V=video, VG=video game,<br />

S=serie, M=miniserie}<br />

− MovieId: Numeric(4)<br />

− Writer: String(45)<br />

− WriterInfo: String(110); further info<br />

(role, awards, …)<br />

− MovieId: Numeric(4)<br />

− Year: Numeric(4)<br />

− YearInfo: String(30); further info (shot<br />

years)<br />

− OccupationId: Numeric(2)<br />

− Occupation: String (25)<br />

− Producer: String(60)<br />

− MovieQuantity: Numeric(3); the number<br />

<strong>of</strong> produced movies<br />

− Rating: Number(4,2), value between 0<br />

<strong>and</strong> 10 (internal use, for classifying ratings)<br />

− RevenueUSD: Numeric(15,2); start <strong>of</strong> an<br />

interval (internal use)<br />

Primary key: MovieId<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

Primary key: MovieId, Country,<br />

ReleaseDate<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− ReleaseYear: ref. I_Years<br />

− Country: ref. I_Countries<br />

Primary key: MovieId, Country<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Country: ref. I_Countries<br />

Primary key: MovieId, SoundMix<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− SoundMix: ref. I_Sounds<br />

Primary key: MovieId<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Type: ref. I_Types<br />

Primary key: MovieId, Writer<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Writer: ref.: I_Writers<br />

Primary key: MovieId, Year<br />

Foreign keys:<br />

− MovieId: ref. I_Movies<br />

− Year: ref. I_Years<br />

Primary key: OccupationId<br />

Primary key: Producer<br />

Primary key: Rating<br />

Primary key: RevenueUSD<br />

− SoundMix: String(40) Primary key: SoundMix<br />

− Type: String(2); {C=cinema,<br />

TV=television, V=video, VG=video game,<br />

S=serie, M=miniserie}<br />

− TypeDescription: String(10)<br />

− UserId: Numeric(4)<br />

− MovieId: Numeric(4)<br />

− Rating: Number(1), value in {1,2,3,4,5}<br />

− Timestamp: Numeric(9); represents<br />

milliseconds from an initial time<br />

Primary key: Type<br />

Primary key: UserId, MovieId<br />

Foreign keys:<br />

− UserId: ref. I_Users<br />

− MovieId: ref. I_Movies


I_Users<br />

Descriptive information<br />

about users<br />

I_Votes<br />

Master <strong>of</strong> votes<br />

I_Writers<br />

Master <strong>of</strong> writers<br />

I_Years<br />

Master <strong>of</strong> years<br />

I_Zones<br />

Master <strong>of</strong> geographical<br />

zones<br />

4.3. Construction <strong>of</strong> the integrated database<br />

− UserId: Numeric(4)<br />

− Gender: Char(1); value in {M,F}<br />

− Age: Numeric(2)<br />

− Occupation: Numeric(2)<br />

− ZipCode: String(10)<br />

− Votes: Numeric(8,2); indicates the start<br />

<strong>of</strong> a voting interval (internal use, for<br />

classifying votes)<br />

− Writer: String(45)<br />

− MovieQuantity: Numeric(3); the number<br />

<strong>of</strong> written movies<br />

− Year: Numeric(4)<br />

− Decade: Numeric(4)<br />

− Zone: String(30); a country, region, sea,<br />

…<br />

− Country: String(40)<br />

Table 13 – Integrated schema<br />

Verónika Peralta<br />

Primary key: UserId<br />

Foreign keys:<br />

− Age: ref. I_Ages<br />

− Occupation: ref. I_Occupation<br />

Primary key: Votes<br />

Primary key: Writer<br />

Primary key: Year<br />

Primary key: Zone<br />

Foreign keys:<br />

− Country: ref. I_Countries<br />

Matching <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> movie titles was the most difficult task in the construction <strong>of</strong> the integrating<br />

database. The remaining tasks basically consist in joining the I_Movies table (the master <strong>of</strong> movies resulting<br />

from the matching process) in order to keep movie features <strong>of</strong> movies included in the master (i.e. assuring<br />

referential integrity). This allowed the creation <strong>of</strong> the appropriate foreign keys. In addition, we created master<br />

tables for most movie features. This section describes these tasks.<br />

Figure 24 shows an overview <strong>of</strong> the process for building tables <strong>of</strong> the integrated database. In order to filter<br />

features <strong>of</strong> movies non-included in the master <strong>of</strong> movies, we use the orphan elimination task described in Subsection<br />

2.3, which simply joins the <strong>IMDb</strong> feature table (<strong>IMDb</strong>_P in Figure 24) with the I_Movies table. Then,<br />

some master tables were aggregated using the aggregation task, also described in Sub-section 2.3, which group<br />

tuples according to a feature attribute. The aggregation task was not executed for some feature tables<br />

(I_MovieCostumeDesigners, I_MovieProductionDesigners) because they have too few repetitive values (for<br />

example, each entry <strong>of</strong> the I_MovieCustomeDesigners table describes a different costume designer). The master<br />

tables generated for each table can be deduced from Figure 23, because they coincide with referential constraints<br />

(excepting those referencing the I_Movies table).<br />

<strong>IMDb</strong>_P<br />

I_Movies<br />

orphan<br />

elimination<br />

aggregation<br />

I_MovieP<br />

I_P<br />

Figure 24 – Construction <strong>of</strong> tables <strong>of</strong> the integrated database<br />

Three special data reconciliation tasks were performed for years, currencies <strong>and</strong> countries. For years, we<br />

performed the union <strong>of</strong> years appearing in I_MovieYears <strong>and</strong> I_MovieReleaseDates tables <strong>and</strong> then, we<br />

calculated decade as a derived attribute, using typical date manipulation functions (similar to the year<br />

calculation task described in Sub-section 2.3.2). For currencies, we obtained budget <strong>and</strong> revenue currencies from<br />

the I_MovieBusiness table. We obtained dollar exchange (at October 2006) for such currencies from different<br />

Internet sites, e.g. [1] (a better exchange should be obtained by considering currencies at movie running time; we<br />

refer this to future work). For countries, we had country names in the I_MovieCountries <strong>and</strong><br />

I_MovieReleaseDates tables, in addition to some zones <strong>of</strong> the I_MovieLocations table <strong>and</strong> we had country codes<br />

in the I_MovieDistributors <strong>and</strong> I_MovieProductionCompanies tables. It was a big mess! Country names were<br />

written in different manners (e.g. ‘UK’ <strong>and</strong> ‘United Kingdom’) <strong>and</strong> sometimes referenced non-sovereign<br />

territories (e.g. French Polynesia) or non-current countries (e.g. ‘URSS’), country codes were also ambiguous.<br />

33


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

In order to build the master <strong>of</strong> countries (I_Countries) we downloaded United Nations <strong>of</strong>ficial country list [10],<br />

which includes sovereignty status <strong>of</strong> countries (UN member sovereign states, non-UN member sovereign states,<br />

no general international recognition sovereign states, dependent territories <strong>and</strong> areas <strong>of</strong> special sovereignty).<br />

Sovereign countries <strong>of</strong> non-sovereign territories were consulted at Wikipedia [11]. We obtained different country<br />

codes (ISO3166-1 alpha 2 code, ISO3166-1 alpha 3 code, UN numerical code <strong>and</strong> internet domain code) <strong>and</strong><br />

further descriptive data (continent, area <strong>and</strong> inhabitants) from the Schmidt lists <strong>of</strong> countries [8] <strong>and</strong> continents<br />

[9]. As a country can have territory in two continents, we defined as primary continent, the one in which the<br />

country has the biggest area. There were inconsistencies between UN <strong>of</strong>ficial country names <strong>and</strong> those obtained<br />

from Wikipedia <strong>and</strong> Schmidt lists, but they were manually solved without major problems.<br />

We created mapping tables in order to join the master with I_MovieCountries <strong>and</strong> I_MovieReleaseDates tables.<br />

Country names included in such tables but not in the master were manually disambiguated <strong>and</strong> related to one <strong>of</strong><br />

the existing countries (when it was an alternative name <strong>of</strong> such country) or added as non-current country (when<br />

it was an old country name). I_MovieCountries <strong>and</strong> I_MovieReleaseDates tables were updated setting country<br />

names according to the mapping tables. Two special country names were created: (i) ‘_unknown’ for Null<br />

values, <strong>and</strong> (ii) <strong>and</strong> _multiple for special cases, such as ‘Korea’, ‘Serbia <strong>and</strong> Montenegro’ or ‘Soviet Union’ that<br />

correspond to multiple current countries.<br />

The master <strong>of</strong> country codes was created from the master <strong>of</strong> countries, selecting attributes country name <strong>and</strong><br />

domain code. Mapping tables were created analogously in order to join the master with I_MovieDistributors <strong>and</strong><br />

I_MovieProductionCompanies. We manually disambiguated country codes in an analogous way. We found<br />

several cases <strong>of</strong> far territories (e.g. ‘French Guiana’) which were associated to the corresponding sovereign<br />

country, multiple countries (e.g. ‘us/ca’ referencing United States <strong>and</strong> Canada) which were associated to the<br />

‘_multiple’ special country name <strong>and</strong> unknown countries codes (e.g. ‘cshh’ <strong>and</strong> ‘ddde’) which were associated to<br />

the ‘_unknown’ special country name.<br />

The master <strong>of</strong> zones was also created from the I_MovieLocations table, adding an attribute for storing a country<br />

name when the zone coincides with a country. The table was joined with the master <strong>of</strong> countries, <strong>and</strong> nonreferenced<br />

zones were manually examined in order to determine if they correspond to a country (setting the<br />

country attribute with the appropriate value) or other types <strong>of</strong> zones, e.g. ‘Persian Gulf’ (setting the country<br />

attribute with the ‘_multiple’ value).<br />

As summary, the construction <strong>of</strong> the integrated database basically consisted in intersection integrated movie<br />

titles with <strong>IMDb</strong> tables describing movie features. In addition, we built master tables (sometimes integrating<br />

several <strong>IMDb</strong> tables) <strong>and</strong> integrating external data sources (country lists <strong>and</strong> currencies). For recapitulating,<br />

Table 14 shows the tables <strong>of</strong> the integrated database <strong>and</strong> their number <strong>of</strong> tuples.<br />

5. Conclusion<br />

In this report we described the procedure followed for extracting <strong>and</strong> integrating <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> data.<br />

Specifically, we described source <strong>and</strong> target schemas <strong>and</strong> the algorithms that perform data extraction, data<br />

cleaning <strong>and</strong> data transformation <strong>and</strong> data integration.<br />

Major difficulties were found at:<br />

34<br />

− <strong>IMDb</strong> data extraction. We needed to develop ad-hoc pre-processing routines in order to normalize<br />

hierarchical <strong>and</strong> tagged structures <strong>and</strong> substitute multiple tabulations <strong>and</strong> blanks.<br />

− <strong>IMDb</strong> data transformation <strong>and</strong> cleaning. We found a lot <strong>of</strong> anomalies in extracted data, ranging from<br />

duplicate <strong>and</strong> orphan tuples to multiple attributes embedded in single columns. We needed to develop<br />

several cleaning <strong>and</strong> transformation tasks.<br />

− Matching <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> titles. We implemented 8 different matching strategies, some <strong>of</strong> them<br />

including human interaction. This was the most time-consuming task.<br />

− Aggregation <strong>of</strong> some master tables, especially, the master <strong>of</strong> countries. We extracted external data in<br />

order to solve conflictive country names <strong>and</strong> codes <strong>and</strong> lead with non-sovereign <strong>and</strong> non-current<br />

countries.<br />

<strong>Data</strong> extracted from <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> sites, as well as the integrated database are available at the <strong>APMD</strong><br />

project web site [7].<br />

In this report we do not dealt with maintenance issues but we would like to treat some issues as future work.<br />

Specifically, as <strong>IMDb</strong> lists are continually updated, it may be interesting to adapt extraction, cleaning <strong>and</strong>


Verónika Peralta<br />

integration processes to incorporate new data. The problem seams to be more difficult for updates <strong>of</strong> <strong>MovieLens</strong><br />

data (which are not announced by the moment) because manual reconciliation tasks will probably be necessary.<br />

Master tables Tuples Relationship tables Tuples Auxiliary tables Tuples<br />

I_Movies 3881<br />

I_Actors 63207 I_MovieActors 110548 I_Ages 7<br />

I_Actresses 31720 I_MovieActresses 48423 I_Budgets 6<br />

I_Biographies 258818 I_Occupations 21<br />

I_MovieBusiness 1740 I_Ratings 10<br />

I_Colors 2 I_MovieColors 3898 I_Revenues 6<br />

I_MovieCostumeDesigners 3373 I_UserRatings 1000194<br />

I_Countries 256 I_MovieCountries 4993 I_Users 6040<br />

I_CountryCodes 265 I_Votes 8<br />

I_Currencies 95<br />

I_Directors 2170 I_MovieDirectors 4141<br />

I_Distributors 2391 I_MovieDistributors 23060<br />

I_Durations 6<br />

I_Genres 25 I_MovieGenres 9288<br />

I_Keywords 14974 I_MovieKeywords 105128<br />

I_Languages 85 I_MovieLanguages 4796<br />

I_LinkTypes 15 I_MovieLinks 18160<br />

I_MovieLocations 14345<br />

I_Producers 8142 I_MovieProducers 16749<br />

I_MovieProductionCompanies 9517<br />

I_MovieProductionDesigners 3152<br />

I_MovieRatings 3861<br />

I_MovieReleaseDates 46554<br />

I_MovieRunningTimes 4839<br />

I_Sounds 28 I_MovieSounds 4928<br />

I_Types 6 I_MovieTypes 3881<br />

I_Writers 5458 I_MovieWriters 8766<br />

I_Years 90 I_MovieYears 3881<br />

I_Zones 133<br />

6. References<br />

Table 14 – Statistics <strong>of</strong> data integration<br />

[1] European Commision: “List <strong>of</strong> countries, territories <strong>and</strong> currencies”, Web Report, March 2007. URL:<br />

http://publications.europa.eu/code/en/en-5000500.htm, last accessed on April 5 th , 2007.<br />

[2] GroupLens Research: “movielens: helping you to find the right movies”. Web site, ULR:<br />

http://movielens.umn.edu, last accessed on July 9 th , 2006.<br />

[3] GroupLens Research: “<strong>MovieLens</strong> <strong>Data</strong> Sets”. Web site, ULR: http://grouplens.org/node/12#attachments,<br />

last accessed on October 5 th , 2006.<br />

[4] Intenet Movie <strong>Data</strong>base, Inc.: “The Intenet Movie <strong>Data</strong>base”, Web site, URL: http://www.imdb.com/, last<br />

accessed on July 9 th , 2007.<br />

[5] Intenet Movie <strong>Data</strong>base, Inc.: “Alternate interfaces”, Web Page, URL: http://www.imdb.com/interfaces,<br />

last accessed on October 5 th , 2006.<br />

[6] Peralta, V.: “Matching <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> Movie Titles”. Technical Report, Laboratoire PRiSM,<br />

Université de Versailles, Versailles, France, March 2007.<br />

[7] Peralta, V.: “<strong>APMD</strong> Test Platform”. URL: http://apmd.prism.uvsq.fr/TestPlatform/, last accessed on May<br />

30 th , 2007.<br />

[8] Schmidt M.: “List <strong>of</strong> countries <strong>and</strong> territories”. Web page, URL:<br />

http://schmidt.devlib.org/data/countries.html, last accessed on April, 10 th , 2007.<br />

35


<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />

[9] Schmidt M.: “Continents”. Web page, URL: http://schmidt.devlib.org/data/continents.html, last accessed on<br />

April, 10 th , 2007.<br />

[10] United Nations Cartographic Section: “List <strong>of</strong> Territories”. PDF file, 2004. URL:<br />

http://www.un.org/Depts/Cartographic/english/geoname.pdf, last accessed on April 10 th , 2007.<br />

[11] Wikipedia: “List <strong>of</strong> countries”. Web page, URL: http://en.wikipedia.org/wiki/List_<strong>of</strong>_countries, last<br />

accessed on April 10 th , 2007.<br />

36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!