Extraction and Integration of MovieLens and IMDb Data - APMD
Extraction and Integration of MovieLens and IMDb Data - APMD
Extraction and Integration of MovieLens and IMDb Data - APMD
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong><br />
Verónika Peralta<br />
Laboratoire PRiSM, Université de Versailles<br />
45, avenue des Etats-Unis<br />
78035 Versailles Cedex<br />
FRANCE<br />
veronika.peralta@prism.uvsq.fr<br />
Technical Report 1<br />
July 2007<br />
Abstract. This report describes the procedure followed for extracting <strong>and</strong> integrating <strong>MovieLens</strong><br />
<strong>and</strong> <strong>IMDb</strong> data. We firstly focus on data extraction, describing source schemas, target schemas<br />
<strong>and</strong> the algorithms that perform data extraction, data cleaning <strong>and</strong> data transformation. We then<br />
describe data integration, describing the integrated schema, the algorithm that match movie titles<br />
<strong>and</strong> the construction <strong>of</strong> the integrated database. Finally, we present some statistics <strong>of</strong> the extracted<br />
<strong>and</strong> integrated data.<br />
1. Introduction<br />
This report describes the procedure followed for extracting <strong>and</strong> integrating <strong>IMDb</strong> <strong>and</strong> <strong>MovieLens</strong> data.<br />
<strong>IMDb</strong>, the Internet Movie <strong>Data</strong>base, is a huge collection <strong>of</strong> movie information (auto-claimed to be the earth’s<br />
biggest movie database). It started as a hobby project by an international group <strong>of</strong> movie fans <strong>and</strong> currently<br />
belongs to the Amazon.com Company. <strong>IMDb</strong> tries to catalog every pertinent detail about a movie, from who was<br />
in it, to who made it, to trivia about it, to filming locations, <strong>and</strong> even where you can find reviews <strong>and</strong> fan sites on<br />
the web. The <strong>IMDb</strong> web site (http://www.imdb.com/ [4]) provides 49 text files in ad-hoc format (called lists)<br />
containing different characteristics about movies (e.g. actors.list or running-times.list). Lists content is<br />
continually updated <strong>and</strong> enlarged; at October 5 th 2006, the movie list included more than 858.000 movies.<br />
<strong>MovieLens</strong> is a movie recommender project, developed by the Department <strong>of</strong> Computer Science <strong>and</strong><br />
Engineering at the University <strong>of</strong> Minnesota. <strong>MovieLens</strong> is a typical collaborative filtering system that collects<br />
movie preferences from users <strong>and</strong> then groups users with similar tastes. Based on the movie ratings expressed by<br />
all the users in a group it attempts to predict for each individual their opinion on movies they have not yet seen.<br />
Two data sets are available at the <strong>MovieLens</strong> web site (http://movielens.umn.edu/ [2]). The first one consists <strong>of</strong><br />
100,000 ratings for 1682 movies by 943 users. The second one consists <strong>of</strong> approximately 1 million ratings for<br />
3883 movies by 6040 users.<br />
Our goal is to build a relational database about movies, which includes movie descriptions <strong>and</strong> user ratings. To<br />
this end, we extract, transform <strong>and</strong> integrate data provided by <strong>IMDb</strong> <strong>and</strong> <strong>MovieLens</strong> sites. This database will be<br />
used, in future works, to evaluate the performance <strong>and</strong> pertinence <strong>of</strong> different techniques <strong>and</strong> algorithms for<br />
query large data sets taking into account user preferences <strong>and</strong> data quality.<br />
The extraction <strong>and</strong> integration procedure has 4 main steps: (i) extraction <strong>of</strong> <strong>MovieLens</strong> data, (ii) extraction <strong>and</strong><br />
transformation <strong>of</strong> <strong>IMDb</strong> data, (iii) matching <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> movie titles, <strong>and</strong> (iv) construction <strong>of</strong> the<br />
integrated database. Figure 1 shows an overview <strong>of</strong> these steps.<br />
1 This research was partially supported by the French Ministry <strong>of</strong> Research <strong>and</strong> New Technologies under the ACI program<br />
devoted to <strong>Data</strong> Masses (ACI-MD), project #MD-33.
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
2<br />
<strong>MovieLens</strong><br />
data<br />
<strong>IMDb</strong> data<br />
<strong>Extraction</strong> <strong>of</strong><br />
<strong>MovieLens</strong> data<br />
<strong>Extraction</strong> <strong>of</strong><br />
<strong>IMDb</strong> data<br />
Matching <strong>of</strong><br />
movie titles<br />
Figure 1 – <strong>Extraction</strong> <strong>and</strong> integration steps<br />
Construction <strong>of</strong><br />
the integrated<br />
database<br />
<strong>Data</strong> extraction <strong>and</strong> integration processes were executed in a Micros<strong>of</strong>t Access database <strong>and</strong> then migrated to an<br />
Oracle database. In the remaining <strong>of</strong> the report we describe each step: Sections 2 <strong>and</strong> 3 present the extraction <strong>of</strong><br />
<strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> data respectively, describing source schemas, target schemas <strong>and</strong> the algorithms that<br />
perform data extraction, data cleaning <strong>and</strong> data transformation. Section 4 presents data integration, describing the<br />
integrated schema, the algorithm that matches movie titles <strong>and</strong> the construction <strong>of</strong> the integrated database.<br />
Matching details are subject <strong>of</strong> a more detailed report [Peralta 2007a]. Finally, Section 7 concludes.<br />
2. <strong>MovieLens</strong> data extraction<br />
<strong>MovieLens</strong> data sets are provided in the form <strong>of</strong> tabular text files (comma separated text). <strong>Data</strong> extraction<br />
consisted in the loading <strong>of</strong> text data to a relational database, normalization <strong>of</strong> data <strong>and</strong> duplicates elimination.<br />
The following sub-sections describe source files, target schemas, extraction processes <strong>and</strong> cleaning processes.<br />
2.1. <strong>MovieLens</strong> source schemas<br />
<strong>MovieLens</strong> data sets were collected by the GroupLens Research Project at the University <strong>of</strong> Minnesota. They<br />
provide two data sets (with 100.000 <strong>and</strong> 1.000.209 evaluations respectively). <strong>Data</strong> was collected through the<br />
<strong>MovieLens</strong> web site [3] <strong>and</strong> was cleaned up, i.e. users who had less than 20 ratings or did not have complete<br />
demographic information were removed from data sets.<br />
We concentrated in the largest one, however, the smallest one is helpful in for the matching <strong>of</strong> movie titles<br />
(because it provides the <strong>IMDb</strong> URL <strong>of</strong> each movie) so it was extracted too. The following sub-sections describe<br />
both schemas.<br />
2.1.1. Source files <strong>of</strong> the large data set<br />
The large data set consists in 3 text files, with tabular format, describing 1000209 anonymous ratings <strong>of</strong> 3883<br />
movies made by 6040 <strong>MovieLens</strong> users who joined <strong>MovieLens</strong> in 2000. In the following, we describe the<br />
contents <strong>of</strong> each text file.<br />
Ratings.dat<br />
This file contains data about 1000209 ratings (1 evaluation, corresponding to 1 user <strong>and</strong> 1 movie, in each line) in<br />
the format: UserID::MovieID::Rating::Timestamp where:<br />
− UserID is an integer, ranging from 1 to 6040, that identifies a user. Each user has rated at least 20 movies.<br />
− MovieID is an integer, ranging from 1 to 3952, that identifies a movie.<br />
− Rating is an integer, ranging from 1 to 5, made on a 5-star scale (whole-star ratings only).<br />
− Timestamp is represented in seconds since 1/1/1970 UTC<br />
Figure 2 shows an extract <strong>of</strong> the file corresponding to 5 ratings.
Movies.dat<br />
1::1193::5::978300760<br />
1::661::3::978302109<br />
1::914::3::978301968<br />
1::3408::4::978300275<br />
1::2355::5::978824291<br />
Figure 2 – Extract <strong>of</strong> the Ratings.dat file<br />
Verónika Peralta<br />
This file contains data about 3883 movies (1 movie in each line) in the format: MovieID::Title::Genres where:<br />
− MovieID is an integer, ranging from 1 to 3952, that identifies a movie.<br />
− Title is a String that concatenates movie title <strong>and</strong> year <strong>of</strong> release (between brackets).<br />
− Genres is a pipe-separated list <strong>of</strong> genres. Provided genres are: Action, Adventure, Animation, Children's,<br />
Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi,<br />
Thriller, War, Western.<br />
Figure 3 shows an extract <strong>of</strong> the file corresponding to 5 movies.<br />
Users.dat<br />
1::Toy Story (1995)::Animation|Children's|Comedy<br />
2::Jumanji (1995)::Adventure|Children's|Fantasy<br />
3::Grumpier Old Men (1995)::Comedy|Romance<br />
4::Waiting to Exhale (1995)::Comedy|Drama<br />
5::Father <strong>of</strong> the Bride Part II (1995)::Comedy<br />
Figure 3 – Extract <strong>of</strong> the Movies.dat file<br />
This file contains data about 6040 users (1 user in each line) in the format:<br />
UserID::Gender::Age::Occupation::Zip-code where:<br />
− UserID is an integer, ranging from 1 to 6040, that identifies a user.<br />
− Gender is denoted by a "M" for male <strong>and</strong> "F" for female<br />
− Age is an integer identifying a range (the minimum age in the range). Provided ranges are: under 18, 18-<br />
24, 25-34, 35-44, 45-49, 50-55, over 56.<br />
− Occupation is an integer, ranging from 0 to 20, indicating the following choices: 0: other or not<br />
specified, 1: academic/educator, 2: artist, 3: clerical/admin, 4: college/grad student, 5: customer service,<br />
6: doctor/health care, 7: executive/managerial, 8: farmer, 9: homemaker, 10: K-12 student, 11: lawyer, 12:<br />
programmer, 13: retired, 14: sales/marketing, 15: scientist, 16: self-employed, 17: technician/engineer,<br />
18: tradesman/craftsman, 19: unemployed, 20: writer<br />
− Zip-code is a five-digits integer indicating user ZIP-code.<br />
All demographic information was provided voluntarily by the users <strong>and</strong> was not checked for accuracy. Only<br />
users who have provided some demographic information are included in this data set. Figure 4 shows an extract<br />
<strong>of</strong> the file corresponding to 5 users.<br />
1::F::1::10::48067<br />
2::M::56::16::70072<br />
3::M::25::15::55117<br />
4::M::45::7::02460<br />
5::M::25::20::55455<br />
Figure 4 – Extract <strong>of</strong> the Users.dat file<br />
3
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
2.1.2. Source files <strong>of</strong> the small data set<br />
The small data set consists in 5 text files, with tabular format, describing 100000 anonymous ratings <strong>of</strong> 1682<br />
movies made by 943 users during the seven-month period from September 19 th , 1997 through April 22 nd , 1998.<br />
In the following, we describe the contents <strong>of</strong> each text file.<br />
u.data<br />
This file contains data about 100000 ratings (1 evaluation, corresponding to 1 user <strong>and</strong> 1 movie, in each line). It<br />
is a tab-separated list <strong>of</strong> UserID, MovieID, Rating <strong>and</strong> Timestamp, where:<br />
4<br />
− UserID is an integer, ranging from 1 to 943, that identifies a user. Each user has rated at least 20 movies.<br />
− MovieID is an integer, ranging from 1 to 1682, that identifies a movie.<br />
− Rating is an integer, ranging from 1 to 5, made on a 5-star scale (whole-star ratings only).<br />
− Timestamp is represented in seconds since 1/1/1970 UTC<br />
<strong>Data</strong> is r<strong>and</strong>omly ordered. Figure 5 shows an extract <strong>of</strong> the file corresponding to 5 ratings.<br />
u.item<br />
196 242 3 881250949<br />
186 302 3 891717742<br />
22 377 1 878887116<br />
244 51 2 880606923<br />
166 346 1 886397596<br />
Figure 5 – Extract <strong>of</strong> the u.data file<br />
This file contains data about 1682 movies (1 movie in each line). This is a pipe-separated list <strong>of</strong> MovieID,<br />
MovieTitle, ReleaseDate, VideoReleaseDate, <strong>IMDb</strong>URL, unknown, Action, Adventure, Animation, Children’s,<br />
Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi,<br />
Thriller, War <strong>and</strong> Western, where<br />
− MovieID is an integer, ranging from 1 to 1682, that identifies a movie.<br />
− MovieTitle is a String that concatenates movie title <strong>and</strong> year <strong>of</strong> release (between brackets).<br />
− ReleaseDate is a DD-Mon-YYYY date indicating movie release date<br />
− VideoReleaseDate was destinated to video release date but it is always NULL<br />
− <strong>IMDb</strong>URL indicates the URL <strong>of</strong> the movie in the <strong>IMDb</strong> site.<br />
− The last 19 fields correspond to genres; a 1 indicates the movie is <strong>of</strong> that genre, a 0 indicates it is not;<br />
movies can be in several genres at once.<br />
Figure 6 shows an extract <strong>of</strong> the file corresponding to 5 movies.<br />
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|<br />
0|0|0|0|0|0|0|0|0|0|0<br />
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|<br />
0|0|0|0|0|0|0|1|0|0<br />
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|<br />
0|0|0|0|0|0|0|0|0|1|0|0<br />
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|<br />
0|1|0|0|0|0|0|0|0|0|0|0<br />
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|<br />
0|0|0|0|0|1|0|0<br />
Figure 6 – Extract <strong>of</strong> the u.item file
u.genre<br />
Verónika Peralta<br />
This file contains data about 19 movie genres (1 genre in each line). This is a pipe-separated list <strong>of</strong> GenreID <strong>and</strong><br />
GenreName, where<br />
− GenreID is an integer, ranging from 0 to 18, that identifies a genre.<br />
− GenreName is a String describing the genre. Genre names corresponds to genre columns <strong>of</strong> the u.item<br />
file.<br />
Figure 7 shows an extract <strong>of</strong> the file corresponding to 5 genres.<br />
u.user<br />
unknown|0<br />
Action|1<br />
Adventure|2<br />
Animation|3<br />
Children's|4<br />
Figure 7 – Extract <strong>of</strong> the u.genre file<br />
This file contains data about 943 users (1 user in each line). This is a pipe-separated list <strong>of</strong> UserID, Age, Gender,<br />
Occupation <strong>and</strong> Zip-code, where:<br />
− UserID is an integer, ranging from 1 to 943, that identifies a user<br />
− Age is an integer indicating user’s age<br />
− Gender is denoted by a "M" for male <strong>and</strong> "F" for female<br />
− Occupation is an String indicating user occupation.<br />
− Zip-code is a five-digits integer indicating user ZIP-code.<br />
All demographic information was provided voluntarily by the users <strong>and</strong> was not checked for accuracy. Only<br />
users who have provided some demographic information are included in this data set. Figure 8 shows an extract<br />
<strong>of</strong> the file corresponding to 5 users.<br />
u.occupation<br />
1|24|M|technician|85711<br />
2|53|F|other|94043<br />
3|23|M|writer|32067<br />
4|24|M|technician|43537<br />
5|33|F|other|15213<br />
Figure 8 – Extract <strong>of</strong> the u.user file<br />
This file lists 21 user occupations (1 occupation in each line). Figure 9 shows an extract <strong>of</strong> the file corresponding<br />
to 5 occupations.<br />
2.2. <strong>MovieLens</strong> target schemas<br />
administrator<br />
artist<br />
doctor<br />
educator<br />
engineer<br />
Figure 9 – Extract <strong>of</strong> the u.occupations file<br />
Both <strong>MovieLens</strong> data set were extracted to a Micros<strong>of</strong>t Access® database. The following sub-sections describe<br />
the target schemas for both data sets.<br />
5
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
2.2.1. Target tables <strong>of</strong> the large data set<br />
The target schema for the large data set consists in 6 tables describing movies, users <strong>and</strong> ratings. Figure 10<br />
shows the tables <strong>of</strong> the target schema, their primary keys (attributes in bold) <strong>and</strong> foreign keys (lines between<br />
tables). Table 1 describes each target table, its attributes <strong>and</strong> constraints.<br />
6<br />
Figure 10 – <strong>MovieLens</strong> schema (large data set)<br />
Table Attributes Constraints<br />
ML_Users<br />
Demographic information<br />
about users<br />
ML_Movies<br />
Titles <strong>of</strong> movies<br />
ML_Ratings<br />
User ratings on movies<br />
ML_MovieGenres<br />
Genres <strong>of</strong> movies<br />
ML_Ages<br />
Age intervals<br />
ML_Occupations<br />
Occupations<br />
− UserId: Numeric(4)<br />
− Gender: Char(1); value in {M,F}<br />
− AgeId: Numeric(2)<br />
− OccupationId: Numeric(2)<br />
− ZipCode: String(10)<br />
− MovieId: Numeric(4)<br />
− MovieTitle: String(100)<br />
− UserId: Numeric(4)<br />
− MovieId: Numeric(4)<br />
− Rating: Number(1), value in {1,2,3,4,5}<br />
− Timestamp: Numeric(9); represents<br />
milliseconds from an initial time<br />
− MovieId: Numeric(4)<br />
− Genre: String(12)<br />
− AgeId: Numeric(2)<br />
− MinAge: Numeric(2)<br />
− MaxAge: Numeric(2)<br />
− OccupationId: Numeric(2)<br />
− Occupation: String (25)<br />
2.2.2. Target tables <strong>of</strong> the small data set<br />
Table 1 – <strong>MovieLens</strong> schema (large data set)<br />
Primary key: UserId<br />
Foreign keys:<br />
− Age: ref. ML_Ages<br />
− Occupation: ref. ML_Occupation<br />
Primary key: MovieId<br />
Unique: MovieTitle<br />
Primary key: UserId, MovieId<br />
Foreign keys:<br />
− UserId: ref. ML_Users<br />
− MovieId: ref. ML_Movies<br />
Primary key: MovieId, Genre<br />
Foreign keys:<br />
− MovieId: ref. SML_Movies<br />
Primary key: AgeId<br />
Primary key: OccupationId<br />
The target schema for the small data set consists in 5 tables describing movies, movie genres, users <strong>and</strong> ratings.<br />
Figure 11 shows the tables <strong>of</strong> the target schema, their primary keys (attributes in bold) <strong>and</strong> foreign keys (lines<br />
between tables). Table 2 describes each target table, its attributes <strong>and</strong> constraints.<br />
Figure 11 – <strong>MovieLens</strong> schema (small data set)
Table Attributes Constraints<br />
SML_Users<br />
Demographic information<br />
about users<br />
SML_Movies<br />
Information about movies<br />
SML_Ratings<br />
User ratings on movies<br />
SML_Genres<br />
Description <strong>of</strong> genres<br />
SML_MovieGenres<br />
Genres <strong>of</strong> movies<br />
− UserId: Numeric(4)<br />
− Age: Numeric(2)<br />
− Gender: Char(1); value in {M,F}<br />
− Occupation: String(15); finite set<br />
− ZipCode: String(5)<br />
− MovieId: Numeric(4)<br />
− MovieTitle: String(100)<br />
− ReleaseYear: Numeric(4)<br />
− Imdb: String(150); represents the URL <strong>of</strong> an<br />
<strong>IMDb</strong> page<br />
− UserId: Numeric(4)<br />
− MovieId: Numeric(4)<br />
− Rating: Number(1), value in {1,2,3,4,5}<br />
− Timestamp: Numeric(9); represents<br />
milliseconds from an initial time<br />
− Genre: String(12)<br />
− GenreId: Numeric(2)<br />
− MovieId: Numeric(4)<br />
− GenreId: Numeric(2)<br />
Table 2 – <strong>MovieLens</strong> schema (small data set)<br />
2.3. <strong>MovieLens</strong> extraction, cleaning <strong>and</strong> transformation processes<br />
Primary key: UserId<br />
Primary key: MovieId<br />
Unique: MovieTitle<br />
Verónika Peralta<br />
Primary key: UserId, MovieId<br />
Foreign keys:<br />
− UserId: ref. SML_Users<br />
− MovieId: ref. SML_Movies<br />
Primary key: GenreId<br />
Primary key: MovieId, GenreId<br />
Foreign keys:<br />
− MovieId: ref. SML_Movies<br />
− GenreId: ref. SML_Genres<br />
Both data sets were quite consistent; however, we detected <strong>and</strong> solved some kinds <strong>of</strong> anomalies:<br />
− Multi-valued attributes: Some source attributes are provided as comma-separated (or pipe-separated) lists<br />
<strong>of</strong> values. As a solution, we normalise tables.<br />
− Null or dummy values in m<strong>and</strong>atory attributes: Some m<strong>and</strong>atory attributes (especially key attributes)<br />
contains Null or dummy values (e.g. MovieTitle = 'Unknown'). As a solution, such tuples are eliminated.<br />
− Duplicates: Some tuples (or tuple keys) are duplicated. As a solution, we keep a unique tuple, reconciling<br />
values <strong>of</strong> further attributes if necessary.<br />
− Orphans: Some tuples do not satisfy referential constraints, i.e. their values are not included in a foreign<br />
table. As a solution, we eliminate orphan tuples.<br />
In order to solve such anomalies we implemented an ETL process for extracting, cleaning <strong>and</strong> transforming data<br />
<strong>of</strong> each source file. ETL workflows consist <strong>of</strong> five major tasks as illustrated in Figure 12 (rectangles represent<br />
P.txt<br />
extraction<br />
P-source<br />
cleaning<br />
P-cleaning<br />
unduplication<br />
P-group<br />
referenciation<br />
P-duplicates P-orphans<br />
Figure 12 – High-level ETL process<br />
referential<br />
P-ref<br />
exportation<br />
P<br />
P’<br />
7
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
tasks; continuous arrows represent control flow; dotted arrows represent data flow, indicating the tables that are<br />
input or output <strong>of</strong> tasks. ETL inputs are a text file containing the data to extract (P.txt in Figure 12) <strong>and</strong><br />
(possibly) some referential tables (referential in Figure 12). ETL output is a relational table containing the<br />
extracted data (P in Figure 12) <strong>and</strong> (possibly) some aggregation tables (P’ in Figure 12). Two additional tables<br />
keep trace <strong>of</strong> the anomalies found (P-duplicates <strong>and</strong> P-orphans in Figure 12). We stored the result <strong>of</strong> each task in<br />
a temporal table (white tables in Figure 12) in order to ease traceability <strong>and</strong> maintainability. For certain specific<br />
files, ETL workflows can be simpler car not all the tasks are necessary for loading <strong>and</strong> cleaning all tables. In the<br />
following, we describe the semantics <strong>of</strong> each task (details are illustrated in Figure 13).<br />
Bulk copy tasks load data from text files to relational tables. We used Micros<strong>of</strong>t Access® loading interface,<br />
indicating file format <strong>and</strong> column format.<br />
Cleaning tasks perform different data cleaning transformations, for example, elimination <strong>of</strong> Null values,<br />
denormalisation, calculation <strong>of</strong> derived attributes <strong>and</strong> so on. Specific cleaning tasks are described later for each<br />
table.<br />
Unduplication tasks eliminate duplicate data. Several sub-tasks are performed, distinguishing normal flow<br />
(when there are no duplicates) <strong>and</strong> error flow (when there are duplicates):<br />
8<br />
− Duplicate check tasks verify the existence <strong>of</strong> duplicate values <strong>of</strong> key (or unique) attributes. Note that key<br />
(or unique) constraints are not defined yet; unduplication is needed in order to define the appropriate<br />
constraints. Duplicate tuples are inserted into temporal tables (P-duplicates in Figure 13), registering the<br />
key (or unique) value, the number <strong>of</strong> times the key (or unique) value is repeated, <strong>and</strong> the conflicting<br />
values <strong>of</strong> other attributes (minimum <strong>and</strong> maximum values). Tasks are implemented as SQL operations <strong>of</strong><br />
the form:<br />
INSERT INTO P-duplicates (key_attribute1,... key_attributeM, quantity,<br />
min_column1, max_column1,... min_columnN, max_columnN)<br />
SELECT key_attribute1,... key_attributeM, COUNT(*), MIN(column1),<br />
MAX(column1),... MIN(columnN), MAX(columnN)<br />
FROM P-cleaning<br />
GROUP BY key_attribute1,... key_attributeM<br />
HAVING COUNT(*)>1;<br />
− Reconciliation tasks are executed only when some duplicates where found (in the corresponding duplicate<br />
check task). Inconsistencies are generally solved by arbitrarily selecting one <strong>of</strong> the values <strong>of</strong> conflictive<br />
attributes (e.g. the minimum one) or filling with Null values. Additional columns are added to the<br />
extraction<br />
P.txt<br />
cleaning<br />
referentiation<br />
referential<br />
bulk copy<br />
cleaning<br />
referential<br />
check<br />
P-group<br />
P-source<br />
P-source P-cleaning<br />
error<br />
ok<br />
P-orphans<br />
copy<br />
unduplication<br />
duplicate<br />
check<br />
P-cleaning<br />
error<br />
ok<br />
orphan<br />
elimination<br />
P-ref<br />
P-duplicates<br />
reconciliation<br />
copy<br />
Figure 13 – Details <strong>of</strong> ETL tasks<br />
exportation<br />
validation<br />
P-ref<br />
duplicate<br />
cleaning<br />
copy<br />
aggregation<br />
grouping<br />
P-group<br />
P<br />
P’
Verónika Peralta<br />
temporal table storing duplicates (P-duplicates in Figure 13) in order to store reconciled values. Tasks are<br />
implemented as a sequence <strong>of</strong> SQL operations, each one solving one inconsistency. The following<br />
operations are examples <strong>of</strong> reconciliation operations:<br />
UPDATE P-duplicates<br />
SET column1 = min_column1;<br />
UPDATE P-duplicates<br />
SET column2 = NULL<br />
WHERE min_column2 max_column2;<br />
− Duplicate cleaning tasks set a unique value for each non-key attribute (those indicated during<br />
reconciliation task) in duplicated tuples. Tasks are implemented as SQL operations <strong>of</strong> the form:<br />
UPDATE P-cleaning<br />
SET column1 = P-duplicates.column1,...<br />
columnN = P-duplicates.columnN<br />
WHERE P-cleaning.key_attribute1 = P-duplicates.key_attribute1 ...<br />
AND P-cleaning.key_attributeM = P-duplicates.key_attributeM<br />
− Grouping tasks keep a unique tuple for each key value. Note that there are conflicting values anymore, so<br />
grouping by all attributes is analogous to grouping by key attributes. Tasks are implemented as SQL<br />
operations <strong>of</strong> the form:<br />
INSERT INTO grouped_table (key_attribute1,... key_attributeM, column1,...<br />
columnN)<br />
SELECT key_attribute1,... key_attributeM, column1,... columnN<br />
FROM cleaning_table<br />
GROUP BY key_attribute1,... key_attributeM, column1,... column;<br />
− Copy tasks duplicate tables in order to provide traceability in the case no duplicates were found. Tasks are<br />
implemented as SQL operations <strong>of</strong> the form:<br />
INSERT INTO P-group (<br />
SELECT *<br />
FROM P-cleaning;<br />
Referentiation tasks eliminate data not satisfying referential constraints. Several sub-tasks are performed,<br />
distinguishing normal <strong>and</strong> error flow:<br />
− Referential check tasks verify the satisfaction <strong>of</strong> referential constraints. Orphan tuples (which do not<br />
match tuples <strong>of</strong> the referenced table) are stored in a temporal table (P-orphans in Figure 13). Tasks are<br />
implemented as SQL operations <strong>of</strong> the form:<br />
INSERT INTO P-orphans (<br />
SELECT *<br />
FROM P-group<br />
WHERE NOT EXISTS (<br />
SELECCT * FROM referential<br />
WHERE reference_attribute1 = foreign_attribute1 AND...<br />
reference_attributeN = foreign_attributeN));<br />
− Orphan elimination tasks deletes tuples not satisfying referential constraints (those found by the<br />
corresponding referential check task). In other words, tasks keep only tuples that join referential tables.<br />
Tasks are implemented as SQL operations <strong>of</strong> the form:<br />
INSERT INTO P-ref (<br />
SELECT P-group.*<br />
FROM P-group, referential<br />
WHERE reference_attribute1 = foreign_attribute1 ...<br />
AND reference_attributeN = foreign_attributeN));<br />
− Copy tasks are analogous to those described for unduplication.<br />
Exportation tasks create result tables. Several sub-tasks are performed:<br />
− Validate tasks perform manual user validations (e.g. control <strong>of</strong> some sampled tuples) or modifications<br />
(e.g. the insertion <strong>of</strong> new tuples). Specific validation tasks are described later for each table.<br />
− Copy tasks are analogous to those described for unduplication.<br />
9
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
10<br />
− Aggregate tasks build aggregations (P’ in Figure 13) for storing existing values <strong>of</strong> a specific attribute, i.e.<br />
master tables (e.g. movie genres). Tasks are implemented as SQL operations <strong>of</strong> the form:<br />
INSERT INTO P’ (<br />
SELECT column1,... columnK<br />
FROM P-ref<br />
GROUP BY column1,... columnK;<br />
The following sub-sections describe the specific tasks <strong>of</strong> each data set.<br />
2.3.1. ETL process for the large data set<br />
The large data set was quite consistent: no duplicates were found, no violations to referential constraints were<br />
detected. However, we detected <strong>and</strong> solved some minor anomalies:<br />
− Genres were represented as multi-valued attributes (pipe-separated lists) in source files. Normalization<br />
routines were performed.<br />
− ZIP codes contained letters <strong>and</strong> ZIP intervals. We ignored such errors <strong>and</strong> stored values as Strings.<br />
− Master tables for ages <strong>and</strong> occupations were created from scratch in order to associate descriptions to age<br />
ranges <strong>and</strong> occupation ids.<br />
Figure 14 shows the extraction workflow. Note that not all the tasks presented in previous section were<br />
necessary for extracting data from all text files. For ease <strong>of</strong> underst<strong>and</strong>ing, we omitted representing some data<br />
flows <strong>and</strong> temporary tables; copy tasks were also omitted.<br />
movies.dat<br />
ratings.dat<br />
users.dat<br />
bulk copy<br />
bulk copy<br />
bulk copy<br />
duplicate<br />
chech<br />
duplicate<br />
check<br />
duplicate<br />
check<br />
referential<br />
check<br />
genre<br />
denormalisation<br />
Figure 14 – ETL workflow for the large data set<br />
referential<br />
check<br />
manual<br />
insert<br />
manual<br />
insert<br />
ML_MovieGenres<br />
ML_Movies<br />
ML_Ratings<br />
ML_Users<br />
ML_Ages<br />
ML_Occupations<br />
The semantics <strong>of</strong> the bulk copy, duplicate check <strong>and</strong> reference check tasks was described previously; the<br />
semantics <strong>of</strong> specific validation <strong>and</strong> aggregation tasks is the following:<br />
− Genre denormalisation task builds a normalized table expressing the relationship among movies <strong>and</strong><br />
genres. Each tuple, which has a list <strong>of</strong> genres, is decomposed in several tuples, one per genre. For<br />
example, from the tuple (see Figure 3) we create<br />
3 tuples , <strong>and</strong> . The task is implemented as a sequence <strong>of</strong><br />
SQL operations that split the genre list according to the token ‘|’.
Verónika Peralta<br />
− Manual insert tasks create new tables by executing a list <strong>of</strong> manually-entered insert operations, for<br />
example:<br />
INSERT INTO ML_Ages (AgeId, MinAge, MaxAge)<br />
VALUES (18, 18, 24);<br />
2.3.2. ETL process for the small data set<br />
We detected <strong>and</strong> solved several anomalies in the small data set:<br />
− We found 18 duplicate titles (with different MovieId). These tuples would cause duplicates when<br />
matching (by title) <strong>IMDb</strong> movies. We kept, arbitrarily, the smaller MovieId <strong>of</strong> each duplicate title.<br />
− There was a movie with title='Unknown' <strong>and</strong> null values for ReleaseYear <strong>and</strong> <strong>IMDb</strong>URL. The movie was<br />
eliminated.<br />
− Ratings <strong>of</strong> eliminated movies were also eliminated. Substituting their MovieId for those <strong>of</strong> the kept<br />
movies would carry to duplicates with contradictory ratings.<br />
− Genres were denormalized in source files (one Boolean attribute for each genre). Normalization routines<br />
were performed.<br />
Figure 15 shows the extraction workflow.<br />
u.genre<br />
u.item<br />
u.data<br />
u.user<br />
u.occupation<br />
bulk copy<br />
bulk copy<br />
bulk copy<br />
bulk copy<br />
bulk copy<br />
null-values<br />
elimination<br />
duplicate<br />
check<br />
duplicate<br />
check<br />
duplicate<br />
check<br />
year<br />
calculation<br />
movie id<br />
reconciliation<br />
referential<br />
check<br />
referential<br />
check<br />
duplicate<br />
cleaning<br />
referential<br />
check<br />
Figure 15 – ETL workflow for the small data set<br />
genre<br />
denormalization<br />
grouping<br />
orphan<br />
elimination<br />
SML_Genres<br />
SML_MovieGenres<br />
SML_Movies<br />
SML_Ratings<br />
SML_Users<br />
The semantics <strong>of</strong> the bulk copy, duplicate check, duplicate cleaning, grouping, reference check <strong>and</strong> orphan<br />
elimination tasks was described previously; the semantics <strong>of</strong> specific cleaning, reconciliation <strong>and</strong> aggregation<br />
tasks is the following:<br />
− Null-values elimination task deletes tuples having null or dummy values. Specifically, we deleted tuples<br />
having ‘Unknown’ as title. Task is implemented as follows:<br />
INSERT INTO movies-cleaning (MovieId, MovieTitle, ReleaseDate, <strong>IMDb</strong>)<br />
SELECT * FROM movies-source<br />
WHERE MovieTitle ‘Unknown’;<br />
11
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
12<br />
− Year calculation task calculates movie release year from movie release date. A typical truncate function is<br />
used. The task is implemented as follows:<br />
UPDATE movies-cleaning<br />
SET ReleaseYear = Mid(ReleaseDate,8,4);<br />
− MovieId reconciliation task keeps the smaller movie id when there are duplicated movid titles (supposed<br />
to be unique). The task is implemented as follows:<br />
UPDATE movie-duplicates<br />
SET MovieId = MinMovieId;<br />
− Genre denormalisation task builds a normalized table expressing the relationship among movies <strong>and</strong><br />
genres. Each tuple, which has several Boolean values each one corresponding to a genre, is decomposed<br />
in several tuples, one for each True value. For example, from the tuple (see Figure 6) we create 3 tuples , <strong>and</strong> <br />
indicating that genres 3, 4 <strong>and</strong> 5 correspond to the movie. The task is implemented as a sequence <strong>of</strong> SQL<br />
operations, one for each genre; for example, the query for Adventure movies (genre 2) is:<br />
INSERT INTO SML_MovieGenres (MovieId, GenreId)<br />
SELECT MovieId, 2<br />
FROM Movies-ref<br />
WHERE Adventure=1;<br />
2.3.3. Summary <strong>and</strong> statistics<br />
As summary <strong>of</strong> ETL tasks, the following tables show the number <strong>of</strong> extracted tuples, the number <strong>of</strong> anomalies,<br />
duplicates <strong>and</strong> orphans found, <strong>and</strong> the number <strong>of</strong> exported tuples for both data sets.<br />
Source file<br />
Extracted<br />
Movies.dat<br />
tuples<br />
3883<br />
Cleaning<br />
anomalies<br />
0<br />
Duplicates<br />
0<br />
Orphans<br />
0<br />
Exported<br />
Resulting table<br />
tuples<br />
3883 ML_Movies<br />
6407 ML_MovieGenres<br />
Ratings.dat 1000209 0 0 0 1000209 ML_Ratings<br />
6040 ML_Users<br />
Users.dat 6040 0 0 0<br />
7 ML_Ages<br />
21 ML_Occupations<br />
Table 3 – Statistics <strong>of</strong> data extraction, cleaning <strong>and</strong> transformation <strong>of</strong> the large data set<br />
Source file<br />
Extracted<br />
u.genre<br />
tuples<br />
19<br />
Cleaning<br />
anomalies<br />
0<br />
Duplicates<br />
0<br />
Orphans<br />
0<br />
Exported<br />
Resulting table<br />
tuples<br />
19 SML_Genres<br />
u.item 1682 1 18 0<br />
1663 SML_Movies<br />
2906 SML_MovieGenres<br />
u.data 100000 0 0 617 99383 SML_Ratings<br />
u.user 943 0 0 0 943 SML_Users<br />
u.occupation 21<br />
3. <strong>IMDb</strong> data extraction<br />
Table 4 – Statistics <strong>of</strong> data extraction, cleaning <strong>and</strong> transformation <strong>of</strong> the small data set<br />
<strong>IMDb</strong> data set consists in ad-hoc text files, having different formats, including tabular lists, tagged text <strong>and</strong><br />
hierarchical-organized text. <strong>Data</strong> extraction consisted in the loading <strong>of</strong> text data to a relational database,<br />
normalization <strong>of</strong> data <strong>and</strong> duplicates elimination. The following sub-sections describe source files, target<br />
schemas, extraction processes <strong>and</strong> cleaning processes.<br />
3.1. <strong>IMDb</strong> source schemas<br />
<strong>IMDb</strong> data set [5] consists <strong>of</strong> 49 lists that catalog different details about movies. At October 5 th 2006, list sizes<br />
varied from 25000 to 5000000 tuples. Each list has a list manager, who is responsible for updating, maintaining<br />
<strong>and</strong> publishing it (http://www.imdb.com/helpdesk/contact). List managers rely on users <strong>of</strong> <strong>IMDb</strong> to submit
Verónika Peralta<br />
corrections <strong>and</strong> additions to keep lists accurate <strong>and</strong> complete as possible. List updates are mostly submitted via<br />
the movie mail-server's central collection service, following specific submission protocols<br />
(http://www.imdb.com/mailing_lists).<br />
We extracted data from 23 lists, representing the most relevant tabular features about movies. We discarded lists<br />
providing textual data (10 lists), such as movie plots or trivia, because it was useless for our database-oriented<br />
goals. The remaining 16 lists will be possibly extracted in near future.<br />
We treated four file formats:<br />
Fixed-length columned files present data in several columns, each one starting at a fixed position. They are very<br />
simple to treat with a database loading interface like the one used for extracting <strong>MovieLens</strong> data sets. Figure 16<br />
shows an extract <strong>of</strong> the ratings.list file. It is the unique file having this format:<br />
− ratings.list has 4 columns (rating distribution, number <strong>of</strong> votes, average rating, movie title).<br />
0000000015 178183 9.1 Godfather, The (1972)<br />
0000000115 214539 9.1 Shawshank Redemption, The (1994)<br />
0000000124 101253 9.0 Godfather: Part II, The (1974)<br />
0000000015 161827 8.9 Lord <strong>of</strong> the Rings: The Return <strong>of</strong> the King, The (2003)<br />
0000000124 87270 8.8 Casablanca (1942)<br />
0000000124 129696 8.8 Schindler's List (1993)<br />
Figure 16 – Extract <strong>of</strong> the ratings.list file<br />
Tab-separated columned files also present data in several columns, but columns are separated by tabulations.<br />
The number <strong>of</strong> tabulations between two consecutive columns is variable (responding to visual presentation)<br />
which unables the direct use <strong>of</strong> database loading interfaces. Files need to be pre-processed in order to substitute<br />
several tabulations by a unique one. In addition, as some values can be omitted (null values), in some cases, a<br />
double tabulation is expected. In particular, most lists contain optional comments. As anomalies, we found<br />
occurrences <strong>of</strong> double spaces instead <strong>of</strong> a tabulation. Figure 17 shows an extract <strong>of</strong> the genres.list file, which<br />
presents movie titles <strong>and</strong> their genres separated by several tabulations. The files having this format are:<br />
− color-info.list has 3 columns (movie title, color, comments).<br />
− countries.list has 2 columns (movie title, country).<br />
− genres.list has 2 columns (movie title, genre).<br />
− keywords.list has 2 parts. The former contains a list <strong>of</strong> keywords (with the number <strong>of</strong> films including such<br />
keywords between brackets), organized in several tab-separated columns. The latter contains two columns<br />
(movie title, keyword).<br />
− language.list has 3 columns (movie title, language, comments).<br />
− locations.list has 2 columns (movie title, location, comments), where location is a comma-separated list<br />
<strong>of</strong> geographical names, for example: “Sevilla, Andalucía, Spain”.<br />
− movies.list has 3 columns (movie title, release year, comments).<br />
− production-companies.list has 3 columns (movie title, production company, comments), where<br />
production company may contain the code <strong>of</strong> one or several countries (between square brackets), for<br />
example: “Taxi Films [uy]”<br />
− release-dates.list has 3 columns (movie title, release date, comments), where release date embeds release<br />
country.<br />
− running-times.list has 2 columns (movie title, running time, comments), where running time may embed<br />
the running country.<br />
− sound-mix.list has 2 columns (movie title, sound mix).<br />
Walking with the Dead (1998) Short<br />
Walking with Walken (2001) Short<br />
Walkin' on Sunshine: The Movie (1997) Comedy<br />
Walkin' on Sunshine: The Movie (1997) Sci-Fi<br />
Walk in the Clouds, A (1995) Drama<br />
Walk in the Clouds, A (1995) Romance<br />
Figure 17 – Extract <strong>of</strong> the genres.list file<br />
13
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
Tagged files precedes data with tags, which indicates the nature <strong>of</strong> data. Figure 18 shows an extract <strong>of</strong> the<br />
biographies.list file; tags are placed at the start <strong>of</strong> each line, indicating for example, an artist name (NM), his real<br />
name (RN), his date <strong>and</strong> place <strong>of</strong> birth (DB), etc. Tags are prefixed for each list but some tags are not explained.<br />
This type <strong>of</strong> file needs ad-hoc parsing. The files having this format are:<br />
14<br />
− biographies.list includes as tags: artist name, real name, birth (date <strong>and</strong> place), decease (date, place <strong>and</strong><br />
cause) <strong>and</strong> height. Other tags are ignored.<br />
− business.list includes as tags: movie title, budget (currency <strong>and</strong> amount), <strong>and</strong> revenue (currency, amount<br />
<strong>and</strong> country). Other tags are ignored.<br />
NM: Noth, Chris<br />
RN: Noth, Christopher David<br />
DB: 13 November 1954, Madison, Wisconsin, USA<br />
BG: Chris Noth was born in Madison, Wisconsin on 13 November 1954.<br />
In his<br />
BG: youth he travelled <strong>and</strong> lived in Engl<strong>and</strong>, Yugoslavia <strong>and</strong> Spain…<br />
Figure 18 – Extract <strong>of</strong> the biographies.list file<br />
Hierarchical-structured files organize data into two hierarchical levels: the outside level describes a movie<br />
feature (e.g. an actor or a director) <strong>and</strong> the inside level lists all movies corresponding to such feature. Figure 19<br />
shows an extract <strong>of</strong> the actors.list file, presenting the list <strong>of</strong> movies played by an actor. Some files contains other<br />
attributes <strong>and</strong> comments, which follow movie titles (in the same line) separated by one or more tabulations. This<br />
type <strong>of</strong> file is the most difficult to extract; it needs ad-hoc extractors to iterate between hierarchical levels, <strong>and</strong><br />
pre-processing for solving tabulations problems (see tab-separated columned files). The files having this format<br />
are:<br />
− actors.list lists actors <strong>and</strong> 4 columns (movie title, role, casting position, comments). Columns (excepting<br />
movie title) are optional.<br />
− actresses.list lists actresses <strong>and</strong> 4 columns (movie title, role, casting position, comments). Columns<br />
(excepting movie title) are optional.<br />
− costume-designers.list lists costume designers <strong>and</strong> 2 columns (movie title, comments).<br />
− directors.list lists directors <strong>and</strong> 2 columns (movie title, comments).<br />
− distributors.list lists movie titles <strong>and</strong> 2 columns (distribution company, comments), where distribution<br />
company may contain acronyms (between brackets) <strong>and</strong> the code <strong>of</strong> one or several countries (between<br />
square brackets), for example: “France 2 (FR 2) [fr]”<br />
− movie-links.list lists movie titles <strong>and</strong> 1 columns (link to another movie title), which embeds the link type<br />
(e.g. “featured in”, “alternate language version <strong>of</strong>”, “referenced in”, etc).<br />
− producers.list lists producers <strong>and</strong> 2 columns (movie title, comments).<br />
− production-designers.list lists producion designers <strong>and</strong> 2 columns (movie title, comments).<br />
− writers.list lists custome designers <strong>and</strong> 2 columns (movie title, comments).<br />
Outst<strong>and</strong>ing Supporting Actor in a Comedy Series]<br />
Noth, Chris Abducted: A Father's Love (1996) (TV) [Larry Coster] <br />
Acting Class, The (2000) [Martin Ballsac] <br />
Apology (1986) (TV) [Roy Burnette] <br />
At Mother's Request (1987) (TV) [Steve Klein]<br />
Baby Boom (1987) (as Christopher Noth) [Yuppie Husb<strong>and</strong>] <br />
Bad Apple (2004) (TV) [Tozzi] <br />
Figure 19 – Extract <strong>of</strong> the actors.list file<br />
Table 5 summarizes file formats, optional attributes <strong>and</strong> other particularities.
Verónika Peralta<br />
File Format Optional att. Particularities<br />
movies.list Tab-separated columned<br />
actors.list Hierarchical-structured<br />
comments,<br />
role, casting<br />
actresses.list Hierarchical-structured<br />
comments,<br />
role, casting<br />
biographies.list Tagged<br />
Columns birth <strong>and</strong> decease are commaseparated<br />
lists <strong>of</strong> dates <strong>and</strong> places.<br />
business.list Tagged<br />
Columns budget <strong>and</strong> revenue embed<br />
currencies, amounts <strong>and</strong> countries.<br />
color-info.list Tab-separated columned comments<br />
costume-designers.list Hierarchical-structured comments<br />
countries.list Tab-separated columned<br />
directors.list Hierarchical-structured comments<br />
distributors.list Hierarchical-structured comments<br />
Column distribution company embeds<br />
company acronyms <strong>and</strong> country code<br />
genres.list Tab-separated columned<br />
keywords.list Tab-separated columned<br />
Contains two lists: (i) list <strong>of</strong> keywords,<br />
(ii) lists <strong>of</strong> movies <strong>and</strong> their keywords.<br />
language.list Tab-separated columned comments<br />
locations.list Tab-separated columned comments<br />
Column location is a comma-separated<br />
list <strong>of</strong> geographical places.<br />
movie-links.list Hierarchical-structured Column link embeds link type.<br />
producers.list Hierarchical-structured comments<br />
production-companies.list Tab-separated columned comments<br />
Column production company embeds<br />
country code<br />
production-designers.list Hierarchical-structured comments<br />
ratings.list Fixed-length columned<br />
release-dates.list Tab-separated columned comments Column release date embeds country.<br />
running-times.list Tab-separated columned comments Column running time embeds country.<br />
sound-mix.list Tab-separated columned<br />
writers.list Hierarchical-structured comments<br />
Table 5 – List formats <strong>and</strong> particularities<br />
Movies correspond to one <strong>of</strong> 6 types: cinema film, series episode, mini-series, TV series, video <strong>and</strong> video game.<br />
Movie identifiers (MovieTitle columns) are Strings that concatenates different data depending on movie types:<br />
− Cinema film identifiers concatenate movie title <strong>and</strong> release year (between brackets). E.g.: Walk in the<br />
Clouds, A (1995)<br />
− Series episode identifiers concatenate movie title (quoted), release year (between brackets) <strong>and</strong> episode<br />
identification (between braces). The latter may include episode name, episode number (between<br />
brackets), or both. E.g.: "Sex <strong>and</strong> the City" (1998) {All or Nothing (#3.10)}<br />
− Mini-series identifiers concatenate movie title (quoted), release year (between brackets) <strong>and</strong> the word<br />
mini (between brackets). E.g.: "Bon Voyage" (2006) (mini)<br />
− TV series identifiers concatenate movie title, release year (between brackets) <strong>and</strong> the word TV (between<br />
brackets). E.g.: Aladdin on Ice (1995) (TV)<br />
− Video identifiers concatenate movie title, release year (between brackets) <strong>and</strong> the letter V (between<br />
brackets). E.g.: Elton John: Live in Barcelona (1992) (V)<br />
− Video game identifiers concatenate movie title, release year (between brackets) <strong>and</strong> the word VG<br />
(between brackets). E.g.: Harry Potter: Quidditch World Cup (2003) (VG)<br />
3.2. <strong>IMDb</strong> target schema<br />
The target schema for the <strong>IMDb</strong> data set consists in 24 tables describing movies <strong>and</strong> companies or persons<br />
related to movies. Figure 20 shows the tables <strong>of</strong> the target schema, their primary keys (underlined attributes) <strong>and</strong><br />
foreign keys (arrows between tables). Additional dotted lines indicate is-a relationships among some tables <strong>and</strong> a<br />
(not implemented) relation describing persons. Table 6 describes each target table, its attributes <strong>and</strong> constraints.<br />
15
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
16<br />
Figure 20 – <strong>IMDb</strong> schema<br />
<strong>IMDb</strong>_Movies<br />
MovieNum<br />
MovieTitle<br />
Type<br />
Year<br />
FurtherInfo<br />
<strong>IMDb</strong>_Ratings<br />
MovieNum<br />
MovieTitle<br />
Distribution<br />
Votes<br />
Rating<br />
<strong>IMDb</strong>_Genres<br />
MovieNum<br />
MovieTitle<br />
Genre<br />
MovieNum<br />
MovieNum<br />
<strong>IMDb</strong>_Countries<br />
MovieNum<br />
MovieTitle<br />
Country<br />
MovieNum<br />
<strong>IMDb</strong>_ReleaseDates<br />
MovieNum<br />
MovieTitle<br />
Country<br />
ReleaseDate<br />
ReleaseYear<br />
ReleaseInfo<br />
MovieNum<br />
<strong>IMDb</strong>_Keywords<br />
Keyword<br />
MovieQuantity<br />
<strong>IMDb</strong>_MovieKeywords<br />
MovieNum<br />
MovieTitle<br />
Keyword<br />
Keyword<br />
MovieNum<br />
MovieNum<br />
<strong>IMDb</strong>_Languages<br />
MovieNum<br />
MovieTitle<br />
Language<br />
LanguageInfo<br />
<strong>IMDb</strong>_ProductionCompanies<br />
MovieNum<br />
MovieTitle<br />
CompanyName<br />
CountryCode<br />
CompanyInfo<br />
MovieNum<br />
MovieNum<br />
<strong>IMDb</strong>_Colors<br />
MovieNum<br />
MovieTitle<br />
Color<br />
ColorInfo<br />
MovieNum<br />
LinkMovieNum=MovieNum<br />
<strong>IMDb</strong>_MovieLinks<br />
MovieNum<br />
LinkMovieNum<br />
MovieTitle<br />
LinkMovieTitle<br />
LinkType<br />
2<br />
MovieNum<br />
<strong>IMDb</strong>_Business<br />
MovieNum<br />
MovieTitle<br />
BudgetCurrency<br />
Budget<br />
RevenueCurrency<br />
Revenue<br />
<strong>IMDb</strong>_Distributors<br />
MovieNum<br />
MovieTitle<br />
DistributionCompany<br />
CountryCode<br />
DistributionInfo<br />
MovieNum<br />
<strong>IMDb</strong>_Locations<br />
MovieNum<br />
MovieTitle<br />
Location<br />
Zone<br />
LocationInfo<br />
MovieNum<br />
<strong>IMDb</strong>_RunningTimes<br />
MovieNum<br />
MovieTitle<br />
Country<br />
Duration<br />
RunningInfo<br />
MovieNum<br />
MovieNum<br />
<strong>IMDb</strong>_Sounds<br />
MovieNum<br />
MovieTitle<br />
SoundMix<br />
<strong>IMDb</strong>_Directors<br />
MovieNum<br />
MovieTitle<br />
Director<br />
DirectorAlias<br />
DirectorInfo<br />
MovieNum<br />
<strong>IMDb</strong>_Writers<br />
MovieNum<br />
MovieTitle<br />
Writer<br />
WriterInfo<br />
MovieNum<br />
<strong>IMDb</strong>_Producers<br />
MovieNum<br />
MovieTitle<br />
Producer<br />
ProducerInfo<br />
MovieNum<br />
<strong>IMDb</strong>_CostumeDesigners<br />
MovieNum<br />
MovieTitle<br />
CostumeDesigner<br />
CostumeInfo<br />
MovieNum<br />
<strong>IMDb</strong>_ProductionDesigners<br />
MovieNum<br />
MovieTitle<br />
ProductionDesigner<br />
DesignerInfo<br />
MovieNum<br />
<strong>IMDb</strong>_Actresses<br />
MovieNum<br />
MovieTitle<br />
Actress<br />
PlayInfo<br />
Role<br />
Casting<br />
MovieNum<br />
<strong>IMDb</strong>_Actors<br />
MovieNum<br />
MovieTitle<br />
Actor<br />
PlayInfo<br />
Role<br />
Casting<br />
MovieNum<br />
<strong>IMDb</strong>_Biographies<br />
Name<br />
RealName<br />
Birth<br />
Decease<br />
Height<br />
<strong>IMDb</strong>_Person<br />
Name<br />
Director<br />
Writer<br />
Producer<br />
Actress<br />
Actor<br />
CostumeDesigner<br />
ProductionDesigner<br />
Name<br />
<strong>IMDb</strong>_Movies<br />
MovieNum<br />
MovieTitle<br />
Type<br />
Year<br />
FurtherInfo<br />
<strong>IMDb</strong>_Ratings<br />
MovieNum<br />
MovieTitle<br />
Distribution<br />
Votes<br />
Rating<br />
<strong>IMDb</strong>_Genres<br />
MovieNum<br />
MovieTitle<br />
Genre<br />
MovieNum<br />
MovieNum<br />
<strong>IMDb</strong>_Countries<br />
MovieNum<br />
MovieTitle<br />
Country<br />
MovieNum<br />
<strong>IMDb</strong>_ReleaseDates<br />
MovieNum<br />
MovieTitle<br />
Country<br />
ReleaseDate<br />
ReleaseYear<br />
ReleaseInfo<br />
MovieNum<br />
<strong>IMDb</strong>_Keywords<br />
Keyword<br />
MovieQuantity<br />
<strong>IMDb</strong>_MovieKeywords<br />
MovieNum<br />
MovieTitle<br />
Keyword<br />
Keyword<br />
MovieNum<br />
MovieNum<br />
<strong>IMDb</strong>_Languages<br />
MovieNum<br />
MovieTitle<br />
Language<br />
LanguageInfo<br />
<strong>IMDb</strong>_ProductionCompanies<br />
MovieNum<br />
MovieTitle<br />
CompanyName<br />
CountryCode<br />
CompanyInfo<br />
MovieNum<br />
MovieNum<br />
<strong>IMDb</strong>_Colors<br />
MovieNum<br />
MovieTitle<br />
Color<br />
ColorInfo<br />
MovieNum<br />
LinkMovieNum=MovieNum<br />
<strong>IMDb</strong>_MovieLinks<br />
MovieNum<br />
LinkMovieNum<br />
MovieTitle<br />
LinkMovieTitle<br />
LinkType<br />
2<br />
MovieNum<br />
<strong>IMDb</strong>_Business<br />
MovieNum<br />
MovieTitle<br />
BudgetCurrency<br />
Budget<br />
RevenueCurrency<br />
Revenue<br />
<strong>IMDb</strong>_Distributors<br />
MovieNum<br />
MovieTitle<br />
DistributionCompany<br />
CountryCode<br />
DistributionInfo<br />
MovieNum<br />
<strong>IMDb</strong>_Locations<br />
MovieNum<br />
MovieTitle<br />
Location<br />
Zone<br />
LocationInfo<br />
MovieNum<br />
<strong>IMDb</strong>_RunningTimes<br />
MovieNum<br />
MovieTitle<br />
Country<br />
Duration<br />
RunningInfo<br />
MovieNum<br />
MovieNum<br />
<strong>IMDb</strong>_Sounds<br />
MovieNum<br />
MovieTitle<br />
SoundMix<br />
<strong>IMDb</strong>_Directors<br />
MovieNum<br />
MovieTitle<br />
Director<br />
DirectorAlias<br />
DirectorInfo<br />
MovieNum<br />
<strong>IMDb</strong>_Writers<br />
MovieNum<br />
MovieTitle<br />
Writer<br />
WriterInfo<br />
MovieNum<br />
<strong>IMDb</strong>_Producers<br />
MovieNum<br />
MovieTitle<br />
Producer<br />
ProducerInfo<br />
MovieNum<br />
<strong>IMDb</strong>_CostumeDesigners<br />
MovieNum<br />
MovieTitle<br />
CostumeDesigner<br />
CostumeInfo<br />
MovieNum<br />
<strong>IMDb</strong>_ProductionDesigners<br />
MovieNum<br />
MovieTitle<br />
ProductionDesigner<br />
DesignerInfo<br />
MovieNum<br />
<strong>IMDb</strong>_Actresses<br />
MovieNum<br />
MovieTitle<br />
Actress<br />
PlayInfo<br />
Role<br />
Casting<br />
MovieNum<br />
<strong>IMDb</strong>_Actors<br />
MovieNum<br />
MovieTitle<br />
Actor<br />
PlayInfo<br />
Role<br />
Casting<br />
MovieNum<br />
<strong>IMDb</strong>_Biographies<br />
Name<br />
RealName<br />
Birth<br />
Decease<br />
Height<br />
<strong>IMDb</strong>_Person<br />
Name<br />
Director<br />
Writer<br />
Producer<br />
Actress<br />
Actor<br />
CostumeDesigner<br />
ProductionDesigner<br />
Name
Table Attributes Constraints<br />
<strong>IMDb</strong>_Movies<br />
Information about<br />
movies<br />
<strong>IMDb</strong>_Actors<br />
Actors <strong>of</strong> movies<br />
<strong>IMDb</strong>_Actresses<br />
Actresses <strong>of</strong> movies<br />
<strong>IMDb</strong>_Biographies<br />
Personal data about<br />
actors, actresses,<br />
directors, producers<br />
<strong>and</strong> other people<br />
involved in movies<br />
<strong>IMDb</strong>_Business<br />
Budget <strong>and</strong> revenue <strong>of</strong><br />
movies<br />
<strong>IMDb</strong>_Colors<br />
Colors <strong>of</strong> movies<br />
<strong>IMDb</strong>_Costume-<br />
Designers<br />
Costume designers <strong>of</strong><br />
movies<br />
<strong>IMDb</strong>_Countries<br />
Countries <strong>of</strong> movies<br />
<strong>IMDb</strong>_Directors<br />
Directors <strong>of</strong> movies<br />
− MovieNum: Numeric(7); autogenerated id<br />
− MovieTitle: String(250)<br />
− Type: String(2); {C=cinema, TV=television,<br />
V=video, VG=video game, S=serie,<br />
M=miniserie}<br />
− Year: String(45); an year, a range or a list <strong>of</strong><br />
years<br />
− FurtherInfo: String(30)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Actor: String(75)<br />
− PlayInfo: String(110); further info<br />
(uncredited, alias, special roles…)<br />
− Role: String(256); role name <strong>and</strong>/or<br />
description<br />
− Casting: Numeric(3); casting position<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Actress: String(75)<br />
− PlayInfo: String(110); further info<br />
(uncredited, alias, special roles…)<br />
− Role: String(256); role name <strong>and</strong>/or<br />
description<br />
− Casting: Numeric(3); casting position<br />
− Name: String(70)<br />
− RealName: String(220)<br />
− Birth: String(130); date <strong>and</strong> place <strong>of</strong> birth<br />
− Decease: String(160); date, place <strong>and</strong> cause<br />
<strong>of</strong> decease<br />
− Height: String(15)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− BudgetCurrency: String(3)<br />
− Budget: Numeric(15,2)<br />
− RevenueCurrency: String(3)<br />
− Revenue: Numeric(15,2)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Color: String(20)<br />
− ColorInfo: String(50); further info (versions)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− CostumeDesigner: String(50)<br />
− CostumeInfo: String(100); further info<br />
(roles, alias, …)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Country: String(30)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Director: String(40)<br />
− DirectorAlias: String(80); director’s<br />
appearing name in the movie<br />
− DirectorInfo: String(100); further info e.g.<br />
uncredited movies, co-direction, direction <strong>of</strong><br />
episodes, segments or videos…<br />
Verónika Peralta<br />
Primary key: MovieNum<br />
Alternative key: MovieTitle<br />
Primary key: MovieNum, Actor<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum, Actress<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: Name<br />
Primary key: MovieNum<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum, Color<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum,<br />
CostumeDesigner<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum, Country<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum, Director<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
17
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
<strong>IMDb</strong>_Distributors<br />
Companies that<br />
distributed movies<br />
<strong>IMDb</strong>_Genres<br />
Genres <strong>of</strong> movies<br />
<strong>IMDb</strong>_Keywords<br />
Master <strong>of</strong> keywords<br />
<strong>IMDb</strong>_Languages<br />
Languages <strong>of</strong> movies<br />
<strong>IMDb</strong>_Locations<br />
Running locations <strong>of</strong><br />
movies<br />
<strong>IMDb</strong>_MovieKeywords<br />
Keywords <strong>of</strong> movies<br />
<strong>IMDb</strong>_MovieLinks<br />
Links between movies<br />
<strong>IMDb</strong>_Producers<br />
Producers <strong>of</strong> movies<br />
<strong>IMDb</strong>_Production-<br />
Companies<br />
Companies that<br />
produced movies<br />
<strong>IMDb</strong>_Production-<br />
Designers<br />
Production designers <strong>of</strong><br />
movies<br />
<strong>IMDb</strong>_Ratings<br />
Global ratings on<br />
movies (not<br />
differentiated par user)<br />
18<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− DistributionCompany: String(110)<br />
− CountryCode: String(15)<br />
− DistributorInfo: String(140); further info<br />
(years, countries, media, …)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Genre: String(12)<br />
− Keyword: String(60)<br />
− MovieQuantity: Numeric(5); the number <strong>of</strong><br />
movies having the keyword<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Language: String(30)<br />
− LanguageInfo: String(70); further info<br />
(versions)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Location: String(170)<br />
− Zone: String(50)<br />
− LocationInfo: String(256); further info<br />
(years, countries, media, …)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Keyword: String(60)<br />
− MovieNum: Numeric(7)<br />
− LinkMovieNum: Numeric(7); linked movie<br />
− MovieTitle: String(250)<br />
− LinkMovieTitle: String(250); linked movie<br />
− LinkType: String(30); references, remakes,<br />
etc.<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Producer: String(60)<br />
− ProducerInfo: String(90); further info (role,<br />
awards, …)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− CompanyName: String(170)<br />
− CountryCode: String(15)<br />
− CompanyInfo: String(140); further info<br />
(years, places, participations, …)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− ProductionDesginer: String(35)<br />
− DesignerInfo: String(140); further info (co-<br />
productions, roles, …)<br />
− Distribution: String(10); unknown meaning<br />
− Votes: Numeric(8,2); quantity <strong>of</strong> votes<br />
(decimals are always zero)<br />
− Rating: Number(4,2), value between 0 <strong>and</strong><br />
10<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
Primary key: MovieNum,<br />
DistributionCompany,<br />
CountryCode<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum, Genre<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: Keyword<br />
Primary key: MovieNum,<br />
Language<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum, Location<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum, Keyword<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
− Keyword: ref. <strong>IMDb</strong>_Keywords<br />
Primary key: MovieNum,<br />
LinkMovieNum, LinkType<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
− LinkMovieNum: ref.<br />
<strong>IMDb</strong>_Movies<br />
Primary key: MovieNum, Producer<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum,<br />
CompanyName, CountryCode<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum,<br />
ProductionDesginer<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies
<strong>IMDb</strong>_ReleaseDates<br />
Dates <strong>of</strong> movie releases<br />
<strong>IMDb</strong>_RunningTimes<br />
Running times <strong>of</strong><br />
movies<br />
<strong>IMDb</strong>_Sounds<br />
Sound mix <strong>of</strong> movies<br />
<strong>IMDb</strong>_Writers<br />
Writers <strong>of</strong> movies<br />
3.3. <strong>IMDb</strong> extraction processes<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Country: String(40)<br />
− ReleaseDate: String(20)<br />
− ReleaseYear: Numeric(4)<br />
− ReleaseInfo: String(90); further info (release<br />
city/festival)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Country: String(40)<br />
− Duration: Numeric(5)<br />
− RunningInfo: String(75); further info<br />
(version, episodes, media, …)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− SoundMix: String(40)<br />
− MovieNum: Numeric(7)<br />
− MovieTitle: String(250)<br />
− Writer: String(45)<br />
− WriterInfo: String(110); further info (role,<br />
awards, …)<br />
Table 6 – <strong>IMDb</strong> schema<br />
Verónika Peralta<br />
Primary key: MovieNum, Country,<br />
ReleaseDate<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum, Country<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum,<br />
SoundMix<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Primary key: MovieNum, Writer<br />
Foreign keys:<br />
− MovieNum: ref. <strong>IMDb</strong>_Movies<br />
Contrarily to <strong>MovieLens</strong> data sets, the <strong>IMDb</strong> data set was very hard to extract <strong>and</strong> load in a relational database.<br />
The reason is that the data set is not conceived to be manipulated by DBMSs but by research engines. As a<br />
consequence, data is not presented in a tabular way <strong>and</strong> it is necessary to develop ad-hoc data extractors for<br />
loading tagged <strong>and</strong> hierarchical data <strong>and</strong> pre-processors for formatting data when it is presented in a pseudotabular<br />
format. This sub-section describes data extraction processes <strong>and</strong> next sub-section presents data cleaning<br />
<strong>and</strong> data transformation processes.<br />
<strong>Data</strong> extraction includes pre-processing tasks for: (i) normalizing hierarchical structures, (ii) substituting<br />
multiple tabulations <strong>and</strong> blanks, <strong>and</strong> (iii) normalizing tagged structures. Then, pre-processed files are loaded into<br />
a relational database. We describe each task as follows:<br />
3.3.1. Normalization <strong>of</strong> hierarchical structures<br />
Many files organize data into two hierarchical levels. The outside level describes a movie feature (e.g. an actor)<br />
<strong>and</strong> the inside level lists all movies corresponding to such feature (see Figure 19). Consequently, only the first<br />
line corresponding to a movie feature contains it explicitly; the following lines only contain movie titles (<strong>and</strong><br />
possible optional attributes). In order to store data in a tabular way, we need to iterate among the list <strong>of</strong> movies<br />
corresponding to each feature <strong>and</strong> insert it at each line. We do so in a pre-processing stage.<br />
Pre-processing program is written in Java (jdk 1.4). It takes as input a list file, processes the file line per line <strong>and</strong><br />
stores resulting lines in an output text file. We recognize lines containing a new feature from those containing<br />
movies <strong>of</strong> a previous feature because the formers start by a String (the feature value) <strong>and</strong> the latter start by a<br />
tabulation mark. A variable, storing the current feature value, is updated for each new feature found. White lines<br />
are ignored.<br />
Table 7 (column hierarchies) indicates which source files need pre-processing for normalizing hierarchical<br />
structures.<br />
3.3.2. Substitution <strong>of</strong> multiple tabulations <strong>and</strong> blanks<br />
An important problem, present in most source files, is the use <strong>of</strong> a variable number <strong>of</strong> tabulations for separating<br />
columns. This disconcerts database loading programs <strong>and</strong> causes the storage <strong>of</strong> data at inappropriate attributes.<br />
19
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
In order to prepare source files for being read by database loading programs, we pre-processed files, substituting<br />
multiple tabulations <strong>and</strong> blanks by a unique tabulation.<br />
Pre-processing programs are written in Java (jdk 1.4). They take as input a list file, process the file line per line<br />
<strong>and</strong> store resulting lines in an output text file. The substitution function is quite simple; it searches for multiple<br />
occurrences <strong>of</strong> tabulations <strong>and</strong>/or blanks <strong>and</strong> substitutes them by a single tabulation.<br />
Optional attributes need more attention. When there is a unique optional attribute <strong>and</strong> it is placed at the end <strong>of</strong><br />
the line, the database loading interface fills null values automatically. However, when there are several optional<br />
attributes, we need to analyze to which attribute corresponds each value <strong>and</strong> (possibly) insert additional<br />
tabulations in order to signal the presence <strong>of</strong> a null value. Most files only contained comments as optional<br />
attributes, which were placed at the end <strong>of</strong> the line, so no further processing was needed. Conversely, actors.list<br />
<strong>and</strong> actresses.list files present three optional attributes: comments, role <strong>and</strong> casting position. We recognized their<br />
values because <strong>of</strong> special characters enclosing them. Specifically, comments, roles <strong>and</strong> casting positions are<br />
enclosed by brackets, square brackets <strong>and</strong> angle brackets respectively (see Figure 19). In all cases (including<br />
files having only comments), we did not eliminate enclosing characters; we deferred it to data cleaning steps.<br />
Table 7 indicates which source files needed pre-processing for substituting multiple tabulations or blanks<br />
(column tabulations) <strong>and</strong> which ones needed the treatment <strong>of</strong> enclosing characters (we indicate attribute names<br />
enclosed by the corresponding special characters).<br />
3.3.3. Normalization <strong>of</strong> tagged structures<br />
Two files utilize tags for organizing data: biographies.list <strong>and</strong> business.list. Both files start each line with a twoletters<br />
tag that indicates the meaning <strong>of</strong> the line (the attribute to which it corresponds).<br />
Biographies.list starts describing a new person by presenting a line with person name (tag NM). Subsequent lines<br />
describe different characteristics <strong>of</strong> a person. We extract real names (tag RN), date <strong>and</strong> place <strong>of</strong> birth (tag DB),<br />
date, place <strong>and</strong> cause <strong>of</strong> decease (tag DD) <strong>and</strong> height (tag HT). We ignore other tags <strong>and</strong> white lines.<br />
Analogously, business.list starts describing a new movie by presenting a line with its title (tag MV). Subsequent<br />
lines describe different characteristics <strong>of</strong> a movie. We extract budget currency <strong>and</strong> amount (tag BT), revenue<br />
currency <strong>and</strong> amount (tag RT). We ignore other tags <strong>and</strong> white lines.<br />
20<br />
File Hierarchies Tabulations Enclosing characters Tags<br />
movies.list no yes no<br />
actors.list yes yes (comments), [role], no<br />
actresses.list yes yes (comments), [role], no<br />
biographies.list no no yes<br />
business.list no no yes<br />
color-info.list no yes (comments) no<br />
costume-designers.list yes yes (comments) no<br />
countries.list no yes no<br />
directors.list yes yes (comments) no<br />
distributors.list yes yes (comments) no<br />
genres.list no yes no<br />
keywords.list no yes no<br />
language.list no yes (comments) no<br />
locations.list no yes (comments) no<br />
movie-links.list yes yes no<br />
producers.list yes yes (comments) no<br />
Production-companies.list no yes (comments) no<br />
Production-designers.list yes yes (comments) no<br />
ratings.list no no no<br />
release-dates.list no yes (comments) no<br />
running-times.list no yes (comments) no<br />
sound-mix.list No yes no<br />
writers.list Yes yes (comments) no<br />
Table 7 – Pre-processing <strong>of</strong> source files
Verónika Peralta<br />
In order to store data in a tabular way, we need to iterate among lines corresponding to each person/movie <strong>and</strong><br />
write a unique line with all available characteristics. We do so in a pre-processing stage.<br />
Pre-processing program was written in Java (jdk 1.4). It takes as input a list file, processes the file line per line<br />
<strong>and</strong> stores resulting lines in an output text file. We store extracted data (corresponding to desired tags) in local<br />
variables. In both files, the desired tags are contained in a single line (other ignored tags are split in several<br />
lines). When we recognize a new person/movie (because <strong>of</strong> reading the corresponding tags), we store the line<br />
corresponding to the previous one.<br />
Table 7 (column tags) indicates which source files needed pre-processing for normalizing tagged structures.<br />
3.3.4. Loading into a relational database<br />
After pre-processing tasks, files have a tabular form <strong>and</strong> can be treated by a database loading interface. We used<br />
Micros<strong>of</strong>t Access ® loading interface, indicating file format <strong>and</strong> column format.<br />
Note that the only file that did not need pre-processing is ratings.list, which is structured as fixed-length<br />
columns.<br />
3.4. <strong>IMDb</strong> cleaning <strong>and</strong> transformation processes<br />
Contrarily to <strong>MovieLens</strong> data sets, the <strong>IMDb</strong> data set presented a great number <strong>of</strong> anomalies <strong>and</strong> consequently<br />
needed the execution <strong>of</strong> several data cleaning <strong>and</strong> data transformation processes. The types <strong>of</strong> anomalies that we<br />
detected <strong>and</strong> solved are:<br />
− We found 6 duplicate titles in movies.list. We keep only one occurrence.<br />
− There were a great number <strong>of</strong> duplicate pairs <strong>of</strong> the form for most features (specific<br />
quantities for each source file are shown in Table 11). In most cases, duplicates tuples had inconsistent<br />
values for remaining attributes. Our police was to substitute inconsistent values by null values (especially<br />
for comments), i.e. we considered inconsistent data as unknown. We did special treatment to some<br />
feature: in the case <strong>of</strong> budget <strong>and</strong> revenues (business.list) we summed inconsistent values (they were<br />
consider as partial data) <strong>and</strong> in the case <strong>of</strong> personal data (biographies.list) we kept the maximum value (in<br />
all inconsistencies, one <strong>of</strong> the values was null).<br />
− There were a great number <strong>of</strong> violations to referential constraints (specific quantities for each source file<br />
are shown in Table 11). We eliminated such tuples.<br />
− Some columns embedded multiple attributes. For example, the value “Taxi Films [uy]” <strong>of</strong> the productioncompanies.list<br />
embeds a production company (“Taxi Films”) <strong>and</strong> the code <strong>of</strong> its country (“Uruguay”).<br />
Eight files contained this type <strong>of</strong> anomalies (see Table 5). We solved them by splitting columns.<br />
− Optional columns are delimited by special characters, e.g. square brackets or braces (specific enclosing<br />
characters are listed in Table 7). We eliminated enclosing characters in most cases. Exceptions were the<br />
occurrences <strong>of</strong> multiple comments, for example “(1963-1964) (USA) (TV) (original airing)”; we kept<br />
brackets because they delimitate each comment.<br />
− Country names <strong>and</strong> codes, appearing in different source files (distributors.list, production-companies.list,<br />
running-times.list, locations.list, countries.list <strong>and</strong> release-dates.list) were not consistent (for example, we<br />
found the names ‘U.K.’ <strong>and</strong> ‘United Kindom’ for referring to the same country). We deferred the problem<br />
to data integration stage.<br />
− As we did not extract all tags in business.list <strong>and</strong> biographies.list, we found an important number <strong>of</strong> tuples<br />
having null values for all columns excepting key (specific quantities for each source file are shown in<br />
Table 11). We filtered such tuples.<br />
− The type <strong>of</strong> a movie (film, series episode, etc.) was not explicitly provided, however, <strong>IMDb</strong> site explains<br />
how titles are build (see Section 3.1). We calculated movie type following such explanations.<br />
− Release year was calculated from release date (release-dates.list) by using typical date manipulation<br />
functions.<br />
− The list <strong>of</strong> keyword included at the start <strong>of</strong> the keywords.list file, which is split in several columns, was<br />
hard to extract. We calculated the same information by grouping keywords <strong>of</strong> all movies <strong>and</strong> counting the<br />
number <strong>of</strong> movies that references each one.<br />
21
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
ETL workflows consisted in the same tasks presented for the <strong>MovieLens</strong> data set but providing different<br />
implementations for specific cleaning <strong>and</strong> reconciliation tasks. The general workflow is illustrated in Figure 21<br />
(P.txt represents either one <strong>of</strong> the text files obtained after pre-processing source files or ratings.list, which does<br />
not need pre-processing). However, for certain files some tasks were not necessary, specifically, if there were no<br />
transformations to do, or no duplications nor orphans were detected. Table 8 shows the tasks that are omitted for<br />
each file; bulk copy <strong>and</strong> duplicate check tasks are executed for all files.<br />
In addition, as indexes by movie titles consumed much space <strong>and</strong> were less efficient, we auto generated new<br />
integer identifiers for movies. We included a new column (MovieNum), <strong>of</strong> auto-numeric type to the moviesgroup<br />
table (which stores the results <strong>of</strong> the grouping task). Each tuple inserted by the grouping task is set with a<br />
new sequential identifier generated automatically. We also maintained movie titles in tables referencing movies,<br />
letting users the possibility <strong>of</strong> choosing the identifier that is best adapted to their applications.<br />
22<br />
P.txt<br />
<strong>IMDb</strong>_Movies<br />
bulk copy<br />
cleaning<br />
duplicate<br />
check<br />
Figure 21 – ETL workflow for <strong>IMDb</strong> files<br />
duplicate<br />
reconciliation grouping<br />
cleaning<br />
referential<br />
check<br />
orphan<br />
elimination<br />
aggregation<br />
<strong>IMDb</strong>_P<br />
<strong>IMDb</strong>_P’<br />
Source file Cleaning<br />
Reconc., dupl. cleaning<br />
<strong>and</strong> grouping<br />
Ref. check<br />
Orphan<br />
elimination Aggregation<br />
movies.txt X X X<br />
actors.txt X X<br />
actresses.txt X X<br />
biographies.txt X X X<br />
business.txt X<br />
color-info.txt X X<br />
costume-designers.txt X X<br />
countries.txt X X<br />
directors.txt X X<br />
distributors.txt X<br />
genres.txt X X<br />
keywords.txt X<br />
language.txt X X<br />
locations.txt X<br />
movie-links.txt X X<br />
producers.txt X X<br />
production-companies.txt X<br />
production-designers.txt X X<br />
ratings.list X X X<br />
release-dates.txt X<br />
running-times.txt X<br />
sound-mix.txt X X<br />
writers.txt X X<br />
Table 8 – ETL tasks that where not necessary to execute<br />
In the following we describe such specific cleaning <strong>and</strong> reconciliation tasks, as well as special remarks:
Verónika Peralta<br />
− Movie type calculation task (cleaning for movies.txt) calculates movie types as derived attributes by<br />
looking for some special characters in the movie title (see Table 9). The task is implemented as a<br />
sequence <strong>of</strong> SQL operations (one for each movie type) that update the Type attribute according to Table<br />
9, for example:<br />
UPDATE movies-cleaning<br />
SET Type=M<br />
WHERE Mid(MovieTitle, 1, 2) = ’”’<br />
AND Mid(MovieTitle, Len(MovieTitle)-5, 6) = ’(mini)’<br />
Movie type Title is quoted Special ending word Generated attribute<br />
Cinema film No none C<br />
Series episode Yes (mini) S<br />
Mini-series Yes none M<br />
TV series No (TV) TV<br />
Video No (V) V<br />
Video game No (VG) VG<br />
Table 9 – Movie type identification<br />
− Null values elimination tasks (initial cleaning for biographies.txt <strong>and</strong> business.txt) eliminate tuples having<br />
null values for all attributes (excepting keys). Tasks are implemented as SQL operations like:<br />
INSERT INTO business-cleaning (MovieTitle, Budget, Revenue)<br />
SELECT *<br />
FROM business-source<br />
WHERE Bugnet IS NOT NULL OR Revenue IS NOT NULL;<br />
The implementation for biographies.txt is analogous.<br />
− Splitting tasks (cleaning for several tasks) update several attributes from one concatenating several<br />
concepts. For example, the ProductionCompany column <strong>of</strong> the productioncompanies.txt file embeds<br />
company names <strong>and</strong> country codes. The tasks search for special characters (e.g. brackets, square brackets,<br />
braces) <strong>and</strong> split attribute values according to such characters. Tasks are implemented as a sequence <strong>of</strong><br />
SQL operations, for example:<br />
INSERT INTO productioncompanies-cleaning<br />
(MovieTitle, ProductionCompany, CompanyInfo, Aux1)<br />
SELECT MovieTitle, ProductionCompany, Comments,<br />
InStr(1,ProductionCompany,’[‘)<br />
FROM productioncompanies-source;<br />
UPDATE productioncompanies-cleaning<br />
SET CompanyName = Mid(ProductionCompany, 1, Aux1-2),<br />
CountryCode = Mid(ProductionCompany,Aux1+1,Len(ProductionCompany)-Aux1-1)<br />
WHERE Aux1 > 0;<br />
UPDATE productioncompanies-cleaning<br />
SET CompanyName = ProductionCompany<br />
WHERE Aux1 = 0;<br />
The first operation calculates the position <strong>of</strong> the special character <strong>and</strong> stores it in the Aux1 attribute. The<br />
second operation performs the split. The third operation is necessary when some values are optional (in<br />
the example, if country code can be omitted). It sets the first attribute <strong>and</strong> less the second as Null.<br />
The implementation <strong>of</strong> splitting tasks for business.txt, distributors.txt, releasedates.txt <strong>and</strong><br />
runningtimes.txt is similar. Table 10 lists attributes to split <strong>and</strong> special characters for those files. Columns<br />
<strong>of</strong> biographies.txt were not split because <strong>of</strong> their format diversity.<br />
As a special case, the splitting <strong>of</strong> movielinks.txt is performed by table look-up, i.e. the first attribute<br />
(movie type) is looked up in a table; the remaining <strong>of</strong> the text corresponds to the second attribute. The<br />
look-up table contains the fifteen link types, namely: “alternate language version <strong>of</strong>”, “edited from”,<br />
“edited into”, “featured in”, “features”, “followed by”, “follows”, “referenced in”, “references”, “remade<br />
as”, “remake <strong>of</strong>”, “spin <strong>of</strong>f”, “spo<strong>of</strong>ed in”, “spo<strong>of</strong>s”, “version <strong>of</strong>”. The tasks is implemented as follows:<br />
23
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
24<br />
INSERT INTO movielinks-cleaning<br />
(MovieTitle, LinkMovieTitle, LinkType)<br />
SELECT MovieTitle, Mid(Link, Len(LinkType), Len(Link)), LinkType<br />
FROM movielinks-source X, MovieTypes Y<br />
WHERE Mid(Link, 1, Len(LinkType)) = LinkType;<br />
File<br />
Multi-purpose<br />
column<br />
First attribute Second attribute Separator<br />
business.txt Budget BudgetCurrency Budget first blank<br />
Revenue RevenueCurrency Revenue first blank<br />
distributors.txt DistributionCompany DistributionCompany CountryCode [<br />
movielinks.txt Link LinkType LinkMovieTitle look up<br />
productioncompanies.txt ProductionCompany CompanyName CountryCode(opt.) [<br />
releasedates.txt ReleaseDate Country ReleaseDate colon<br />
runningtimes.txt RunningTime Country (optional) Duration colon<br />
Table 10 – Splitting attributes<br />
− Enclosing character elimination tasks (cleaning for several tables, listed in Table 7) deletes initial <strong>and</strong><br />
ending character <strong>of</strong> a column, which enclose the value. The task is implemented as follows:<br />
INSERT INTO languages-cleaning<br />
(MovieTitle, Language, LanguageInfo)<br />
SELECT MovieTitle, Language, Mid(LanguageInfo,2,Len(LanguageInfo)-2)<br />
FROM languages-source;<br />
The implementation for other files is analogous.<br />
− Zone calculation task (cleaning for location.txt) extracts zone from location attribute. Specifically,<br />
location values are comma-separated lists <strong>of</strong> geographic places, from most precise to largest (e.g.<br />
“Toronto, Ontario, Canada” <strong>and</strong> “Biltmore Hotel - 506 S. Gr<strong>and</strong> Ave., Downtown, Los Angeles,<br />
California, USA”). The number <strong>of</strong> items in each location value is variable. The Zone attribute will store<br />
the last item <strong>of</strong> a location, which is generally a country name. In order to calculate it, we need to iterate in<br />
the location value, searching the position <strong>of</strong> a comma <strong>and</strong> splitting the location value as described for<br />
previous task.<br />
The task is implemented as a sequence <strong>of</strong> SQL operations, as follows:<br />
INSERT INTO locations-cleaning<br />
(MovieTitle, Location, Zone, LocationInfo)<br />
SELECT MovieTitle, Location, Location, Comments<br />
FROM locations-source;<br />
UPDATE locations-cleaning<br />
SET Zone = Mid(Zone, InStr(1,Zone,’,’)+2, Len(Zone))<br />
WHERE InStr(1,Zone,’,’) > 0;<br />
The first operation sets zone as the complete location. The second operation splits zone according to the<br />
position <strong>of</strong> the first comma, storing the second part. Note that only tuples containing a comma are<br />
updated. The second operation is repeated until no tuple is updated.<br />
− Release year calculation task (further cleaning for releasedates.txt) calculates release year from release<br />
date. A typical truncate function is used. The implementation <strong>of</strong> the task was presented in Section 2.3.2.<br />
− Reconciliation tasks set conflictive values to Null. The task is implementation was discussed in Section<br />
2.3.<br />
− Orphan elimination tasks keep tuples joining the <strong>IMDb</strong>_Movies table, as described in Section 2.3. At this<br />
moment, we copy the auto-generated movie identifier (MovieNum) obtained from the <strong>IMDb</strong>_Movies<br />
table. The implementation for the genres.txt file is:<br />
INSERT INTO genres-ref (MovieNum, MovieTitle, Genre)<br />
SELECT Y.MovieNum, X.MovieTitle, X.Genre<br />
FROM genres-group X, <strong>IMDb</strong>_Movies Y<br />
WHERE X.MovieTitle = Y.MovieTitle;
Verónika Peralta<br />
The implementation for other files is analogous. Note that we execute the task even if no orphans were<br />
found, in order to obtain movie identifiers.<br />
− Referential check task (for movielinks.txt) is executed twice in order to check that both, movie titles <strong>and</strong><br />
link movie titles, are included in the <strong>IMDb</strong>_Movies table, generating two orphan tables. Implementation<br />
was described in Section 2.3.<br />
− Orphan elimination task (for movielinks.txt) also considers that both, movie titles <strong>and</strong> link movie titles,<br />
are included in the <strong>IMDb</strong>_Movies table, i.e. it joins twice the <strong>IMDb</strong>_Movies table. The task is<br />
implemented as follows:<br />
INSERT INTO movielinks-ref (MovieNum, LinkMovieNum, MovieTitle,<br />
LinkMovieTitle, LinkType)<br />
SELECT X.MovieNum, Y.MovieNum, M.MovieTitle, M.LinkMovieTitle, M.LinkType<br />
FROM movielinks-group M, <strong>IMDb</strong>_Movies X, <strong>IMDb</strong>_Movies Y<br />
WHERE M.MovieTitle = X.MovieTitle<br />
AND M.LinkMovieTitle = Y.MovieTitle;<br />
− Aggregation task (for keywords.txt) stores the list <strong>of</strong> existing keywords <strong>and</strong> the number <strong>of</strong> movies related<br />
to each keyword. It is implemented as follows:<br />
INSERT INTO <strong>IMDb</strong>_Keywords (Keyword, MovieQuantity)<br />
SELECT Keyword, count(*)<br />
FROM <strong>IMDb</strong>_MovieKeywords<br />
GROUP BY Keyword;<br />
As a summary <strong>of</strong> ETL tasks, Table 11 shows the number <strong>of</strong> extracted tuples, the number <strong>of</strong> anomalies, duplicates<br />
<strong>and</strong> orphans found, <strong>and</strong> the number <strong>of</strong> exported tuples.<br />
Source file<br />
movies.list<br />
Extracted Cleaning<br />
Exported<br />
Duplicates Orphans Resulting table<br />
tuples anomalies tuples<br />
858.967 0 6 858.961 <strong>IMDb</strong>_Movies<br />
actors.list 5.014.897 0 40 486.875 4.527.982 <strong>IMDb</strong>_Actors<br />
actresses.list 2.743.802 0 35 327.935 2415832 <strong>IMDb</strong>_Actresses<br />
biographies.list 312.837 54.011 8 258818 <strong>IMDb</strong>_Biographies<br />
business.list 57.364 31.847 1 0 25516 <strong>IMDb</strong>_Business<br />
color-info.list 433.117 0 846 44.541 387730 <strong>IMDb</strong>_Colors<br />
costume-designers.list 99.192 0 35 7.644 91513 <strong>IMDb</strong>_CostumeDesigners<br />
countries.list 530.295 0 681 0 529614 <strong>IMDb</strong>_Countries<br />
directors.list 513.344 0 295 34 513015 <strong>IMDb</strong>_Directors<br />
distributors.list 403.271 0 15.023 0 388248 <strong>IMDb</strong>_Distributors<br />
genres.list 637.976 0 606 57.519 579851 <strong>IMDb</strong>_Genres<br />
keywords.list 1.124.778 0 507 330<br />
32704 <strong>IMDb</strong>_Keywords<br />
1123941 <strong>IMDb</strong>_MovieKeywords<br />
language.list 445.840 0 1.355 0 444485 <strong>IMDb</strong>_Languages<br />
locations.list 216.951 0 45 0 216906 <strong>IMDb</strong>_Locations<br />
movie-links.list 533.428 0 0 36.572 496856 <strong>IMDb</strong>_MovieLinks<br />
producers.list 1.038.128 0 422.088 172.775 443265 <strong>IMDb</strong>_Producers<br />
production-companies.list 478.346 0 831 0 477515 <strong>IMDb</strong>_ProductionCompanies<br />
production-designers.list 103.347 0 12 7509 95826 <strong>IMDb</strong>_ProductionDesigners<br />
ratings.list 159.957 0 0 66 159891 <strong>IMDb</strong>_Ratings<br />
release-dates.list 830.416 0 2.007 53.556 774853 <strong>IMDb</strong>_ReleaseDates<br />
running-times.list 327.312 0 4.405 49.901 273006 <strong>IMDb</strong>_RunningTimes<br />
sound-mix.list 231.318 0 496 8.512 222310 <strong>IMDb</strong>_Sounds<br />
writers.list 775.660 0 25.260 130.812 619588 <strong>IMDb</strong>_Writers<br />
Table 11 – Statistics <strong>of</strong> data extraction, cleaning <strong>and</strong> transformation<br />
25
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
4. <strong>Data</strong> integration<br />
This section describes the construction <strong>of</strong> an integrated database from the extracted data. We start describing the<br />
difficulty <strong>of</strong> matching movie identifiers <strong>and</strong> describing the matching process. We then describe the integrated<br />
schema <strong>and</strong> its construction.<br />
4.1. Matching processes<br />
We extracted 858.961 movies from <strong>IMDb</strong> <strong>and</strong> 3.883 movies from <strong>MovieLens</strong>. After integration, we constated<br />
that all <strong>MovieLens</strong> movies were included in the <strong>IMDb</strong> data set. However, matching titles was not trivial.<br />
Furthermore, only 79% <strong>of</strong> matches had identical movie titles in both data sets. Other matching strategies were<br />
implemented, including manual look-up, in order to match the remaining titles. This section presents an<br />
overview <strong>of</strong> the matching process; details can be found in [Peralta 2007a].<br />
Both, <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> identify movies by ad-hoc identifiers (called movie titles) that concatenate title <strong>and</strong><br />
running year (see Sections 2.1.1<strong>and</strong> 3.1). Note that we do not consider series episodes, mini-series, TV series,<br />
videos or video games (which have different identifiers) because <strong>MovieLens</strong> data set only includes cinema<br />
movies. But even identifiers are built in the same manner, we found discrepancies in titles. Most typical causes<br />
are inclusion <strong>of</strong> special characters, typing errors, transposition or omission <strong>of</strong> articles, use <strong>of</strong> different running<br />
years, translation <strong>of</strong> foreign titles <strong>and</strong> use <strong>of</strong> alternative titles. Table 12 illustrates differences in movie titles.<br />
<strong>IMDb</strong> titles <strong>MovieLens</strong> titles Causes<br />
¡Three Amigos! (1986) Three Amigos! (1986) Special characters<br />
Shall We Dance (1937) Shall We Dance? (1937)<br />
8 ½ Women (1999) 8 1/2 Women (1999)<br />
One Hundred <strong>and</strong> One Dalmatians (1961) 101 Dalmatians (1961)<br />
Two Moon Junction (1988) Two Moon Juction (1988) Typing errors<br />
La Bamba (1987) Bamba, La (1987) Transposed articles<br />
El Dorado (1966) Dorado, El (1967)<br />
Three Ages (1923) Three Ages, The (1923) Omitted articles<br />
Story <strong>of</strong> G.I. Joe (1945) Story <strong>of</strong> G.I. Joe, The (1945)<br />
Tarantella (1996) Tarantella (1995) Different years<br />
Supernova (2000/I) Supernova (2000)<br />
Abre los ojos (1997) Open Your Eyes (Abre los ojos) (1997) English titles<br />
Caro diario (1994) Dear Diary (Caro Diario) (1994)<br />
Huitième jour, Le (1996) Eighth Day, The (Le Huitième jour ) (1996)<br />
Historia <strong>of</strong>icial, La (1985) Official Story, The (La Historia Oficial) (1985)<br />
Cité des enfants perdus, La (1995) City <strong>of</strong> Lost Children, The (1995)<br />
Star Wars (1977) Star Wars: Episode IV - A New Hope (1977) Alternative titles<br />
Santa Claus (1985) Santa Claus: The Movie (1985)<br />
Sunset Blvd. (1950) Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)<br />
Sugar Hill (1994) Harlem (1993)<br />
26<br />
Table 12 – Examples <strong>of</strong> movies having different titles<br />
In order to match <strong>MovieLens</strong> titles with <strong>IMDb</strong> titles (included in ML_Movies <strong>and</strong> <strong>IMDb</strong>_Movies tables), we<br />
followed several matching strategies. Each strategy tries to match <strong>MovieLens</strong> titles not matched by previous<br />
strategies as shown in Figure 22. To this end, each strategy considers a different heuristic for building c<strong>and</strong>idate<br />
(alternative) titles for unmatched movies <strong>and</strong> compares them with <strong>IMDb</strong> titles. We implemented the following<br />
strategies:<br />
− Join by movie title: This strategy computes the exact match. We joined <strong>IMDb</strong>_Movies <strong>and</strong> ML_Movies<br />
tables by the MovieTitle attribute. We obtained 3086 matches (79%).<br />
− Match using the <strong>MovieLens</strong> small data set: This strategy uses the SML_Movies table (small data set) as a<br />
mapping table between unmatched <strong>MovieLens</strong> titles <strong>and</strong> <strong>IMDb</strong> titles. Specifically, the SML_Movies table<br />
contains the URL <strong>of</strong> the <strong>IMDb</strong> web page corresponding to each movie. We formatted the URL <strong>and</strong><br />
replaced special characters (e.g. %20 represents a blank) obtaining a c<strong>and</strong>idate title. We joined this table<br />
with unmatched <strong>MovieLens</strong> movies (by the MovieTitle attribute) <strong>and</strong> with <strong>IMDb</strong>_Movies (by the just
ML_Movies<br />
SML_Movies<br />
<strong>IMDb</strong>_Movies<br />
match1<br />
ok (insert)<br />
unmatched<br />
match2<br />
unmatched<br />
match3<br />
unmatched<br />
match4<br />
unmatched<br />
match5<br />
unmatched<br />
match6<br />
unmatched<br />
match7<br />
unmatched<br />
match8<br />
ok<br />
ok<br />
ok<br />
ok<br />
ok<br />
ok<br />
ok<br />
Figure 22 – Matching process<br />
I_Movies<br />
Verónika Peralta<br />
built c<strong>and</strong>idate title). We had several difficulties: First, the intersection <strong>of</strong> both <strong>MovieLens</strong> collections is<br />
not very large. Second, we cannot replace all special characters. As a result, we obtained 83 new matches,<br />
totalizing 3169 matches (82%).<br />
− Match extracting foreign title: Some <strong>MovieLens</strong> titles are translated to English but include, between<br />
brackets, the original title (see examples in Table 12). We extracted original titles (text between brackets)<br />
<strong>of</strong> unmatched <strong>MovieLens</strong> movies obtaining a c<strong>and</strong>idate title. Then, we joined them with <strong>IMDb</strong>_Movies.<br />
We obtained 92 new matches, totalizing 3261 matches (84%).<br />
− Match ignoring running year: Sometimes, movie titles embeds running years, sometimes they embed<br />
diffusion years <strong>and</strong> sometimes they include additional characters (e.g. ‘1999/I’). This strategy consists in<br />
ignoring years for joining movie titles <strong>and</strong> manually verifying the obtained matches. We removed years<br />
from movie titles <strong>of</strong> unmatched <strong>MovieLens</strong> movies <strong>and</strong> <strong>IMDb</strong>_Movies. Then, we joined these auxiliary<br />
titles storing matches in a temporal table. A human validated matches (e.g. those differentiating in only<br />
one year) <strong>and</strong> eliminated erroneous ones. As a result, we obtained 295 new matches, totalizing 3556<br />
matches (92%).<br />
− Matching <strong>of</strong> 20 first characters: This strategy consists in truncating titles to 20 characters, joining them<br />
<strong>and</strong> manually verifying the obtained matches. We tried to solve some kinds <strong>of</strong> alternative titles (e.g.<br />
“Friday the 13th Part III (1982)” <strong>and</strong> “Friday the 13th Part 3: 3D (1982)”) or special characters or<br />
truncations (e.g. “Why Do Fools Fall In Love? (1998)” <strong>and</strong> “Why Do Fools Fall In Love (1998)”). We<br />
truncated titles <strong>of</strong> unmatched <strong>MovieLens</strong> movies <strong>and</strong> <strong>IMDb</strong>_Movies. Then, we joined these auxiliary<br />
titles storing matches in a temporal table. A human validated matches by comparing whole titles <strong>and</strong><br />
eliminated erroneous matches. As a result, we obtained 34 new matches, totalizing 3590 matches (92%).<br />
− Matching <strong>of</strong> 10 first characters: This strategy repeats the previous one, but truncating titles to 10<br />
characters. We obtained 46 new matches, totalizing 3636 matches (94%).<br />
− Manual look-up: This strategy consists in manually examining unmatched <strong>MovieLens</strong> titles looking for<br />
possible matches in the <strong>IMDb</strong>_Movies table. We used intuition for finding titles in the huge <strong>IMDb</strong><br />
collection (e.g. transposing words for “La Bamba (1987)” <strong>and</strong> “Bamba, La (1987)”) <strong>and</strong> knowledge <strong>of</strong><br />
foreign languages (e.g. “Cité des enfants perdus, La (1995)” <strong>and</strong> “City <strong>of</strong> Lost Children, The (1995)”).<br />
We obtained 116 new matches, totalizing 3752 matches (97%).<br />
− Web look-up: The last strategy uses the search engine <strong>of</strong> <strong>MovieLens</strong> web site to find unmatched movies.<br />
Then, we used the link provided by <strong>MovieLens</strong> site to access the movie page at <strong>IMDb</strong> <strong>and</strong> we copied its<br />
title. We matched the remaining 131 movie titles.<br />
After matching titles, we found that 2 movies had the same corresponding movie at <strong>IMDb</strong>, i.e. we detected 2<br />
additional duplicates. Tuples with original title were kept; those with foreign title were removed.<br />
27
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
28<br />
Figure 23 – Integrated schema<br />
I_Movies<br />
MovieId<br />
Title<strong>MovieLens</strong><br />
TitleIMDB<br />
I_MovieRatings<br />
MovieId<br />
Distribution<br />
Votes<br />
Rating<br />
I_Genres<br />
Genre<br />
I_MovieGenres<br />
MovieId<br />
Genre<br />
MovieId<br />
MovieId<br />
Genre<br />
I_Countries<br />
Country<br />
LongName<br />
DomainCode<br />
ISO2Code<br />
ISO3Code<br />
UNnumericalCode<br />
IsCurrent<br />
IsSovereign<br />
Sovereign<br />
Continent<br />
SecondaryContinent<br />
Area<br />
Inhabitants<br />
I_MovieCountries<br />
MovieId<br />
Country<br />
MovieId<br />
Country<br />
I_Years<br />
Year<br />
Decade<br />
I_MovieYears<br />
MovieId<br />
Year<br />
YearInfo<br />
Year<br />
MovieId<br />
I_MovieReleaseDates<br />
MovieId<br />
Country<br />
ReleaseDate<br />
ReleaseYear<br />
ReleaseInfo<br />
Year=<br />
ReleaseYear<br />
Country<br />
MovieId<br />
I_Keywords<br />
Keyword<br />
MovieQuantities<br />
I_MovieKeywords<br />
MovieId<br />
Keyword<br />
Keyword<br />
MovieId<br />
Language<br />
MovieId<br />
I_Languages<br />
Language<br />
I_MovieLanguages<br />
MovieId<br />
Language<br />
LanguageInfo<br />
I_CountryCodes<br />
CountryCode<br />
Country<br />
Description<br />
I_MovieProductionCompanies<br />
MovieId<br />
CompanyName<br />
CountryCode<br />
CompanyInfo<br />
CountryCode<br />
MovieId<br />
Type<br />
MovieId<br />
I_Types<br />
Type<br />
TypeDescription<br />
I_MovieTypes<br />
MovieId<br />
Type<br />
Color<br />
MovieId<br />
I_Colors<br />
Color<br />
I_MovieColors<br />
MovieId<br />
Color<br />
ColorInfo<br />
LinkType<br />
MovieId<br />
LinkMovieId=MovieId<br />
I_LinkTypes<br />
LinkType<br />
I_MovieLinks<br />
MovieId<br />
LinkMovieId<br />
LinkType<br />
2<br />
BudgetCurrency=Currency /<br />
RevenueCurrency=Currency<br />
MovieId<br />
I_Currencies<br />
Currency<br />
ConversionUSD<br />
I_MovieBusiness<br />
MovieId<br />
BudgetCurrency<br />
Budget<br />
BudgetUSD<br />
RevenueCurrency<br />
Revenue<br />
RevenueUSD<br />
2<br />
I_MovieDistributors<br />
MovieId<br />
DistributionCompany<br />
CountryCode<br />
DistributionInfo<br />
CountryCode<br />
MovieId<br />
I_Distributors<br />
DistributionCompany<br />
DistributionCompany<br />
I_Zones<br />
Zone<br />
Country<br />
Country<br />
I_MovieLocations<br />
MovieId<br />
Location<br />
Zone<br />
LocationInfo<br />
Zone<br />
MovieId<br />
I_MovieRunningTimes<br />
MovieId<br />
Country<br />
Duration<br />
RunningInfo<br />
Country<br />
MovieId<br />
SoundMix<br />
MovieId<br />
I_Sounds<br />
SoundMix<br />
I_MovieSounds<br />
MovieId<br />
SoundMix<br />
I_Directors<br />
Director<br />
MovieQuantity<br />
I_MovieDirectors<br />
MovieId<br />
Director<br />
DirectorAlias<br />
DirectorInfo<br />
MovieId<br />
Director<br />
I_Writers<br />
Writer<br />
MovieQuantity<br />
I_MovieWriters<br />
MovieId<br />
Writer<br />
WriterInfo<br />
MovieId<br />
Writer<br />
I_Producers<br />
Producer<br />
MovieQuantity<br />
I_MovieProducers<br />
MovieId<br />
Producer<br />
ProducerInfo<br />
MovieId<br />
Producer<br />
I_MovieCostumeDesigners<br />
MovieId<br />
CostumeDesigner<br />
CostumeInfo<br />
MovieId<br />
I_MovieProductionDesigners<br />
MovieId<br />
ProductionDesigner<br />
DesignerInfo<br />
MovieId<br />
I_Actresses<br />
Actress<br />
MovieQuantity<br />
I_MovieActresses<br />
MovieId<br />
Actress<br />
PlayInfo<br />
Role<br />
Casting<br />
MovieId<br />
Actress<br />
I_Actors<br />
Actor<br />
MovieQuantity<br />
I_MovieActors<br />
MovieId<br />
Actor<br />
PlayInfo<br />
Role<br />
Casting<br />
MovieId<br />
Actor<br />
I_Biographies<br />
Name<br />
RealName<br />
Birth<br />
Decease<br />
Height<br />
I_Ages<br />
AgeId<br />
MinAge<br />
MaxAge<br />
I_Durations<br />
DurationMin<br />
DurationMax<br />
DurationInterval<br />
DurationMin ≤ Duration<br />
≤ DurationMax<br />
Country<br />
I_Occupations<br />
OccupationId<br />
Occupation<br />
I_UserRatings<br />
UserId<br />
MovieId<br />
Rating<br />
Timestamp<br />
I_Users<br />
UserId<br />
Gender<br />
AgeId<br />
OccupationId<br />
ZipCode<br />
UserId<br />
MovieId<br />
AgeId OccupationId<br />
I_Person<br />
Name<br />
Director<br />
Writer<br />
Producer<br />
Actress<br />
Actor<br />
CostumeDesigner<br />
ProductionDesigner<br />
Name<br />
I_Votes<br />
Votes<br />
I_Ratings<br />
Ratings<br />
Votes<br />
Rating<br />
I_Revenues<br />
RevenueUSD<br />
I_Budgets<br />
BudgetUSD<br />
RevenueUSD<br />
BudgetUSD<br />
I_Movies<br />
MovieId<br />
Title<strong>MovieLens</strong><br />
TitleIMDB<br />
I_MovieRatings<br />
MovieId<br />
Distribution<br />
Votes<br />
Rating<br />
I_Genres<br />
Genre<br />
I_MovieGenres<br />
MovieId<br />
Genre<br />
MovieId<br />
MovieId<br />
Genre<br />
I_Countries<br />
Country<br />
LongName<br />
DomainCode<br />
ISO2Code<br />
ISO3Code<br />
UNnumericalCode<br />
IsCurrent<br />
IsSovereign<br />
Sovereign<br />
Continent<br />
SecondaryContinent<br />
Area<br />
Inhabitants<br />
I_MovieCountries<br />
MovieId<br />
Country<br />
MovieId<br />
Country<br />
I_Years<br />
Year<br />
Decade<br />
I_MovieYears<br />
MovieId<br />
Year<br />
YearInfo<br />
Year<br />
MovieId<br />
I_MovieReleaseDates<br />
MovieId<br />
Country<br />
ReleaseDate<br />
ReleaseYear<br />
ReleaseInfo<br />
Year=<br />
ReleaseYear<br />
Country<br />
MovieId<br />
I_Keywords<br />
Keyword<br />
MovieQuantities<br />
I_MovieKeywords<br />
MovieId<br />
Keyword<br />
Keyword<br />
MovieId<br />
Language<br />
MovieId<br />
I_Languages<br />
Language<br />
I_MovieLanguages<br />
MovieId<br />
Language<br />
LanguageInfo<br />
I_CountryCodes<br />
CountryCode<br />
Country<br />
Description<br />
I_MovieProductionCompanies<br />
MovieId<br />
CompanyName<br />
CountryCode<br />
CompanyInfo<br />
CountryCode<br />
MovieId<br />
Type<br />
MovieId<br />
I_Types<br />
Type<br />
TypeDescription<br />
I_MovieTypes<br />
MovieId<br />
Type<br />
Color<br />
MovieId<br />
I_Colors<br />
Color<br />
I_MovieColors<br />
MovieId<br />
Color<br />
ColorInfo<br />
LinkType<br />
MovieId<br />
LinkMovieId=MovieId<br />
I_LinkTypes<br />
LinkType<br />
I_MovieLinks<br />
MovieId<br />
LinkMovieId<br />
LinkType<br />
2<br />
BudgetCurrency=Currency /<br />
RevenueCurrency=Currency<br />
MovieId<br />
I_Currencies<br />
Currency<br />
ConversionUSD<br />
I_MovieBusiness<br />
MovieId<br />
BudgetCurrency<br />
Budget<br />
BudgetUSD<br />
RevenueCurrency<br />
Revenue<br />
RevenueUSD<br />
2<br />
I_MovieDistributors<br />
MovieId<br />
DistributionCompany<br />
CountryCode<br />
DistributionInfo<br />
CountryCode<br />
MovieId<br />
I_Distributors<br />
DistributionCompany<br />
DistributionCompany<br />
I_Zones<br />
Zone<br />
Country<br />
Country<br />
I_MovieLocations<br />
MovieId<br />
Location<br />
Zone<br />
LocationInfo<br />
Zone<br />
MovieId<br />
I_MovieRunningTimes<br />
MovieId<br />
Country<br />
Duration<br />
RunningInfo<br />
Country<br />
MovieId<br />
SoundMix<br />
MovieId<br />
I_Sounds<br />
SoundMix<br />
I_MovieSounds<br />
MovieId<br />
SoundMix<br />
I_Directors<br />
Director<br />
MovieQuantity<br />
I_MovieDirectors<br />
MovieId<br />
Director<br />
DirectorAlias<br />
DirectorInfo<br />
MovieId<br />
Director<br />
I_Writers<br />
Writer<br />
MovieQuantity<br />
I_MovieWriters<br />
MovieId<br />
Writer<br />
WriterInfo<br />
MovieId<br />
Writer<br />
I_Producers<br />
Producer<br />
MovieQuantity<br />
I_MovieProducers<br />
MovieId<br />
Producer<br />
ProducerInfo<br />
MovieId<br />
Producer<br />
I_MovieCostumeDesigners<br />
MovieId<br />
CostumeDesigner<br />
CostumeInfo<br />
MovieId<br />
I_MovieProductionDesigners<br />
MovieId<br />
ProductionDesigner<br />
DesignerInfo<br />
MovieId<br />
I_Actresses<br />
Actress<br />
MovieQuantity<br />
I_MovieActresses<br />
MovieId<br />
Actress<br />
PlayInfo<br />
Role<br />
Casting<br />
MovieId<br />
Actress<br />
I_Actors<br />
Actor<br />
MovieQuantity<br />
I_MovieActors<br />
MovieId<br />
Actor<br />
PlayInfo<br />
Role<br />
Casting<br />
MovieId<br />
Actor<br />
I_Biographies<br />
Name<br />
RealName<br />
Birth<br />
Decease<br />
Height<br />
I_Ages<br />
AgeId<br />
MinAge<br />
MaxAge<br />
I_Durations<br />
DurationMin<br />
DurationMax<br />
DurationInterval<br />
DurationMin ≤ Duration<br />
≤ DurationMax<br />
Country<br />
I_Occupations<br />
OccupationId<br />
Occupation<br />
I_UserRatings<br />
UserId<br />
MovieId<br />
Rating<br />
Timestamp<br />
I_Users<br />
UserId<br />
Gender<br />
AgeId<br />
OccupationId<br />
ZipCode<br />
UserId<br />
MovieId<br />
AgeId OccupationId<br />
I_Person<br />
Name<br />
Director<br />
Writer<br />
Producer<br />
Actress<br />
Actor<br />
CostumeDesigner<br />
ProductionDesigner<br />
Name<br />
I_Votes<br />
Votes<br />
I_Ratings<br />
Ratings<br />
Votes<br />
Rating<br />
I_Revenues<br />
RevenueUSD<br />
I_Budgets<br />
BudgetUSD<br />
RevenueUSD<br />
BudgetUSD
4.2. Integrated schema<br />
Verónika Peralta<br />
The integrated schema consists in 52 tables describing movies, companies <strong>and</strong> persons related to movies <strong>and</strong> the<br />
users that evaluated movies. Figure 23 shows the tables <strong>of</strong> the integrated schema, their primary keys (underlined<br />
attributes) <strong>and</strong> foreign keys (arrows between tables). Additional dotted lines relate some tables to a fictitious (not<br />
implemented) relation, describing persons. Shadow tables were used for the construction <strong>of</strong> the referential but<br />
are not visible for making queries. Table 13 describes each table, its attributes <strong>and</strong> constraints.<br />
Table Attributes Constraints<br />
I_Movies<br />
Join between <strong>IMDb</strong> <strong>and</strong><br />
<strong>MovieLens</strong><br />
I_Actors<br />
Master <strong>of</strong> actors<br />
I_Actresses<br />
Master <strong>of</strong> actresses<br />
I_Ages<br />
Age intervals<br />
I_Biographies<br />
Biographies <strong>of</strong> actors,<br />
actresses, directors,<br />
producers <strong>and</strong> other<br />
people involved in movies<br />
I_Budgets<br />
Master <strong>of</strong> budget<br />
intervals<br />
I_Colors<br />
Master <strong>of</strong> colors<br />
I_Countries<br />
Master <strong>of</strong> countries,<br />
providing several<br />
aggregation criteria <strong>and</strong><br />
international country<br />
codes<br />
− MovieId: Numeric(4); <strong>MovieLens</strong>’ id<br />
− Title<strong>MovieLens</strong>: String(100)<br />
− TitleImdb: String(250)<br />
− Actor: String(75)<br />
− MovieQuantity: Numeric(3); the number<br />
<strong>of</strong> played movies<br />
− Actress: String(75)<br />
− MovieQuantity: Numeric(3); the number<br />
<strong>of</strong> played movies<br />
− AgeId: Numeric(2)<br />
− MinAge: Numeric(2)<br />
− MaxAge: Numeric(2)<br />
− Name: String(70)<br />
− RealName: String(220)<br />
− Birth: String(130); date <strong>and</strong> place <strong>of</strong> birth<br />
− Decease: String(160); date, place <strong>and</strong><br />
cause <strong>of</strong> decease<br />
− Height: String(15)<br />
− BudgetUSD: Numeric(15,2); start <strong>of</strong> an<br />
interval (internal use)<br />
Primary key: MovieId<br />
Unique: TitleImdb<br />
Primary key: Actor<br />
Primary key: Actress<br />
Primary key: AgeId<br />
Primary key: Name<br />
Primary key: BudgetUSD<br />
− Color: String(20) Primary key: Color<br />
− Country: String(40)<br />
− LongName: String(110)<br />
− DomainCode: String(2); internet domain<br />
− ISO2Code: String(2); ISO3166-1-alpha2<br />
code<br />
− ISO3Code: String(3); ISO3166-1-alpha2<br />
code<br />
− UNnumericalCode: Numeric(3); united<br />
nations country code<br />
− IsCurrent: Numeric(1); 1 for current<br />
countries, 0 for old ones<br />
− IsSovereign: Numeric(1); 1 for sovereign<br />
UN nations, 2 for sovereign non-UN nations,<br />
3 for sovereign non-recognized nations, 4 for<br />
dependent territories <strong>and</strong> 5 for areas <strong>of</strong><br />
special sovereignty<br />
− Sovereign: String(40); name <strong>of</strong> sovereign<br />
nation (current country in the case <strong>of</strong> old<br />
countries)<br />
− Continent: String(20)<br />
− SecondaryContinent: String(20); for<br />
countries having territories in two continents<br />
(the one with the highest are is taken as main<br />
continent)<br />
− Area: Numeric(8)<br />
− Inhabitants: Numeric(10)<br />
Primary key: Country<br />
29
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
I_CountryCodes<br />
Codes <strong>of</strong> countries<br />
I_Currencies<br />
Master <strong>of</strong> currencies<br />
I_Directors<br />
Master <strong>of</strong> directors<br />
I_Distributors<br />
Master <strong>of</strong> distribution<br />
companies<br />
I_Durations<br />
Master <strong>of</strong> duration<br />
intervales<br />
I_Genres<br />
Master <strong>of</strong> genres<br />
I_Keywords<br />
Master <strong>of</strong> keywords<br />
I_Languages<br />
Master <strong>of</strong> languages<br />
I_LinkTypes<br />
Types <strong>of</strong> links between<br />
movies<br />
I_MovieActors<br />
Actors <strong>of</strong> movies<br />
I_MovieActresses<br />
Actresses <strong>of</strong> movies<br />
I_MovieBusiness<br />
Budget <strong>and</strong> revenue <strong>of</strong><br />
movies<br />
I_MovieColors<br />
Colors <strong>of</strong> movies<br />
I_MovieCostume-<br />
Designers<br />
Costume designers <strong>of</strong><br />
movies<br />
30<br />
− CountryCode: String(15)<br />
− Country: String(40)<br />
− Description: String(50); further info<br />
(regions, deprecations...)<br />
− Currency: String(3)<br />
− ConversionUSD: Numeric(12,10); at<br />
October 2006<br />
− Director: String(40)<br />
− MovieQuantity: Numeric(3); the number<br />
<strong>of</strong> directed movies<br />
− DistributionCompany: String(70)<br />
− DurationMin: Numeric(5)<br />
− DurationMin: Numeric(5)<br />
− DurationInterval: Numeric(5)<br />
Primary key: CountryCode<br />
Foreign keys:<br />
− Country: ref. I_Countries<br />
Primary key: Currency<br />
Primary key: Director<br />
Primary key:<br />
DistributionCompany<br />
Primary key: DurationMin<br />
− Genre: String(12) Primary key: Genre<br />
− Keyword: String(60)<br />
− MovieQuantity: Numeric(5); the number<br />
<strong>of</strong> movies having the keyword<br />
Primary key: Keyword<br />
− Language: String(30) Primary key: Language<br />
− LinkType: String(30); references,<br />
remakes, etc.<br />
− MovieId: Numeric(4)<br />
− Actor: String(75)<br />
− PlayInfo: String(110); further info<br />
(uncredited, alias, special roles…)<br />
− Role: String(256); role name <strong>and</strong>/or<br />
description<br />
− Casting: Numeric(3); casting position<br />
− MovieId: Numeric(4)<br />
− Actress: String(75)<br />
− PlayInfo: String(110); further info<br />
(uncredited, alias, special roles…)<br />
− Role: String(256); role name <strong>and</strong>/or<br />
description<br />
− Casting: Numeric(3); casting position<br />
− MovieId: Numeric(4)<br />
− BudgetCurrency: String(3)<br />
− Budget: Numeric(15,2)<br />
− BudgetUSD: Numeric(15,2)<br />
− RevenueCurrency: String(3)<br />
− Revenue: Numeric(15,2)<br />
− RevenueUSD: Numeric(15,2)<br />
− MovieId: Numeric(4)<br />
− Color: String(20)<br />
− ColorInfo: String(50); further info<br />
(versions)<br />
− MovieId: Numeric(4)<br />
− CostumeDesigner: String(50)<br />
− CostumeInfo: String(100); further info<br />
(roles, alias, …)<br />
Primary key: LinkType<br />
Primary key: MovieId, Actor<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Actor: ref. I_Actors<br />
Primary key: MovieId, Actress<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Actress: ref. I_Actresses<br />
Primary key: MovieId<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− BudgetCurrency: ref.<br />
I_Currencies<br />
− RevenueCurrency: ref.<br />
I_Currencies<br />
Primary key: MovieId, Color<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Color: ref. I_Colors<br />
Primary key: MovieId,<br />
CostumeDesigner<br />
Foreign keys:<br />
− MovieId: ref. I_Movies
I_MovieCountries<br />
Countries <strong>of</strong> movies<br />
I_MovieDirectors<br />
Countries <strong>of</strong> movies<br />
I_MovieDistributors<br />
Companies that<br />
distributed movies<br />
I_MovieGenres<br />
Genres <strong>of</strong> movies<br />
I_MovieKeywords<br />
Keywords <strong>of</strong> movies<br />
I_MovieLanguages<br />
Languages <strong>of</strong> movies<br />
I_MovieLinks<br />
Links between movies<br />
I_MovieLocations<br />
Running locations <strong>of</strong><br />
movies<br />
I_MovieProducers<br />
Producers <strong>of</strong> movies<br />
I_MovieProduction-<br />
Companies<br />
Companies that produced<br />
movies<br />
I_MovieProduction-<br />
Designers<br />
Production designers <strong>of</strong><br />
movies<br />
− MovieId: Numeric(4)<br />
− Country: String(40)<br />
− MovieId: Numeric(4)<br />
− Director: String(40)<br />
− DirectorAlias: String(40); director’s<br />
appearing name in the movie<br />
− DirectorInfo: String(70); further info e.g.<br />
uncredited movies, co-direction, direction <strong>of</strong><br />
episodes, segments or videos…<br />
− MovieId: Numeric(4)<br />
− DistributionCompany: String(70)<br />
− CountryCode: String(15)<br />
− DistributorInfo: String(100); further info<br />
(years, countries, media, …)<br />
− MovieId: Numeric(4)<br />
− Genre: String(12)<br />
− MovieId: Numeric(4)<br />
− Keyword: String(60)<br />
− MovieId: Numeric(4)<br />
− Language: String(30)<br />
− LanguageInfo: String(70); further info<br />
(versions)<br />
− MovieId: Numeric(4)<br />
− LinkMovieId: Numeric(4)<br />
− LinkType: String(30); references,<br />
remakes, etc.<br />
− MovieId: Numeric(4)<br />
− Location: String(140)<br />
− Zone: String(30)<br />
− LocationInfo: String(256); further info<br />
(years, countries, media, …)<br />
− MovieId: Numeric(4)<br />
− Producer: String(60)<br />
− ProducerInfo: String(90); further info<br />
(role, awards, …)<br />
− MovieId: Numeric(4)<br />
− CompanyName: String(120)<br />
− CountryCode: String(15)<br />
− CompanyInfo: String(140); further info<br />
(years, places, participations, …)<br />
− MovieId: Numeric(4)<br />
− ProductionDesginer: String(35)<br />
− DesignerInfo: String(140); further info<br />
(co-productions, roles, …)<br />
Verónika Peralta<br />
Primary key: MovieId, Country<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Country: ref. I_Countries<br />
Primary key: MovieId, Director<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Director: ref. I_Directors<br />
Primary key: MovieId,<br />
DistributionCompany,<br />
CountryCode<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− DistributionCompany: ref.<br />
I_Distributors<br />
− CountryCode: ref.<br />
I_CountryCodes<br />
Primary key: MovieId, Genre<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Genre: ref. I_Genres<br />
Primary key: MovieId, Keyword<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Keyword: ref. I_Keywords<br />
Primary key: MovieId, Language<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Language: ref. I_Languages<br />
Primary key: MovieId,<br />
LinkMovieId, LinkType<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Link MovieId: ref. I_Movies<br />
− LinkType: ref. I_LinkTypes<br />
Primary key: MovieId, Location<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Zone: ref. I_Zones<br />
Primary key: MovieId, Producer<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Producer: ref. I_Producers<br />
Primary key: MovieId,<br />
CompanyName, CountryCode<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− CountryCode: ref.<br />
I_CountryCodes<br />
Primary key: MovieId,<br />
ProductionDesginer<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
31
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
I_MovieRatings<br />
Global ratings on movies<br />
(not differentiated par<br />
user)<br />
I_MovieReleaseDates<br />
Dates <strong>of</strong> different releases<br />
I_MovieRunningTimes<br />
Running times <strong>of</strong> movies<br />
I_MovieSounds<br />
Sound mix <strong>of</strong> movies<br />
I_MovieTypes<br />
Types <strong>of</strong> movies<br />
I_MovieWriters<br />
Writers <strong>of</strong> movies<br />
I_MovieYears<br />
Years <strong>of</strong> movies<br />
I_Occupations<br />
User occupations<br />
I_Producers<br />
Master <strong>of</strong> producers<br />
I_Ratings<br />
Master <strong>of</strong> ratings<br />
I_Revenues<br />
Master <strong>of</strong> revenue<br />
intervals<br />
I_Sounds<br />
Master <strong>of</strong> sounds<br />
I_Types<br />
Types <strong>of</strong> movies<br />
I_UserRatings<br />
User ratings on movies<br />
32<br />
− MovieId: Numeric(4)<br />
− Distribution: String(10); unknown<br />
meaning<br />
− Votes: Numeric(8,2); quantity <strong>of</strong> votes<br />
(decimals are always zero)<br />
− Rating: Number(4,2), value between 0<br />
<strong>and</strong> 10<br />
− MovieId: Numeric(4)<br />
− Country: String(40)<br />
− ReleaseDate: String(20)<br />
− ReleaseYear: Numeric(4)<br />
− ReleaseInfo: String(90); further info<br />
(release city/festival)<br />
− MovieId: Numeric(4)<br />
− Country: String(40)<br />
− Duration: Numeric(5)<br />
− RunningInfo: String(75); further info<br />
(version, episodes, media, …)<br />
− MovieId: Numeric(4)<br />
− SoundMix: String(40)<br />
− MovieId: Numeric(4)<br />
− Type: String(2); {C=cinema,<br />
TV=television, V=video, VG=video game,<br />
S=serie, M=miniserie}<br />
− MovieId: Numeric(4)<br />
− Writer: String(45)<br />
− WriterInfo: String(110); further info<br />
(role, awards, …)<br />
− MovieId: Numeric(4)<br />
− Year: Numeric(4)<br />
− YearInfo: String(30); further info (shot<br />
years)<br />
− OccupationId: Numeric(2)<br />
− Occupation: String (25)<br />
− Producer: String(60)<br />
− MovieQuantity: Numeric(3); the number<br />
<strong>of</strong> produced movies<br />
− Rating: Number(4,2), value between 0<br />
<strong>and</strong> 10 (internal use, for classifying ratings)<br />
− RevenueUSD: Numeric(15,2); start <strong>of</strong> an<br />
interval (internal use)<br />
Primary key: MovieId<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
Primary key: MovieId, Country,<br />
ReleaseDate<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− ReleaseYear: ref. I_Years<br />
− Country: ref. I_Countries<br />
Primary key: MovieId, Country<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Country: ref. I_Countries<br />
Primary key: MovieId, SoundMix<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− SoundMix: ref. I_Sounds<br />
Primary key: MovieId<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Type: ref. I_Types<br />
Primary key: MovieId, Writer<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Writer: ref.: I_Writers<br />
Primary key: MovieId, Year<br />
Foreign keys:<br />
− MovieId: ref. I_Movies<br />
− Year: ref. I_Years<br />
Primary key: OccupationId<br />
Primary key: Producer<br />
Primary key: Rating<br />
Primary key: RevenueUSD<br />
− SoundMix: String(40) Primary key: SoundMix<br />
− Type: String(2); {C=cinema,<br />
TV=television, V=video, VG=video game,<br />
S=serie, M=miniserie}<br />
− TypeDescription: String(10)<br />
− UserId: Numeric(4)<br />
− MovieId: Numeric(4)<br />
− Rating: Number(1), value in {1,2,3,4,5}<br />
− Timestamp: Numeric(9); represents<br />
milliseconds from an initial time<br />
Primary key: Type<br />
Primary key: UserId, MovieId<br />
Foreign keys:<br />
− UserId: ref. I_Users<br />
− MovieId: ref. I_Movies
I_Users<br />
Descriptive information<br />
about users<br />
I_Votes<br />
Master <strong>of</strong> votes<br />
I_Writers<br />
Master <strong>of</strong> writers<br />
I_Years<br />
Master <strong>of</strong> years<br />
I_Zones<br />
Master <strong>of</strong> geographical<br />
zones<br />
4.3. Construction <strong>of</strong> the integrated database<br />
− UserId: Numeric(4)<br />
− Gender: Char(1); value in {M,F}<br />
− Age: Numeric(2)<br />
− Occupation: Numeric(2)<br />
− ZipCode: String(10)<br />
− Votes: Numeric(8,2); indicates the start<br />
<strong>of</strong> a voting interval (internal use, for<br />
classifying votes)<br />
− Writer: String(45)<br />
− MovieQuantity: Numeric(3); the number<br />
<strong>of</strong> written movies<br />
− Year: Numeric(4)<br />
− Decade: Numeric(4)<br />
− Zone: String(30); a country, region, sea,<br />
…<br />
− Country: String(40)<br />
Table 13 – Integrated schema<br />
Verónika Peralta<br />
Primary key: UserId<br />
Foreign keys:<br />
− Age: ref. I_Ages<br />
− Occupation: ref. I_Occupation<br />
Primary key: Votes<br />
Primary key: Writer<br />
Primary key: Year<br />
Primary key: Zone<br />
Foreign keys:<br />
− Country: ref. I_Countries<br />
Matching <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> movie titles was the most difficult task in the construction <strong>of</strong> the integrating<br />
database. The remaining tasks basically consist in joining the I_Movies table (the master <strong>of</strong> movies resulting<br />
from the matching process) in order to keep movie features <strong>of</strong> movies included in the master (i.e. assuring<br />
referential integrity). This allowed the creation <strong>of</strong> the appropriate foreign keys. In addition, we created master<br />
tables for most movie features. This section describes these tasks.<br />
Figure 24 shows an overview <strong>of</strong> the process for building tables <strong>of</strong> the integrated database. In order to filter<br />
features <strong>of</strong> movies non-included in the master <strong>of</strong> movies, we use the orphan elimination task described in Subsection<br />
2.3, which simply joins the <strong>IMDb</strong> feature table (<strong>IMDb</strong>_P in Figure 24) with the I_Movies table. Then,<br />
some master tables were aggregated using the aggregation task, also described in Sub-section 2.3, which group<br />
tuples according to a feature attribute. The aggregation task was not executed for some feature tables<br />
(I_MovieCostumeDesigners, I_MovieProductionDesigners) because they have too few repetitive values (for<br />
example, each entry <strong>of</strong> the I_MovieCustomeDesigners table describes a different costume designer). The master<br />
tables generated for each table can be deduced from Figure 23, because they coincide with referential constraints<br />
(excepting those referencing the I_Movies table).<br />
<strong>IMDb</strong>_P<br />
I_Movies<br />
orphan<br />
elimination<br />
aggregation<br />
I_MovieP<br />
I_P<br />
Figure 24 – Construction <strong>of</strong> tables <strong>of</strong> the integrated database<br />
Three special data reconciliation tasks were performed for years, currencies <strong>and</strong> countries. For years, we<br />
performed the union <strong>of</strong> years appearing in I_MovieYears <strong>and</strong> I_MovieReleaseDates tables <strong>and</strong> then, we<br />
calculated decade as a derived attribute, using typical date manipulation functions (similar to the year<br />
calculation task described in Sub-section 2.3.2). For currencies, we obtained budget <strong>and</strong> revenue currencies from<br />
the I_MovieBusiness table. We obtained dollar exchange (at October 2006) for such currencies from different<br />
Internet sites, e.g. [1] (a better exchange should be obtained by considering currencies at movie running time; we<br />
refer this to future work). For countries, we had country names in the I_MovieCountries <strong>and</strong><br />
I_MovieReleaseDates tables, in addition to some zones <strong>of</strong> the I_MovieLocations table <strong>and</strong> we had country codes<br />
in the I_MovieDistributors <strong>and</strong> I_MovieProductionCompanies tables. It was a big mess! Country names were<br />
written in different manners (e.g. ‘UK’ <strong>and</strong> ‘United Kingdom’) <strong>and</strong> sometimes referenced non-sovereign<br />
territories (e.g. French Polynesia) or non-current countries (e.g. ‘URSS’), country codes were also ambiguous.<br />
33
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
In order to build the master <strong>of</strong> countries (I_Countries) we downloaded United Nations <strong>of</strong>ficial country list [10],<br />
which includes sovereignty status <strong>of</strong> countries (UN member sovereign states, non-UN member sovereign states,<br />
no general international recognition sovereign states, dependent territories <strong>and</strong> areas <strong>of</strong> special sovereignty).<br />
Sovereign countries <strong>of</strong> non-sovereign territories were consulted at Wikipedia [11]. We obtained different country<br />
codes (ISO3166-1 alpha 2 code, ISO3166-1 alpha 3 code, UN numerical code <strong>and</strong> internet domain code) <strong>and</strong><br />
further descriptive data (continent, area <strong>and</strong> inhabitants) from the Schmidt lists <strong>of</strong> countries [8] <strong>and</strong> continents<br />
[9]. As a country can have territory in two continents, we defined as primary continent, the one in which the<br />
country has the biggest area. There were inconsistencies between UN <strong>of</strong>ficial country names <strong>and</strong> those obtained<br />
from Wikipedia <strong>and</strong> Schmidt lists, but they were manually solved without major problems.<br />
We created mapping tables in order to join the master with I_MovieCountries <strong>and</strong> I_MovieReleaseDates tables.<br />
Country names included in such tables but not in the master were manually disambiguated <strong>and</strong> related to one <strong>of</strong><br />
the existing countries (when it was an alternative name <strong>of</strong> such country) or added as non-current country (when<br />
it was an old country name). I_MovieCountries <strong>and</strong> I_MovieReleaseDates tables were updated setting country<br />
names according to the mapping tables. Two special country names were created: (i) ‘_unknown’ for Null<br />
values, <strong>and</strong> (ii) <strong>and</strong> _multiple for special cases, such as ‘Korea’, ‘Serbia <strong>and</strong> Montenegro’ or ‘Soviet Union’ that<br />
correspond to multiple current countries.<br />
The master <strong>of</strong> country codes was created from the master <strong>of</strong> countries, selecting attributes country name <strong>and</strong><br />
domain code. Mapping tables were created analogously in order to join the master with I_MovieDistributors <strong>and</strong><br />
I_MovieProductionCompanies. We manually disambiguated country codes in an analogous way. We found<br />
several cases <strong>of</strong> far territories (e.g. ‘French Guiana’) which were associated to the corresponding sovereign<br />
country, multiple countries (e.g. ‘us/ca’ referencing United States <strong>and</strong> Canada) which were associated to the<br />
‘_multiple’ special country name <strong>and</strong> unknown countries codes (e.g. ‘cshh’ <strong>and</strong> ‘ddde’) which were associated to<br />
the ‘_unknown’ special country name.<br />
The master <strong>of</strong> zones was also created from the I_MovieLocations table, adding an attribute for storing a country<br />
name when the zone coincides with a country. The table was joined with the master <strong>of</strong> countries, <strong>and</strong> nonreferenced<br />
zones were manually examined in order to determine if they correspond to a country (setting the<br />
country attribute with the appropriate value) or other types <strong>of</strong> zones, e.g. ‘Persian Gulf’ (setting the country<br />
attribute with the ‘_multiple’ value).<br />
As summary, the construction <strong>of</strong> the integrated database basically consisted in intersection integrated movie<br />
titles with <strong>IMDb</strong> tables describing movie features. In addition, we built master tables (sometimes integrating<br />
several <strong>IMDb</strong> tables) <strong>and</strong> integrating external data sources (country lists <strong>and</strong> currencies). For recapitulating,<br />
Table 14 shows the tables <strong>of</strong> the integrated database <strong>and</strong> their number <strong>of</strong> tuples.<br />
5. Conclusion<br />
In this report we described the procedure followed for extracting <strong>and</strong> integrating <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> data.<br />
Specifically, we described source <strong>and</strong> target schemas <strong>and</strong> the algorithms that perform data extraction, data<br />
cleaning <strong>and</strong> data transformation <strong>and</strong> data integration.<br />
Major difficulties were found at:<br />
34<br />
− <strong>IMDb</strong> data extraction. We needed to develop ad-hoc pre-processing routines in order to normalize<br />
hierarchical <strong>and</strong> tagged structures <strong>and</strong> substitute multiple tabulations <strong>and</strong> blanks.<br />
− <strong>IMDb</strong> data transformation <strong>and</strong> cleaning. We found a lot <strong>of</strong> anomalies in extracted data, ranging from<br />
duplicate <strong>and</strong> orphan tuples to multiple attributes embedded in single columns. We needed to develop<br />
several cleaning <strong>and</strong> transformation tasks.<br />
− Matching <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> titles. We implemented 8 different matching strategies, some <strong>of</strong> them<br />
including human interaction. This was the most time-consuming task.<br />
− Aggregation <strong>of</strong> some master tables, especially, the master <strong>of</strong> countries. We extracted external data in<br />
order to solve conflictive country names <strong>and</strong> codes <strong>and</strong> lead with non-sovereign <strong>and</strong> non-current<br />
countries.<br />
<strong>Data</strong> extracted from <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> sites, as well as the integrated database are available at the <strong>APMD</strong><br />
project web site [7].<br />
In this report we do not dealt with maintenance issues but we would like to treat some issues as future work.<br />
Specifically, as <strong>IMDb</strong> lists are continually updated, it may be interesting to adapt extraction, cleaning <strong>and</strong>
Verónika Peralta<br />
integration processes to incorporate new data. The problem seams to be more difficult for updates <strong>of</strong> <strong>MovieLens</strong><br />
data (which are not announced by the moment) because manual reconciliation tasks will probably be necessary.<br />
Master tables Tuples Relationship tables Tuples Auxiliary tables Tuples<br />
I_Movies 3881<br />
I_Actors 63207 I_MovieActors 110548 I_Ages 7<br />
I_Actresses 31720 I_MovieActresses 48423 I_Budgets 6<br />
I_Biographies 258818 I_Occupations 21<br />
I_MovieBusiness 1740 I_Ratings 10<br />
I_Colors 2 I_MovieColors 3898 I_Revenues 6<br />
I_MovieCostumeDesigners 3373 I_UserRatings 1000194<br />
I_Countries 256 I_MovieCountries 4993 I_Users 6040<br />
I_CountryCodes 265 I_Votes 8<br />
I_Currencies 95<br />
I_Directors 2170 I_MovieDirectors 4141<br />
I_Distributors 2391 I_MovieDistributors 23060<br />
I_Durations 6<br />
I_Genres 25 I_MovieGenres 9288<br />
I_Keywords 14974 I_MovieKeywords 105128<br />
I_Languages 85 I_MovieLanguages 4796<br />
I_LinkTypes 15 I_MovieLinks 18160<br />
I_MovieLocations 14345<br />
I_Producers 8142 I_MovieProducers 16749<br />
I_MovieProductionCompanies 9517<br />
I_MovieProductionDesigners 3152<br />
I_MovieRatings 3861<br />
I_MovieReleaseDates 46554<br />
I_MovieRunningTimes 4839<br />
I_Sounds 28 I_MovieSounds 4928<br />
I_Types 6 I_MovieTypes 3881<br />
I_Writers 5458 I_MovieWriters 8766<br />
I_Years 90 I_MovieYears 3881<br />
I_Zones 133<br />
6. References<br />
Table 14 – Statistics <strong>of</strong> data integration<br />
[1] European Commision: “List <strong>of</strong> countries, territories <strong>and</strong> currencies”, Web Report, March 2007. URL:<br />
http://publications.europa.eu/code/en/en-5000500.htm, last accessed on April 5 th , 2007.<br />
[2] GroupLens Research: “movielens: helping you to find the right movies”. Web site, ULR:<br />
http://movielens.umn.edu, last accessed on July 9 th , 2006.<br />
[3] GroupLens Research: “<strong>MovieLens</strong> <strong>Data</strong> Sets”. Web site, ULR: http://grouplens.org/node/12#attachments,<br />
last accessed on October 5 th , 2006.<br />
[4] Intenet Movie <strong>Data</strong>base, Inc.: “The Intenet Movie <strong>Data</strong>base”, Web site, URL: http://www.imdb.com/, last<br />
accessed on July 9 th , 2007.<br />
[5] Intenet Movie <strong>Data</strong>base, Inc.: “Alternate interfaces”, Web Page, URL: http://www.imdb.com/interfaces,<br />
last accessed on October 5 th , 2006.<br />
[6] Peralta, V.: “Matching <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> Movie Titles”. Technical Report, Laboratoire PRiSM,<br />
Université de Versailles, Versailles, France, March 2007.<br />
[7] Peralta, V.: “<strong>APMD</strong> Test Platform”. URL: http://apmd.prism.uvsq.fr/TestPlatform/, last accessed on May<br />
30 th , 2007.<br />
[8] Schmidt M.: “List <strong>of</strong> countries <strong>and</strong> territories”. Web page, URL:<br />
http://schmidt.devlib.org/data/countries.html, last accessed on April, 10 th , 2007.<br />
35
<strong>Extraction</strong> <strong>and</strong> <strong>Integration</strong> <strong>of</strong> <strong>MovieLens</strong> <strong>and</strong> <strong>IMDb</strong> <strong>Data</strong> – Technical Report<br />
[9] Schmidt M.: “Continents”. Web page, URL: http://schmidt.devlib.org/data/continents.html, last accessed on<br />
April, 10 th , 2007.<br />
[10] United Nations Cartographic Section: “List <strong>of</strong> Territories”. PDF file, 2004. URL:<br />
http://www.un.org/Depts/Cartographic/english/geoname.pdf, last accessed on April 10 th , 2007.<br />
[11] Wikipedia: “List <strong>of</strong> countries”. Web page, URL: http://en.wikipedia.org/wiki/List_<strong>of</strong>_countries, last<br />
accessed on April 10 th , 2007.<br />
36