07.03.2014 Views

Star Schema מחסני נתונים Star Schema – Example 1 Star Schema

Star Schema מחסני נתונים Star Schema – Example 1 Star Schema

Star Schema מחסני נתונים Star Schema – Example 1 Star Schema

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Star</strong> <strong>Schema</strong><br />

• In a star schema, each dimension table has a<br />

single-part primary key that links to one<br />

part of the multipart primary key in the fact<br />

table.<br />

מחסני נתונים<br />

תכנון לוגי של מסד נתונים רב מימדי<br />

באמצעות סכימה טבלאית<br />

4<br />

<strong>Star</strong> <strong>Schema</strong> – <strong>Example</strong> 1<br />

Time Dimensions<br />

Time_key<br />

Day of week<br />

Day_number_of<br />

month<br />

Week_number_in_yea<br />

r<br />

Month<br />

Quarter<br />

Year<br />

Holliday_flag<br />

Weekday_flag<br />

Sales<br />

Time_key<br />

Product_key<br />

Store_key<br />

Dollars_sold<br />

Units_sold<br />

Product Dimension<br />

Product_key<br />

Description<br />

Brand<br />

Category<br />

Store Dimension<br />

Store_key<br />

Store_name<br />

Address<br />

Floor_plan_type<br />

Mainly<br />

descriptiv<br />

e textual<br />

Dimension 1 Fact Table<br />

Dimension 2<br />

3<br />

d1_key1<br />

Att1<br />

Att2<br />

d2_key1<br />

fact1<br />

Dimension 3<br />

fact2<br />

Dimension 4<br />

d3_key1<br />

<strong>Star</strong> <strong>Schema</strong><br />

d1_key1<br />

d2_key2<br />

d3_key1<br />

d4_key1<br />

Mainly numeric and<br />

additive<br />

d4_key1<br />

ז'/תמוז/תש"ע<br />

ז'/תמוז/תש"ע<br />

1


<strong>Star</strong> <strong>Schema</strong> – <strong>Example</strong> 3<br />

<strong>Star</strong> <strong>Schema</strong> – <strong>Example</strong> 2<br />

Reminder: Normal Forms<br />

Seeks to eliminate data redundancy: transaction that<br />

changes any data only need to touch the database in<br />

one place (optimized for updates)<br />

The Standard Template Query<br />

Select p.brand, sum(f.dollars),sum(f.units)<br />

From sales f, product p, time t<br />

Where f.product_key=p.product_key<br />

And f.time_key = t.time_key<br />

And t.quarter=“1 Q 1995”<br />

Group by p.brand<br />

2


On the other hand …<br />

1. Complexity of query specification is high.<br />

Without normalization it will be much clearer to<br />

user. (Simple queries structures)<br />

2. Poor access efficiency – Normalized design is<br />

the worst, by far, for most query access. A<br />

normalized design is optimized for key- based,<br />

record-at-a-time inquiry or table-level query that<br />

efficiently uses the provided indexes.<br />

Resisting Normalization<br />

1. Eliminate redundancy? – Generally eliminating duplicate rows<br />

is good. However eliminating "redundant" attributes in a star<br />

schema dimension table will actually destroy its high- access<br />

efficiency. Time saving (browsing performance) is much more<br />

critical in data warehouse.<br />

2. Save space? – This corollary to eliminating redundancy is a<br />

holdover from another era. The relative impact of storage on cost<br />

is way down. The loss of access efficiency has far greater cost<br />

impact. Furthermore The Fact table in a dimensional schema is<br />

naturally highly normalized. Disk space saving due to<br />

normalization is typically less than 1%.<br />

3. Support efficient update? Does not apply at all - Data<br />

Warehouse is Nonvolatile: no updates of data (only data<br />

loading). The load methods for relational tables in a star schema<br />

design can actually be more efficient than a load of normalized<br />

transaction and snow- flaked reference data.<br />

Division<br />

Division_id<br />

Division_desc<br />

ER - BCNF<br />

Region<br />

Region_id<br />

Region_desc<br />

Why Normalization of Dimension does<br />

not save space?<br />

– A typical <strong>Example</strong><br />

• Fact Table data size:<br />

• Fact Table index size:<br />

• Largest dim’ table size:<br />

• Savings by normalization:<br />

• Total size before:<br />

• Total size after:<br />

30GB<br />

20GB<br />

0.1GB<br />

0.05GB<br />

51GB<br />

50.5GB.<br />

Dept<br />

Dept_id<br />

Dept_desc<br />

Division_id<br />

Sales Facts<br />

Dept_id<br />

Market_id<br />

Week_id<br />

Sales<br />

Market<br />

Market_id<br />

Market_desc<br />

Region_id<br />

3


Snowflake <strong>Schema</strong><br />

Dimensional (Denormalization)<br />

• In a snowflake schema, one or more dimension<br />

tables are decomposed into multiple tables with<br />

the subordinate dimension tables joined to a<br />

primary dimension table instead of to the fact<br />

table.<br />

• i.e.:A refinement of star schema where some<br />

dimensional hierarchy is normalized into a set of<br />

smaller dimension tables, forming a shape similar<br />

to snowflake<br />

Dept. Lookup<br />

Dept_id<br />

Dept_desc<br />

Division_desc<br />

Sales Facts<br />

Dept_id<br />

Market_id<br />

Week_id<br />

Sales<br />

Market Lookup<br />

Market_id<br />

Market_desc<br />

Region_desc<br />

Snowflake <strong>Schema</strong><br />

Sales<br />

Snowflake <strong>Schema</strong><br />

Large Hierarchy<br />

Customer<br />

Time_key<br />

Customer_Key<br />

15<br />

amount<br />

Customer_Key<br />

Demo_Key<br />

Name<br />

…<br />

Demographic<br />

Demo_Key<br />

Income_Level<br />

Age_Level<br />

Sex<br />

ז'/תמוז/תש"ע<br />

4


Sales<br />

Time_key<br />

Customer_Key<br />

Demo_Key<br />

18<br />

amount<br />

Mini-Dimension<br />

Customer<br />

Customer_Key<br />

Name<br />

…<br />

Demographic<br />

Demo_Key<br />

Income_Level<br />

Age_Level<br />

Sex<br />

ז'/תמוז/תש"ע<br />

<strong>Star</strong> schemas or Snowflake schemas?<br />

• Both star and snowflake schemas can represents the<br />

same dimensional models; the difference is in their<br />

RDBMS implementations.<br />

• Snowflake schemas support ease of dimension<br />

maintenance because they are more normalized.<br />

• <strong>Star</strong> schemas are easier for direct user access and<br />

often support simpler and more efficient queries.<br />

• The decision to model a dimension as a star or<br />

snowflake depends on the nature of the dimension<br />

itself, such as how frequently it changes and which<br />

of its elements change, and often involves<br />

evaluating tradeoffs between ease of use and ease<br />

of maintenance.<br />

• In most designs, star schemas are preferable to<br />

snowflake schemas because they involve fewer joins<br />

for information retrieval.<br />

• Surrogate keys<br />

– A surrogate key is the primary key for a dimension table and is<br />

independent of any keys provided by source data systems.<br />

– Surrogate keys are created and maintained in the data warehouse and<br />

should not encode any information about the contents of records;<br />

– automatically increasing integers make good surrogate keys.<br />

– The original key for each record may be carried in the dimension<br />

table but is not used as the primary key.<br />

– Benefits:<br />

• a layer of isolation between DW and the source system;<br />

• Simple: numeric keys<br />

• Can handle ambiguous ID’s.<br />

– Drawback: increased ETL processing<br />

Dimensions Keys<br />

• Using Original Operational keys<br />

– Benefit: reduced transformation effort<br />

– Drawbacks:<br />

• Compound and textual keys;<br />

• Dependency on the source systems (OLTP); for instance what<br />

happen if the operational system create new key when customer<br />

change address, while we don’t want to create a “new” customer.<br />

• Ambiguous ID’s coming from different sources;<br />

– Multiple application systems<br />

– World wide companies with many branches: each branch uses its<br />

own customer’s counting.<br />

– companies that have done mergers or acquisitions.<br />

5


Time/Date Dimension<br />

• For hourly time granularity, the hour<br />

breakdown can be incorporated into the date<br />

dimension or placed in a separate dimension.<br />

• Business needs influence this design decision.<br />

• If the main use is to extract contiguous<br />

chunks of time that cross day boundaries (for<br />

example 11/24/2000 10 p.m. to 11/25/2000 6<br />

a.m.), then it is easier if the hour and day are<br />

in the same dimension.<br />

• However, it is easier to analyze cyclical and<br />

recurring daily events if they are in separate<br />

dimensions.<br />

• Unless there is a clear reason to combine date<br />

and hour in a single dimension, it is generally<br />

better to keep them in separate dimensions!<br />

Time/Date Dimension<br />

• A date dimension with one record per day will suffice if users do<br />

not need time granularity finer than a single day. A date by day<br />

dimension table will contain 365 records per year (366 in leap<br />

years).<br />

• A separate time dimension table should be constructed if a fine<br />

time granularity, such as minute or second, is needed. A time<br />

dimension table of one-minute granularity will contain 1,440 rows<br />

for a day, and a table of seconds will contain 86,400 rows for a<br />

day. If exact event time is needed, it should be stored in the fact<br />

table.<br />

• When a separate time dimension is used, the fact table contains<br />

one foreign key for the date dimension and another for the time<br />

dimension. Separate date and time dimensions simplify many<br />

filtering operations. For example, summarizing data for a range of<br />

days requires joining only the date dimension table to the fact<br />

table. Analyzing cyclical data by time period within a day requires<br />

joining just the time dimension table. The date and time dimension<br />

tables can both be joined to the fact table when a specific time<br />

range is needed.<br />

6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!