04.02.2013 Views

Usage Patterns for Apache HBase

Usage Patterns for Apache HBase

Usage Patterns for Apache HBase

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

From Shared-All to Shared-Nothing<br />

Successfully used <strong>Patterns</strong> in<br />

application and table design<br />

with Hbase<br />

Bob Schulze, eCircle AG<br />

March 2010 @ Berlin <strong>Apache</strong> Hadoop Get Together<br />

-<strong>Patterns</strong>


➲ You have Big Data<br />

Audience<br />

➲ Your Organization needs predictable scaling<br />

options<br />

➲ You need to be flexible with your Data<br />

➲ You are a Techie Person<br />

-<strong>Patterns</strong>


➲ What is shared?<br />

Content<br />

➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />

➲ Example: Credid-Card Processing<br />

➲ Storage <strong>Patterns</strong><br />

➲ Application Design Proposal<br />

➲ Missing: Call <strong>for</strong> Features<br />

➲ HbaseExplorer<br />

-<strong>Patterns</strong>


Shared ..somewhat<br />

➲ Representation and<br />

Business Layers have<br />

well known scaling<br />

patterns<br />

➲ Many of these patterns<br />

rely on a transactional<br />

database underneath<br />

➲ Distributed transactions<br />

are expensive<br />

-<strong>Patterns</strong>


Shared ..somewhat<br />

➲ Representation and<br />

Business Layers have<br />

well known scaling<br />

patterns<br />

➲ Many of these patterns<br />

rely on a transactional<br />

database underneath<br />

➲ Distributed transactions<br />

are expensive<br />

-<strong>Patterns</strong>


Shared ..almost nothing<br />

➲ All Business Threads<br />

can run independently a<br />

long way<br />

➲ Contention only within<br />

one Shard → More<br />

Shards, Less Contention<br />

➲ Highly Available and<br />

Consistent<br />

➲ Give away<br />

perfection:Transactions<br />

must be hand-made<br />

-<strong>Patterns</strong>


Shared ..almost nothing<br />

➲ All Business Threads<br />

can run independently a<br />

long way<br />

➲ Contention only within<br />

one Shard → More<br />

Shards, Less Contention<br />

➲ Highly Available and<br />

Consistent<br />

➲ Give away<br />

perfection:Transactions<br />

must be hand-made<br />

-<strong>Patterns</strong>


➲ What is shared?<br />

Content<br />

➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />

➲ Example: Credid-Card Processing<br />

➲ Storage <strong>Patterns</strong><br />

➲ Application Design Proposal<br />

➲ Missing: Call <strong>for</strong> Features<br />

➲ HbaseExplorer<br />

-<strong>Patterns</strong>


RDBMs Solutions<br />

Card Dealer Amount Location Currency<br />

Customer Card Name Address Email<br />

Card Model Valid From<br />

Customer Date Supporter Req-ID Status<br />

➲ Transactions<br />

➲ Access Rules<br />

➲ Types<br />

➲ FK's<br />

➲ Fixed<br />

structure<br />

➲ History?<br />

➲ Rows?<br />

-<strong>Patterns</strong>


Alternative: <strong>HBase</strong><br />

➲ Key/Value with structure in the „fat“ values<br />

● Arbitrary keys, but retain sortability<br />

➲ Data stored in Shards (regionserver) by<br />

splitting up the key range<br />

➲ Values are organized in Families<br />

➲ All Data has history<br />

-<strong>Patterns</strong>


Card Number<br />

1233.45.33-23<br />

Card Number<br />

1233.45.33-24<br />

Hbase: Key-Value<br />

-<strong>Patterns</strong>


Hbase: Regionserver hold Shards<br />

a...k l...m n...z<br />

-<strong>Patterns</strong>


Recap: Hbase/BigTable Data Model<br />

➲ family+Column=Column Qualifier<br />

-<strong>Patterns</strong>


➲ Stored in own Files<br />

● Important <strong>for</strong> retrieval<br />

➲ Have own Settings<br />

● Compression<br />

● Versions<br />

● Timed Deletions (TTL)<br />

● Size Constraints<br />

● Counters<br />

Column Families<br />

-<strong>Patterns</strong>


➲ What is shared?<br />

Content<br />

➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />

➲ Example: Credid-Card Processing<br />

➲ Storage <strong>Patterns</strong><br />

➲ Application Design Proposal<br />

➲ Missing: Call <strong>for</strong> Features<br />

➲ HbaseExplorer<br />

-<strong>Patterns</strong>


Example: Credit Card Processing<br />

Card Dealer Amount Location Currency<br />

Customer Card Name Address Email<br />

Card Model Valid From<br />

Customer Date Supporter Req-Id Status<br />

Transaction<br />

Card Owner<br />

Card Properties<br />

Support Req's<br />

-<strong>Patterns</strong>


RowKey: Card Number<br />

Hbase main table Layout<br />

TS Card Transaction Owner Support Notes<br />

t5 registerCode=<br />

model=<br />

pinHash=<br />

ValidFrom=<br />

ValidTo=<br />

t4 =<br />

<br />

Location=<br />

A new card was<br />

issued to the<br />

customer<br />

pinAttempts= A Transaction was<br />

made<br />

t3 Reason=<br />

=<br />

<br />

Supporter=<br />

t2 City=<br />

Addressid=<br />

src=byTel<br />

operator=<br />

t1 Email=<br />

<br />

src=web<br />

Reason=<br />

=<br />

Solved<br />

Supporter=<br />

Support Request from<br />

Customer<br />

Address-Change by<br />

support<br />

Email-Change from<br />

WebSite<br />

-<strong>Patterns</strong>


Supplementary (index-) Tables<br />

➲ Email to Card Mapping<br />

RowKey: md5(email)<br />

Card<br />

<br />

<br />

Allows to do lookups by a given email<br />

Index can be maintained without transaction!<br />

➲ Address references<br />

RowKey: addressid Address<br />

Sample <strong>for</strong> a simple relation<br />

Can be further encrypted<br />

Name=<br />

City=<br />

Street=<br />

-<strong>Patterns</strong>


➲ What is shared?<br />

Content<br />

➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />

➲ Example: Credid-Card Processing<br />

➲ Storage <strong>Patterns</strong><br />

➲ Application Design Proposal<br />

➲ Missing: Call <strong>for</strong> Features<br />

➲ HbaseExplorer<br />

-<strong>Patterns</strong>


<strong>Patterns</strong> to Store Data, why?<br />

➲ We can talk about it<br />

➲ Based on Space efficiency<br />

● Even if space is cheap, data has to be searched through and has<br />

to be moved<br />

➲ Based on Lookup efficiency<br />

● Most Data is stored sorted<br />

➲ Used <strong>for</strong> direct lookups as well as in MR<br />

aggregations<br />

-<strong>Patterns</strong>


Pattern: swim-above<br />

➲ Most recent value at top (in API)<br />

● Where was the last Transaction?<br />

get(Cardid: 123.22.34-24, Family: Transaction, Column: Location)<br />

TS Card Transaction Owner Support Notes<br />

t5 registerCode=123<br />

model=superFlash<br />

pinHash=aw3224hhds<br />

ValidFrom=se344qq1<br />

ValidTo=12esdrf43q.q<br />

t4 D123376=123E<br />

Location=Berlin<br />

t3 D2231=82.22E<br />

Location=Munich<br />

● What is the current model?<br />

A new card was<br />

issued to the<br />

customer<br />

pinAttempts=1 A Transaction was<br />

made<br />

pinAttempts=2 A Transaction was<br />

made<br />

get(Cardid: 123.22.34-24, Family: Card, Column: model)<br />

-<strong>Patterns</strong>


Pattern: swim-above<br />

➲ Most recent value at top (in API)<br />

● Where was the last Transaction?<br />

get(Cardid: 123.22.34-24, Family: Transaction, Column: Location)<br />

TS Card Transaction Owner Support Notes<br />

t5 registerCode=123<br />

model=superFlash<br />

pinHash=aw3224hhds<br />

ValidFrom=se344qq1<br />

ValidTo=12esdrf43q.q<br />

t4 D123376=123E<br />

Location=Berlin<br />

t3 D2231=82.22E<br />

Location=Munich<br />

● What is the current model?<br />

A new card was<br />

issued to the<br />

customer<br />

pinAttempts=1 A Transaction was<br />

made<br />

pinAttempts=2 A Transaction was<br />

made<br />

get(Cardid: 123.22.34-24, Family: Card, Column: model)<br />

-<strong>Patterns</strong>


Pattern: Data grouped by Timestamp<br />

● Who changed the Address to „Munich“ ?<br />

1. Figure out the Timestamp(s)<br />

get(Cardid: 123.22.34.24, City=Munich) → ts2<br />

2. get the fields <strong>for</strong> this TS<br />

get(Cardid: 123.22.34-24, timestamp: t2)<br />

TS Card Transaction Owner Support Notes<br />

t3 Reason=<br />

=<br />

<br />

Supporter=<br />

t2 City=Munich<br />

Addressid=<br />

src=byTel<br />

operator=<br />

t1 Email=<br />

<br />

src=web<br />

Reason=Call-In<br />

R213=Solved<br />

Supporter=PaulG<br />

Support Request<br />

from Customer<br />

Address-Change<br />

by support<br />

Email-Change<br />

from WebSite<br />

-<strong>Patterns</strong>


Pattern: Data grouped by Timestamp<br />

● Who changed the Address to „Munich“ ?<br />

1. Figure out the Timestamp(s)<br />

get(Cardid: 123.22.34.24, City=Munich) → ts2<br />

2. get the fields <strong>for</strong> this TS<br />

get(Cardid: 123.22.34-24, timestamp: t2)<br />

TS Card Transaction Owner Support Notes<br />

t3 Reason=<br />

=<br />

<br />

Supporter=<br />

t2 City=Munich<br />

Addressid=<br />

src=byTel<br />

operator=<br />

t1 Email=<br />

<br />

src=web<br />

Reason=Call-In<br />

R213=Solved<br />

Supporter=PaulG<br />

Support Request<br />

from Customer<br />

Address-Change<br />

by support<br />

Email-Change<br />

from WebSite<br />

-<strong>Patterns</strong>


Pattern: ColumnName-Is-Value<br />

➲ No value, Often useful <strong>for</strong> indexes<br />

RowKey: 123.23452-24<br />

RowKey: md5(email)<br />

Card<br />

123.23662-21<br />

123.23452-24<br />

TS Card Transaction Owner Support Notes<br />

t3 Reason=<br />

=<br />

<br />

Supporter=<br />

t2 City=Munich<br />

Addressid=<br />

src=byTel<br />

operator=<br />

Reason=Call-In<br />

R213=Solved<br />

Supporter=PaulG<br />

Support Request<br />

from Customer<br />

Address-Change by<br />

support<br />

-<strong>Patterns</strong>


Pattern: ColumnName-Is-Value<br />

➲ No value, Often useful <strong>for</strong> indexes<br />

RowKey: 123.23452-24<br />

RowKey: md5(email)<br />

TS Card Transaction Owner Support Notes<br />

t3 Reason=<br />

=<br />

<br />

Supporter=<br />

t2 City=Munich<br />

Addressid=<br />

src=byTel<br />

operator=<br />

Card<br />

123.23662-21<br />

123.23452-24<br />

Reason=Call-In<br />

R213=Solved<br />

Supporter=PaulG<br />

● Where did al.lias@gmx.com use his cards?<br />

index.get(9fc81d4292e6a404c2d64c9eaa66e43a) → Cardids<br />

cards.get(123.23452-24,...)<br />

Support Request<br />

from Customer<br />

Address-Change by<br />

support<br />

-<strong>Patterns</strong>


Pattern: Column-Enum<br />

● What is the status of Support-Request R213<br />

(status is one of: Opened, Reviewed, Assigned, Pending, Solved)<br />

get(Cardid: 123.22.34.24, Family: Support, Column: R213)<br />

TS Card Transaction Owner Support Notes<br />

t3 Reason=<br />

=<br />

<br />

Supporter=<br />

t2 City=Munich<br />

Addressid=<br />

src=byTel<br />

operator=<br />

t1 Email=<br />

<br />

src=web<br />

Reason=Call-In<br />

R213=Solved<br />

Supporter=PaulG<br />

Support Request<br />

from Customer<br />

Address-Change<br />

by support<br />

Email-Change<br />

from WebSite<br />

-<strong>Patterns</strong>


Pattern: Atomic Counters<br />

● Solves the common Problem when many clients try to update one/few<br />

rows in a RDBMS table<br />

● Use a separate table/family/column, use the key or family to partition the<br />

load<br />

● Example: Write a Record and add up some stats<br />

1. cards.insert(key: 123.22.34.24, Column: R213, Value=Solved,...)<br />

2. stats.increment(key: 123.22.34.24, Column: SREGSPERMONTH,+1)<br />

● Small overhead even on excessive use:<br />

:year-month-date=<br />

:year-month=<br />

● timestamps!<br />

-<strong>Patterns</strong>


➲ Constant Costs<br />

Pattern: index table<br />

● Always one more insert to (another sharded) index table<br />

➲ Lock Free<br />

● Versions=1<br />

➲ Only „eventually“ consistant<br />

● But on our control!<br />

-<strong>Patterns</strong>


➲ Swim-above<br />

Pattern Summary<br />

➲ Data grouped by timestamp<br />

➲ ColumnName-is-Value<br />

➲ Column-Enum<br />

➲ Atomic Counter<br />

➲ Index table<br />

-<strong>Patterns</strong>


➲ What is shared?<br />

Content<br />

➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />

➲ Example: Credid-Card Processing<br />

➲ Storage <strong>Patterns</strong><br />

➲ Application Design Proposal<br />

➲ Missing: Call <strong>for</strong> Features<br />

➲ HbaseExplorer<br />

-<strong>Patterns</strong>


Application Design<br />

➲ Persistance Code<br />

moves out of App<br />

● Gets reusable!<br />

● Easy to test<br />

➲ Fix what's<br />

missing<br />

● Security / Access Control<br />

● Firewall<br />

● Index Handling<br />

● Transactions<br />

● ORM mappings<br />

-<strong>Patterns</strong>


➲ What is shared?<br />

Content<br />

➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />

➲ Example: Credid-Card Processing<br />

➲ Storage <strong>Patterns</strong><br />

➲ Application Design Proposal<br />

➲ Missing: Call <strong>for</strong> Features<br />

➲ HbaseExplorer<br />

-<strong>Patterns</strong>


Hbase: missing pieces<br />

➲ Multi-Get/Scan would<br />

allow to read data<br />

from multiple Region<br />

Servers in parallel to<br />

one client<br />

➲ Same with Multi-Put<br />

➲ Patches avail., 0.21<br />

-<strong>Patterns</strong>


Hbase: missing pieces<br />

➲ Server Side Processing reduces data<br />

transfer and distributes computing<br />

● Scan transfers only matches<br />

scan rowkey=, dealer=<br />

● Allows aggregations on server side (1 st map already on region<br />

server)<br />

● Some server-side Scan-Filters help already today<br />

● Java Expression Language?<br />

-<strong>Patterns</strong>


Hbase: missing pieces<br />

➲ Bloomfilter<br />

● Reduce key-lookup time<br />

● Disappeared in 0.20.x, planned to be reanimated in 0.21<br />

➲ Hfile persistant internal value index<br />

● Value indexes / Value compression<br />

● Timestamp Index<br />

-<strong>Patterns</strong>


➲ What is shared?<br />

Content<br />

➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />

➲ Example: Credid-Card Processing<br />

➲ Storage <strong>Patterns</strong><br />

➲ Application Design Proposal<br />

➲ Missing: Call <strong>for</strong> Features<br />

➲ HbaseExplorer<br />

-<strong>Patterns</strong>


Hbaseexplorer: scan<br />

-<strong>Patterns</strong>


Hbaseexplorer: cluster setup<br />

-<strong>Patterns</strong>


Hbaseexplorer; Table Definition<br />

-<strong>Patterns</strong>


Hbaseexplorer: statistics<br />

-<strong>Patterns</strong>


hbaseexplorer<br />

➲ Complements Ruby-Shell<br />

● Visual Data Representations<br />

● UI Tools <strong>for</strong> Table Creation<br />

● Embedded M/R Jobs <strong>for</strong> Table Copy or Statistics<br />

collection<br />

➲ Open Source @ SourceForge<br />

● Java, WebApp, Grails<br />

● Coders needed!<br />

➲ More info<br />

● http://althelies.wordpress.com/hbaseexplorer/<br />

-<strong>Patterns</strong>


eCircle AG<br />

➲ Biggest Direct Email-Marketing Company in<br />

Europe<br />

➲ 10 yrs, 200 Employees, now in 6 Countries<br />

➲ Lots of data<br />

● 100Mio permission Emails / Day<br />

● Individualized Emails stored, trackings, hosting<br />

● Privacy challanges → even more data<br />

● We went through the classic RDBMS scaling story<br />

➲ We hire!<br />

● Java, UI (JSP, Ajax)<br />

-<strong>Patterns</strong>


Thank you!<br />

b.schulze@ecircle.com<br />

al.lias@gmx.de<br />

-<strong>Patterns</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!