Usage Patterns for Apache HBase
Usage Patterns for Apache HBase
Usage Patterns for Apache HBase
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
From Shared-All to Shared-Nothing<br />
Successfully used <strong>Patterns</strong> in<br />
application and table design<br />
with Hbase<br />
Bob Schulze, eCircle AG<br />
March 2010 @ Berlin <strong>Apache</strong> Hadoop Get Together<br />
-<strong>Patterns</strong>
➲ You have Big Data<br />
Audience<br />
➲ Your Organization needs predictable scaling<br />
options<br />
➲ You need to be flexible with your Data<br />
➲ You are a Techie Person<br />
-<strong>Patterns</strong>
➲ What is shared?<br />
Content<br />
➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />
➲ Example: Credid-Card Processing<br />
➲ Storage <strong>Patterns</strong><br />
➲ Application Design Proposal<br />
➲ Missing: Call <strong>for</strong> Features<br />
➲ HbaseExplorer<br />
-<strong>Patterns</strong>
Shared ..somewhat<br />
➲ Representation and<br />
Business Layers have<br />
well known scaling<br />
patterns<br />
➲ Many of these patterns<br />
rely on a transactional<br />
database underneath<br />
➲ Distributed transactions<br />
are expensive<br />
-<strong>Patterns</strong>
Shared ..somewhat<br />
➲ Representation and<br />
Business Layers have<br />
well known scaling<br />
patterns<br />
➲ Many of these patterns<br />
rely on a transactional<br />
database underneath<br />
➲ Distributed transactions<br />
are expensive<br />
-<strong>Patterns</strong>
Shared ..almost nothing<br />
➲ All Business Threads<br />
can run independently a<br />
long way<br />
➲ Contention only within<br />
one Shard → More<br />
Shards, Less Contention<br />
➲ Highly Available and<br />
Consistent<br />
➲ Give away<br />
perfection:Transactions<br />
must be hand-made<br />
-<strong>Patterns</strong>
Shared ..almost nothing<br />
➲ All Business Threads<br />
can run independently a<br />
long way<br />
➲ Contention only within<br />
one Shard → More<br />
Shards, Less Contention<br />
➲ Highly Available and<br />
Consistent<br />
➲ Give away<br />
perfection:Transactions<br />
must be hand-made<br />
-<strong>Patterns</strong>
➲ What is shared?<br />
Content<br />
➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />
➲ Example: Credid-Card Processing<br />
➲ Storage <strong>Patterns</strong><br />
➲ Application Design Proposal<br />
➲ Missing: Call <strong>for</strong> Features<br />
➲ HbaseExplorer<br />
-<strong>Patterns</strong>
RDBMs Solutions<br />
Card Dealer Amount Location Currency<br />
Customer Card Name Address Email<br />
Card Model Valid From<br />
Customer Date Supporter Req-ID Status<br />
➲ Transactions<br />
➲ Access Rules<br />
➲ Types<br />
➲ FK's<br />
➲ Fixed<br />
structure<br />
➲ History?<br />
➲ Rows?<br />
-<strong>Patterns</strong>
Alternative: <strong>HBase</strong><br />
➲ Key/Value with structure in the „fat“ values<br />
● Arbitrary keys, but retain sortability<br />
➲ Data stored in Shards (regionserver) by<br />
splitting up the key range<br />
➲ Values are organized in Families<br />
➲ All Data has history<br />
-<strong>Patterns</strong>
Card Number<br />
1233.45.33-23<br />
Card Number<br />
1233.45.33-24<br />
Hbase: Key-Value<br />
-<strong>Patterns</strong>
Hbase: Regionserver hold Shards<br />
a...k l...m n...z<br />
-<strong>Patterns</strong>
Recap: Hbase/BigTable Data Model<br />
➲ family+Column=Column Qualifier<br />
-<strong>Patterns</strong>
➲ Stored in own Files<br />
● Important <strong>for</strong> retrieval<br />
➲ Have own Settings<br />
● Compression<br />
● Versions<br />
● Timed Deletions (TTL)<br />
● Size Constraints<br />
● Counters<br />
Column Families<br />
-<strong>Patterns</strong>
➲ What is shared?<br />
Content<br />
➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />
➲ Example: Credid-Card Processing<br />
➲ Storage <strong>Patterns</strong><br />
➲ Application Design Proposal<br />
➲ Missing: Call <strong>for</strong> Features<br />
➲ HbaseExplorer<br />
-<strong>Patterns</strong>
Example: Credit Card Processing<br />
Card Dealer Amount Location Currency<br />
Customer Card Name Address Email<br />
Card Model Valid From<br />
Customer Date Supporter Req-Id Status<br />
Transaction<br />
Card Owner<br />
Card Properties<br />
Support Req's<br />
-<strong>Patterns</strong>
RowKey: Card Number<br />
Hbase main table Layout<br />
TS Card Transaction Owner Support Notes<br />
t5 registerCode=<br />
model=<br />
pinHash=<br />
ValidFrom=<br />
ValidTo=<br />
t4 =<br />
<br />
Location=<br />
A new card was<br />
issued to the<br />
customer<br />
pinAttempts= A Transaction was<br />
made<br />
t3 Reason=<br />
=<br />
<br />
Supporter=<br />
t2 City=<br />
Addressid=<br />
src=byTel<br />
operator=<br />
t1 Email=<br />
<br />
src=web<br />
Reason=<br />
=<br />
Solved<br />
Supporter=<br />
Support Request from<br />
Customer<br />
Address-Change by<br />
support<br />
Email-Change from<br />
WebSite<br />
-<strong>Patterns</strong>
Supplementary (index-) Tables<br />
➲ Email to Card Mapping<br />
RowKey: md5(email)<br />
Card<br />
<br />
<br />
Allows to do lookups by a given email<br />
Index can be maintained without transaction!<br />
➲ Address references<br />
RowKey: addressid Address<br />
Sample <strong>for</strong> a simple relation<br />
Can be further encrypted<br />
Name=<br />
City=<br />
Street=<br />
-<strong>Patterns</strong>
➲ What is shared?<br />
Content<br />
➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />
➲ Example: Credid-Card Processing<br />
➲ Storage <strong>Patterns</strong><br />
➲ Application Design Proposal<br />
➲ Missing: Call <strong>for</strong> Features<br />
➲ HbaseExplorer<br />
-<strong>Patterns</strong>
<strong>Patterns</strong> to Store Data, why?<br />
➲ We can talk about it<br />
➲ Based on Space efficiency<br />
● Even if space is cheap, data has to be searched through and has<br />
to be moved<br />
➲ Based on Lookup efficiency<br />
● Most Data is stored sorted<br />
➲ Used <strong>for</strong> direct lookups as well as in MR<br />
aggregations<br />
-<strong>Patterns</strong>
Pattern: swim-above<br />
➲ Most recent value at top (in API)<br />
● Where was the last Transaction?<br />
get(Cardid: 123.22.34-24, Family: Transaction, Column: Location)<br />
TS Card Transaction Owner Support Notes<br />
t5 registerCode=123<br />
model=superFlash<br />
pinHash=aw3224hhds<br />
ValidFrom=se344qq1<br />
ValidTo=12esdrf43q.q<br />
t4 D123376=123E<br />
Location=Berlin<br />
t3 D2231=82.22E<br />
Location=Munich<br />
● What is the current model?<br />
A new card was<br />
issued to the<br />
customer<br />
pinAttempts=1 A Transaction was<br />
made<br />
pinAttempts=2 A Transaction was<br />
made<br />
get(Cardid: 123.22.34-24, Family: Card, Column: model)<br />
-<strong>Patterns</strong>
Pattern: swim-above<br />
➲ Most recent value at top (in API)<br />
● Where was the last Transaction?<br />
get(Cardid: 123.22.34-24, Family: Transaction, Column: Location)<br />
TS Card Transaction Owner Support Notes<br />
t5 registerCode=123<br />
model=superFlash<br />
pinHash=aw3224hhds<br />
ValidFrom=se344qq1<br />
ValidTo=12esdrf43q.q<br />
t4 D123376=123E<br />
Location=Berlin<br />
t3 D2231=82.22E<br />
Location=Munich<br />
● What is the current model?<br />
A new card was<br />
issued to the<br />
customer<br />
pinAttempts=1 A Transaction was<br />
made<br />
pinAttempts=2 A Transaction was<br />
made<br />
get(Cardid: 123.22.34-24, Family: Card, Column: model)<br />
-<strong>Patterns</strong>
Pattern: Data grouped by Timestamp<br />
● Who changed the Address to „Munich“ ?<br />
1. Figure out the Timestamp(s)<br />
get(Cardid: 123.22.34.24, City=Munich) → ts2<br />
2. get the fields <strong>for</strong> this TS<br />
get(Cardid: 123.22.34-24, timestamp: t2)<br />
TS Card Transaction Owner Support Notes<br />
t3 Reason=<br />
=<br />
<br />
Supporter=<br />
t2 City=Munich<br />
Addressid=<br />
src=byTel<br />
operator=<br />
t1 Email=<br />
<br />
src=web<br />
Reason=Call-In<br />
R213=Solved<br />
Supporter=PaulG<br />
Support Request<br />
from Customer<br />
Address-Change<br />
by support<br />
Email-Change<br />
from WebSite<br />
-<strong>Patterns</strong>
Pattern: Data grouped by Timestamp<br />
● Who changed the Address to „Munich“ ?<br />
1. Figure out the Timestamp(s)<br />
get(Cardid: 123.22.34.24, City=Munich) → ts2<br />
2. get the fields <strong>for</strong> this TS<br />
get(Cardid: 123.22.34-24, timestamp: t2)<br />
TS Card Transaction Owner Support Notes<br />
t3 Reason=<br />
=<br />
<br />
Supporter=<br />
t2 City=Munich<br />
Addressid=<br />
src=byTel<br />
operator=<br />
t1 Email=<br />
<br />
src=web<br />
Reason=Call-In<br />
R213=Solved<br />
Supporter=PaulG<br />
Support Request<br />
from Customer<br />
Address-Change<br />
by support<br />
Email-Change<br />
from WebSite<br />
-<strong>Patterns</strong>
Pattern: ColumnName-Is-Value<br />
➲ No value, Often useful <strong>for</strong> indexes<br />
RowKey: 123.23452-24<br />
RowKey: md5(email)<br />
Card<br />
123.23662-21<br />
123.23452-24<br />
TS Card Transaction Owner Support Notes<br />
t3 Reason=<br />
=<br />
<br />
Supporter=<br />
t2 City=Munich<br />
Addressid=<br />
src=byTel<br />
operator=<br />
Reason=Call-In<br />
R213=Solved<br />
Supporter=PaulG<br />
Support Request<br />
from Customer<br />
Address-Change by<br />
support<br />
-<strong>Patterns</strong>
Pattern: ColumnName-Is-Value<br />
➲ No value, Often useful <strong>for</strong> indexes<br />
RowKey: 123.23452-24<br />
RowKey: md5(email)<br />
TS Card Transaction Owner Support Notes<br />
t3 Reason=<br />
=<br />
<br />
Supporter=<br />
t2 City=Munich<br />
Addressid=<br />
src=byTel<br />
operator=<br />
Card<br />
123.23662-21<br />
123.23452-24<br />
Reason=Call-In<br />
R213=Solved<br />
Supporter=PaulG<br />
● Where did al.lias@gmx.com use his cards?<br />
index.get(9fc81d4292e6a404c2d64c9eaa66e43a) → Cardids<br />
cards.get(123.23452-24,...)<br />
Support Request<br />
from Customer<br />
Address-Change by<br />
support<br />
-<strong>Patterns</strong>
Pattern: Column-Enum<br />
● What is the status of Support-Request R213<br />
(status is one of: Opened, Reviewed, Assigned, Pending, Solved)<br />
get(Cardid: 123.22.34.24, Family: Support, Column: R213)<br />
TS Card Transaction Owner Support Notes<br />
t3 Reason=<br />
=<br />
<br />
Supporter=<br />
t2 City=Munich<br />
Addressid=<br />
src=byTel<br />
operator=<br />
t1 Email=<br />
<br />
src=web<br />
Reason=Call-In<br />
R213=Solved<br />
Supporter=PaulG<br />
Support Request<br />
from Customer<br />
Address-Change<br />
by support<br />
Email-Change<br />
from WebSite<br />
-<strong>Patterns</strong>
Pattern: Atomic Counters<br />
● Solves the common Problem when many clients try to update one/few<br />
rows in a RDBMS table<br />
● Use a separate table/family/column, use the key or family to partition the<br />
load<br />
● Example: Write a Record and add up some stats<br />
1. cards.insert(key: 123.22.34.24, Column: R213, Value=Solved,...)<br />
2. stats.increment(key: 123.22.34.24, Column: SREGSPERMONTH,+1)<br />
● Small overhead even on excessive use:<br />
:year-month-date=<br />
:year-month=<br />
● timestamps!<br />
-<strong>Patterns</strong>
➲ Constant Costs<br />
Pattern: index table<br />
● Always one more insert to (another sharded) index table<br />
➲ Lock Free<br />
● Versions=1<br />
➲ Only „eventually“ consistant<br />
● But on our control!<br />
-<strong>Patterns</strong>
➲ Swim-above<br />
Pattern Summary<br />
➲ Data grouped by timestamp<br />
➲ ColumnName-is-Value<br />
➲ Column-Enum<br />
➲ Atomic Counter<br />
➲ Index table<br />
-<strong>Patterns</strong>
➲ What is shared?<br />
Content<br />
➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />
➲ Example: Credid-Card Processing<br />
➲ Storage <strong>Patterns</strong><br />
➲ Application Design Proposal<br />
➲ Missing: Call <strong>for</strong> Features<br />
➲ HbaseExplorer<br />
-<strong>Patterns</strong>
Application Design<br />
➲ Persistance Code<br />
moves out of App<br />
● Gets reusable!<br />
● Easy to test<br />
➲ Fix what's<br />
missing<br />
● Security / Access Control<br />
● Firewall<br />
● Index Handling<br />
● Transactions<br />
● ORM mappings<br />
-<strong>Patterns</strong>
➲ What is shared?<br />
Content<br />
➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />
➲ Example: Credid-Card Processing<br />
➲ Storage <strong>Patterns</strong><br />
➲ Application Design Proposal<br />
➲ Missing: Call <strong>for</strong> Features<br />
➲ HbaseExplorer<br />
-<strong>Patterns</strong>
Hbase: missing pieces<br />
➲ Multi-Get/Scan would<br />
allow to read data<br />
from multiple Region<br />
Servers in parallel to<br />
one client<br />
➲ Same with Multi-Put<br />
➲ Patches avail., 0.21<br />
-<strong>Patterns</strong>
Hbase: missing pieces<br />
➲ Server Side Processing reduces data<br />
transfer and distributes computing<br />
● Scan transfers only matches<br />
scan rowkey=, dealer=<br />
● Allows aggregations on server side (1 st map already on region<br />
server)<br />
● Some server-side Scan-Filters help already today<br />
● Java Expression Language?<br />
-<strong>Patterns</strong>
Hbase: missing pieces<br />
➲ Bloomfilter<br />
● Reduce key-lookup time<br />
● Disappeared in 0.20.x, planned to be reanimated in 0.21<br />
➲ Hfile persistant internal value index<br />
● Value indexes / Value compression<br />
● Timestamp Index<br />
-<strong>Patterns</strong>
➲ What is shared?<br />
Content<br />
➲ Recap RDBMS vs <strong>HBase</strong>/BigTable<br />
➲ Example: Credid-Card Processing<br />
➲ Storage <strong>Patterns</strong><br />
➲ Application Design Proposal<br />
➲ Missing: Call <strong>for</strong> Features<br />
➲ HbaseExplorer<br />
-<strong>Patterns</strong>
Hbaseexplorer: scan<br />
-<strong>Patterns</strong>
Hbaseexplorer: cluster setup<br />
-<strong>Patterns</strong>
Hbaseexplorer; Table Definition<br />
-<strong>Patterns</strong>
Hbaseexplorer: statistics<br />
-<strong>Patterns</strong>
hbaseexplorer<br />
➲ Complements Ruby-Shell<br />
● Visual Data Representations<br />
● UI Tools <strong>for</strong> Table Creation<br />
● Embedded M/R Jobs <strong>for</strong> Table Copy or Statistics<br />
collection<br />
➲ Open Source @ SourceForge<br />
● Java, WebApp, Grails<br />
● Coders needed!<br />
➲ More info<br />
● http://althelies.wordpress.com/hbaseexplorer/<br />
-<strong>Patterns</strong>
eCircle AG<br />
➲ Biggest Direct Email-Marketing Company in<br />
Europe<br />
➲ 10 yrs, 200 Employees, now in 6 Countries<br />
➲ Lots of data<br />
● 100Mio permission Emails / Day<br />
● Individualized Emails stored, trackings, hosting<br />
● Privacy challanges → even more data<br />
● We went through the classic RDBMS scaling story<br />
➲ We hire!<br />
● Java, UI (JSP, Ajax)<br />
-<strong>Patterns</strong>
Thank you!<br />
b.schulze@ecircle.com<br />
al.lias@gmx.de<br />
-<strong>Patterns</strong>