java - hbase how to choose pre split strategies and how its affect your rowkeys -
i trying pre split hbase table. 1 hbaseadmin java api create hbase table function of startkey, endkey , number of regions. here's java api use hbaseadmin void createtable(htabledescriptor desc, byte[] startkey, byte[] endkey, int numregions)
is there recommendation on choosing startkey , endkey based on dataset?
my approach lets have 100 records in dataset. want data divided approximately in 10 regions each have approx 10 records. find startkey scan '/mytable', {limit => 10}
, pick last rowkey startkey , scan '/mytable', {limit => 90}
, pick last rowkey endkey.
does approach find startkey , rowkey looks ok or there better practice?
edit tried following approaches pre split empty table. 3 didn't work way used it. think need salt key equal distribution.
ps> displaying region info
1)
byte[][] splits = new regionsplitter.hexstringsplit().split(10); hbaseadmin.createtable(tabledescriptor, splits);
this gives regions boundaries like:
{ "startkey":"-infinity", "endkey":"11111111", "numberofrows":3628951, }, { "startkey":"11111111", "endkey":"22222222", }, { "startkey":"22222222", "endkey":"33333333", }, { "startkey":"33333333", "endkey":"44444444", }, { "startkey":"88888888", "endkey":"99999999", }, { "startkey":"99999999", "endkey":"aaaaaaaa", }, { "startkey":"aaaaaaaa", "endkey":"bbbbbbbb", }, { "startkey":"eeeeeeee", "endkey":"infinity", }
this useless rowkeys of composite form 'deptid|month|roleid|regionid'
, doesn't fit above boundaries.
2)
byte[][] splits = new regionsplitter.uniformsplit().split(10); hbaseadmin.createtable(tabledescriptor, splits)
this has same issue:
{ "startkey":"-infinity", "endkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\x99", } { "startkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\ "endkey":"33333332", } { "startkey":"33333332", "endkey":"l\\xcc\\xcc\\xcc\\xcc\\xcc\\xcc\\xcb", } { "startkey":"\\xe6ffffffa", "endkey":"infinity", }
3) tried supplying start key , end key , got following useless regions.
hbaseadmin.createtable(tabledescriptor, bytes.tobytes("04120|200808|805|1999"), bytes.tobytes("01253|201501|805|1999"), 10); { "startkey":"-infinity", "endkey":"04120|200808|805|1999", } { "startkey":"04120|200808|805|1999", "endkey":"000ptp\\xdc200w\\xd07\\x9c805|1999", } { "startkey":"000ptp\\xdc200w\\xd07\\x9c805|1999", "endkey":"000ptq<200wp6\\xbc805|1999", } { "startkey":"001\\x11\\x15\\x13\\x1c201\\x15\\x902\\x5c805|1999", "endkey":"01253|201501|805|1999", } { "startkey":"01253|201501|805|1999", "endkey":"infinity", }
first question : out of experience hbase, not aware hard rule creating number of regions, start key , end key.
but underlying thing is,
with rowkey design, data should distributed across regions , not hotspotted (36.1. hotspotting)
however, if define fixed number of regions mentioned 10. there may not 10 after heavy data load. if reaches, limit, number of regions again split.
in way of creating table hbase admin documentation says, creates new table specified number of regions. start key specified become end key of first region of table, , end key specified become start key of last region of table (the first region has null start key , last region has null end key).
moreover, prefer creating table through script presplits 0-10 , design rowkey such salted , sitting on 1 of region servers avoid hotspotting.
edit : if want implement own regionsplit can implement , provide own implementation org.apache.hadoop.hbase.util.regionsplitter.splitalgorithm
, override
public byte[][] split(int numberofsplits)
second question : understanding : want find startrowkey , end rowkey inserted data in specific table... below ways.
if want find start , end rowkeys
scan '.meta'
table understand how start rowkey , end rowkey..you can access ui http://hbasemaster:60010 if can see how rowkeys spread across each region. each region start , rowkeys there.
to know how keys organized, after pre splitting table , inserting in hbase... use firstkeyonlyfilter
for example : scan 'yourtablename', filter => 'firstkeyonlyfilter()'
displays 100 rowkeys.
if have huge data (not 100 rows mentioned) , want take dump of rowkeys can use below out side shell..
echo "scan 'yourtablename', filter => 'firstkeyonlyfilter()'" | hbase shell > rowkeys.txt