Introducing leveldb-server

| 0 comments

We just released leveldb-server. Start forking: Github leveldb-server

leveldb-server

  • Async leveldb server and client based on zeromq
  • Storage engine leveldb
  • Networking library zeromq
  • We use leveldb-server at Safebox

License

New BSD license. Please see license.txt for more details.

Features

  • Very simple key-value storage
  • Data is sorted by key – allows range queries
  • Data is automatically compressed
  • Can act as persistent cache
  • For our use at Safebox it replaced memcached+mysql
  • Simple backups cp -rf level.db backup.db
  • Networking/wiring from zeromq messaging library – allows many topologies
  • Async server for scalability and capacity
  • Sync client for easy coding
  • Easy polyglot client bindings. See zmq bindings
>>> db.put("k3", "v3")
'True'
>>> db.get("k3")
'v3'
>>> db.range()
'[{"k1": "v1"}, {"k2": "v2"}, {"k3": "v3"}]'
>>> db.range("k1", "k2")
'[{"k1": "v1"}, {"k2": "v2"}]'
>>> db.delete('k1')
>>>
Will be adding high availability, replication and autosharding using the same zeromq framework.

Dependencies

python 2.6+ (older versions with simplejson)
zmq
pyzmq
leveldb
pyleveldb 

Getting Started

Instructions for an EC2 Ubuntu box.

Installing zeromq

wget http://download.zeromq.org/zeromq-2.1.10.tar.gz
tar xvfz zeromq-2.1.10.tar.gz
cd zeromq-2.1.10
sudo ./configure
sudo make
sudo make install

Installing pyzmq

wget https://github.com/zeromq/pyzmq/downloads/pyzmq-2.1.10.tar.gz
tar xvfz pyzmq-2.1.10.tar.gz
cd pyzmq-2.1.10/
sudo python setup.py configure --zmq=/usr/local/lib/
sudo python setup.py install

Installing leveldb and pyleveldb

svn checkout http://py-leveldb.googlecode.com/svn/trunk/ py-leveldb-read-only
cd py-leveldb-read-only
sudo compile_leveldb.sh
sudo python setup.py install

Starting the leveldb-server

> python leveldb-server.py -h
Usage: leveldb-server.py 
 -p [port and host settings] Default: tcp://127.0.0.1:5147
 -f [database file name] Default: level.db

leveldb-server

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -p HOST, --host=HOST  
  -d DBFILE, --dbfile=DBFILE
> python leveldb-server.py

Using the leveldb-client-py

> cd clients/py/
> sudo python setup.py install
> python 
>>> from leveldbClient import database
>>> db = database.leveldb()
>>> db.get("Key")
>>> db.put("K", "V")
>>> db.range()
>>> db.range(start, end)
>>> db.delete("K")

Backups
cp -rpf level.db backup.db

Known issues and work in progress

Would love your pull requests on
  • Benchmarking and performance analysis
  • client libraries for other languages
  • [issue] zeromq performance issues with 1M+ inserts at a time
  • [feature] timeouts in client library
  • [feature] support for counters
  • [feature] limit support in range queries
  • Serializing and seperate threads for get/put/range in leveldb-server
  • HA/replication/autosharding and possibly pub-sub for replication

Thanks

Thanks to all the folks who have contributed to all the dependencies. Special thanks to pyzmq/examples/mongo* author for inspiration.

Launching Safebox

| 0 comments

Site just went live few hours ago. Go ahead and get Safebox for ur dropbox

-Srini

Benchmarking JavaScript Engines - V8, SFX-Nitro, Carakan, Tracemonkey

| 0 comments

We were evaluating javascript engines for the purposes of integrating in a prototype product. Chrome-V8, Safari-Squirrel Fish Extreme(aka Nitro) and Firefox Tracemonkey are the three options that we had to consider(just because all three of them are open sourced). There are lot of online resources with comparison study done however all of them are either old or not relevant now. So, thought I would share what we are seeing with respect to speed. There are other metrics that influence the adoption but this article is solely on the speed. I have included Opera results for the completeness. IE is not included for the time constraints that we had and also it being dog slow. 

Setup:

  • Use Dromaeo for the purposes of comparison. It is super easy to run and compare. 
  • Every benchmark is done multiple times and took an average - though variation is very minimal
  • During every run, machine had only 3 apps open. One putty session, one process explorer and the relevant browser. So machine resources are same for all.
  • Run all the benchmarks on Windows box - Dell Poweredge SC 420 - XP, Pentium 4 HT, 2 CPU 2.8GHz, 2GB RAM.
  • Used Macbook Pro 15" to make sure the trend is same. 
  • All the latest release versions: Chrome - 4.1.249.1045 (Build 42898), Safari - 4.0.3(Build 531.9.1), Opera - 10.51(Build 3315), Firefox - 3.6.3
Opera has save-as text file feature. So after all the runs, results are extracted by loading the html page in Opera » save as text file » then run following one-liner to get every run into ":" separated text file so that it is easy to load up in excel. 

more * | egrep ":|runs\/s|txt" | egrep -v "URL:|http:|Origin, Source, Tests:|run on:|::::::|rv:" | perl -ne 'chomp; s/runs/ :runs/g; s/txt/txt: /g; if(/runs/){print "$_ \n";}else{print}' > ../foo.txt

more * is being used above assuming all the text files are saved in the present working directory. 

Sunspider:

Includes well balanced benchmark suite. Covers various areas including math problems, string operations, crypto, raytracers, etc.. 
As seen above, Caracan is the clear winner followed by V8, SFX and TraceM for Sunspider benchmark suite. 

DOM Core:

These tests include setting, getting DOM attributes, DOM traversal, DOM element querying, etc..
SFX is the clear winner followed by V8, TraceMonkey and Caracan for DOM Core suite.

V8 Test:

These include tests like looping, function calls, object manipulation, etc.. 
 
As seen above, V8 is clear winner followed by Caracan, SFX, and Tracemonkey for V8 benchmark suite.

JS Lib:

Fom jQuery, JS Libarary tests include lot of DOM modifications, events, prototypes, jQuery DOM traversal, styling etc...
 
As seen above, V8 is clear winner followed by SFX, Caracan, and Tracemonkey for JS Lib benchmark suite.

I haven't included benchmarks Dromaeo and CSS Selector. Dromaeo; because Caracan results were like 100x on regexp cases which I wanted double check if it is indeed the case. CSS Selector; because each run takes like ~20mins which is quite a bit considering multiple runs.

Conclusion:

If you follow technical documentation on V8, SFX, Caracan and Tracemonkey, you can clearly see similarities in approach each is taking in tackling the problem. Some of the common techniques and modes are creating virtual machines, JIT techniques, inline cache, GC optimizations. Yet there are few differences in each engine. V8 doesn't generate bytecode instead it generates native code directly - similar to compilation for static languages. However it does follow virtual machine techniques under the hood without having two separate steps. SFX generates bytecode and then native code based on the platform it is being run. This approach is similar to LuaJIT where JIT techniques are applied on the bytecode. Caracan seems to take best of both - native code compilation for loops and bytecode interpreter to tackle remaining part of the code. Tracemonkey is slightly different compared to other in that it is the only engine which tries to record/trace as the interpreter kicks in - more details on this here. Jigermonkey is the next version(still in works) to TM which seem to take SFX assembler and combine it to the existing spidermonkey and tracer. 

We have decided to start off with V8 and keep SFX/Nitro as the backup JavaScript engine. Feel free to drop in a comment/question.



US population distribution for marketing and planning purposes

| 0 comments

We had to do a quick study on where to focus geographically/state-wise for one of our products - ideally it shouldn't matter where the user is but we had this constraint to get the stats to focus on 10 states. 

Obvious start was to look at population stats; more population translates to more users/customer base translates to focus. For data source, we used US Census Bureau population projections for each state from year 2000 to year 2009 - available here.

As you see in below picture, we added two extra columns along with the sorted list - percentage and cumulative percentage sum. 


To our surprise, first 9 states constitute more than 50% of whole population in US. I did look at the population spread before but this is definitely a surprise to see 50% of whole US population living in 9 states. 

Books on Start Up and entrepreneurship

| 0 comments

Here is a write-up on couple of books that I liked reading and learning internals about start ups and how it all works. I would recommend both of them equally. Both have their own strong points in various topics. For example, 'High Tech Start Up' shines in the area of market analysis and product planning; 'Engineering Your Start-up' shines in the area of accounting and finance with lots of templates and ready-made excel sheets to get started with planning. 
Again, just like any good book and education, learn as much as possible and see what makes sense and relevant for your situation. 

        
 
It has 14 chapters in total - spanning over 270 pages. The most impressive thing I liked in this book is market and product planning. According to the author, if we assume x-axis being a product and y-axis being a market - following quadrants can be made.
Quadrant 1: New Product and New Market - Missionary sales and tech push. This is where, consumers don't know they need the product and there is no existing product. Typically high risk and costly. 
Quadrant 2: Existing Product and New Market - Marketing driven and essentially you are selling new use of the existing product. 
Quadrant 3: Existing Product and Existing Market - Face lot of competition. Can be used for income substitution. 
Quadrant 4: New Product and Existing Market - Tech push, market pull. Delivering value to existing problem.
 
It is good if you are in Q4 - where you already have the market and with your product you are innovating. If you are in Q1, you would need lots of money raising to do. 

I found chapter 5, which deals with the business planning interesting in that, you would come to realize so many details that make up for a good planning. 

It has total of 21 chapters spanning over 400 pages. Accounting, Finances, and Term Sheets are dealt really well in this book. 'Rule of X Competitors' and Financial Statements really impressed me from this book. 
Rule of X Competitors: Take any market, at any point in time, there is a room for only X viable competitors. As from the definition, obviously, the smaller the X, the better. According to the author, if X is more than 7, you might want to reconsider the plan. 

On the Financial Statements, it is tough to cover details in a post like this, however following quick 3 statements are something you want to keep track of all the time. 
1. Balance Sheet
2. Income Statement
3. Cash Flow Statement

Final Note: Rules are there to break and laws are there to follow. So you can read and learn about how it is done typically but just adapt the rules according to your situation.

Don't miss news and trends when you are on vacation

| 0 comments

When you are on vacation or don't have time to read all your favorite news sources on a given day, you might miss current news and trends which are happening and updated every day.

There are days in which I just don't want to get online for reading news and browsing. At the same time I don't want to miss the trends and highlights which are relevant on a given day. With few lines of perl code, wget, cronjob, problem is solved. Hope this helps others too.

Some of this is very straight fwd. However, I am including detailed description so that everyone gets it without questions.

Installation steps:

  1. download this file > vbr.tar.gz
  2. tar xvfz vbr.tar.gz
  3. cd vbr; ls You will see four files. vbr.pl - worker perl script which does the wget fetching and saving the html files to relevant directories vbr.cron - crontab input file to schedule list - list of websites that you want to archive vbr.cmd - command file to tie perl script and input file
  4. pwd This is your present working directory. e.g.: /home/srini/Projects/vbr
  5. Change paths in "vbr.cmd", "vbr.cron" - you can use your favorite text editor for this Change /home/srini/Projects/vbr/ - to your pwd from step 4.
  6. Add or modify sites that you want to archive in "list" First column is the web location. Second column is the directory name that you want to save the archive files to.
  7. View and modify "vbr.cron" file to schedule events. Edit mins/hours when you want to get snap shot of web sites
  8. crontab -l If this returns nothing, jump to step 9 If you have other events already scheduled, do crontab -e and append the contents of "vbr.cron" to the list and you are done.
  9. crontab -i vbr.cron This will register jobs with crontab

All your html pages will be saved under <your pwd from step 4>/pages/<sitename that you assigned>

Sample pages:

Getting Started

| 0 comments

Getting Started.

Hello World