added tetrahedral mesh test scene
expose b3Config as member variable for demos.
move a 'glFlush' out of the innerloop (render performance)
SSE -> SSE2 in premake
fix crash in broadphase (when no aabb's exist)
implement CPU version of narrowphase convex collision, for comparison/debug purposes
start towards cpu/gpu sync, for adding/removing bodies (work in progress)
added alternative batching kernel (slow)
tweaked controls a bit
added command-line options --selected_demo=<int> and --new_batching
started looking into parallel 3d sap