Currently, the output buffer is a std::vector<uint8_t>.
When the buffer grows, resizing will cause unnecessary memcpy().
This change uses a list of bytes object to represent output buffer, can avoid the extra overhead of resizing.
In addition, C++ code can be removed, it's a pure C extension.