3be21fb12f
liblzma's API.
220 lines
9.7 KiB
Plaintext
220 lines
9.7 KiB
Plaintext
|
|
Using liblzma securely
|
|
----------------------
|
|
|
|
0. Introduction
|
|
|
|
This document discusses how to use liblzma securely. There are issues
|
|
that don't apply to zlib or libbzip2, so reading this document is
|
|
strongly recommended even for those who are very familiar with zlib
|
|
or libbzip2.
|
|
|
|
While making liblzma itself as secure as possible is essential, it's
|
|
out of scope of this document.
|
|
|
|
|
|
1. Memory usage
|
|
|
|
The memory usage of liblzma varies a lot.
|
|
|
|
|
|
1.1. Problem sources
|
|
|
|
1.1.1. Block coder
|
|
|
|
The memory requirements of Block encoder depend on the used filters
|
|
and their settings. The memory requirements of the Block decoder
|
|
depend on the which filters and with which filter settings the Block
|
|
was encoded. Usually the memory requirements of a decoder are equal
|
|
or less than the requirements of the encoder with the same settings.
|
|
|
|
While the typical memory requirements to decode a Block is from a few
|
|
hundred kilobytes to tens of megabytes, a maliciously constructed
|
|
files can require a lot more RAM to decode. With the current filters,
|
|
the maximum amount is about 7 GiB. If you use multi-threaded decoding,
|
|
every Block can require this amount of RAM, thus a four-threaded
|
|
decoder could suddenly try to allocate 28 GiB of RAM.
|
|
|
|
If you don't limit the maximum memory usage in any way, and there are
|
|
no resource limits set on the operating system side, one malicious
|
|
input file can run the system out of memory, or at least make it swap
|
|
badly for a long time. This is exceptionally bad on servers e.g.
|
|
email server doing virus scanning on incoming messages.
|
|
|
|
|
|
1.1.2. Metadata decoder
|
|
|
|
Multi-Block .lzma files contain at least one Metadata Block.
|
|
Externally the Metadata Blocks are similar to Data Blocks, so all
|
|
the issues mentioned about memory usage of Data Blocks applies to
|
|
Metadata Blocks too.
|
|
|
|
The uncompressed content of Metadata Blocks contain information about
|
|
the Stream as a whole, and optionally some Extra Records. The
|
|
information about the Stream is kept in liblzma's internal data
|
|
structures in RAM. Extra Records can contain arbitrary data. They are
|
|
not interpreted by liblzma, but liblzma will provide them to the
|
|
application in uninterpreted form if the application wishes so.
|
|
|
|
Usually the Uncompressed Size of a Metadata Block is small. Even on
|
|
extreme cases, it shouldn't be much bigger than a few megabytes. Once
|
|
the Metadata has been parsed into native data structures in liblzma,
|
|
it usually takes a little more memory than in the encoded form. For
|
|
all normal files, this is no problem, since the resulting memory usage
|
|
won't be too much.
|
|
|
|
The problem is that a maliciously constructed Metadata Block can
|
|
contain huge amount of "information", which liblzma will try to store
|
|
in its internal data structures. This may cause liblzma to allocate
|
|
all the available RAM unless some kind of resource usage limits are
|
|
applied.
|
|
|
|
Note that the Extra Records in Metadata are always parsed but, but
|
|
memory is allocated for them only if the application has requested
|
|
liblzma to provide the Extra Records to the application.
|
|
|
|
|
|
1.2. Solutions
|
|
|
|
If you need to decode files from untrusted sources (most people do),
|
|
you must limit the memory usage to avoid denial of service (DoS)
|
|
conditions caused by malicious input files.
|
|
|
|
The first step is to find out how much memory you are allowed consume
|
|
at maximum. This may be a hardcoded constant or derived from the
|
|
available RAM; whatever is appropriate in the application.
|
|
|
|
The simplest solution is to use setrlimit() if the kernel supports
|
|
RLIMIT_AS, which limits the memory usage of the whole process.
|
|
For more portable and fine-grained limitting, you can use
|
|
memory limiter functions found from <lzma/memlimit.h>.
|
|
|
|
|
|
1.2.1. Encoder
|
|
|
|
lzma_memory_usage() will give you a rough estimate about the memory
|
|
usage of the given filter chain. To dramatically simplify the internal
|
|
implementation, this function doesn't take into account all the small
|
|
helper data structures needed in various places; only the structures
|
|
with significant memory usage are taken into account. Still, the
|
|
accuracy of this function should be well within a mebibyte.
|
|
|
|
The Subblock filter is a special case. If a Subfilter has been
|
|
specified, it isn't taken into account when lzma_memory_usage()
|
|
calculates the memory usage. You need to calculate the memory usage
|
|
of the Subfilter separately.
|
|
|
|
Keeping track of Blocks in a Multi-Block Stream takes a few dozen
|
|
bytes of RAM per Block (size of the lzma_index structure plus overhead
|
|
of malloc()). It isn't a good idea to put tens of thousands of Blocks
|
|
into a Stream unless you have a very good reason to do so (compressed
|
|
dictionary could be an example of such situation).
|
|
|
|
Also keep the number and sizes of Extra Records sane. If you produce
|
|
the list of Extra Records automatically from some untrusted source,
|
|
you should not only validate the content of these Records, but also
|
|
their memory usage.
|
|
|
|
|
|
1.2.2. Decoder
|
|
|
|
A single-threaded decoder should simply use a memory limiter and
|
|
indicate an error if it runs out of memory.
|
|
|
|
Memory-limitting with multi-threaded decoding is tricky. The simple
|
|
solution is to divide the maximum allowed memory usage with the
|
|
maximum allowed threads, and give each Block decoder their own
|
|
independent lzma_memory_limiter. The drawback is that if one Block
|
|
needs notably more RAM than any other Block, the decoder will run out
|
|
of memory when in reality there would be plenty of free RAM.
|
|
|
|
An attractive alternative would be using shared lzma_memory_limiter.
|
|
Depending on the application and the expected type of input, this may
|
|
either be the best solution or a source of hard-to-repeat problems.
|
|
Consider the following requirements:
|
|
- You use at maximum of n threads.
|
|
- x(i) is the decoder memory requirements of the Block number i
|
|
in an expected input Stream.
|
|
- The memory limiter is set to higher value than the sum of n
|
|
highest values x(i).
|
|
|
|
(If you are better at explaining the above conditions, please
|
|
contribute your improved version.)
|
|
|
|
If the above conditions aren't met, it is possible that the decoding
|
|
will fail unpredictably. That is, on the same machine using the same
|
|
settings, the decoding may sometimes succeed and sometimes fail. This
|
|
is because sometimes threads may run so that the Blocks with highest
|
|
memory usage are tried to be decoded at the same time.
|
|
|
|
Most .lzma files have all the Blocks encoded with identical settings,
|
|
or at least the memory usage won't vary dramatically. That's why most
|
|
multi-threaded decoders probably want to use the simple "separate
|
|
lzma_memory_limiter for each thread" solution, possibly fallbacking
|
|
to single-threaded mode in case the per-thread memory limits aren't
|
|
enough in multi-threaded mode.
|
|
|
|
FIXME: Memory usage of Stream info.
|
|
|
|
[
|
|
|
|
]
|
|
|
|
|
|
2. Huge uncompressed output
|
|
|
|
2.1. Data Blocks
|
|
|
|
Decoding a tiny .lzma file can produce huge amount of uncompressed
|
|
output. There is an example file of 45 bytes, which decodes to 64 PiB
|
|
(that's 2^56 bytes). Uncompressing such a file to disk is likely to
|
|
fill even a bigger disk array. If the data is written to a pipe, it
|
|
may not fill the disk, but would still take very long time to finish.
|
|
|
|
To avoid denial of service conditions caused by huge amount of
|
|
uncompressed output, applications using liblzma should use some method
|
|
to limit the amount of output produced. The exact method depends on
|
|
the application.
|
|
|
|
All valid .lzma Streams make it possible to find out the uncompressed
|
|
size of the Stream without actually uncompressing the data. This
|
|
information is available in at least one of the Metadata Blocks.
|
|
Once the uncompressed size is parsed, the decoder can verify that
|
|
it doesn't exceed certain limits (e.g. available disk space).
|
|
|
|
When the uncompressed size is known, the decoder can actively keep
|
|
track of the amount of output produced so far, and that it doesn't
|
|
exceed the known uncompressed size. If it does exceed, the file is
|
|
known to be corrupt and an error should be indicated without
|
|
continuing to decode the rest of the file.
|
|
|
|
Unfortunately, finding the uncompressed size beforehand is often
|
|
possible only in non-streamed mode, because the needed information
|
|
could be in the Footer Metdata Block, which (obviously) is at the
|
|
end of the Stream. In purely streamed mode decoding, one may need to
|
|
use some rough arbitrary limits to prevent the problems described in
|
|
the beginning of this section.
|
|
|
|
|
|
2.2. Metadata
|
|
|
|
Metadata is stored in Metadata Blocks, which are very similar to
|
|
Data Blocks. Thus, the uncompressed size can be huge just like with
|
|
Data Blocks. The difference is, that the contents of Metadata Blocks
|
|
aren't given to the application as is, but parsed by liblzma. Still,
|
|
reading through a huge Metadata can take very long time, effectively
|
|
creating a denial of service like piping decoded a Data Block to
|
|
another process would do.
|
|
|
|
At first it would seem that using a memory limiter would prevent
|
|
this issue as a side effect. But it does so only if the application
|
|
requests liblzma to allocate the Extra Records and provide them to
|
|
the application. If Extra Records aren't requested, they aren't
|
|
allocated either. Still, the Extra Records are being read through
|
|
to validate that the Metadata is in proper format.
|
|
|
|
The solution is to limit the Uncompressed Size of a Metadata Block
|
|
to some relatively large value. This will make liblzma to give an
|
|
error when the given limit is reached.
|
|
|